Sampling to Efficiently Train Bilingual Neural Network
Language Models
PRESENTER:
Colin Cherry
NRC
ABSTRACT:
The neural network joint model of translation (NNJM) combines source and
target context in a 15-gram, feed-forward neural network language model to
produce a powerful translation feature. However, its softmax top layer
means that probability and gradient calculations require a sum over the
entire output vocabulary, resulting in very slow maximum likelihood (MLE)
training. This has led some groups to train using Noise Contrastive
Estimation (NCE), which side steps this sum by sampling over the output.
We carry out the first direct comparison of MLE and NCE training
objectives for the NNJM, showing that NCE is significantly outperformed by
MLE on large-scale Arabic-English and Chinese-English translation tasks.
We also show that this drop can be avoided by using a recently proposed
translation noise distribution. In addition to these translation-specific
results, this talk will include a tutorial on Noise Contrastive
Estimation, which is a generally useful technique for efficient training
of any log-linear model with a large output space.