Cross-lingual word embeddings for
low-resource and morphologically-rich
languages
PRESENTER:
Ali Hakimi Parizi
University of New Brunswick
ABSTRACT:
Despite recent advances in natural language processing, there is still a
gap
in state-of-the-art methods to address problems related to low-resource
and
morphologically-rich languages. These methods are data-hungry, and due
to the scarcity of training data for low-resource and morphologically-rich
languages, developing NLP tools for them is a challenging task. Approaches
for forming cross-lingual embeddings and transferring knowledge from a
rich to
a low-resource language have emerged to overcome the lack of training
data. Although in recent years we have seen major improvements in
crosslingual
methods, these methods still have some limitations that have not
been addressed properly. An important problem is the out-of-vocabulary
word (OOV) problem, i.e., words that occur in a document being processed,
but that the model did not observe during training. The OOV problem is
more signicant in the case of low-resource languages, since there is
relatively
little training data available for them, and also in the case of
morphologically-rich
languages, since it is very likely that we do not observe a considerable
amount of their word forms in the training data. Approaches to learning
sub-word embeddings have been proposed to address the OOV problem
in monolingual models, but most prior work has not considered sub-word
embeddings in cross-lingual models. The hypothesis is that
it
is possible to leverage sub-word information to overcome the OOV problem
in low-resource and morphologically-rich languages. This work shows the
effectiveness of
sub-word information in the cross-lingual space and how it can be employed
to overcome the OOV problem. Moreover, it presents a novel
crosslingual
word representation method that incorporates sub-word information
during the training process to learn a better cross-lingual shared space
and
also better represent OOVs in the shared space. This method is
particularly
suitable for low-resource scenarios and this claim is proven through a
series of
experiments on bilingual lexicon induction, monolingual word similarity,
and
a downstream task, document classication. More specically, it is shown
that
this method is suitable for low-resource languages by conducting bilingual
lexicon induction on twelve low-resource and morphologically-rich
languages.