DATE: Tue, May 18, 2021
TIME: 1 pm
PLACE: On Zoom
TITLE: Cross-lingual word embeddings for low-resource and morphologically-rich languages
PRESENTER: Ali Hakimi Parizi
University of New Brunswick
ABSTRACT:

Despite recent advances in natural language processing, there is still a gap in state-of-the-art methods to address problems related to low-resource and morphologically-rich languages. These methods are data-hungry, and due to the scarcity of training data for low-resource and morphologically-rich languages, developing NLP tools for them is a challenging task. Approaches for forming cross-lingual embeddings and transferring knowledge from a rich to a low-resource language have emerged to overcome the lack of training data. Although in recent years we have seen major improvements in crosslingual methods, these methods still have some limitations that have not been addressed properly. An important problem is the out-of-vocabulary word (OOV) problem, i.e., words that occur in a document being processed, but that the model did not observe during training. The OOV problem is more signicant in the case of low-resource languages, since there is relatively little training data available for them, and also in the case of morphologically-rich languages, since it is very likely that we do not observe a considerable amount of their word forms in the training data. Approaches to learning sub-word embeddings have been proposed to address the OOV problem in monolingual models, but most prior work has not considered sub-word embeddings in cross-lingual models. The hypothesis is that it is possible to leverage sub-word information to overcome the OOV problem in low-resource and morphologically-rich languages. This work shows the effectiveness of sub-word information in the cross-lingual space and how it can be employed to overcome the OOV problem. Moreover, it presents a novel crosslingual word representation method that incorporates sub-word information during the training process to learn a better cross-lingual shared space and also better represent OOVs in the shared space. This method is particularly suitable for low-resource scenarios and this claim is proven through a series of experiments on bilingual lexicon induction, monolingual word similarity, and a downstream task, document classication. More specically, it is shown that this method is suitable for low-resource languages by conducting bilingual lexicon induction on twelve low-resource and morphologically-rich languages.