DATE: Thu, Oct 5, 2017
TIME: 1 pm
PLACE: SITE 5084
TITLE: Beyond Word2Vec: Learning from Unobserved Patterns
PRESENTER: Behrouz Haji Soleimani
Dalhousie University
ABSTRACT:

In recent years there has been an increasing interest in learning compact representations (i.e embeddings) for a set of input datapoints. In Natural Language Processing, embeddings are used to learn vector representations for text in different granularity levels from word and sentence levels to paragraph and document levels. The pre-trained word vectors can be used as inputs in almost any NLP application (e.g. sentiment analysis). In this talk, we will briefly overview some of the successful word embedding algorithms and their underlying characteristics. This includes both neural word embeddings as well as spectral methods. We examine the importance of "negative samples", the unobserved or insignificant word-context co-occurrences, and show how they can be effectively used to improve the distribution of words in the latent space. We introduce two new word embedding algorithms that try to overcome the weaknesses of existing algorithms. These methods exploit the full capacity of negative samples and result in a better distribution of words in the embedded space. We trained our algorithms on articles of Wikipedia and show that they outperform state-of-the-art methods in various word similarity tasks.
Biography: Behrouz is currently a PhD student at Dalhousie University and is working as a Research Assistant in the Institute for Big Data Analytics lab since 2013. He received his MSc degree in Computer Science from the University of Tehran in 2011. He has done an internship at Bell Canada, and now is an AI intern at Kinaxis. His current research mainly focuses on dimensionality reduction, representation learning, deep learning, and natural language processing. He has also experience in geospatial data analysis, computer vision, and time-series analysis.