Beyond Word2Vec: Learning from Unobserved Patterns
PRESENTER:
Behrouz Haji Soleimani
Dalhousie University
ABSTRACT:
In recent years there has been an increasing interest in learning compact
representations (i.e embeddings) for a set of input datapoints. In Natural
Language Processing, embeddings are used to learn vector representations
for text in different granularity levels from word and sentence levels to
paragraph and document levels. The pre-trained word vectors can be used as
inputs in almost any NLP application (e.g. sentiment analysis). In this
talk, we will briefly overview some of the successful word embedding
algorithms and their underlying characteristics. This includes both neural
word embeddings as well as spectral methods. We examine the importance of
"negative samples", the unobserved or insignificant word-context
co-occurrences, and show how they can be effectively used to improve the
distribution of words in the latent space. We introduce two new word
embedding algorithms that try to overcome the weaknesses of existing
algorithms. These methods exploit the full capacity of negative samples
and result in a better distribution of words in the embedded space. We
trained our algorithms on articles of Wikipedia and show that they
outperform state-of-the-art methods in various word similarity tasks.
Biography:
Behrouz is currently a PhD student at Dalhousie University and is working
as a Research Assistant in the Institute for Big Data Analytics lab since
2013. He received his MSc degree in Computer Science from the University
of Tehran in 2011. He has done an internship at Bell Canada, and now is an
AI intern at Kinaxis. His current research mainly focuses on
dimensionality reduction, representation learning, deep learning, and
natural language processing. He has also experience in geospatial data
analysis, computer vision, and time-series analysis.