Dec 3, 2014

DATE:	Wed, Dec 3, 2014
TIME:	12:00 pm
PLACE:	Council Room (SITE 5-084)
TITLE:	Topic Modeling of Short Social Messages
PRESENTER:	Kenton White Girih Inc.
ABSTRACT: Topic modeling discovers the abstract topics that occur in a collection of documents. Latent Dirichlet Allocation (LDA), perhaps the most popular topic modeling algorithm, use the statistical occurrence of words in a document to infer a topic distribution among the document collection. These techniques assume that each document is a mixture of related topics. A collection of short social messages (SSM), such as Tweets, breaks this assumption. With SSMs each document is a single topic and the collection of documents may not be a mixture of related topics. Instead, I explore using Non-Negative Matrix Factorization (NMF) to model topics in SSMs. NMF segments SSMs into topics based on inferred similarities of the authors, using author identity from the social graph. Working with a corpus of 801,943 Tweets collected from Ottawa ON in August of 2014, I compare the topics extracted by LDA and NMF. I will show how NMF can learn to extract local news topics from Twitter streams.