DATE: Wed, Dec 3, 2014
TIME: 12:00 pm
PLACE: Council Room (SITE 5-084)
TITLE: Topic Modeling of Short Social Messages
PRESENTER: Kenton White
Girih Inc.
ABSTRACT:

Topic modeling discovers the abstract topics that occur in a collection of documents. Latent Dirichlet Allocation (LDA), perhaps the most popular topic modeling algorithm, use the statistical occurrence of words in a document to infer a topic distribution among the document collection. These techniques assume that each document is a mixture of related topics. A collection of short social messages (SSM), such as Tweets, breaks this assumption. With SSMs each document is a single topic and the collection of documents may not be a mixture of related topics. Instead, I explore using Non-Negative Matrix Factorization (NMF) to model topics in SSMs. NMF segments SSMs into topics based on inferred similarities of the authors, using author identity from the social graph. Working with a corpus of 801,943 Tweets collected from Ottawa ON in August of 2014, I compare the topics extracted by LDA and NMF. I will show how NMF can learn to extract local news topics from Twitter streams.