DATE: Thursday, April 4, 2013
TIME: 3:30 pm
PLACE: Council Room (SITE 5-084)
TITLE: Text Representation and General Topic Annotation via Latent Dirichlet Allocation (LDA)
PRESENTER: Amir H. Razavi,
University of Ottawa
ABSTRACT:

In this session, we present a document representation and general topic annotation method that can be applied on corpora containing short documents such as social media texts or news articles. The method applies Latent Dirichlet Allocation (LDA) on its core to infer the corpus major topics, which then will be used for document representation. Each document is assigned one or more topic clusters automatically. The representation that we propose has multiple levels (different granularities) by using different numbers of topics that leads to improve the classification performance. Further document annotation is done through a projection of the topics extracted and assigned by the LDA into a set of generic categories. The translation from the topical clusters to the small set of generic categories is done manually. It is remarkable that the number of the topical clusters that need to be manually mapped to the general topics are far smaller than the number of postings of a corpus that normally need to be annotated to build training and testing sets manually. We show that the accuracy of the annotation done through this method is about 80%, which is comparable with inter-human agreement in similar tasks. Additionally, by applying the LDA method, the corpus entries are represented by low-dimensional vectors which can be fed into many supervised or unsupervised machine learning algorithms that empirically cannot be applied on the conventional high-dimensional text representation methods.