DATE: Tuesday, Mar. 07, 2006
TIME: 2:30 pm
PLACE: Council Room (SITE 5-084)
TITLE: Active Learning for Text Classification
PRESENTER: Yimin Ma
University of Ottawa
ABSTRACT:

Text classification is the problem of automatically assigning text documents into one or more categories. Traditionally, machine learning research has assumed that the class distribution in the training data is reasonably balanced, but it's not always the case in reality. In our work, we are particularly focus on text classification when the training data is highly imbalanced; we aim to get high recall for the minority class. We applied the feature selection method proposed by George Forment to prevent the predictive features from the majority class hide all useful features from the minority class. Also, due to the cost of manually label the trainig instances is very expensive, we applied the idea of active learning to minimize the number of training instances that need to be labeled. We studied several active learning methods and find the one that is most appropriate for our scenario. We also involved the document density factor when calculate the utility in active learning, and our experiment shows it help to improved the recall for the minority class.