DATE: Tuesday, Nov. 14, 2006
TIME: 2:30 pm
PLACE: Council Room (SITE 5-084)
TITLE: Text Classification for Highly Skewed Data
PRESENTER: Yimin Ma
University of Ottawa
ABSTRACT:

Many real-life text classification tasks face the problem of highly skewed data. Different approaches are used to handle the skewed data problem in recent studies. The most intuitive approach is re-sampling the training set to make the class distribution relatively balanced. Another approach is to address the data bias problem using feature selection in the pre-processing step.
In this study we investigate the behavior of some popular feature selection methods when the data is highly skewed. We propose an improved version of BNS feature selection method Modified-BNS, which using the smoothed class distribution as the feature ratio. We also study how feature selection followed by under-sampling of the majority class influences the performance of different classifiers. Our experimental results show that using Modified-BNS combine with under-sampling significantly improved the performance of Naïve Bayes.