DATE: Wednesday, May 23, 2012
TIME: 3:00 pm
PLACE: Council Room (SITE 5-084)
TITLE: Document Clustering with Dual Supervision
PRESENTER: Evangelos Milios
Dalhousie University
ABSTRACT:

Nowadays, academic researchers maintain a personal library of papers, which they would like to organize based on their needs. Clustering techniques are often employed to achieve this goal by grouping the document collection into different topics. Unsupervised clustering does not require any user effort but only produces one universal output with which users may not be satisfied. Therefore, document clustering needs user input for guidance to generate personalized clusters. Semi-supervised clustering incorporates prior information and has the potential to produce customized clusters. Traditional semi-supervised clustering is based on user supervision in the form of labeled instances or pairwise instance constraints. However, alternative forms of user supervision exist such as labeling features. The joint use of document-level and feature-level supervision has been called dual supervision. We first explore a framework to use feature supervision for feature selection by indicating whether a feature is useful for clustering. Second, we enhance the semi-supervised clustering with feature supervision using feature re-weighting. Third, we propose a unified framework to combine document supervision and feature supervision through seeding. The newly proposed algorithms are evaluated using oracles and demonstrated to be helpful in producing better clusters matching a single user's point of view than document clustering without any supervision and with only document supervision. Finally, we conduct a user study to confirm that different users have different understandings of the same document collection and prefer personalized clusters. At the same time, we demonstrate that document clustering with dual supervision is able to produce good personalized clusters even with noisy user input. Dual supervision is also demonstrated to work better in personalization than any single form of supervision.

This is joint research with Yeming Hu and Jamie Blustein