DATE: Monday, Nov 22, 2010
TIME: 3:30 pm
PLACE: Council Room (SITE 5-084)
TITLE: Accelerating K-Means Revisited - Sparse Data Vectors
PRESENTER: Andrew McPherson
IT Research and Development, CSEC
ABSTRACT:

In 2003 Charles Elkan reported a considerable gain in speed for K-Means clustering using the triangle inequality. In this talk we present some further work specifically targetting the case where the entities being clustered are represented by very sparse vectors in a very large dimensional space. This situation naturallly arises when clustering a large set of documents represented by bag-of-words vectors. In this talk we are concerned only with speed and we specifically concentrate on small numbers of clusters such as are used in iterative clustering schemes such as X-Means.