Feb. 20, 2004

DATE:	Friday, Feb. 20, 2004
TIME:	2:00 pm
PLACE:	Council Room (SITE 5-084)
TITLE:	DBRS: A Heuristic Density-Based Spatial Clustering Method with Random Sampling
PRESENTER:	Howard J. Hamilton Laboratory for Computational Discovery Department of Computer Science University of Regina
ABSTRACT: When analyzing spatial databases or other datasets with spatial attributes, one frequently wants to cluster the data according to spatial attributes. In this presentation, we will describe a novel density-based spatial clustering method called DBRS. We will begin with a description of clustering methods in general. Then we will explain the density- based approach to clustering, wherein clusters are allowed to have arbitrary shapes as long as they satisfy a minimum density requirement. Next we will discuss the type of datasets we often encounter in our work with large spatial datasets. Then we will describe the DBRS algorithm, and the results of a series of experiments evaluating its effectiveness. Finally, we will present our conclusions concerning the applicability of DBRS to spatial clustering problems. The DBRS algorithm can identify clusters of widely varying shapes, clusters of varying densities, clusters that depend on non-spatial attributes, and approximate clusters in very large databases. DBRS achieves these results by repeatedly picking an unclassified point at random and examining its neighborhood. If the neighborhood is sparsely populated or the purity of the points in the neighborhood is too low, the point is classified as noise. Otherwise, if any point in the neighborhood is part of a known cluster, this neighborhood is joined to that cluster, i.e., all points in the neighborhood are classified as being part of the known cluster. If neither of these two possibilities applies, a new cluster is begun with this neighborhood. Since the neighbors of the selected point are not examined further, DBRS scales well on dense clusters. A heuristic is proposed for approximate clustering in very large databases. With this heuristic, the run time can be significantly reduced by assuming that a probabilistically controlled number of points are noise.