DATE: Thursday, November 29, 2012
TIME: 4:00 pm
PLACE: Council Room (SITE 5-084)
TITLE: Data exploration and visual clustering of large datasets by using Multidimensional Scaling
PRESENTER:Witold Dzwinel
AGH University of Science and Technology, Institute of Computer Science, Krakow, Poland
ABSTRACT:

Data mining of large amount of data requires computationally efficient and reliable machine learning tools. Blind selection and parameterization of classification/clustering schemes too often yield misleading results. Interactive visualization of data enables better and faster penetration of parameter space and choosing the best match of classification and clustering procedures and their parameters to the type of data being explored. Multidimensional scaling (MDS) is a good candidate to be a methodological framework of interactive visualization. It allows for reconstructing the topology of an original data space in a target 3-D (or 2-D) Euclidean space. We show that by representing every data object as a point particle in 3(2)-D dissipative environment, the Newtonian evolution of initially random configuration of particles leads to a stationary state with a minimal potential energy, which corresponds to the reconstruction error. However, this robust heuristics, similarly as other MDS algorithms, suffers O(M2) memory and time complexity. This fact disables interactive visualization of data sets consisting of M~105 data objects. We show that by employing incomplete dissimilarity matrix D, which represents the topology of the original data space, and by matching its structure to GPU processing mechanisms we are able to visualize interactively datasets consisting of M~105 objects on a workstation equipped with a fast GPU board. Only a small increase of the error as compared to the original MDS procedure is observed.We compare the timings obtained on GPU environment for a few MDS methods including Glimmer, which is currently considered to be the most efficient algorithm used for visualization of huge datasets. We show that though our algorithm is a little bit slower than Glimmer it achieves considerably lower value of reconstruction error yielding better 3-D representation of original data. Moreover, owning to all the advantages of particle method, such as the possibility of interactive control on the simulation parameters and reconstruction error formula, we show that our approach allows for much deeper exploration of data than its counterparts.