IDM 2002 Grant Report

Order, Spacing, and Clustering in Visual Exploration of Large Scale Data
Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department
Worcester Polytechnic Institute

Contact Information:

Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department, Worcester Polytechnic Institute
100 Institute Rd, Worcester, MA 01609
Phone: (508) 831-5671 (Ward), (508) 831-5815 (Rundensteiner)
Fax: (508) 831-5776
E-mail: [matt, rundenst]@cs.wpi.edu

WWW Page:

http://davis.wpi.edu/~xmdv

Project Award Information:

Award Number: IIS-0119276
Duration: 9/01/2001 - 8/31/2004
Title: Order, Spacing, and Clustering in Visual Exploration of Large Scale Data

Keywords:

visualization, data mining, hierarchical data management, high dimensional data analysis

Project Summary:

This project involves the development of interactive visualization and data management techniques for the exploration of data sets characterized by very high dimensionality and data type heterogeneity. This will be accomplished by applying multi-resolution strategies across the dimensions of a data set as well as within individual dimensions containing nominal or categorical values. For visualization, the tasks will involve the design and development of methods for determining good ordering, spacing, and clustering of attributes and dimensions, and augmenting several existing multivariate visualization methods to allow variable spacing and resolution in each space (inter-attribute, intra-attribute, inter-record). We also plan to develop ordering and spacing schemes to emphasize strong correlations within data sets, either between dimensions or between individual records. For interaction, the goal will be the investigation, development, and assessment of tools for intuitive navigation and view modification within the three spaces. Interactive, user-guided reclustering tools will be developed to split and group data and dimensions based on user observations, thus allowing users input into the process of locating the most important features of high-dimensional complex data. Finally, for data management, the tasks will involve research into high-dimensional indexing, multi-resolution data view management, query processing and optimization, as well as caching and prefetching strategies to enable efficient exploration of large, complex data repositories.

Publications and Products:

1.: Y. Fua, E. A. Rundensteiner, and M. O. Ward, ``Hierarchical Parallel Coordinates for Visualizing Large Multivariate Data Sets,'' Proc. Visualization '99, pp. 43-50 (October, 1999).
2.: Y. Fua, M. O. Ward, and E. A. Rundensteiner, ``Structure-based brushes: a mechanism for navigating hierarchically organized data and information spaces,'' IEEE Trans. Visualization and Computer Graphics, Vol. 6, No. 2, pp. 150-159 (April, 2000).
3.: J. Yang, M. Ward, and E. Rundensteiner, ``Interactive hierarchical displays: a general framework for visualization and exploration of large multivariate data sets,'' accepted for publication, Computers and Graphics, 2002.
4.: J. Yang, M. Ward, and E. Rundensteiner, ``Hierarchical exploration of large multivariate data sets,'' in Data Visualization: the State of the Art, (F. Post, ed.), in press, 2002.
5.: J. Yang, M. Ward, and E. Rundensteiner, ``Visual hierarchical dimension reduction for exploration of high dimensional data sets,''submitted to IEEE Symposium on Information Visualization (March, 2002)
6.: J. Yang, M. Ward, and E. Rundensteiner, ``InterRing: an interactive tool for visually navigating and manipulating hierarchical structures,'' submitted to IEEE Symposium on Information Visualization (March, 2002)
7.: XmdvTool 5.0 released to the public domain (Spring, 2002), with support for hierarchical versions of all visualization tools, distortion techniques in screen, data, and structure space for focus + context exploration, and optimized indexing and querying with the Oracle back-end.

Project Impact:

This project currently contributes to the educational development of three graduate students, two of whom are female. The software under development has been presented to students in a graduate course (CS563, Advanced Topics in Graphics). Hierarchical data management techniques, another focus of this project, has been presented in another graduate course (CS561, Advanced Database Systems). Furthermore, several research groups have downloaded and are working with current and past releases of the software package, which is available, including source code, to the public domain.

Goals, Objectives, and Targeted Activities:

The stages of the project are as follows:

1.: Investigate different measures of correlation, distance, and similarity between dimensions of a data set in order to more compactly and appropriately convey the relevant information content to the user. Incorporate selected measures into algorithms for clustering, ordering, and spacing data dimensions.
2.: Investigate different measures of distance and correlation between values within one nominal data dimension, and integrate selected measures into algorithms for clustering and ordering of elements in all non-numeric data fields.
3.: Investigate and develop methods to enhance multivariate visualizations to incorporate automated ordering and spacing of dimensions.
4.: Investigate and develop methods to graphically depict meta-dimensions (clusters of dimensions) and enhance visualization tools to include this functionality.
5.: Design and develop techniques to visually depict clusters of nominal variable values, and implement, along with methods for variable spacing and ordering, within XmdvTool.
6.: Design interactive tools for effective and intuitive exploration of data that has been hierarchically structured within a dimension (for nominals), between dimensions, and between data records.
7.: Design and develop interactive tools for user-guided reclustering, ordering, and spacing of hierarchically structured information, and apply them to the three hierarchical structures.
8.: Develop the database management infrastructure needed to support rapid restructuring and hierarchical navigation of data sets containing large numbers of records, large numbers of variables, and nominal attributes with potentially significant implicit relationships.
9.: Evaluate all the above as appropriate via a combination of computational benchmarking, performance studies, usability testing, as well as domain-specific case studies.

Project References:

Links to publications regarding this project, all software developed, and case studies highlighting the utilization of the software on a variety of datasets can be found at the project web site http://davis.wpi.edu/~xmdv.

Area Background:

Visualization is the graphical presentation of data and information for the purposes of communicating results, verifying hypotheses, and qualitative exploration. It has long been a standard tool to assist statistical and scientific analysis and is becoming an increasingly important component in database and data mining activities, both for its ability to provide rich overviews and to permit users to rapidly detect patterns and outliers. The process of visualizing data consists of mapping selected data fields to specific graphical components or their attributes, such as position, size, or color, in such a way that data features of interest may be readily perceived, classified, and measured by the user.

Most visualization techniques developed to date work most effectively with data sets with small numbers of dimensions (less than 20) containing only numeric data. However, data sets today commonly exceed hundreds of dimensions, and often contain non-numeric fields. Clearly, there are many challenging problems in visualizing more complex data sets.

In another vein, while database technology is quite sufficient to store large numbers of heterogeneous records with many records, the operations most efficiently supported are those required for transaction management, and not necessarily the interactive exploration process. Techniques such as multi-resolution indexing, semantic caching, and adaptive prefetching are essential to enable real-time access to the data of most interest to the particular user performing a specific analysis task.

Area References:

1.: M. Livny, R. Ramakrishnan, K. Beyer, G. Chen, D. Donjerkovic, S. Lawande, J. Myllymaki, and Kent Wenger, DEVise: Integrated Querying and Visual Exploration of Large Datasets. Proc. 1997 ACM SIGMOD International Conference on Management of Data , May, 1997.
2.: Cleveland, W., Visualizing Data, Hobart Press, Summit, NJ, 1993.
3.: Nielson, G. M., Hagen, H., and Muller, M.(eds.), Scientific Visualization: Overviews, Methodologies, Techniques, IEEE Computer Society Press, Los Alamitos, CA, 1997.

Potential Related Projects:

Multivariate data visualization, Hierarchical data management and analysis, Caching and prefetching.

Matthew Ward
2002-04-15