Untitled Document

Hierarchical Visualization Techniques for Data Mining
Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department
Worcester Polytechnic Institute

Contact Information:

Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department, Worcester Polytechnic Institute
100 Institute Rd, Worcester, MA 01609
Phone: (508) 831-5671 (Ward), (508) 831-5815 (Rundensteiner)
Fax: (508) 831-5776
E-mail: [matt, rundenst]@cs.wpi.edu

WWW Page:

http://davis.wpi.edu/~xmdv

Project Award Information:

Award Number: IIS-9732897
Duration: 9/01/1998 - 8/31/2001
Title: Hierarchical Visualization Techniques for Data Mining

Keywords:

visualization, data mining, hierarchical data management

Project Summary:

This project involves the development of interactive visualization and data management techniques for the exploration of very large multivariate data sets. The approach consists of extending several multivariate data visualization techniques currently implemented in an existing visualization tool (XmdvTool, developed at WPI) to support hierarchical views of the data, with support for focusing and drill-down using data-driven and structure-driven brushes. A variety of interactive tools for visual data exploration are being designed, including tools for navigating the data hierarchy, for switching smoothly between different views of the same data, and for highlighting phenomena in the data. In order to provide such visual capabilities to the user for interactive exploration over potentially huge data sets with acceptable performance, we are also exploring necessary data management issues, such as hierarchical data structures, query optimization algorithms, indexing techniques, and caching strategies.

Publications and Products:

1.: M. O. Ward, ``XmdvTool: Integrating Multiple Methods for Visualizing Multivariate Data'', Proc. Visualization '94, pp. 326-333 (1994).
2.: M. O. Ward and A. R. Martin, ``High Dimensional Brushing for Interactive Exploration of Multivariate Data'', Proc. Visualization '95, pp. 271-278 (1995).
3.: Y. Fua, E. A. Rundensteiner, and M. O. Ward, ``Hierarchical Parallel Coordinates for Visualizing Large Multivariate Data Sets,'' Proc. Visualization '99, pp. 43-50 (October, 1999).
4.: Y. Fua, M. O. Ward, and E. A. Rundensteiner, ``Navigating Hierarchies with Structure-Based Brushes,'' Proc. IEEE Symposium on Information Visualization, pp. 58-64 (October, 1999). An extended version of this paper has been invited for publication in a special issue of IEEE Transactions on Visualization and Computer Graphics.
5.: D. Stroe, E. Rundensteiner, and M. Ward, ``MinMax trees: Efficient relational operation support for hierarchical data exploration'', submitted to ACM Transactions on Database Systems, (Nov. 1999).
6.: D. Stroe, E. Rundensteiner, and M. Ward, ``Scalable Visual Hierarchy Exploration,'' submitted to DEXA 2000 (February, 2000).
7.: XmdvTool 4.0 released to public domain (October, 1999), with support for hieararchical parallel coordinates and structure-based brushes.

Project Impact:

This project has contributed to the educational development of five graduate students to date, two of whom are female. The software under development has been presented to students in a graduate course (CS563, Advanced Topics in Graphics). Hierarchical data management techniques, another focus of this project, has been presented in another graduate course (CS561, Advanced Database Systems). Furthermore, several research groups have begun experimenting with the newly released software for their large-scale visualization projects, including Oak Ridge National Labs, the University of Manitoba, the Maui High Performance Computing Center, the University of Maryland, and the Royal Institute of Technology in Stockholm.

Goals, Objectives, and Targeted Activities:

The stages of the project are as follows:

1.: Identify, design, and implement algorithms for hierarchical partitioning and/or clustering large multivariate data sets.
2.: Design and implement extended versions of existing multivariate visualization techniques to convey statistical summarizations of selected subtrees.
3.: Design and implement strategies for managing and querying large hierarchical dynamic data sets using relational database technology.
4.: Study and develop techniques for efficiently computing summarizations (aggregates) of subtrees of the hierarchy, both a-priori as well as on the fly during re-grouping operations by the user.
5.: Develop memory management, caching and prefetching strategies to enable interactive hierarchical exploration even over large disk-resident data sets,
6.: Design and implement interactive tools to allow focus, drill-down, consolidation, and other exploratory operations through direct and indirect manipulation.
7.: Evaluate and refine the visualization, data management, and interactive exploration tools using both real and synthetic data sets.

Project References:

Links to publications regarding this project, all software developed, and case studies highlighting the utilization of the software on a variety of datasets can be found at the project web site http://davis.wpi.edu/~xmdv.

Area Background:

Visualization is the graphical presentation of data and information for the purposes of communicating results, verifying hypotheses, and qualitative exploration. It has long been a standard tool to assist statistical and scientific analysis and is becoming an increasingly important component in database and data mining activities, both for its ability to provide rich overviews and to permit users to rapidly detect patterns and outliers. The process of visualizing data consists of mapping selected data fields to specific graphical components or their attributes, such as position, size, or color, in such a way that data features of interest may be readily perceived, classified, and measured by the user.

Most visualization techniques developed to date focus on the display of relatively small data sets (less than 10,000 records). However, data sets today commonly exceed millions or tens of millions of records. One important area of research is thus the development of database technology to enable scalability of the visual exploration of such large data repositories in an interactive fashion. In particular, the question needs to be addressed as to how to model data sets larger than main memory using database technology (such as the relational model). Second, how to translate different visual exploration operations (such as to zoom, distort, and similarity-search) into query specifications that are efficiently processable over such database representations. Interaction-driven optimization strategies that make use of the features of the interface to develop special-purpose database query optimization techniques, such as clustering and indexing, are needed. Lastly, dynamic user-pattern aware techniques for information access and interactions represent a critical technology for providing database support for visualization. Specific tasks may include effective memory management, speculative prefetching, and dynamic aggregation and approximation.

Area References:

1.: M. Livny, R. Ramakrishnan, K. Beyer, G. Chen, D. Donjerkovic, S. Lawande, J. Myllymaki, and Kent Wenger, DEVise: Integrated Querying and Visual Exploration of Large Datasets. Proc. 1997 ACM SIGMOD International Conference on Management of Data , May, 1997.
2.: Cleveland, W., Visualizing Data, Hobart Press, Summit, NJ, 1993.
3.: Nielson, G. M., Hagen, H., and Muller, M.(eds.), Scientific Visualization: Overviews, Methodologies, Techniques, IEEE Computer Society Press, Los Alamitos, CA, 1997.

Potential Related Projects:

Multivariate data visualization, Hierarchical data management and analysis, Caching and prefetching.

Matthew Ward
2000-02-29