Visualization Techniques for Data Mining
Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department
Worcester Polytechnic Institute
Matthew O. Ward
and Elke A. Rundensteiner
Computer Science Department, Worcester Polytechnic Institute
100 Institute Rd, Worcester, MA 01609
Phone: (508) 831-5671 (Ward), (508) 831-5815 (Rundensteiner)
Fax: (508) 831-5776
E-mail: [matt, rundenst]@cs.wpi.edu
Project Award Information:
Award Number: IIS-9732897
Duration: 9/01/1998 - 8/31/2001
Title: Hierarchical Visualization Techniques for Data Mining
visualization, data mining, hierarchical data management
This project involves the development of interactive visualization and data management techniques for the exploration of very large multivariate data sets. The approach consists of extending several multivariate data visualization techniques currently implemented in an existing visualization tool (XmdvTool, developed at WPI) to support hierarchical views of the data, with support for focusing and drill-down using data-driven and structure-driven brushes. A variety of interactive tools for visual data exploration are being designed, including tools for navigating the data hierarchy, for switching smoothly between different views of the same data, and for highlighting phenomena in the data. In order to provide such visual capabilities to the user for interactive exploration over potentially huge data sets with acceptable performance, we are also exploring necessary data management issues, such as hierarchical data structures, query optimization algorithms, indexing techniques, and caching strategies.
Publications and Products:
This project has contributed to the educational development of five graduate students to date, two of whom are female. The software under development has been presented to students in a graduate course (CS563, Advanced Topics in Graphics). Hierarchical data management techniques, another focus of this project, has been presented in another graduate course (CS561, Advanced Database Systems). Furthermore, several research groups have begun experimenting with the newly released software for their large-scale visualization projects, including Oak Ridge National Labs, the University of Manitoba, the Maui High Performance Computing Center, the University of Maryland, and the Royal Institute of Technology in Stockholm.
Goals, Objectives, and Targeted Activities:
The stages of the project are as follows:
Links to publications regarding this project, all software developed, and case studies highlighting the utilization of the software on a variety of datasets can be found at the project web site http://davis.wpi.edu/~xmdv.
Visualization is the graphical presentation of data and information for the purposes of communicating results, verifying hypotheses, and qualitative exploration. It has long been a standard tool to assist statistical and scientific analysis and is becoming an increasingly important component in database and data mining activities, both for its ability to provide rich overviews and to permit users to rapidly detect patterns and outliers. The process of visualizing data consists of mapping selected data fields to specific graphical components or their attributes, such as position, size, or color, in such a way that data features of interest may be readily perceived, classified, and measured by the user.
Most visualization techniques developed to date focus on the display of relatively small data sets (less than 10,000 records). However, data sets today commonly exceed millions or tens of millions of records. One important area of research is thus the development of database technology to enable scalability of the visual exploration of such large data repositories in an interactive fashion. In particular, the question needs to be addressed as to how to model data sets larger than main memory using database technology (such as the relational model). Second, how to translate different visual exploration operations (such as to zoom, distort, and similarity-search) into query specifications that are efficiently processable over such database representations. Interaction-driven optimization strategies that make use of the features of the interface to develop special-purpose database query optimization techniques, such as clustering and indexing, are needed. Lastly, dynamic user-pattern aware techniques for information access and interactions represent a critical technology for providing database support for visualization. Specific tasks may include effective memory management, speculative prefetching, and dynamic aggregation and approximation.
Potential Related Projects:
Multivariate data visualization, Hierarchical data management and analysis, Caching and prefetching.