Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department, Worcester Polytechnic Institute
100 Institute Rd, Worcester, MA 01609
Phone: (508) 831-5671 (Ward), (508) 831-5815 (Rundensteiner)
Fax: (508) 831-5776
E-mail: [matt, rundenst]@cs.wpi.edu
Project Award Information:
Award Number: IIS-0119276
Duration: 9/01/2001 - 8/31/2004
Title: Order, Spacing, and Clustering in Visual Exploration of Large Scale Data
visualization, data mining, hierarchical data management, high dimensional data analysis
This project involves the development of interactive visualization and data management techniques for the exploration of data sets characterized by very high dimensionality and data type heterogeneity. This will be accomplished by applying multi-resolution strategies across the dimensions of a data set as well as within individual dimensions containing nominal or categorical values. For visualization, the tasks will involve the design and development of methods for determining good ordering, spacing, and clustering of attributes and dimensions, and augmenting several existing multivariate visualization methods to allow variable spacing and resolution in each space (inter-attribute, intra-attribute, inter-record). We also plan to develop ordering and spacing schemes to emphasize strong correlations within data sets, either between dimensions or between individual records. For interaction, the goal will be the investigation, development, and assessment of tools for intuitive navigation and view modification within the three spaces. Interactive, user-guided reclustering tools will be developed to split and group data and dimensions based on user observations, thus allowing users input into the process of locating the most important features of high-dimensional complex data. Finally, for data management, the tasks will involve research into high-dimensional indexing, multi-resolution data view management, query processing and optimization, as well as caching and prefetching strategies to enable efficient exploration of large, complex data repositories.
Publications and Products:
This project currently contributes to the educational development of three graduate students, two of whom are female. The software under development has been presented to students in a graduate course (CS563, Advanced Topics in Graphics). Hierarchical data management techniques, another focus of this project, has been presented in another graduate course (CS561, Advanced Database Systems). Furthermore, several research groups have downloaded and are working with current and past releases of the software package, which is available, including source code, to the public domain.
Goals, Objectives, and Targeted Activities:
The stages of the project are as follows:
Links to publications regarding this project, all software developed, and case studies
highlighting the utilization of the software on a variety of datasets can be found at
the project web site
Visualization is the graphical presentation of data and information for the purposes of communicating results, verifying hypotheses, and qualitative exploration. It has long been a standard tool to assist statistical and scientific analysis and is becoming an increasingly important component in database and data mining activities, both for its ability to provide rich overviews and to permit users to rapidly detect patterns and outliers. The process of visualizing data consists of mapping selected data fields to specific graphical components or their attributes, such as position, size, or color, in such a way that data features of interest may be readily perceived, classified, and measured by the user.
Most visualization techniques developed to date work most effectively with data sets with small numbers of dimensions (less than 20) containing only numeric data. However, data sets today commonly exceed hundreds of dimensions, and often contain non-numeric fields. Clearly, there are many challenging problems in visualizing more complex data sets.
In another vein, while database technology is quite sufficient to store large numbers of heterogeneous records with many records, the operations most efficiently supported are those required for transaction management, and not necessarily the interactive exploration process. Techniques such as multi-resolution indexing, semantic caching, and adaptive prefetching are essential to enable real-time access to the data of most interest to the particular user performing a specific analysis task.
Potential Related Projects:
Multivariate data visualization, Hierarchical data management and analysis, Caching and prefetching.