Order, Spacing, and Clustering in Visual Exploration
of Large Scale Data
Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department
Worcester Polytechnic Institute
Contact Information:
Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department, Worcester Polytechnic Institute
100 Institute Rd, Worcester, MA 01609
Phone: (508) 831-5671 (Ward), (508) 831-5815 (Rundensteiner)
Fax: (508) 831-5776
E-mail: [matt, rundenst]@cs.wpi.edu
URL: http://www.cs.wpi.edu/~[matt, rundenst]
WWW Page:
http://davis.wpi.edu/~xmdv
Project Award Information:
Award Number: IIS-0119276
Duration: 9/01/2001 - 8/31/2004
Title: Order, Spacing, and Clustering in Visual Exploration of Large Scale Data
Keywords:
visualization, data mining, hierarchical data management, high dimensional
data analysis
Project Summary:
This project involves the development of interactive visualization
and data management techniques for the exploration of data sets characterized
by very high dimensionality and data type heterogeneity. This will be
accomplished by applying multi-resolution strategies across the
dimensions of a data set as well as within individual dimensions
containing nominal or categorical values. For visualization, the tasks will
involve the design and development of methods for determining good
ordering, spacing, and clustering of attributes and dimensions, and
augmenting several existing multivariate visualization methods to
allow variable spacing and resolution in each space (inter-attribute,
intra-attribute, inter-record). We also plan to develop ordering and
spacing schemes to emphasize strong correlations within data sets,
either between dimensions or between individual records. For
interaction, the goal will be the investigation, development, and
assessment of tools for intuitive navigation and view modification
within the three spaces. Interactive, user-guided reclustering tools
will be developed to split and group data and dimensions based on user
observations, thus allowing users input into the process of locating
the most important features of high-dimensional complex data. Finally,
for data management, the tasks will involve research into high-dimensional
indexing, multi-resolution data view management, query processing and
optimization, as well as caching and prefetching strategies to enable
efficient exploration of large, complex data repositories.
Publications and Products:
- Y. Fua, E. A. Rundensteiner, and M. O. Ward, ``Hierarchical Parallel
Coordinates for Visualizing Large Multivariate Data Sets,''
Proc. Visualization '99, pp. 43-50 October, 1999.
- Y. Fua, M. O. Ward, and E. A. Rundensteiner, ``Structure-based brushes:
a mechanism for navigating hierarchically organized data and information
spaces,'' IEEE Trans. Visualization and Computer Graphics, Vol. 6, No. 2,
pp. 150-159, April, 2000.
- M. Ward,
``A taxonomy of glyph placement strategies for multidimensional data
visualization,'' Information Visualization, Vol. 1, pp. 194-210, 2002.
- J. Yang, M. Ward, and E. Rundensteiner, ``Hierarchical exploration of
large multivariate data sets,'' in
Data Visualization: the State of the Art, (F. Post, ed.), 2002.
- J. Yang, M. Ward, and E. Rundensteiner, ``InterRing: an interactive tool
for visually navigating and manipulating hierarchical structures,''
Proc. Information Visualization, pp. 77-84, October, 2002.
- J. Yang, M. Ward, E. Rundensteiner, and S. Huang, ``Visual hierarchical
dimension reduction for exploration of high dimensional data sets,''
Proc. Eurographics Visualization Symposium (VisSym '03), pp. 19-28, 2003.
- P. Doshi, E. Rundensteiner, and M. Ward,
``Prefetching for visual data exploration,''
Proc. Database Systems for Advanced Applications '03, March, 2003.
- P. Doshi, G. Rosario, E. Rundensteiner and M. Ward, "A Strategy Selection
Framework for Adaptive Prefetching in Data Visualization", 15th International
Conference on Scientific and Statistical Database Management
(SSDBM 2003), pp 107-116, July 2003.
- J. Yang, W. Peng, M. Ward, and E. Rundensteiner,
``Interactive hierarchical dimension ordering, spacing, and filtering for
exploration of high dimensional data sets,''
Proc. InfoVis '03, pp. 105-112, October, 2003.
- G. Rosario, E. Rundensteiner, D. Brown, and M. Ward,
``Mapping nominal values to numbers for effective visualization,''
Proc. InfoVis '03, pp. 113-120, October, 2003.
- J. Yang, M. Ward, and E. Rundensteiner, ``Interactive hierarchical
displays: a general framework for visualization and exploration of large
multivariate data sets,'' Computers and Graphics, Vol. 27, No. 2, pp. 265-283,
2003.
- M. Ward, W. Peng, and X. Wang,
``Hierarchical visual data mining for large-scale data,''
Computational Statistics, V. 19, pp. 147 - 158, 2004.
- M. Ward,
``Finding needles in large-scale multivariate data haystacks,''
IEEE Computer Graphics and Applications, Vol. 24, No. 5, pp. 16-19, 2004.
- M. Ward and J. Yang,
``Interaction spaces in data and information visualization,''
Proc. Joint Eurographics - IEEE TCVG Symposium on Visualization, May, 2004.
- G. Rosario, M. Ward, E. Rundensteiner, D. Brown, and S. Huang,
``Mapping nominal values to numbers for effective visualization,''
Information Visualization (in press), 2004.
- W. Peng, M. Ward, and E. Rundensteiner,
``Clutter reduction in multi-dimensional data visualization using dimension
reordering,''
IEEE Symposium on Information Visualization (accepted for publication), 2004.
- J. Yang, A. Patro, S. Huang, N. Mehta, M. Ward, and E. Rundensteiner,
``Value and relation display for interactive exploration of high dimensional
datasets,''
IEEE Symposium on Information Visualization (accepted for publication), 2004.
- XmdvTool 6.0 released to the public domain (Fall, 2003), with support for
hierarchical versions of all visualization tools, distortion techniques in
screen, data, and structure space for focus + context exploration, and
optimized indexing and querying with the Oracle back-end. Next release
will be Fall, 2004 and will include support for nominal variables, a new
class of visualization techniques (pixel-oriented), and extensive support
for management and analysis of data dimensions.
Project Impact:
This project has contributed to the educational development
of seven graduate students, three of whom are female.
The software under
development has been presented to students in a graduate course
(CS563, Advanced Topics in Graphics). Hierarchical data
management techniques, another focus of this project, has been
presented in another graduate course (CS561, Advanced Database
Systems). Furthermore, several
research groups have downloaded and are working with current and past
releases of the software package, which is available, including source
code, to the public domain.
Goals, Objectives, and Targeted Activities:
The stages of the project are as follows:
- Investigate different measures of correlation, distance, and similarity
between dimensions of a data set in order to more compactly and
appropriately convey the relevant information content to the user. Incorporate selected measures into
algorithms for clustering, ordering, and spacing data dimensions.
- Investigate different measures of distance and correlation
between values within one nominal data dimension, and integrate selected measures into algorithms for
clustering and ordering of elements in all non-numeric data fields.
- Investigate and develop methods to enhance multivariate visualizations
to incorporate automated ordering and spacing of dimensions.
- Investigate and develop methods to graphically depict meta-dimensions
(clusters of dimensions) and enhance visualization tools to include this
functionality.
- Design and develop techniques to visually depict clusters of nominal
variable values, and implement, along with methods for variable spacing and
ordering, within XmdvTool.
- Design interactive tools for effective and intuitive exploration of
data that has been hierarchically structured within a dimension (for
nominals), between dimensions, and between data records.
- Design and develop interactive tools for user-guided reclustering,
ordering, and spacing of hierarchically structured information, and apply
them to the three hierarchical structures.
- Develop the database management infrastructure needed
to support rapid restructuring and hierarchical navigation of data sets
containing large numbers of records, large numbers of variables, and
nominal attributes with potentially significant implicit relationships.
- Evaluate all the above as appropriate via a combination of computational
benchmarking, performance studies, usability testing, as well as
domain-specific case studies.
Project References:
Links to publications regarding this project, all software developed, and case studies
highlighting the utilization of the software on a variety of datasets can be found at
the project web site http://davis.wpi.edu/~xmdv
.
Area Background:
Visualization is the graphical presentation of data and information for the
purposes of communicating results, verifying hypotheses, and qualitative
exploration. It has long been a standard tool to assist statistical and
scientific
analysis and is becoming an increasingly important component in database
and data mining activities, both for its ability to provide rich overviews
and to permit users to rapidly detect patterns and outliers.
The process of visualizing data consists of mapping selected data
fields to specific graphical components or their attributes, such as
position, size, or color, in such a way that data features of interest may
be readily perceived, classified, and measured by the user.
Most visualization
techniques developed to date work most effectively with data sets with small
numbers of dimensions (less than 20) containing only numeric data.
However, data sets today commonly exceed hundreds of dimensions, and often
contain non-numeric fields. Clearly, there are many challenging problems
in visualizing more complex data sets.
In another vein, while database technology is quite sufficient to store
large numbers of heterogeneous records with many records, the operations
most efficiently supported are those required for transaction management,
and not necessarily the interactive exploration process. Techniques such
as multi-resolution indexing, semantic caching, and adaptive prefetching
are essential to enable real-time access to the data of most interest to
the particular user performing a specific analysis task.
Area References:
- 1.
- M. Livny, R. Ramakrishnan,
K. Beyer, G. Chen, D. Donjerkovic, S. Lawande, J. Myllymaki, and Kent
Wenger,
DEVise: Integrated Querying and Visual Exploration of Large Datasets.
Proc. 1997 ACM
SIGMOD International Conference on Management of Data , May, 1997.
- 2.
- Cleveland, W., Visualizing Data, Hobart Press, Summit, NJ, 1993.
- 3.
- Nielson, G. M., Hagen, H., and Muller, M.(eds.), Scientific Visualization: Overviews,
Methodologies, Techniques, IEEE Computer Society Press, Los Alamitos, CA, 1997.
Potential Related Projects:
Multivariate data visualization,
Hierarchical data management and analysis,
Caching and prefetching.
Matthew Ward
2004-09-10