Order, Spacing, and Clustering in Visual Exploration
of Large Scale Data
Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department
Worcester Polytechnic Institute
Matthew O. Ward and Elke A. Rundensteiner
Computer Science Department, Worcester Polytechnic Institute
100 Institute Rd, Worcester, MA 01609
Phone: (508) 831-5671 (Ward), (508) 831-5815 (Rundensteiner)
Fax: (508) 831-5776
E-mail: [matt, rundenst]@cs.wpi.edu
URL: http://www.cs.wpi.edu/~[matt, rundenst]
Project Award Information:
Award Number: IIS-0119276
Duration: 9/01/2001 - 8/31/2004
Title: Order, Spacing, and Clustering in Visual Exploration of Large
Visualization, data mining, hierarchical data management, high dimensional
This project involves the development of interactive visualization and
data management techniques for the exploration of data sets characterized
by very high dimensionality and data type heterogeneity. This will be accomplished
by applying multi-resolution strategies across the dimensions of a data
set as well as within individual dimensions containing nominal or categorical
values. For visualization, the tasks will involve the design and development
of methods for determining good ordering, spacing, and clustering of attributes
and dimensions, and augmenting several existing multivariate visualization
methods to allow variable spacing and resolution in each space (inter-attribute,
intra-attribute, inter-record). We also plan to develop ordering and spacing
schemes to emphasize strong correlations within data sets, either between
dimensions or between individual records. For interaction, the goal will
be the investigation, development, and assessment of tools for intuitive
navigation and view modification within the three spaces. Interactive,
user-guided reclustering tools will be developed to split and group data
and dimensions based on user observations, thus allowing users input into
the process of locating the most important features of high-dimensional
complex data. Finally, for data management, the tasks will involve research
into high-dimensional indexing, multi-resolution data view management,
query processing and optimization, as well as caching and prefetching strategies
to enable efficient exploration of large, complex data repositories.
Publications and Products:
Y. Fua, E. A. Rundensteiner, and M. O. Ward, ``Hierarchical Parallel Coordinates
for Visualizing Large Multivariate Data Sets,'' Proc. Visualization '99,
pp. 43-50 October, 1999.
Y. Fua, M. O. Ward, and E. A. Rundensteiner, ``Structure-based brushes:
a mechanism for navigating hierarchically organized data and information
spaces,'' IEEE Trans. Visualization and Computer Graphics, Vol. 6, No.
2, pp. 150-159, April, 2000.
M. Ward, ``A taxonomy of glyph placement strategies for multidimensional
data visualization,'' Information Visualization, Vol. 1, pp. 194-210, 2002.
J. Yang, M. Ward, and E. Rundensteiner, ``Hierarchical exploration of large
multivariate data sets,'' in Data Visualization: the State of the Art,
(F. Post, ed.), 2002.
J. Yang, M. Ward, and E. Rundensteiner, ``InterRing: an interactive tool
for visually navigating and manipulating hierarchical structures,'' Proc.
Information Visualization, pp. 77-84, October, 2002.
J. Yang, M. Ward, E. Rundensteiner, and S. Huang, ``Visual hierarchical
dimension reduction for exploration of high dimensional data sets,'' Proc.
Eurographics Visualization Symposium (VisSym '03), pp. 19-28, 2003.
P. Doshi, E. Rundensteiner, and M. Ward, ``Prefetching for visual data
exploration,'' Proc. Database Systems for Advanced Applications '03, March,
P. Doshi, G. Rosario, E. Rundensteiner and M. Ward, "A Strategy Selection
Framework for Adaptive Prefetching in Data Visualization", 15th International
Conference on Scientific and Statistical Database Management (SSDBM 2003),
pp 107-116, July 2003.
J. Yang, W. Peng, M. Ward, and E. Rundensteiner, ``Interactive hierarchical
dimension ordering, spacing, and filtering for exploration of high dimensional
data sets,'' Proc. InfoVis '03, pp. 105-112, October, 2003.
G. Rosario, E. Rundensteiner, D. Brown, and M. Ward, ``Mapping nominal
values to numbers for effective visualization,'' Proc. InfoVis '03, pp.
113-120, October, 2003.
J. Yang, M. Ward, and E. Rundensteiner, ``Interactive hierarchical displays:
a general framework for visualization and exploration of large multivariate
data sets,'' Computers and Graphics, Vol. 27, No. 2, pp. 265-283, 2003.
M. Ward, W. Peng, and X. Wang, ``Hierarchical visual data mining for large-scale
data,'' Computational Statistics, V. 19, pp. 147 - 158, 2004.
M. Ward, ``Finding needles in large-scale multivariate data haystacks,''
IEEE Computer Graphics and Applications, Vol. 24, No. 5, pp. 16-19, 2004.
M. Ward and J. Yang, ``Interaction spaces in data and information visualization,''
Proc. Joint Eurographics - IEEE TCVG Symposium on Visualization, May, 2004.
G. Rosario, M. Ward, E. Rundensteiner, D. Brown, and S. Huang, ``Mapping
nominal values to numbers for effective visualization,'' Information Visualization
(in press), 2004.
W. Peng, M. Ward, and E. Rundensteiner, ``Clutter reduction in multi-dimensional
data visualization using dimension reordering,'' IEEE Symposium on Information
Visualization (accepted for publication), 2004.
J. Yang, A. Patro, S. Huang, N. Mehta, M. Ward, and E. Rundensteiner, ``Value
and relation display for interactive exploration of high dimensional datasets,''
IEEE Symposium on Information Visualization (accepted for publication),
XmdvTool 6.0 released to the public domain (Fall, 2003), with support for
hierarchical versions of all visualization tools, distortion techniques
in screen, data, and structure space for focus + context exploration, and
optimized indexing and querying with the Oracle back-end. Next release
will be Fall, 2004 and will include support for nominal variables, a new
class of visualization techniques (pixel-oriented), and extensive support
for management and analysis of data dimensions.
This project has contributed to the educational development of seven
graduate students, three of whom are female. The software under development
has been presented to students in a graduate course (CS563, Advanced Topics
in Graphics). Hierarchical data management techniques has been presented
in another graduate course (CS561, Advanced Database Systems). Furthermore,
several research groups have downloaded and are working with current and
past releases of the software package, which is available, including source
code, to the public domain.
Goals, Objectives, and Targeted Activities:
The stages of the project are as follows:
Investigate different measures of correlation, distance, and similarity
between dimensions of a data set in order to more compactly and appropriately
convey the relevant information content to the user. Incorporate selected
measures into algorithms for clustering, ordering, and spacing data dimensions.
Investigate different measures of distance and correlation between values
within one nominal data dimension, and integrate selected measures into
algorithms for clustering and ordering of elements in all non-numeric data
Investigate and develop methods to enhance multivariate visualizations
to incorporate automated ordering and spacing of dimensions.
Investigate and develop methods to graphically depict meta-dimensions (clusters
of dimensions) and enhance visualization tools to include this functionality.
Design and develop techniques to visually depict clusters of nominal variable
values, and implement, along with methods for variable spacing and ordering,
Design interactive tools for effective and intuitive exploration of data
that has been hierarchically structured within a dimension (for nominals),
between dimensions, and between data records.
Design and develop interactive tools for user-guided reclustering, ordering,
and spacing of hierarchically structured information, and apply them to
the three hierarchical structures.
Develop an XML model for management of meta-data about the domains of dimensions
and their mapping to display features, including appropriate tools for
producing such mappings as well as manipulating them.
Develop the database management infrastructure needed to support rapid
restructuring and hierarchical navigation of data sets containing large
numbers of records, large numbers of variables, and nominal attributes
with potentially significant implicit relationships.
Evaluate all the above as appropriate via a combination of computational
benchmarking, performance studies, usability testing, as well as domain-specific
Links to publications regarding this project, all software developed,
and case studies highlighting the utilization of the software on a variety
of datasets can be found at the project web site http://davis.wpi.edu/~xmdv.
Visualization is the graphical presentation of data and information
for the purposes of communicating results, verifying hypotheses, and qualitative
exploration. It has long been a standard tool to assist statistical and
scientific analysis and is becoming an increasingly important component
in database and data mining activities, both for its ability to provide
rich overviews and to permit users to rapidly detect patterns and outliers.
The process of visualizing data consists of mapping selected data fields
to specific graphical components or their attributes, such as position,
size, or color, in such a way that data features of interest may be readily
perceived, classified, and measured by the user.
Most visualization techniques developed to date work most effectively
with data sets with small numbers of dimensions (less than 20) containing
only numeric data. However, data sets today commonly exceed hundreds of
dimensions, and often contain non-numeric fields. Clearly, there are many
challenging problems in visualizing more complex data sets.
In another vein, while database technology is quite sufficient to store
large numbers of heterogeneous records with many records, the operations
most efficiently supported are those required for transaction management,
and not necessarily the interactive exploration process. Techniques such
as multi-resolution indexing, semantic caching, and adaptive prefetching
are essential to enable real-time access to the data of most interest to
the particular user performing a specific analysis task.
Potential Related Projects:
M. Livny, R. Ramakrishnan, K. Beyer, G. Chen, D. Donjerkovic, S. Lawande,
J. Myllymaki, and Kent Wenger, DEVise: Integrated Querying and Visual Exploration
of Large Datasets. Proc. 1997 ACM SIGMOD International Conference on Management
of Data , May, 1997.
Cleveland, W., Visualizing Data, Hobart Press, Summit, NJ, 1993.
Nielson, G. M., Hagen, H., and Muller, M.(eds.), Scientific Visualization:
Overviews, Methodologies, Techniques, IEEE Computer Society Press, Los
Alamitos, CA, 1997.
Multivariate data visualization, Hierarchical data management and analysis,
Caching and prefetching.