We have been exchanging ideas in targeted meetings with several individuals
at Fidelity Corporation about the problems they face of detection interesting
trends (nuggets) in their data, and how to manage such knowledge over
time. There is a potential that this may lead to a more in-depth collaboration
as the project progresses, and we are developing technology that could
potentially be applied to address some of their challenges. However, Fidelity
Corporation appears to be more interested in 'products' they can purchase
than longer-term fundamental research projects; so no in-depth collaboration
has emerged today out of these interactions.
Activities and findings:
Research and Education Activities:
This project focuses on the creation, management, and exploration of discoveries
during exploratory data analysis. These discoveries, which we term 'nuggets',
can be clusters, anomalies, trends, associations, and other components
of the reasoning and decision-making process. There are 5 major tasks
in this research effort:
(1.) Nugget Modeling and Storage: Technology for the modeling and management
of nuggets, their complex interrelationships, and their supporting or
refuting relationships with the relevant data will be developed.
(2.) Nugget Discovery and Capture: Three methods for nugget
generation will be developed: explicit identification and confirmation
by user, implicit capture based on analysis of user logs, and automated
discovery using statistical and data mining techniques.
(3.) Nugget Lifespan Management: Computational and interactive visual
methods to enable analysts to efficiently validate, annotate, classify,
organize, and purge nuggets will be devised.
(4.) Nugget-Supported Visual Exploration: Visual representations of
hypotheses, evidence, nuggets, and the data associated with them will
help analysts explore their data, manage their discoveries, and organize
their reasoning processes.
(5.) Assessment: Following user-centered design principles, we will
insure that end-users participate in the design, development, and testing
to help insure that the resulting software tools will be both useful and
usable to the targeted audiences.
During this first year of the grant we have focused on the following:
1. Nugget extraction via user-guided selection on data views. Using common
multivariate visualization techniques, analysts can interactively brush
over data subsets of interest to create nuggets. These can be annotated
to simplify recall of appropriate nuggets and thus simplify their use
in hypothesis formation and confirmation.
2. Nugget extraction via automated extraction during data exploration.
Using common multivariate visualization techniques, analysts can interactively
explore data subsets of interest. Our system will then monitor their navigation
and data brushing activities, and will automatically create nuggets of
potential interest. Interest here is predicated by the number of times
analyst re-examines a given data subset and the length of time spend with
that data. These automatically extracted nuggets can then again be annotated,
as above, to simplify recall of appropriate nuggets.
3. Nugget extraction via selection on model views. We've created an initial
set of model-space visualizations based on linear regression, conveying
for every combination of model parameters how well the data fits the model.
Promising models are easily identified, and the parameters can be either
manually or automatically refined to increase the accuracy of the model.
Another variant on this approach is to represent multivariate clusters
as N-dimensional objects and combine the visualizations of the data with
that of the object shape characteristics.
4. Nugget organization and management. We have developed methods for nugget
clustering, refinement, and pruning nuggets to help avoid overloading
the analyst with too many nuggets. To date, these methods have focused
on the simple nuggets extracted via automatic selection in the data space.
Such nugget cleanup is most essential within this context, because the
nuggets are continuously generated by our system - yet many of them tend
to be rather similar by design. Hence, consolidation of multiple similar
nuggets into one representative nugget is employed to reduce the nugget
space, keeping it manageable.
5. Evaluation Using Small User Study. To verify the usefulness of nuggets
for facilitating exploration, we have extended XMDVtool with the services
of nugget extraction, consolidation, and maintenance (NMS) to provide
nugget support during visual data exploration. We then conducted a small
preliminary user study with the goal to compare users' efficiency and
accuracy when solving tasks with and without the help of NMS. Specifically,
in our study, we randomly divided 12 users, all WPI students, into 4 groups,
3 users per
group and asked them to finish the same 5 knowledge discovery tasks (each
based on 3 real datasets) -- some groups with and some groups without
support of NMS services. Our study confirmed that NMS can indeed improve
users timeefficiency when solving knowledge discovery tasks. Our preliminary
evaluation also shows that NMS enhances users' accuracy of finishing these
tasks. The details of this user study can be found in (Di Yang, MS Thesis
6. Evaluation using Case Study. Furthermore, we have also began to evaluate
our model-view technology for aiding the discovery of hypotheses (i.e.,
models) about the data. In particular, we observed that an analyst is
aiding in detecting which among possibly many models may be the most suitable
fit for a given data set, including if a single model cannot achieve the
desired match level -- which subsets of our overall data set may best
be matched by which model.
During this second year of the grant we have focused on the following:
7. Nugget extraction via space partitioning. We decompose space into
bins or cells based on segmenting each dimension into some number of non-overlapping
regions. A given bin is either empty, predominantly one class, or some
mixture. We then merge bins that are adjacent in data space into hyperboxes
with a consistent label. This can accommodate arbitrarily shaped regions,
as several overlapping hyperboxes can have the same label.
8. Visualization of nugget space at several levels of abstraction. Each
ofthe bins above can be visualized as entities that have a location inN-dimensions
as well as a class ID (unless it is a mixture, in which case it is treated
as unclassified). Different layouts can be used to convey different relationships
between bins, including ordered by class, centered around a selected bin,
or using a spring layout based on representatives from each class. Nuggets
formed by combining adjacent bins of the same class can be shown using
a variant on star glyphs using an MDS layout. Pattern-specific
visualizations are used to convey characteristics of specific patterns
(e.g., cluster, association rules, user-selected visual patterns) derived
via different methods.
9. Linkages between abstraction levels. Using edge bundle techniques,
we link each entity at one level of abstraction with its components at
lower levels of abstraction as well as its parent pattern. This 4-level
connection between data and models allows users to explore data in a wide
range of ways, using a variety of techniques to extract patterns and/or
10. Visualizing classifier space. Using prototype-based classifiers,
we have studied ways of visualizing relationships between classifiers
as well as using existing classifiers to generate new classifiers with
characteristics of the classifiers used in the generation. This can result
in new classifiers without the need to return to the raw data, while at
the same time giving better performance on the dataset being analyzed.
11. Reusing results of association rule mining. We are studying ways
to ascertain relationships between association rules that can be used
to estimate relations on subsets of data without recomputing the association
rules. This approximation can lead to significant performance increases
without significant loss of accuracy.
During the third year of the grant we have focused on the following:
12. Nugget Discovery Process: we designed a visual subgroup mining system
supporting a closed loop analysis that involves both data mining and visual
analysis in one coherent process. The users can perform data mining as
the first step for extracting patterns on multi-variate data sets, and
then visually analyze the results
and the corresponding data. Inspired by the visual exploration, further
refinement of the data mining query leads to the next cycle of visual
exploration and analysis.
13. Nugget Modeling and Representation: We proposed a representation
of the mining results in an understandable form. In addition to storage
benefits, this representation is easy for analysts
to understand, and can be directly shown using common multivariate visualization
14. A 4-level visual and structure model: Our structure model that
allows users to explore the data space at four different levels of abstraction:
instances, cells, nuggets, and clusters. For each level of this nugget
space, we designed a view in which users are able to explore and select
items to visualize. In particular, the nugget level mining results are
represented as regular hyper-box shaped regions, which can be easily understood,
compactly stored. The connections between the different layers are shown
based on the user's cursor position. The layout strategies
help users make sense of the relationships between the extracted patterns.
15. We implemented the above techniques in an integrated system called
Nugget Browser in XmdvTool. our freeware multivariate data
visualization tool. Case studies suggest that our visualization techniques
are effective in discovering patterns in multivariate datasets. In the
coming year, we intend to continue this research by conducting a more
comprehensive user study to assess the usability of the technology, and
identify possible places for improvement.
16. Automated Nugget Extraction and Refinement Techniques. We tackle
the new problem of interactive mining of localized association rules.
That is, we provide an analyst the ability to select an arbitrary subset
of data and efficiently mine association rules specific to that subset.
For this, we design a preprocess- once-query-many approach. Namely, first
the data set is preprocessed in an off-line manner to extract relevant
features in a global fashion and then second the actual user query customized
to a data subset is processed at run-time by exploiting this preprocessed
16a. For this, a nugget store, called L-FIST, is designed that uses a
novel itemset-based data partitioning that enables compact storage of
the itemsets and the underlying data subsets. This store maintains precomputed
meta-data about rule-related properties of the data subsets.
16b. Nugget Extraction Plan Modeling. We introduce an algebraic approach
towards modeling alternate strategies fo localized association rule mining.
We define the mining tasks as a pipeline of algebraic operators and apply
the principles of query optimization towards tackling mining localized
rules. In addition to the straightforward solution of uses a mining algorithm
over the user-chosen subset, we design five POQM plans for answering
a localized association rule mining query by leveraging this pre-computed
knowledge maintained in this off-line L-FIST nugget store.
16c. Nugget Extraction Optimizer. Our cost analysis demonstrates that
neither of these plan types outperforms the others for all possible query
scenarios. Rather the costs of these plans dependent on several key factors
including the user-selected thresholds, the data subsets, and the L-FIST
store. We present our analytical evaluation of the execution costs for
each of the alternative plans based on cost models. A runtime query optimizer
is then also designed to select the fastest alternative to process a given
user request based on our cost models.
16d. Experimental Evaluation. An experimental evaluation using several
different data sets (IBM Quest and PUMSB dataset benchmarks) is conducted
to assess the relative benefits and effectiveness of each of the proposed
extraction methods. Initial guidelines for selecting among the methods
are established. Yet we expect to complete our evaluation using additional
data sets and experiments in the coming year to draw final conclusions.
17. Time-Series Patterns. We investigated the use of N-grams in the analysis
of time-series data, mapping each N-gram to a point in N-dimensional shape
space. We then used glyphs as a means of displaying each N-gram, and
PCA to lay out the glyphs, resulting in clusters and paths that made finding
patterns in the time-series much easier. A linked brushing mechanism
connected to a time-line view allows users to see distributions and evolutions
of patterns. If we treat each N-gram as a nugget (i.e., the nugget is
formed based on time, rather than data values), we can use some of the
same types of exploration techniques we've developed for other nugget
Last but not least, in the non-cost extension year, we anticipate to look
into completing the above tasks as well as exploring the following research
One, we plan to look into the research question of localized neighbor-hood
driven pattern mining. In particular we proposed to explore pointwise
visualization and exploration techniques for visual multivariate analysis.
The general idea is that any local pattern extracted using the neighborhood
around a focal point could be explored in a point-wise manner. That is,
each local pattern could be extracted based on a regression model and
the relationships between the focal point and its neighbors. Such a system
would enable an
analyst to explore sensitivity information at individual data points.
While layout strategies applied to local patterns could reveal which neighbors
are of potential interest. Following the idea of subgroup mining, we plan
to employ a statistical method to assign each local pattern an outlier
factor, so that users can quickly identify anomalous local patterns that
deviate from the global pattern. Users can also compare the local pattern
with the global pattern both visually and statistically. Appropriate visualizations
would need to be designed to integrate the local pattern into the original
attribute space so to reveal the distribution of the data.
We plan to also finalize our design for the hypothesis view, which allows
analysts to organize (manually and semi-automatically) the nuggets relevant
to a particular hypothesis.
Finally, while we have performed evaluations on each of the components
of the system as they have been developed, we need to continue this process.
In particular, we have yet to perform the expert evaluations that we
had planned, due to some delays in recruiting appropriate domain experts.
Training and Development:
In year 1, three Ph.D. students were supported under this grant in part.
Di Yang has been supported partially in the first year; focusing on the
development of automated techniques for the extraction of nuggets, in
particular, density-based clusters. Di Yang has conducted experimental
evaluations using real data sets, including data from Mitre Corporation.
Di then switched his attention over to the management of streams - and
thus was thereafter supported on a different grant. Zhenyu Guo, who started
on the project in September, 2008, has focused on interactive nugget extraction
for linear regression models. Abhishek Mukherji now having replaced Di
Yang, has started on the project in May 2009, and has focused on automated
methods for nugget extraction as well as management techniques for generalized
In year 2, Guo and Mukherji have continued their exploration of nuggets,
including extraction, analysis, and visualization. Guo has extended his
work on nuggets formed from linear regression analysis to include nuggets
that result from prototype-based classifiers. He has developed a number
of interactive visualizations for exploring nuggets at multiple levels
of abstraction. Mukherji has focused on nuggets resulting from association
rule mining, and in particular, is interested in efficient mechanisms
to use meta-information so to respond to parameterized requests over subsets
of the data to efficiently extract local rules.
Three undergraduate REU students have been supported under this grant.
Each started attending our research group meetings in the spring, and
started receiving funding in May of the year of their employment. They
each have participated full-time in our project during the summer. Initially,
each successfully ported visualizations from our old architecture (C++/Tcl/Tk
under Visual Studio) to our new architecture (C++/Qt under Eclipse).
They then shifted their focus to other research activities, in particular:
general color management methodologies (Jason Stasik), real data source
capture and integration (Dan Spitz), and dynamic brushing (Nik Deapen).
In year 3, Guo and Mukherji have made steady progress on expanding our
capabilities to extract, manage, and analyze nuggets. Guo has completed
his first version of the Nugget Browser, with multiple linked views at
different abstractions. A rich assortment of layout strategies have been
developed and tested, as well as some innovations in fiber bundles for
linking the views. Mukherji has moved on to study the representation
and integration of heterogeneous nugget types, and in particular, how
nuggets of one type can be used to refine nuggets of other types.
In the K12 REK project supported by NSF in 2008/2009 by one of the PIs,
we have worked with K12 students on small research projects with the goal
to increase their awareness and interest in science and technology. In
this outreach context, we have made an effort to expose these K12 students
to visual exploration technology, as studied and developed as part of
this research NSF grant.
Other Specific Products:
XmdvTool 8.0 Version [Released October 20, 2010]
Source & Binary Releases
We have releaseed XmdvTool 8.0, and the Windows and Linux/Unix versions
of it can be found at SourceForge -- linked off our project page.
The salient features of XmdvTool 8.0 are as follows:
New software architecture: The new system is based on the information
visualization reference model (or visualization pipeline) developed by
Ed Chi. For more details on our extensions to this pipeline, namely, our
Operator-Centric Design Patterns for Information Visualization, please
see our research paper in VDA 2010.
New development environment: We have ported XmdvTool to Eclipse using
Qt for the UI to enhance portability.
Multiple views: User can open multiple datasets at once, and observe each
dataset in multiple sub windows with different visualizations. These windows
can be tiled and/or cascaded.
Color strategy: With a new color strategy dialog, users can assign colors
to datapoints based on data values or different orderings. We support
sequential, diverging, and qualitative color maps based on Cynthia Brewer's
CSV file support: We enable users to open comma-separated values (CSV)
files directly in XmdvTool, in addition to the XmdvTool native file format
This software has already been released as freeware on our XMDV project
This is the project web site. Copies of most papers, as well as the code,
documentation, and datasets, are available here.
Contributions within Discipline:
Within the visualization field, we have developed a new approach to visual
data analysis by creating views of the model space that are interactively
linked to the corresponding data views. Thus one can indicate a particular
model over thousands of possibilities and see which data fits or doesn't
fit the model. This is a powerful tool in situations where multiple distinct
phenomena are present in the data. The analyst can thus interactively
segment the data based on the fit of the models. In year 1 this has focused
on linear regression models, while year 2 has focused on classifiers (association
rules and prototype-based classification), and year 3 has focused on neighborhood
techniques/sensitivity analysis and time-series patterns.
In year 1 we also contributed to the data modeling and management field
by creating a new representation of high dimensional objects that we call
generalized hyper-cylinders. This compact representation can be used to
represent cluster shapes in a descriptive manner, and is useful not only
for visualizing the cluster but also computing changes in clusters and
even specifying queries on high dimensional data.
In year 2 we explored representations that can be used to seamlessly analyze
nuggets extracted via different mechanisms (automated, manual). For example,
this allows us to compare the results of different clustering algorithms
with different association rule mining methods with subsets of data isolated
via interactive visual analysis. It is also a useful mechanism for combining
results from multiple analysts working on the same dataset.
In addition, we have also developed nearness functions for efficiently
comparing nuggets that accurately capture the intuition of humans (as
verified via a case study) on the closeness of these concepts both in
terms of query specification as well as implied data
content. Several algorithms for implementing these functions efficiently
have been developed, and then employed for nugget consolidation and cleanup.
In year 3 we created a new multi-view framework for visual exploration
of nuggets at different levels of abstraction, including the raw data,
bins in a descretized version of the data space, hyperboxes, and views
specific to the extraction process (e.g., clusters, rules, user-identified).
We also extended our work to include time-series data, using short, potentially
overlapping subsequences of data to represent a shape in N-D space. These
'temporal nuggets' can then be the focus of analysis in terms of similarities
and variations in shapes, and can be used to locate repeated and unusual
patterns in the data.
Providing technology for discovering nuggets and patterns within general
data spaces has the potential to lead to contributions in multiple disciplines,
by providing analysts with tools that allow them to conduct their scientific
explorations in a more effective manner.
As indicated earlier, three Ph.D. students and 3 undergraduate (REU) students
have been trained in state-of-the-art technology, as part of this project
We distribute the software generated by our research to the public domain
on a regular basis. Researchers at several universities and research
labs use our tools for their work, and educators at numerous schools use
our software in their courses. We also provide a repository of data sets
that we've collected and posted on our web site; many researchers in visualization,
data mining, and statistics have and continue to use these data sets.
Exploratory data analysis touches nearly every aspect of our society from
medicine to manufacturing to homeland security. Interactive visualization
of data, models, and reasoning processes has been recognized as a critical
technology in all these fields. Over the years, techniques we have developed
have been integrated into commercial visualization tools, such as Tableau
and Spotfire, which are being used in a wide range of disciplines.
Special Requirements for Annual Project Report: