Project Reporting ANNUAL REPORT FOR AWARD # 0812027

Worcester Polytech Inst
III-COR-Small: Managing Discoveries in Visual Analytics

Participant Individuals:
CoPrincipal Investigator(s) : Elke A Rundensteiner
Graduate student(s) : Zhenyu Guo; Abhishek Mukherji
Research Experience for Undergraduates(s) : Jason Stasik; Daniel Spitz
Graduate student(s) : Di Yang
Research Experience for Undergraduates(s) : Nicholas Deapen

Partner Organizations:

Other collaborators:

We have been exchanging ideas in targeted meetings with several individuals
at Fidelity Corporation about the problems they face of detection interesting
trends (nuggets) in their data, and how to manage such knowledge over
time. There is a potential that this may lead to a more in-depth collaboration
as the project progresses, and we are developing technology that could
potentially be applied to address some of their challenges. However, Fidelity
Corporation appears to be more interested in 'products' they can purchase
than longer-term fundamental research projects; so no in-depth collaboration
has emerged today out of these interactions.

Activities and findings:

Research and Education Activities: 
This project focuses on the creation, management, and exploration of discoveries during exploratory data analysis. These discoveries, which we term 'nuggets', can be clusters, anomalies, trends, associations, and other components of the reasoning and decision-making process. There are 5 major tasks in this research effort: (1.) Nugget Modeling and Storage: Technology for the modeling and management of nuggets, their complex interrelationships, and their supporting or refuting relationships with the relevant data will be developed. (2.) Nugget Discovery and Capture: Three methods for nugget generation will be developed: explicit identification and confirmation by user, implicit capture based on analysis of user logs, and automated discovery using statistical and data mining techniques. (3.) Nugget Lifespan Management: Computational and interactive visual methods to enable analysts to efficiently validate, annotate, classify, organize, and purge nuggets will be devised. (4.) Nugget-Supported Visual Exploration: Visual representations of hypotheses, evidence, nuggets, and the data associated with them will help analysts explore their data, manage their discoveries, and organize their reasoning processes. (5.) Assessment: Following user-centered design principles, we will insure that end-users participate in the design, development, and testing to help insure that the resulting software tools will be both useful and usable to the targeted audiences.

Findings:
During this first year of the grant we have focused on the following: 1. Nugget extraction via user-guided selection on data views. Using common multivariate visualization techniques, analysts can interactively brush over data subsets of interest to create nuggets. These can be annotated to simplify recall of appropriate nuggets and thus simplify their use in hypothesis formation and confirmation. 2. Nugget extraction via automated extraction during data exploration. Using common multivariate visualization techniques, analysts can interactively explore data subsets of interest. Our system will then monitor their navigation and data brushing activities, and will automatically create nuggets of potential interest. Interest here is predicated by the number of times an analyst re-examines a given data subset and the length of time spend with that data. These automatically extracted nuggets can then again be annotated, as above, to simplify recall of appropriate nuggets. 3. Nugget extraction via selection on model views. We've created an initial set of model-space visualizations based on linear regression, conveying for every combination of model parameters how well the data fits the model. Promising models are easily identified, and the parameters can be either manually or automatically refined to increase the accuracy of the model. Another variant on this approach is to represent multivariate clusters as N-dimensional objects and combine the visualizations of the data with that of the object shape characteristics. 4. Nugget organization and management. We have developed methods for nugget clustering, refinement, and pruning nuggets to help avoid overloading the analyst with too many nuggets. To date, these methods have focused on the simple nuggets extracted via automatic selection in the data space. Such nugget cleanup is most essential within this context, because the nuggets are continuously generated by our system - yet many of them tend to be rather similar by design. Hence, consolidation of multiple similar nuggets into one representative nugget is employed to reduce the nugget space, keeping it manageable. 5. Evaluation Using Small User Study. To verify the usefulness of nuggets for facilitating exploration, we have extended XMDVtool with the services of nugget extraction, consolidation, and maintenance (NMS) to provide nugget support during visual data exploration. We then conducted a small preliminary user study with the goal to compare users' efficiency and accuracy when solving tasks with and without the help of NMS. Specifically, in our study, we randomly divided 12 users, all WPI students, into 4 groups, 3 users per group and asked them to finish the same 5 knowledge discovery tasks (each based on 3 real datasets) -- some groups with and some groups without support of NMS services. Our study confirmed that NMS can indeed improve users timeefficiency when solving knowledge discovery tasks. Our preliminary evaluation also shows that NMS enhances users' accuracy of finishing these tasks. The details of this user study can be found in (Di Yang, MS Thesis document). 6. Evaluation using Case Study. Furthermore, we have also began to evaluate our model-view technology for aiding the discovery of hypotheses (i.e., models) about the data. In particular, we observed that an analyst is aiding in detecting which among possibly many models may be the most suitable fit for a given data set, including if a single model cannot achieve the desired match level -- which subsets of our overall data set may best be matched by which model. During this second year of the grant we have focused on the following: 7. Nugget extraction via space partitioning. We decompose space into bins or cells based on segmenting each dimension into some number of non-overlapping regions. A given bin is either empty, predominantly one class, or some mixture. We then merge bins that are adjacent in data space into hyperboxes with a consistent label. This can accommodate arbitrarily shaped regions, as several overlapping hyperboxes can have the same label. 8. Visualization of nugget space at several levels of abstraction. Each ofthe bins above can be visualized as entities that have a location inN-dimensions as well as a class ID (unless it is a mixture, in which case it is treated as unclassified). Different layouts can be used to convey different relationships between bins, including ordered by class, centered around a selected bin, or using a spring layout based on representatives from each class. Nuggets formed by combining adjacent bins of the same class can be shown using a variant on star glyphs using an MDS layout. Pattern-specific visualizations are used to convey characteristics of specific patterns (e.g., cluster, association rules, user-selected visual patterns) derived via different methods. 9. Linkages between abstraction levels. Using edge bundle techniques, we link each entity at one level of abstraction with its components at lower levels of abstraction as well as its parent pattern. This 4-level connection between data and models allows users to explore data in a wide range of ways, using a variety of techniques to extract patterns and/or models. 10. Visualizing classifier space. Using prototype-based classifiers, we have studied ways of visualizing relationships between classifiers as well as using existing classifiers to generate new classifiers with characteristics of the classifiers used in the generation. This can result in new classifiers without the need to return to the raw data, while at the same time giving better performance on the dataset being analyzed. 11. Reusing results of association rule mining. We are studying ways to ascertain relationships between association rules that can be used to estimate relations on subsets of data without recomputing the association rules. This approximation can lead to significant performance increases without significant loss of accuracy. During the third year of the grant we have focused on the following: 12. Nugget Discovery Process: we designed a visual subgroup mining system supporting a closed loop analysis that involves both data mining and visual analysis in one coherent process. The users can perform data mining as the first step for extracting patterns on multi-variate data sets, and then visually analyze the results and the corresponding data. Inspired by the visual exploration, further refinement of the data mining query leads to the next cycle of visual exploration and analysis. 13. Nugget Modeling and Representation: We proposed a representation of the mining results in an understandable form. In addition to storage benefits, this representation is easy for analysts to understand, and can be directly shown using common multivariate visualization approaches. 14. A 4-level visual and structure model: Our structure model that allows users to explore the data space at four different levels of abstraction: instances, cells, nuggets, and clusters. For each level of this nugget space, we designed a view in which users are able to explore and select items to visualize. In particular, the nugget level mining results are represented as regular hyper-box shaped regions, which can be easily understood, visualized, and compactly stored. The connections between the different layers are shown based on the user's cursor position. The layout strategies help users make sense of the relationships between the extracted patterns. 15. We implemented the above techniques in an integrated system called Nugget Browser in XmdvTool. our freeware multivariate data visualization tool. Case studies suggest that our visualization techniques are effective in discovering patterns in multivariate datasets. In the coming year, we intend to continue this research by conducting a more comprehensive user study to assess the usability of the technology, and identify possible places for improvement. 16. Automated Nugget Extraction and Refinement Techniques. We tackle the new problem of interactive mining of localized association rules. That is, we provide an analyst the ability to select an arbitrary subset of data and efficiently mine association rules specific to that subset. For this, we design a preprocess- once-query-many approach. Namely, first the data set is preprocessed in an off-line manner to extract relevant features in a global fashion and then second the actual user query customized to a data subset is processed at run-time by exploiting this preprocessed store. 16a. For this, a nugget store, called L-FIST, is designed that uses a novel itemset-based data partitioning that enables compact storage of the itemsets and the underlying data subsets. This store maintains precomputed meta-data about rule-related properties of the data subsets. 16b. Nugget Extraction Plan Modeling. We introduce an algebraic approach towards modeling alternate strategies fo localized association rule mining. We define the mining tasks as a pipeline of algebraic operators and apply the principles of query optimization towards tackling mining localized rules. In addition to the straightforward solution of uses a mining algorithm over the user-chosen subset, we design five POQM plans for answering a localized association rule mining query by leveraging this pre-computed knowledge maintained in this off-line L-FIST nugget store. 16c. Nugget Extraction Optimizer. Our cost analysis demonstrates that neither of these plan types outperforms the others for all possible query scenarios. Rather the costs of these plans dependent on several key factors including the user-selected thresholds, the data subsets, and the L-FIST store. We present our analytical evaluation of the execution costs for each of the alternative plans based on cost models. A runtime query optimizer is then also designed to select the fastest alternative to process a given user request based on our cost models. 16d. Experimental Evaluation. An experimental evaluation using several different data sets (IBM Quest and PUMSB dataset benchmarks) is conducted to assess the relative benefits and effectiveness of each of the proposed extraction methods. Initial guidelines for selecting among the methods are established. Yet we expect to complete our evaluation using additional data sets and experiments in the coming year to draw final conclusions. 17. Time-Series Patterns. We investigated the use of N-grams in the analysis of time-series data, mapping each N-gram to a point in N-dimensional shape space. We then used glyphs as a means of displaying each N-gram, and PCA to lay out the glyphs, resulting in clusters and paths that made finding patterns in the time-series much easier. A linked brushing mechanism connected to a time-line view allows users to see distributions and evolutions of patterns. If we treat each N-gram as a nugget (i.e., the nugget is formed based on time, rather than data values), we can use some of the same types of exploration techniques we've developed for other nugget types. Last but not least, in the non-cost extension year, we anticipate to look into completing the above tasks as well as exploring the following research questions. One, we plan to look into the research question of localized neighbor-hood driven pattern mining. In particular we proposed to explore pointwise visualization and exploration techniques for visual multivariate analysis. The general idea is that any local pattern extracted using the neighborhood around a focal point could be explored in a point-wise manner. That is, each local pattern could be extracted based on a regression model and the relationships between the focal point and its neighbors. Such a system would enable an analyst to explore sensitivity information at individual data points. While layout strategies applied to local patterns could reveal which neighbors are of potential interest. Following the idea of subgroup mining, we plan to employ a statistical method to assign each local pattern an outlier factor, so that users can quickly identify anomalous local patterns that deviate from the global pattern. Users can also compare the local pattern with the global pattern both visually and statistically. Appropriate visualizations would need to be designed to integrate the local pattern into the original attribute space so to reveal the distribution of the data. We plan to also finalize our design for the hypothesis view, which allows analysts to organize (manually and semi-automatically) the nuggets relevant to a particular hypothesis. Finally, while we have performed evaluations on each of the components of the system as they have been developed, we need to continue this process. In particular, we have yet to perform the expert evaluations that we had planned, due to some delays in recruiting appropriate domain experts.

Training and Development:
In year 1, three Ph.D. students were supported under this grant in part. Di Yang has been supported partially in the first year; focusing on the development of automated techniques for the extraction of nuggets, in particular, density-based clusters. Di Yang has conducted experimental evaluations using real data sets, including data from Mitre Corporation. Di then switched his attention over to the management of streams - and thus was thereafter supported on a different grant. Zhenyu Guo, who started on the project in September, 2008, has focused on interactive nugget extraction for linear regression models. Abhishek Mukherji now having replaced Di Yang, has started on the project in May 2009, and has focused on automated methods for nugget extraction as well as management techniques for generalized nuggets. In year 2, Guo and Mukherji have continued their exploration of nuggets, including extraction, analysis, and visualization. Guo has extended his work on nuggets formed from linear regression analysis to include nuggets that result from prototype-based classifiers. He has developed a number of interactive visualizations for exploring nuggets at multiple levels of abstraction. Mukherji has focused on nuggets resulting from association rule mining, and in particular, is interested in efficient mechanisms to use meta-information so to respond to parameterized requests over subsets of the data to efficiently extract local rules. Three undergraduate REU students have been supported under this grant. Each started attending our research group meetings in the spring, and started receiving funding in May of the year of their employment. They each have participated full-time in our project during the summer. Initially, each successfully ported visualizations from our old architecture (C++/Tcl/Tk under Visual Studio) to our new architecture (C++/Qt under Eclipse). They then shifted their focus to other research activities, in particular: general color management methodologies (Jason Stasik), real data source capture and integration (Dan Spitz), and dynamic brushing (Nik Deapen). In year 3, Guo and Mukherji have made steady progress on expanding our capabilities to extract, manage, and analyze nuggets. Guo has completed his first version of the Nugget Browser, with multiple linked views at different abstractions. A rich assortment of layout strategies have been developed and tested, as well as some innovations in fiber bundles for linking the views. Mukherji has moved on to study the representation and integration of heterogeneous nugget types, and in particular, how nuggets of one type can be used to refine nuggets of other types.

Outreach Activities:
In the K12 REK project supported by NSF in 2008/2009 by one of the PIs, we have worked with K12 students on small research projects with the goal to increase their awareness and interest in science and technology. In this outreach context, we have made an effort to expose these K12 students to visual exploration technology, as studied and developed as part of this research NSF grant.

Journal Publications:
D. Yang, Z. Xie, E. Rundensteiner, and M. Ward, "Managing discoveries in the visual analytics process", ACM SIGKDD Explorations (special issue on Visual Analytics), vol. 9, (2007), p. 22., " " Published
D. Yang, Z. Xie, E. Rundensteiner, and M. Ward, "Nugget discovery in visual exploration environments by query consolidation", Proc. ACM Conference on Information and Knowledge Management, vol. , (2007), p. 603., " " Published
M. Ward and Z. Guo, "Generalized hyper-cylinders: a mechanism for modeling and visualing N-D objects", Proc. Dagstuhl Seminar on Scientific Visualization 2007; published in Scientific Visualization: Advanced Concepts 2010, vol. 1, (2010), p. 1., " " Published
Z. Guo, M. Ward, and E. Rundensteiner, "Model Space Visualization for Multivariate Linear Trend Discovery", Proc. IEEE Symposium on Visual Analytics Science and Technology, vol. , (2009), p. ., " " Published
Di Yang, Elke A. Rundensteiner, Matthew O. Ward, "Analysis Guided Visual Exploration of Multivariate Data", IEEE Symposium on Visual Analytics, Science and Technology (VAST), vol. , (2007), p. 1., " " Published
Xie, Z., Guo, Z., Ward, M.O., and Rundensteiner, E.A., "Operator-centric design patterns for information visualization software", Proc. SPIE-IS&T Electronic Imaging, Visualization and Data Analysis, vol. 7530, (2010), p. 75300., " " Published
Abhishek Mukherji and Elke A. Rundensteiner and Matthew O. Ward, "Achieving High Freshness and Optimal Throughput in CPU-limited Execution of Multi-Join Continuous Queries", British National Conference on Databases (BNCOD 2011), vol. 1, (2011), p. 1., " " Accepted
Matthew O. Ward and Zhenyu Guo, "Visual Exploration of Time-Series Data with Shape Space Projections", Eurographics / IEEE Symposium on Visualization 2011 (EuroVis 2011) Volume 30, Number 3 (to appear)., vol. 30, (2011), p. 1., " " Published
Zaixian Xie, Matthew O. Ward, Elke A. Rundensteiner, "Visual Exploration of Stream Pattern Changes Using a Data-Driven Framework", Proc. of 6th International Symposium on Visual Computing, pp 522-532, Nov.29 - Dec. 1, 2010 (Lecture Notes in Computer Scien, vol. 1, (2010), p. 522., " " Published
Z. Guo, M. Ward, and E. Rundensteiner, "Nugget Browser: Visual Subgroup Mining and Statistical Significance Discovery in Multivariate Datasets", Proc. Int. Conference on Information Visualization (IV2011), vol. , (2011), p. ., " " Accepted
Z. Guo, M. Ward, E. Rundensteiner, and C. Ruiz, "Pointwise Local Pattern Exploration for Sensitivity Analysis", IEEE Conf. on Visual Analytics Science and Technology (VAST 2011), vol. , (2011), p. ., " " Submitted

Book(s) of other one-time publications(s):

Other Specific Products:

Software (or netware)
XmdvTool 8.0 Version [Released October 20, 2010] 

Source & Binary Releases

We have releaseed XmdvTool 8.0, and the Windows and Linux/Unix versions
of it can be found at SourceForge -- linked off our project page. 

The salient features of XmdvTool 8.0 are as follows:

New software architecture: The new system is based on the information
visualization reference model (or visualization pipeline) developed by
Ed Chi. For more details on our extensions to this pipeline, namely, our
Operator-Centric Design Patterns for Information Visualization, please
see our research paper in VDA 2010. 

New development environment: We have ported XmdvTool to Eclipse using
Qt for the UI to enhance portability. 

Multiple views: User can open multiple datasets at once, and observe each
dataset in multiple sub windows with different visualizations. These windows
can be tiled and/or cascaded. 

Color strategy: With a new color strategy dialog, users can assign colors
to datapoints based on data values or different orderings. We support
sequential, diverging, and qualitative color maps based on Cynthia Brewer's
work.
 
CSV file support: We enable users to open comma-separated values (CSV)
files directly in XmdvTool, in addition to the XmdvTool native file format
(.okc). 
This software has already been released as freeware on our XMDV project
webpage:

http://davis.wpi.edu/xmdv/

Internet Dissemination:

http://davis.wpi.edu/~xmdv

This is the project web site.  Copies of most papers, as well as the code,
documentation, and datasets, are available here.

Contributions:

Contributions within Discipline:

 Within the visualization field, we have developed a new approach to visual
data analysis by creating views of the model space that are interactively
linked to the corresponding data views. Thus one can indicate a particular
model over thousands of possibilities and see which data fits or doesn't
fit the model. This is a powerful tool in situations where multiple distinct
phenomena are present in the data. The analyst can thus interactively
segment the data based on the fit of the models.  In year 1 this has focused
on linear regression models, while year 2 has focused on classifiers (association
rules and prototype-based classification), and year 3 has focused on neighborhood
techniques/sensitivity analysis and time-series patterns.

In year 1 we also contributed to the data modeling and management field
by creating a new representation of high dimensional objects that we call
generalized hyper-cylinders. This compact representation can be used to
represent cluster shapes in a descriptive manner, and is useful not only
for visualizing the cluster but also computing changes in clusters and
even specifying queries on high dimensional data.

In year 2 we explored representations that can be used to seamlessly analyze
nuggets extracted via different mechanisms (automated, manual).  For example,
this allows us to compare the results of different clustering algorithms
with different association rule mining methods with subsets of data isolated
via interactive visual analysis.  It is also a useful mechanism for combining
results from multiple analysts working on the same dataset.

In addition, we have also developed nearness functions for efficiently
comparing nuggets that accurately capture the intuition of humans (as
verified via a case study) on the closeness of these concepts both in
terms of query specification as well as implied data 
content. Several algorithms for implementing these functions efficiently
have been developed, and then employed for nugget consolidation and cleanup.

In year 3 we created a new multi-view framework for visual exploration
of nuggets at different levels of abstraction, including the raw data,
bins in a descretized version of the data space, hyperboxes, and views
specific to the extraction process (e.g., clusters, rules, user-identified).
 We also extended our work to include time-series data, using short, potentially
overlapping subsequences of data to represent a shape in N-D space.  These
'temporal nuggets' can then be the focus of analysis in terms of similarities
and variations in shapes, and can be used to locate repeated and unusual
patterns in the data.


Contributions to Other Disciplines:
 Providing technology for discovering nuggets and patterns within general
data spaces has the potential to lead to contributions in multiple disciplines,
by providing analysts with tools that allow them to conduct their scientific
explorations in a more effective manner.

Contributions to Education and Human Resources:
 As indicated earlier, three Ph.D. students and 3 undergraduate (REU) students
have been trained in state-of-the-art technology, as part of this project
effort.

Contributions to Resources for Science and Technology:
 We distribute the software generated by our research to the public domain
on a regular basis.  Researchers at several universities and research
labs use our tools for their work, and educators at numerous schools use
our software in their courses.  We also provide a repository of data sets
that we've collected and posted on our web site; many researchers in visualization,
data mining, and statistics have and continue to use these data sets.

Contributions Beyond Science and Engineering:
 Exploratory data analysis touches nearly every aspect of our society from
medicine to manufacturing to homeland security.  Interactive visualization
of data, models, and reasoning processes has been recognized as a critical
technology in all these fields.  Over the years, techniques we have developed
have been integrated into commercial visualization tools, such as Tableau
and Spotfire, which are being used in a wide range of disciplines.

Conference Proceedings:
Yang, D;Rundensteiner, EA;Ward, MO, "Analysis guided visual exploration of multivariate data", IEEE Symposium on Visual Analytics Science and Technology, OCT 30-NOV 01, 2007, VAST: IEEE SYMPOSIUM ON VISUAL ANALYTICS SCIENCE AND TECHNOLOGY 2007, PROCEEDINGS, : 83-90 2007

Special Requirements for Annual Project Report:


Categories for which nothing is reported:
Participants: Partner organizations
Products: Book or other one-time publication
Special Reporting Requirements
Animal, Human Subjects, Biohazards


FastLane Home Page Take you to the Project System Control Screen We welcome comments on this system

If you have trouble accessing any FastLane page, please contact the FastLane Help Desk at 1-800-673-6188