Partner Organizations:
Other collaborators:
We have been exchanging ideas in targeted meetings with several individuals at Fidelity Corporation about the problems they face of detection interesting trends (nuggets) in their data, and how to manage such knowledge over time. There is a potential that this may lead to a more in-depth collaboration as the project progresses, and we are developing technology that could potentially be applied to address some of their challenges. However, Fidelity Corporation appears to be more interested in 'products' they can purchase than longer-term fundamental research projects; so no in-depth collaboration has emerged today out of these interactions.
Activities and findings:
Research and Education Activities: This project focuses on the creation, management, and exploration of discoveries during exploratory data analysis. These discoveries, which we term 'nuggets', can be clusters, anomalies, trends, associations, and other components of the reasoning and decision-making process. There are 5 major tasks in this research effort: (1.) Nugget Modeling and Storage: Technology for the modeling and management of nuggets, their complex interrelationships, and their supporting or refuting relationships with the relevant data will be developed. (2.) Nugget Discovery and Capture: Three methods for nugget generation will be developed: explicit identification and confirmation by user, implicit capture based on analysis of user logs, and automated discovery using statistical and data mining techniques. (3.) Nugget Lifespan Management: Computational and interactive visual methods to enable analysts to efficiently validate, annotate, classify, organize, and purge nuggets will be devised. (4.) Nugget-Supported Visual Exploration: Visual representations of hypotheses, evidence, nuggets, and the data associated with them will help analysts explore their data, manage their discoveries, and organize their reasoning processes. (5.) Assessment: Following user-centered design principles, we will insure that end-users participate in the design, development, and testing to help insure that the resulting software tools will be both useful and usable to the targeted audiences.
Findings: During this first year of the grant we have focused on the following: 1. Nugget extraction via user-guided selection on data views. Using common multivariate visualization techniques, analysts can interactively brush over data subsets of interest to create nuggets. These can be annotated to simplify recall of appropriate nuggets and thus simplify their use in hypothesis formation and confirmation. 2. Nugget extraction via automated extraction during data exploration. Using common multivariate visualization techniques, analysts can interactively explore data subsets of interest. Our system will then monitor their navigation and data brushing activities, and will automatically create nuggets of potential interest. Interest here is predicated by the number of times an analyst re-examines a given data subset and the length of time spend with that data. These automatically extracted nuggets can then again be annotated, as above, to simplify recall of appropriate nuggets. 3. Nugget extraction via selection on model views. We've created an initial set of model-space visualizations based on linear regression, conveying for every combination of model parameters how well the data fits the model. Promising models are easily identified, and the parameters can be either manually or automatically refined to increase the accuracy of the model. Another variant on this approach is to represent multivariate clusters as N-dimensional objects and combine the visualizations of the data with that of the object shape characteristics. 4. Nugget organization and management. We have developed methods for nugget clustering, refinement, and pruning nuggets to help avoid overloading the analyst with too many nuggets. To date, these methods have focused on the simple nuggets extracted via automatic selection in the data space. Such nugget cleanup is most essential within this context, because the nuggets are continuously generated by our system - yet many of them tend to be rather similar by design. Hence, consolidation of multiple similar nuggets into one representative nugget is employed to reduce the nugget space, keeping it manageable. 5. Evaluation Using Small User Study. To verify the usefulness of nuggets for facilitating exploration, we have extended XMDVtool with the services of nugget extraction, consolidation, and maintenance (NMS) to provide nugget support during visual data exploration. We then conducted a small preliminary user study with the goal to compare users' efficiency and accuracy when solving tasks with and without the help of NMS. Specifically, in our study, we randomly divided 12 users, all WPI students, into 4 groups, 3 users per group and asked them to finish the same 5 knowledge discovery tasks (each based on 3 real datasets) -- some groups with and some groups without support of NMS services. Our study confirmed that NMS can indeed improve users timeefficiency when solving knowledge discovery tasks. Our preliminary evaluation also shows that NMS enhances users' accuracy of finishing these tasks. The details of this user study can be found in (Di Yang, MS Thesis document). 6. Evaluation using Case Study. Furthermore, we have also began to evaluate our model-view technology for aiding the discovery of hypotheses (i.e., models) about the data. In particular, we observed that an analyst is aiding in detecting which among possibly many models may be the most suitable fit for a given data set, including if a single model cannot achieve the desired match level -- which subsets of our overall data set may best be matched by which model. During this second year of the grant we have focused on the following: 7. Nugget extraction via space partitioning. We decompose space into bins or cells based on segmenting each dimension into some number of non-overlapping regions. A given bin is either empty, predominantly one class, or some mixture. We then merge bins that are adjacent in data space into hyperboxes with a consistent label. This can accommodate arbitrarily shaped regions, as several overlapping hyperboxes can have the same label. 8. Visualization of nugget space at several levels of abstraction. Each ofthe bins above can be visualized as entities that have a location inN-dimensions as well as a class ID (unless it is a mixture, in which case it is treated as unclassified). Different layouts can be used to convey different relationships between bins, including ordered by class, centered around a selected bin, or using a spring layout based on representatives from each class. Nuggets formed by combining adjacent bins of the same class can be shown using a variant on star glyphs using an MDS layout. Pattern-specific visualizations are used to convey characteristics of specific patterns (e.g., cluster, association rules, user-selected visual patterns) derived via different methods. 9. Linkages between abstraction levels. Using edge bundle techniques, we link each entity at one level of abstraction with its components at lower levels of abstraction as well as its parent pattern. This 4-level connection between data and models allows users to explore data in a wide range of ways, using a variety of techniques to extract patterns and/or models. 10. Visualizing classifier space. Using prototype-based classifiers, we have studied ways of visualizing relationships between classifiers as well as using existing classifiers to generate new classifiers with characteristics of the classifiers used in the generation. This can result in new classifiers without the need to return to the raw data, while at the same time giving better performance on the dataset being analyzed. 11. Reusing results of association rule mining. We are studying ways to ascertain relationships between association rules that can be used to estimate relations on subsets of data without recomputing the association rules. This approximation can lead to significant performance increases without significant loss of accuracy. During the third year of the grant we have focused on the following: 12. Nugget Discovery Process: we designed a visual subgroup mining system supporting a closed loop analysis that involves both data mining and visual analysis in one coherent process. The users can perform data mining as the first step for extracting patterns on multi-variate data sets, and then visually analyze the results and the corresponding data. Inspired by the visual exploration, further refinement of the data mining query leads to the next cycle of visual exploration and analysis. 13. Nugget Modeling and Representation: We proposed a representation of the mining results in an understandable form. In addition to storage benefits, this representation is easy for analysts to understand, and can be directly shown using common multivariate visualization approaches. 14. A 4-level visual and structure model: Our structure model that allows users to explore the data space at four different levels of abstraction: instances, cells, nuggets, and clusters. For each level of this nugget space, we designed a view in which users are able to explore and select items to visualize. In particular, the nugget level mining results are represented as regular hyper-box shaped regions, which can be easily understood, visualized, and compactly stored. The connections between the different layers are shown based on the user's cursor position. The layout strategies help users make sense of the relationships between the extracted patterns. 15. We implemented the above techniques in an integrated system called Nugget Browser in XmdvTool. our freeware multivariate data visualization tool. Case studies suggest that our visualization techniques are effective in discovering patterns in multivariate datasets. In the coming year, we intend to continue this research by conducting a more comprehensive user study to assess the usability of the technology, and identify possible places for improvement. 16. Automated Nugget Extraction and Refinement Techniques. We tackle the new problem of interactive mining of localized association rules. That is, we provide an analyst the ability to select an arbitrary subset of data and efficiently mine association rules specific to that subset. For this, we design a preprocess- once-query-many approach. Namely, first the data set is preprocessed in an off-line manner to extract relevant features in a global fashion and then second the actual user query customized to a data subset is processed at run-time by exploiting this preprocessed store. 16a. For this, a nugget store, called L-FIST, is designed that uses a novel itemset-based data partitioning that enables compact storage of the itemsets and the underlying data subsets. This store maintains precomputed meta-data about rule-related properties of the data subsets. 16b. Nugget Extraction Plan Modeling. We introduce an algebraic approach towards modeling alternate strategies fo localized association rule mining. We define the mining tasks as a pipeline of algebraic operators and apply the principles of query optimization towards tackling mining localized rules. In addition to the straightforward solution of uses a mining algorithm over the user-chosen subset, we design five POQM plans for answering a localized association rule mining query by leveraging this pre-computed knowledge maintained in this off-line L-FIST nugget store. 16c. Nugget Extraction Optimizer. Our cost analysis demonstrates that neither of these plan types outperforms the others for all possible query scenarios. Rather the costs of these plans dependent on several key factors including the user-selected thresholds, the data subsets, and the L-FIST store. We present our analytical evaluation of the execution costs for each of the alternative plans based on cost models. A runtime query optimizer is then also designed to select the fastest alternative to process a given user request based on our cost models. 16d. Experimental Evaluation. An experimental evaluation using several different data sets (IBM Quest and PUMSB dataset benchmarks) is conducted to assess the relative benefits and effectiveness of each of the proposed extraction methods. Initial guidelines for selecting among the methods are established. Yet we expect to complete our evaluation using additional data sets and experiments in the coming year to draw final conclusions. 17. Time-Series Patterns. We investigated the use of N-grams in the analysis of time-series data, mapping each N-gram to a point in N-dimensional shape space. We then used glyphs as a means of displaying each N-gram, and PCA to lay out the glyphs, resulting in clusters and paths that made finding patterns in the time-series much easier. A linked brushing mechanism connected to a time-line view allows users to see distributions and evolutions of patterns. If we treat each N-gram as a nugget (i.e., the nugget is formed based on time, rather than data values), we can use some of the same types of exploration techniques we've developed for other nugget types. Last but not least, in the non-cost extension year, we anticipate to look into completing the above tasks as well as exploring the following research questions. One, we plan to look into the research question of localized neighbor-hood driven pattern mining. In particular we proposed to explore pointwise visualization and exploration techniques for visual multivariate analysis. The general idea is that any local pattern extracted using the neighborhood around a focal point could be explored in a point-wise manner. That is, each local pattern could be extracted based on a regression model and the relationships between the focal point and its neighbors. Such a system would enable an analyst to explore sensitivity information at individual data points. While layout strategies applied to local patterns could reveal which neighbors are of potential interest. Following the idea of subgroup mining, we plan to employ a statistical method to assign each local pattern an outlier factor, so that users can quickly identify anomalous local patterns that deviate from the global pattern. Users can also compare the local pattern with the global pattern both visually and statistically. Appropriate visualizations would need to be designed to integrate the local pattern into the original attribute space so to reveal the distribution of the data. We plan to also finalize our design for the hypothesis view, which allows analysts to organize (manually and semi-automatically) the nuggets relevant to a particular hypothesis. Finally, while we have performed evaluations on each of the components of the system as they have been developed, we need to continue this process. In particular, we have yet to perform the expert evaluations that we had planned, due to some delays in recruiting appropriate domain experts.
Training and Development: In year 1, three Ph.D. students were supported under this grant in part. Di Yang has been supported partially in the first year; focusing on the development of automated techniques for the extraction of nuggets, in particular, density-based clusters. Di Yang has conducted experimental evaluations using real data sets, including data from Mitre Corporation. Di then switched his attention over to the management of streams - and thus was thereafter supported on a different grant. Zhenyu Guo, who started on the project in September, 2008, has focused on interactive nugget extraction for linear regression models. Abhishek Mukherji now having replaced Di Yang, has started on the project in May 2009, and has focused on automated methods for nugget extraction as well as management techniques for generalized nuggets. In year 2, Guo and Mukherji have continued their exploration of nuggets, including extraction, analysis, and visualization. Guo has extended his work on nuggets formed from linear regression analysis to include nuggets that result from prototype-based classifiers. He has developed a number of interactive visualizations for exploring nuggets at multiple levels of abstraction. Mukherji has focused on nuggets resulting from association rule mining, and in particular, is interested in efficient mechanisms to use meta-information so to respond to parameterized requests over subsets of the data to efficiently extract local rules. Three undergraduate REU students have been supported under this grant. Each started attending our research group meetings in the spring, and started receiving funding in May of the year of their employment. They each have participated full-time in our project during the summer. Initially, each successfully ported visualizations from our old architecture (C++/Tcl/Tk under Visual Studio) to our new architecture (C++/Qt under Eclipse). They then shifted their focus to other research activities, in particular: general color management methodologies (Jason Stasik), real data source capture and integration (Dan Spitz), and dynamic brushing (Nik Deapen). In year 3, Guo and Mukherji have made steady progress on expanding our capabilities to extract, manage, and analyze nuggets. Guo has completed his first version of the Nugget Browser, with multiple linked views at different abstractions. A rich assortment of layout strategies have been developed and tested, as well as some innovations in fiber bundles for linking the views. Mukherji has moved on to study the representation and integration of heterogeneous nugget types, and in particular, how nuggets of one type can be used to refine nuggets of other types.
Outreach Activities: In the K12 REK project supported by NSF in 2008/2009 by one of the PIs, we have worked with K12 students on small research projects with the goal to increase their awareness and interest in science and technology. In this outreach context, we have made an effort to expose these K12 students to visual exploration technology, as studied and developed as part of this research NSF grant.
Journal Publications:
Other Specific Products:
XmdvTool 8.0 Version [Released October 20, 2010] Source & Binary Releases We have releaseed XmdvTool 8.0, and the Windows and Linux/Unix versions of it can be found at SourceForge -- linked off our project page. The salient features of XmdvTool 8.0 are as follows: New software architecture: The new system is based on the information visualization reference model (or visualization pipeline) developed by Ed Chi. For more details on our extensions to this pipeline, namely, our Operator-Centric Design Patterns for Information Visualization, please see our research paper in VDA 2010. New development environment: We have ported XmdvTool to Eclipse using Qt for the UI to enhance portability. Multiple views: User can open multiple datasets at once, and observe each dataset in multiple sub windows with different visualizations. These windows can be tiled and/or cascaded. Color strategy: With a new color strategy dialog, users can assign colors to datapoints based on data values or different orderings. We support sequential, diverging, and qualitative color maps based on Cynthia Brewer's work. CSV file support: We enable users to open comma-separated values (CSV) files directly in XmdvTool, in addition to the XmdvTool native file format (.okc).
This software has already been released as freeware on our XMDV project webpage: http://davis.wpi.edu/xmdv/
http://davis.wpi.edu/~xmdv
This is the project web site. Copies of most papers, as well as the code, documentation, and datasets, are available here.
Contributions:
Contributions within Discipline:
Within the visualization field, we have developed a new approach to visual data analysis by creating views of the model space that are interactively linked to the corresponding data views. Thus one can indicate a particular model over thousands of possibilities and see which data fits or doesn't fit the model. This is a powerful tool in situations where multiple distinct phenomena are present in the data. The analyst can thus interactively segment the data based on the fit of the models. In year 1 this has focused on linear regression models, while year 2 has focused on classifiers (association rules and prototype-based classification), and year 3 has focused on neighborhood techniques/sensitivity analysis and time-series patterns. In year 1 we also contributed to the data modeling and management field by creating a new representation of high dimensional objects that we call generalized hyper-cylinders. This compact representation can be used to represent cluster shapes in a descriptive manner, and is useful not only for visualizing the cluster but also computing changes in clusters and even specifying queries on high dimensional data. In year 2 we explored representations that can be used to seamlessly analyze nuggets extracted via different mechanisms (automated, manual). For example, this allows us to compare the results of different clustering algorithms with different association rule mining methods with subsets of data isolated via interactive visual analysis. It is also a useful mechanism for combining results from multiple analysts working on the same dataset. In addition, we have also developed nearness functions for efficiently comparing nuggets that accurately capture the intuition of humans (as verified via a case study) on the closeness of these concepts both in terms of query specification as well as implied data content. Several algorithms for implementing these functions efficiently have been developed, and then employed for nugget consolidation and cleanup. In year 3 we created a new multi-view framework for visual exploration of nuggets at different levels of abstraction, including the raw data, bins in a descretized version of the data space, hyperboxes, and views specific to the extraction process (e.g., clusters, rules, user-identified). We also extended our work to include time-series data, using short, potentially overlapping subsequences of data to represent a shape in N-D space. These 'temporal nuggets' can then be the focus of analysis in terms of similarities and variations in shapes, and can be used to locate repeated and unusual patterns in the data.
Providing technology for discovering nuggets and patterns within general data spaces has the potential to lead to contributions in multiple disciplines, by providing analysts with tools that allow them to conduct their scientific explorations in a more effective manner.
As indicated earlier, three Ph.D. students and 3 undergraduate (REU) students have been trained in state-of-the-art technology, as part of this project effort.
We distribute the software generated by our research to the public domain on a regular basis. Researchers at several universities and research labs use our tools for their work, and educators at numerous schools use our software in their courses. We also provide a repository of data sets that we've collected and posted on our web site; many researchers in visualization, data mining, and statistics have and continue to use these data sets.
Exploratory data analysis touches nearly every aspect of our society from medicine to manufacturing to homeland security. Interactive visualization of data, models, and reasoning processes has been recognized as a critical technology in all these fields. Over the years, techniques we have developed have been integrated into commercial visualization tools, such as Tableau and Spotfire, which are being used in a wide range of disciplines.
Conference Proceedings:
Special Requirements for Annual Project Report: