Project Reporting ANNUAL REPORT FOR AWARD # 1117139

Worcester Polytech Inst
CGV: Small: Model-Driven Visual Analytics on Streams

Participant Individuals:
Principal Investigator(s) : Matthew O Ward
Co-Principal Investigator(s) : Elke A Rundensteiner; Huong N Higgins
Graduate student(s) : Kaiyu Zhao; Xika Lin; Maryam Hasan

Partner Organizations:

Other collaborators:

We have had discussions with Frank Hoy, director of the WPI Entrepreneurship
Program, about meeting with  members of his consortium to discuss the
technology we're developing and about soliciting their input.  

These discussions are continuing. Yet our joint conclusion was to undertaken
the first assessments by the co-PI on the project that is in our targeted
domain area, namely, business and fraud. And, given this plan to identify
and prepare real world datasets of actual fraud oompanies from official
data sources. At a later stage, we'll also be working with  others in
that larger WPI community as participants to join our discussions and
to help guide and evaluate the technology to assess its effectiveness
and utility. 

assure it's maximally useful for the targeted purposes.

Activities and findings:

Research and Education Activities: 
The objective of this proposed research is to design, develop, and assess visual analytics technology to support risk assessment and monitoring in an environment characterized by a) high-speed streaming data as well as vast archives of historical data, b) numerous competing or complementary models for capturing patterns of interest, and c) significant societal importance for achieving decisions swiftly. Our target applications are financial risk and fraud analysis, though the resulting technology should be broadly applicable across other domains and problems. We originally broke the project into 5 major tasks: 1. Model Formation: creation of computational methods for extracting patterns and computing models based on streaming data. 2. Model Management: developing methods to annotate and fine tune models, both automatically and with human involvement. 3. Model Change Analysis: creating mechanisms to monitor how well the models fit the data over time and determine when models should be modified or replaced. 4. Model-Driven History Analytics: using compressed versions of historical data, allow users to find and analyze past patterns that are similar to current data and model characteristics. 5. Assessment: involve domain experts in all stages of design, development, and evaluation. In fact, it was the 5th task, continuous involvement of a domain expert, that has led to some modifications in our initially designed set of tasks. Rather than an emphasis on primarily streaming data, we have ended up focusing on time-series data at a coarser granularity (annual, quarterly, monthly, weekly, daily), as this tends to be the resolution at which fraud and risk are best analyzed. Thus a significant part of our activity during year one was to study algorithms and issues within time-series analysis as well as the types of models in current use within the financial analytics community. In particular, we have designed and implemented a number of methods for measuring and visualizing similarities among multivariate time series datasets. Given that each company has hundreds of values associated with it at each time period, we investigated ways of searching this space for interesting patterns. One activity involved designing sequences of views, each with different data and/or models as well as different visual mappings, and having the user identify patterns of interest by selecting them on the screen. We then identified the companies that were most frequently selected as interesting (usually because their behavior, as reflected in the data, diverged from the main population). In each view, the interestingness of each company from previous views is conveyed in the data, and thus companies in the proximity of one or more interesting points could become classified as interesting. This manual search, selection, and scoring inspired us to think more about the importance of associations across multiple models. Another major activity involved the study of ensemble models, where multiple models are used in the analysis of a given dataset. We're interested in how to visualize these multiple models, which may number in the dozens or even hundreds, as well as how to choose which ones are performing best on different regions of the data space. This has led to the study and development of alternate partitioning strategies for the dataset (both automated and user-guided) so that for each partition, one can make a decision about which models are doing well and which aren't.

Findings:
Key findings from our work in year 1 include: 1. There are distinct differences among domain analysts regarding global and local analysis. Some are looking only for 'big picture' results - findings that are applicable across an entire population, while others are interested in very local phenomena, namely, they hold only among a small set of very similar objects (for example, companies). We have designed several customized views that support both types of analysis; both in terms of effective grouping and by compact indicators of the relative contribution of an object to a given group's impact factor. We have also developed a number of supporting interactions on these views to enable swift analysis and exploration by an analyst of the displayed groups by re-grouping and adjustment of indicators. 2. We have learned about different classes of fraudulent behaviors as observed in the company context. There are many flavors of fraudulent behavior, including both under-reporting and over-reporting. There are also many factors that can lead to a company's bankruptsy that aren't always included in the data. This makes a purely data-driven automated analysis somewhat limited in the confidence in which one can make an assessment. Clearly, a qualified analyst that knows the domain well is required to be involved in the analysis process. More over, the results of analysis achieved with our tool would help to wither down a large number of companies to a small number of suspicious companies that then should be further explored - by collecting other types of data that may only be gotten by actually visiting the company and getting access to their accounting books. 3. Given that local contextualized versus global models can lead to completely distinct observations, we have designed index structures that produce a compact representation of diverse models. This model index can be used to extract and analyze efficiently rather diverse submodels at interaction speed.

Training and Development:
One Ph.D. student has been involved in this project on full-time basis from the very start of the project (for 1 year now). He has been studying a broad range of topics, including financial analytics, time-series analysis and visualization, and ensemble models in data mining. He has also greatly improved his understanding of R, Qt, and Eclipse, as well as his C++ programming skills. Two additional Ph.D. students have been or are becoming involved part-time in this project. For starters, they then learn about the software development technology we are utilizing, in particular, C++, Eclipse, etc. One of them has focussed on compact representations for models that can be efficiently extracted based on different model parameters, while the other is exploring alternate (unstructured) stream data sources for extracting additional knowledge that could be exploited for the identification of fraudulent behaviors.

Outreach Activities:
None to date.

Journal Publications:

Book(s) of other one-time publications(s):
Kaiyu Zhao, Matthew Ward, Elke Rundensteiner, and Huong Higgins, "Progressive Grouping: Using Multiple Models and Views to Discover Similar Objects" , bibl. Currently in revision for resubmission to another venue, EuroVis., (2012). Conference working on resubmission
D. Yang, E. Rundensteiner, and M. Ward., "Summarization and Matching of Density-Based Clusters in Steaming Environments" , bibl. 5(2):121-132, 2012, (2012). Conference Proceedings Accepted
of Collection: , "PVLDB"
Zhenyu Guo, Matthew O Ward, Elke A. Rundensteiner, and C. Ruiz, "Evaluation of a Pointwise Local Visual Exploration Method" , bibl. accepted in June 2012, (2012). Proceedings Accepted
of Collection: , "Tsinghua Science and Technology, Special Issue on Visualization and Computer Graphics, "

Other Specific Products:


Contributions:

Contributions within Discipline:

 We have developed novel methods for combining measures of interestingness
for data objects across multiple views of the data and models extracted
from the data, i.e., applying the notion of model ensembly in an innovative
manner and in a new important context (fraud).  

We have developed new ways of visualizing degrees of similarity among
data objects based on their interestingness scores, along with linked
interactions among data and model views.



Contributions to Other Disciplines:
 With help from our domain expert in fraud, we have started to apply  the
prototyped technology within 2 areas of the financial analytics discipline.
 

The first is to study patterns of bankruptcy over time to identify candidate
companies whose behavior have data patterns similar to those of bankrupt
companies and thus could be classified as having a higher risk than other
companies in the data set.  For this we have selected, extracted, and
cleaned several data sets from the domain with actual bankruptcy cases.

The second is to use databases of companies convicted of fraudulent financial
practices to identify regions in the data space (over time) that are characteristic
of companies later found to have committed fraud.  Companies occupying
similar regions in the data space, i.e., to exhibit similar properties,
are considered candidates for investigation in this approach.

Contributions to Education and Human Resources:
 One Ph.D. student, Kaiyu Zhao, has been involved in this project from
the very start.  He has been making good progress, and will have 2 papers
submitted during 2012.  

We have also been working with other students throughout this time period
--  involving them on a part-time basis while they take courses, work
on other milestones in our graduate program and learn the foundation.
The first student left after completing his M.S. degree. We continue to
identify other students and to involve them to  bring them up to speed
and gauge their abilities and depth of interest. Their involvement has
allowed the students to learn new software engineering skills and get
exposed to research by attending our weekly research meetings and participating
in our discussions.  We expect to bring one of the students full-time
into this project going forward.

Contributions to Resources for Science and Technology:
 Nothing to date, but our plan is to release the resulting code into the
public domain to allow others to build on it as well as use it for instructional
purposes within courses on data visualization and financial analytics.

Contributions Beyond Science and Engineering:
 Detection of fraudulent behavior of companies or other entities has the
potential to be of significant societal benefit, affecting individuals
which otherwise may suffer from financial and other forms of misfortune
that such fraud typically brings.


Conference Proceedings:

Special Requirements for Annual Project Report:


Categories for which nothing is reported:
Participants: Partner organizations
Products: Journal Publications
Products: Other Specific Product
Products: Internet Dissemination
Conference Proceedings
Special Reporting Requirements
Animal, Human Subjects, Biohazards


FastLane Home Page Take you to the Project System Control Screen We welcome comments on this system.

If you have trouble accessing any FastLane page, please contact the FastLane Help Desk at 1-800-673-6188