We have had discussions with Frank Hoy, director of the WPI Entrepreneurship
Program, about meeting with members of his consortium to discuss the
technology we're developing and about soliciting their input.
These discussions are continuing. Yet our joint conclusion was to undertaken
the first assessments by the co-PI on the project that is in our targeted
domain area, namely, business and fraud. And, given this plan to identify
and prepare real world datasets of actual fraud oompanies from official
data sources. At a later stage, we'll also be working with others in
that larger WPI community as participants to join our discussions and
to help guide and evaluate the technology to assess its effectiveness
assure it's maximally useful for the targeted purposes.
Activities and findings:
Research and Education Activities:
The objective of this proposed research is to design, develop, and assess
visual analytics technology to support risk assessment and monitoring
in an environment characterized by a) high-speed streaming data as well
as vast archives of historical data, b) numerous competing or complementary
models for capturing patterns of interest, and c) significant societal importance for achieving
decisions swiftly. Our target applications are financial risk and fraud
analysis, though the resulting technology should be broadly applicable
across other domains and problems.
We originally broke the project into 5 major tasks:
1. Model Formation: creation of computational methods for extracting patterns
and computing models based on streaming data.
2. Model Management: developing methods to annotate and fine tune models,
both automatically and with human involvement.
3. Model Change Analysis: creating mechanisms to monitor how well the
models fit the data over time and determine when models should be modified
4. Model-Driven History Analytics: using compressed versions of historical
data, allow users to find and analyze past patterns that are similar to
current data and model characteristics.
5. Assessment: involve domain experts in all stages of design, development,
In fact, it was the 5th task, continuous involvement of a domain expert,
that has led to some modifications in our initially designed set of tasks.
Rather than an emphasis on primarily streaming data, we have ended up
focusing on time-series data at a coarser granularity (annual, quarterly,
monthly, weekly, daily), as this tends to be the resolution at which fraud
and risk are best analyzed.
Thus a significant part of our activity during year one was to study algorithms
and issues within time-series analysis as well as the types of models
in current use within the financial analytics community. In particular,
we have designed and implemented a number of methods for measuring and
visualizing similarities among multivariate time series datasets.
Given that each company has hundreds of values associated with it at each
time period, we investigated ways of searching this space for interesting
patterns. One activity involved designing sequences of views, each with
different data and/or models as well as different visual mappings, and
having the user identify patterns of interest by selecting them on the
screen. We then identified the companies that were most frequently selected
as interesting (usually because their behavior, as reflected in the data,
diverged from the main population). In each view, the interestingness
of each company from previous views is conveyed in the data, and thus
companies in the proximity of one or more interesting points could become
classified as interesting. This manual search, selection, and scoring
inspired us to think more about the importance of associations across
Another major activity involved the study of ensemble models, where multiple
models are used in the analysis of a given dataset. We're interested
in how to visualize these multiple models, which may number in the dozens
or even hundreds, as well as how to choose which ones are performing best
on different regions of the data space. This has led to the study and
development of alternate partitioning strategies for the dataset (both
automated and user-guided) so that for each partition, one can make a
decision about which models are doing well and which aren't.
Key findings from our work in year 1 include:
1. There are distinct differences among domain analysts regarding global
and local analysis. Some are looking only for 'big picture' results -
findings that are applicable across an entire population, while others
are interested in very local phenomena, namely, they hold only among a
small set of very similar objects (for example, companies). We have designed
several customized views that support both types of analysis; both in
terms of effective grouping and by compact indicators of the relative
contribution of an object to a given group's impact factor. We have also
developed a number of supporting interactions on these views to enable
swift analysis and exploration by an analyst of the displayed groups by
re-grouping and adjustment of indicators.
2. We have learned about different classes of fraudulent behaviors as
observed in the company context. There are many flavors of fraudulent
behavior, including both under-reporting and over-reporting. There are
also many factors that can lead to a company's bankruptsy that aren't
always included in the data. This makes a purely data-driven automated
analysis somewhat limited in the confidence in which one can make an assessment.
Clearly, a qualified analyst that knows the domain well is required to
be involved in the analysis process. More over, the results of analysis
achieved with our tool would help to wither down a large number of companies
to a small number of suspicious companies that then should be further
explored - by collecting other types of data that may only be gotten by
actually visiting the company and getting access to their accounting books.
3. Given that local contextualized versus global models can lead to completely
distinct observations, we have designed index structures that produce
a compact representation of diverse models. This model index can be used
to extract and analyze efficiently rather diverse submodels at interaction
Training and Development:
One Ph.D. student has been involved in this project on full-time basis
from the very start of the project (for 1 year now). He has been studying
a broad range of topics, including financial analytics, time-series analysis
and visualization, and ensemble models in data mining. He has also greatly
improved his understanding of R, Qt, and Eclipse, as well as his C++ programming
Two additional Ph.D. students have been or are becoming involved part-time
in this project. For starters, they then learn about the software development
technology we are utilizing, in particular, C++, Eclipse, etc. One of
them has focussed on compact representations for models that can be efficiently
extracted based on different model parameters, while the other is exploring
alternate (unstructured) stream data sources for extracting additional
knowledge that could be exploited for the identification of fraudulent
None to date.
Other Specific Products:
Contributions within Discipline:
We have developed novel methods for combining measures of interestingness
for data objects across multiple views of the data and models extracted
from the data, i.e., applying the notion of model ensembly in an innovative
manner and in a new important context (fraud).
We have developed new ways of visualizing degrees of similarity among
data objects based on their interestingness scores, along with linked
interactions among data and model views.
With help from our domain expert in fraud, we have started to apply the
prototyped technology within 2 areas of the financial analytics discipline.
The first is to study patterns of bankruptcy over time to identify candidate
companies whose behavior have data patterns similar to those of bankrupt
companies and thus could be classified as having a higher risk than other
companies in the data set. For this we have selected, extracted, and
cleaned several data sets from the domain with actual bankruptcy cases.
The second is to use databases of companies convicted of fraudulent financial
practices to identify regions in the data space (over time) that are characteristic
of companies later found to have committed fraud. Companies occupying
similar regions in the data space, i.e., to exhibit similar properties,
are considered candidates for investigation in this approach.
One Ph.D. student, Kaiyu Zhao, has been involved in this project from
the very start. He has been making good progress, and will have 2 papers
submitted during 2012.
We have also been working with other students throughout this time period
-- involving them on a part-time basis while they take courses, work
on other milestones in our graduate program and learn the foundation.
The first student left after completing his M.S. degree. We continue to
identify other students and to involve them to bring them up to speed
and gauge their abilities and depth of interest. Their involvement has
allowed the students to learn new software engineering skills and get
exposed to research by attending our weekly research meetings and participating
in our discussions. We expect to bring one of the students full-time
into this project going forward.
Nothing to date, but our plan is to release the resulting code into the
public domain to allow others to build on it as well as use it for instructional
purposes within courses on data visualization and financial analytics.
Detection of fraudulent behavior of companies or other entities has the
potential to be of significant societal benefit, affecting individuals
which otherwise may suffer from financial and other forms of misfortune
that such fraud typically brings.
Special Requirements for Annual Project Report: