Partner Organizations:
Other collaborators:
We have had discussions with Frank Hoy, director of the WPI Entrepreneurship Program, about meeting with members of his consortium to discuss the technology we're developing and about soliciting their input. These discussions are continuing. Yet our joint conclusion was to undertaken the first assessments by the co-PI on the project that is in our targeted domain area, namely, business and fraud. And, given this plan to identify and prepare real world datasets of actual fraud oompanies from official data sources. At a later stage, we'll also be working with others in that larger WPI community as participants to join our discussions and to help guide and evaluate the technology to assess its effectiveness and utility. assure it's maximally useful for the targeted purposes.
Activities and findings:
Research and Education Activities: The objective of this proposed research is to design, develop, and assess visual analytics technology to support risk assessment and monitoring in an environment characterized by a) high-speed streaming data as well as vast archives of historical data, b) numerous competing or complementary models for capturing patterns of interest, and c) significant societal importance for achieving decisions swiftly. Our target applications are financial risk and fraud analysis, though the resulting technology should be broadly applicable across other domains and problems. We originally broke the project into 5 major tasks: 1. Model Formation: creation of computational methods for extracting patterns and computing models based on streaming data. 2. Model Management: developing methods to annotate and fine tune models, both automatically and with human involvement. 3. Model Change Analysis: creating mechanisms to monitor how well the models fit the data over time and determine when models should be modified or replaced. 4. Model-Driven History Analytics: using compressed versions of historical data, allow users to find and analyze past patterns that are similar to current data and model characteristics. 5. Assessment: involve domain experts in all stages of design, development, and evaluation. In fact, it was the 5th task, continuous involvement of a domain expert, that has led to some modifications in our initially designed set of tasks. Rather than an emphasis on primarily streaming data, we have ended up focusing on time-series data at a coarser granularity (annual, quarterly, monthly, weekly, daily), as this tends to be the resolution at which fraud and risk are best analyzed. Thus a significant part of our activity during year one was to study algorithms and issues within time-series analysis as well as the types of models in current use within the financial analytics community. In particular, we have designed and implemented a number of methods for measuring and visualizing similarities among multivariate time series datasets. Given that each company has hundreds of values associated with it at each time period, we investigated ways of searching this space for interesting patterns. One activity involved designing sequences of views, each with different data and/or models as well as different visual mappings, and having the user identify patterns of interest by selecting them on the screen. We then identified the companies that were most frequently selected as interesting (usually because their behavior, as reflected in the data, diverged from the main population). In each view, the interestingness of each company from previous views is conveyed in the data, and thus companies in the proximity of one or more interesting points could become classified as interesting. This manual search, selection, and scoring inspired us to think more about the importance of associations across multiple models. Another major activity involved the study of ensemble models, where multiple models are used in the analysis of a given dataset. We're interested in how to visualize these multiple models, which may number in the dozens or even hundreds, as well as how to choose which ones are performing best on different regions of the data space. This has led to the study and development of alternate partitioning strategies for the dataset (both automated and user-guided) so that for each partition, one can make a decision about which models are doing well and which aren't.
Findings: Key findings from our work in year 1 include: 1. There are distinct differences among domain analysts regarding global and local analysis. Some are looking only for 'big picture' results - findings that are applicable across an entire population, while others are interested in very local phenomena, namely, they hold only among a small set of very similar objects (for example, companies). We have designed several customized views that support both types of analysis; both in terms of effective grouping and by compact indicators of the relative contribution of an object to a given group's impact factor. We have also developed a number of supporting interactions on these views to enable swift analysis and exploration by an analyst of the displayed groups by re-grouping and adjustment of indicators. 2. We have learned about different classes of fraudulent behaviors as observed in the company context. There are many flavors of fraudulent behavior, including both under-reporting and over-reporting. There are also many factors that can lead to a company's bankruptsy that aren't always included in the data. This makes a purely data-driven automated analysis somewhat limited in the confidence in which one can make an assessment. Clearly, a qualified analyst that knows the domain well is required to be involved in the analysis process. More over, the results of analysis achieved with our tool would help to wither down a large number of companies to a small number of suspicious companies that then should be further explored - by collecting other types of data that may only be gotten by actually visiting the company and getting access to their accounting books. 3. Given that local contextualized versus global models can lead to completely distinct observations, we have designed index structures that produce a compact representation of diverse models. This model index can be used to extract and analyze efficiently rather diverse submodels at interaction speed.
Training and Development: One Ph.D. student has been involved in this project on full-time basis from the very start of the project (for 1 year now). He has been studying a broad range of topics, including financial analytics, time-series analysis and visualization, and ensemble models in data mining. He has also greatly improved his understanding of R, Qt, and Eclipse, as well as his C++ programming skills. Two additional Ph.D. students have been or are becoming involved part-time in this project. For starters, they then learn about the software development technology we are utilizing, in particular, C++, Eclipse, etc. One of them has focussed on compact representations for models that can be efficiently extracted based on different model parameters, while the other is exploring alternate (unstructured) stream data sources for extracting additional knowledge that could be exploited for the identification of fraudulent behaviors.
Outreach Activities: None to date.
Journal Publications:
Other Specific Products:
Contributions within Discipline:
We have developed novel methods for combining measures of interestingness for data objects across multiple views of the data and models extracted from the data, i.e., applying the notion of model ensembly in an innovative manner and in a new important context (fraud). We have developed new ways of visualizing degrees of similarity among data objects based on their interestingness scores, along with linked interactions among data and model views.
With help from our domain expert in fraud, we have started to apply the prototyped technology within 2 areas of the financial analytics discipline. The first is to study patterns of bankruptcy over time to identify candidate companies whose behavior have data patterns similar to those of bankrupt companies and thus could be classified as having a higher risk than other companies in the data set. For this we have selected, extracted, and cleaned several data sets from the domain with actual bankruptcy cases. The second is to use databases of companies convicted of fraudulent financial practices to identify regions in the data space (over time) that are characteristic of companies later found to have committed fraud. Companies occupying similar regions in the data space, i.e., to exhibit similar properties, are considered candidates for investigation in this approach.
One Ph.D. student, Kaiyu Zhao, has been involved in this project from the very start. He has been making good progress, and will have 2 papers submitted during 2012. We have also been working with other students throughout this time period -- involving them on a part-time basis while they take courses, work on other milestones in our graduate program and learn the foundation. The first student left after completing his M.S. degree. We continue to identify other students and to involve them to bring them up to speed and gauge their abilities and depth of interest. Their involvement has allowed the students to learn new software engineering skills and get exposed to research by attending our weekly research meetings and participating in our discussions. We expect to bring one of the students full-time into this project going forward.
Nothing to date, but our plan is to release the resulting code into the public domain to allow others to build on it as well as use it for instructional purposes within courses on data visualization and financial analytics.
Detection of fraudulent behavior of companies or other entities has the potential to be of significant societal benefit, affecting individuals which otherwise may suffer from financial and other forms of misfortune that such fraud typically brings.
Conference Proceedings:
Special Requirements for Annual Project Report: