Contact us:

Email: d s r g {at} cs wpi edu Group meeting:
Time: Tuesdays 12-1pm
Place: Fuller Labs 311
All are welcome to attend!


Computer Science Department
Data Science Program
Worcester Polytechnic Institute
100 Institute Road
Worcester, MA 01609


Data Science Research Group (DSRG)

Welcome to the WPI Data Science Research Group. We are a group of faculty, researchers, and students who are working on database projects. Our group focuses on research issues and project work related to very large database and information systems in support of advanced applications including business, engineering, and sciences, large-scale data analytics, scientific data management, annotation and provenance management, multi-dimensional query processing and optimizations. Currently on-going projects include intelligent event analytics, scalable data stream processing systems, map-reduce technologies, biological databases, stream mining and discovery, large-scale visual information exploration, medical process tracking, and distributed heterogeneous information sources, to just name a few. We thrive to build software systems to evaluate the feasibility of our innovations and to evaluate their usefulness by applying them to real problems.

Current projects:

Data Integration
MATTERS: Massachusetts Technology, Talent, and Economic Reporting System

The Massachusetts Technology, Talent, and Economic Reporting System (MATTERS) is an online analytics dashboard empowered by a powerful dynamic data integration infrastructure. Extracting data sets across various public government data sites, the system allows users to quickly access, analyze and visualize a number of key factors impacting the economic competitiveness of US states. This project is a collaboration between the Massachusetts High Technology Council (MHTC) and Worcester Polytechnic Institute. Under the supervision of Professor Elke Rundensteiner, students at WPI have worked with experts from high tech industry, research organizations, and higher education institutions developing this tool.

Complex event stream processing
CEA: Complex Event Analytics

The recent advances in hardware and software have enabled the capture of different measurements of data in a wide range of fields. Applications that generate rapid, continuous and large volumes of event streams include readings from sensors used in applications, such as physics, biology and chemistry experiments, weather sensors, health sensors, network sensors, online auctions, credit card operations, financial tickers, web server log records, etc. Given these developments, the world is poised for a sea-change in terms of variety, scale and importance of applications that can be envisioned based on the real-time analysis and exploitation of such event stream for decision making - from dynamic traffic management, environmental monitoring to health care alike. Clearly, the ability to infer relevant patterns from these event streams in real-time to make near instantaneous yet informed decisions, henceforth called complex event analytics, is absolutely crucial for these mission critical applications.

HIT: Hierarchical Instantiating Timed automaton

Real-time reactive applications from supply chain tracking to health care data analytics have increasingly gained on importance and complexity. To facilitate the specification of involved event-based application semantics, we introduce a novel model HIT that finds a middle ground between a specification composed of a large set of low-level queries versus a high-level graphical workflow description. The workflow is captured by the Hierarchical Instantiating Timed automaton (HIT), while succinct queries are formulated within its states which provide valuable context for launching query execution. HIT models an arbitrary number of event-driven sequential or concurrent hierarchical processes as required for realization of complex real-world applications using a succinct yet expressive specification. The effectiveness of HIT is illustrated by a full case study of the auction scenario.

Annotations in relational databases
PrefNotes: A Framework for Personalized Annotation Management in Relational Databases

Annotations play a key role in understanding and describing the data, and annotation management has become an integral component in most emerging applications such as scientific databases. Scientists need to exchange not only data but also their thoughts, comments and annotations on the data as well. Annotation represent comments, Lineage of data, description and much more. Therefore, several annotation management techniques have been proposed to efficiently and abstractly handle the annotations. However, with the increasing scale of collaboration and the extensive use of annotations among users and scientists, the number and size of the annotations may far exceed the size of the original data itself. Therefore, among the so many existing annotations, different users may have different preferences and only small number of annotations can be of interest to each user based on his (her) preferences. Current annotation management techniques report all annotations to users without taking into account their preferences. We propose PrefNotes, a framework for personalized annotation propagation in relational databases. PrefNotes captures users’ preferences and profiles and personalizes the annotation propagation at query time by reporting the Top K most relevant annotations (per tuple) for each user. PrefNotes supports static and dynamic profiles for each user. We propose three variants of Top K operators namely fixed, proportional, and approximate proportional operators that differ in their cost model and accuracy. PrefNotes is implemented inside PostgreSQL.

InsightNotes: Supporting Annotations Beyond Propagation In Relational Databases

Scientific database systems provide backbone support to various scientific applications. In these applications, efficient and effective annotation management mechanism is vital for sharing knowledge and establishing a collaborative environment among end-users and scientists. Annotations may represent comments on the data, provenance or lineage information, and highlights on conflicting or erroneous values. The extensive use of annotation and the expanding scale of collaboration may cause the size of annotations to far exceed the size of the original data, and hence it becomes extremely difficult for end-users to extract useful insight and the valuable knowledge hidden within the annotations. In this project, we propose the InsightNotes system; an advanced annotation management system over relational databases, for exploiting annotations in novel ways through summarization, mining, and ranking techniques with the objective of reporting concise and meaningful representations instead of the raw annotations. InsightNotes also address the query processing challenges involved in building and querying such complex representations.

Scalable Data Mining Technologies
A Framework for Analyzing Text Data Streams in Social Microblogging Networks

An enormous amount of data exists in massive scale, either static or in the form of data streams. These massive data contains interesting and useful information. For instance, social micro-blogging sites such as Twitter contain large amounts of messages. Some of these messages contain valuable information about a wide variety of real-world events. Analyzing such a data stream presents signicant opportunities, as well as challenges. This research project focuses on online identication of emerging trends and topics of discussion and explore the evolution of the topics over time. Identifying trending topics in real time on Twitter is a challenging problem, due to fast evolving and large scale of the unstructured data. To tackle these challenges, the system should provide these features: (1) The system should be able to process data in a single-pass. (2) The mining algorithm must be fast and scalable to handle the massive data in real time. (3) The mining method needs to be executed incrementally in online fashion. (4) Also it should be able to handle outliers and evolution of data. The continuous evolution of data makes it essential to quickly identify new trends in the data.

XMDV: Visual Exploration Support for Data Mining and Discovery

XmdvTool is a public-domain software package for the interactive visual exploration of multivariate data sets. It is available on all major platforms such as UNIX, LINUX, MAC and Windows. XmdvTool is developed using Qt and Eclipse CDT. It supports five methods for displaying flat form data and hierarchically clustered data: (1) Scatterplots, (2) Star Glyphs, (3) Parallel Coordinates, (4) Dimensional Stacking, and (5) Pixel-oriented Display.

XmdvTool also supports a variety of interaction modes and tools, including brushing in screen, data, and structure spaces, zooming, panning, and distortion techniques, and the masking and reordering of dimensions. Univariate display and graphical summarization, via tree-maps and modified Tukey box plots, are also supported. Finally, color themes and user customizable color assignments permit tailoring of the aesthetics to the users.

XmdvTool has been applied to a wide range of application areas, some of which are highlighted in our Case Studies. Some of these domains include remote sensing, financial, geochemical, census, and simulation data. We are always looking for new applications, so if you've had some success with the system in your domain, we'd love to hear from you. See our contact page and join our user group if you'd like to contribute something or get further information.

Distributed Scalable Outliers in Big Data

Distance-based Outlier Detection is a popular and fundamental task for data analysis. However, the potential quadratic time complexity is impeding its usefulness for large scale data. We address this issue and propose a distributed algorithm that scales on large data. We also embrace Map-Reduce, a popular shared nothing distributed system, as the platform. This work outperforms existing solution by: (1) Reduce the replica transportation cost (between different physical machines); (2) Guarantee load balancing over data skewness.

PARAS: A Parameter Space Framework for Online Association Mining

Association rule mining is known to be computationally intensive, yet real-time decision-making applications are increasingly intolerant to delays. In this paper, we introduce the parameter space model, called PARAS. PARAS enables efficient rule mining by compactly maintaining the final rulesets. The PARAS model is based on the notion of stable region abstractions that form the coarse granularity ruleset space. Based on new insights on the redundancy relationships among rules, PARAS establishes a surprisingly compact representation of complex redundancy relationships while enabling efficient redundancy resolution at query-time. Besides the classical rule mining requests, the PARAS model supports three novel classes of exploratory queries. Using the proposed PSpace index, these exploratory query classes can all be answered with near real-time responsiveness. Our experimental evaluation using several benchmark datasets demonstrates that PARAS achieves 2 to 5 orders of magnitude improvement over state-of-theart approaches in online association rule mining.

Scalable data stream processing
QueryMesh: A Novel Paradigm for Query Processing

Technological advances in positioning, sensor and monitoring technology drive data acquisition devices to generate massive streams of data. The goal of this research is to develop a new class of high-performance stream data management systems capable of coping with scenarios with infinite data arriving in large volumes, and with near-real time response requirements. The proposed query processing paradigm, termed the multi-route query mesh model (QM), overcomes a major limitation in current query optimizers, both static and stream ones alike, namely the assignment of a single `best' query execution plan for all input data. This approach, being based on the strong assumption of data uniformity, results in substandard performance for possibly all data items. Instead, query mesh adopts a processing structure composed of a data classifier and a multiple route plan infrastructure. Different learning models can be plugged as classifier logic into the QM model. Given the complexity of the QM solution space, cost-based search heuristics are designed to efficiently find high-quality query meshes. QM is adaptive supporting the detection and incremental modification of the QM classifier and its routes. Intellectual merit lies in the design, development and evaluation of a novel multi-route paradigm for stream query processing, -- a perfect middle ground between the two current extremes of single-plan versus route-less solutions. Experimental studies compare query mesh to state-of-the-art solutions. QM impacts society by facilitating a wide range of stream-centric applications, including medical out-patient monitoring, emergency management, and business intelligence processing, and by integrating project activities with education.

Distributed Scalable Outliers in Big Data

Distance-based Outlier Detection is a popular and fundamental task for data analysis. However, the potential quadratic time complexity is impeding its usefulness for large scale data. We address this issue and propose a distributed algorithm that scales on large data. We also embrace Map-Reduce, a popular shared nothing distributed system, as the platform. This work outperforms existing solution by: (1) Reduce the replica transportation cost (between different physical machines); (2) Guarantee load balancing over data skewness.

RAINDROP: XQueries Over XML Streams (Automaton Meets Algebra)

As XML becomes popular, more and more stream sources exist in the XML format. Typical XML stream applications include XML message brokers for B2B message-oriented middleware servers and selective dissemination of information such as personalized newspaper delivery. The general goal of raindrop is to tackle challenges of stream processing that are specific to XML, in particular, processing XQuery, a standard XML query language, over XML streams.

It is important to note that unlike tuple-based or object-based data streams, XML streams are more appropriately modeled as a sequence of primitive tokens, such as a start tag, an end tag or a PCDATA item. However, a token is not self-contained, compared to a tuple that is a self-contained structure whose semantics are completely determined by its own values. A token, on the other hand, lacks semantics without the context provided by other tokens in the stream. Structural pattern retrieval, one of three functionalities in an XQuery (the other two are filtering and restructuring), has to be first performed on these non-self-contained tokens to compose self-contained objects.

While the automata model is naturally suited for pattern matching on tokenized XML streams, the algebraic model in contrast is a well-established technique in database systems for set-oriented processing of self-contained data units, i.e., tuples. However, neither automata models nor algebraic models are well-equipped to handle both computation paradigms. The goal of the Raindrop project is now to accommodate these two paradigms within one uniform algebraic framework, thus taking advantage of both. In our query model, both tokenized data and self-contained tuples are supported in a uniform manner. Query plans in this way can be flexibly rewritten using algebra-like equivalence rules to change what computation is done using tokenized data versus tuples. Raindrop system has four levels of abstractions in its system framework, namely, semantics-focused plan, stream logical plan, stream physical plan and execution plan. Various optimization techniques are provided at each level.

CAPE: Continuous Adaptive Processing Engine

The growth of electronic commerce and the widespread use of sensor networks have created the demand for online processing and monitoring applications, creating a new class of query processing over continuously generated data streams. Traditional database techniques, which assume data to be bounded as well as statically stored and indexed, are largely incapable of handling these new applications, and so Continuous Query (CQ) Systems have appeared. CQ systems must be adaptive to properly manage their available resources in the face of data streams with widely varying arrival rates, and a constantly changing set of standing user queries that must be processed. Not a priori optimization algorithm can be successful given such variability. The CAPE project aims to propose a novel architecture for a CQ system that (1) incorporates adaptability at all levels of query processing; and (2) incorporates a dynamic metadata model used to help optimize all levels of query processing.

The CAPE project aims to provide novel techniques for processing large numbers of concurrent continuous queries with required Quality of Service (QoS). Because of the dynamic nature of query registration and stream behavior, we are designing heterogeneous-grained adaptivity for CAPE and exploits dynamic metadata at all levels in continuous query processing, including the query operator execution, memory allocation, operator scheduling, query plan structuring and query plan distribution among multiple machines. We will (1) design an extensible dynamic metadata model; (2) design adaptive algorithms for use in each layer of query processing to exploit available metadata; (3) develop QoS specification models for capturing resource usage; (4) incorporate a hierarchical interaction model for coordinating the adaptation at different levels within the CQ system; and (5) design a family of metadata-exploiting optimization techniques.

Undergraduate projects
Interactive web-based dashboard for the Massachusetts High Tech Council

The goal of this project is to build and analyze the effects of an interactive web-based dashboard for the Massachusetts High Tech Council, a pro-technology advocacy and lobbyist organization. We conducted a survey of Massachusetts High Technology Council (MHTC) members about the perceived effectiveness of the dashboard as well as a usability study of the dashboard prototype to test the ease of use. This allowed us to better understand the impact of technology in policy making.

Students: Nilesh C Patel, Stefan Gvozdenovic, Theodore J Meyer