Research Projects
RLD: Robust Distributed Stream Processing
Distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. Designed and implemented RLD - a comprehensive solution that provides nearly optimal query performance under load fluctuations without suffering from the performance penalty caused by load migration. Proposed polynomial time technique to generate a robust logical and physical solution with a probabilistic bound on the space coverage. RLD consistently outperforms state-of-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data stream environments.- C. Lei, E. A. Rundensteiner, and J. D. Guttman. "Robust Distributed Stream Processing". ICDE 2013. (pdf)
QueryMesh: The Multi-Route Query Model and Infrastructure
In traditional and stream database systems alike, the optimizer typically picks a single query plan for all data based on overall data statistics. However, many have observed that real-life datasets tend to have non-uniform distributions. Selecting a single query plan may result in ineffective query execution for possibly large portions of the actual data. Query Mesh (or QM) is a practical alternative to this state-of-the-art in query proce-ssing. The main idea of QM is to compute multiple routes (i.e., query plans), each design-ed for a particular subset of the data with distinct statistical properties. A classifer model is induced and used to assign the best route to process new data tuples based upon their data characteristics. In this study we focused in depth on the compile-time (STATIC) QM optimization problem (i.e., on how to produce QM assuming relatively stable statistics).The major contributions of this work are:
- R. V. Nehme, K. Works, E. A. Rundensteiner, and E. Bertino, 2009. ''Query mesh: multi-route query processing technology''. Proc. VLDB Endow. 2, 2 (Aug. 2009), 1530-1533. (pdf)
- R. V. Nehme. ''Efficient query processing for rich and diverse real-time data''.Diss. Purdue University, 2009. (pdf)
- R. V. Nehme, K. Works, C. Lei, E. A. Rundensteiner, and E. Bertino "Multi-Route Query Processing and Optimization", JCSS 79 (2013), 312-329. (pdf)
Query Mesh: Adaptive Online Route Selection
In real-life applications, different subsets of data may have distinct statistical properties, e.g., various websites may have diverse visitation rates, different categories of stocks may have dissimilar price fluctuation patterns. For such applications, it can be fruitful to eliminate the commonly made single execution plan assumption and instead execute a query using several plans, each optimally serving a subset of data with particular statistical properties. Furthermore, in dynamic environments, data properties may change continuously, thus calling for adaptivity. The intriguing question is: can we have an execution strategy that (1) is plan-based to leverage on all the benefits of traditional plan-based systems, (2) supports multiple plans each customized for different subset of data, and yet (3) is as adaptive as “plan-less” sys- tems like Eddies? While the recently proposed Query Mesh (QM) approach provides a foundation for such an execution paradigm, it does not address the question of adaptivity required for highly dynamic environments. In this work, we fill this gap by proposing a Self-Tuning Query Mesh (ST-QM) – an adaptive solution for content-based multi-plan execution engines. The major contributions of this work are:- R. V. Nehme, E. A. Rundensteiner, and E. Bertino. ''Self-tuning query mesh for adaptive multi-route query processing'' EDBT '09, 803-814. (pdf)
- R. V. Nehme. ''Efficient query processing for rich and diverse real-time data''.Diss. Purdue University, 2009. (pdf)
Constraint-Exploiting Adaptive Stream Processing Engine
Data stream management systems processing long-running queries over large volumes of stream data must typically deliver time-critical responses. Since data in these applications is generated on the fly, meta knowledge about cardinality and data value arrival patterns is typically not obtainable when compile time query optimization decisions are made. Also, no pre-built index is available to be exploited for query processing. Therefore, traditional query optimization strategies, which heavily rely on pre-built indices, become inapplicable. As observed by S. Babu and J. Widom, in some applications meta knowledge on data values may become available as streaming data is generated. In this study we focused in depth on the problem of finding a semantic query optimization (SQO) approach that utilizes dynamic substream metadata at runtime to find a more efficient query plan than the one selected at compilation time. The major contributions of this work are:- L. Ding, K. Works, and E. A. Rundensteiner. "Semantic Stream Query Optimization Exploiting Dynamic Metadata" ICDE '11. (pdf)
Multi-Resource Constraint Query Optimization
This project focuses on techniques that generate robust query plans that adhere to both CPU and memory restrictions. Polynomial time techniques to generate "near-optimal" continuous query plans are also proposed and experimentally compared.- Y. Zhu, V. Raghavan and E. A. Rundensteiner, "A New Look At Generating Multi-Join Continuous Query Plans: A Qualified Plan Generation Problem", Data and Knowledge Engineering (DKE Journal), (pdf).
- V. Raghavan, Y. Zhu, E. A. Rundensteiner, and D. Dougherty, "Multi-Join Continuous Query Optimization: Covering the Spectrum of Linear, Acyclic and Cyclic Queries", British National Conference on Databases (BNCOD) 2009, (pdf).
Query Relaxation via Space Partitioning and Mapping Functions
Although a large proportion of queries used everyday have implicit or explicit cardinality constraints, database engines have minimal support for ensuring query cardinality. This leads to two important prob lems: the empty result problem and the few/many problem. The former problem arises when queries against the database simply return no results, while the latter is a more general problem where the number of results returned is either too large or insufficient for the user's purpose. This inability to produce results of the expected cardinality requires users to undertake a frustrating trial-and-error process t o formulate queries that may eventually return the desired number of results. Without prior knowledge of the precise data in the database, this process is laborious, and can waste a large amount of system resources during repeated query execution. Generating queries meeting particular cardinality constraint s is also essential in testing databases since specifically structured queries are required to thorough ly evaluate new database technologies. A solution to the cardinality assurance problem is query relaxation - the selective loosening of a set of query predicates to meet the required cardinality. However, query relaxation is an NP-hard problem p osing several challenges: (1) there are an exponential number of combinations of query predicates that can be relaxed; (2) the degree of relaxation required is not known beforehand; and (3) queries need to be re-executed to ensure the desired cardinality. In this project we will investigate new techniques t o relax queries by utilizing the principles of mapping functions and space partitioning. Additionally, the problems of formulating query recommendations and generating database testing queries is explored.- M. Vartak, V. Raghavan, E. Rundensteiner, "QRelX: Generating Meaningful Queries that Provide Cardina lity Assurance", Demo Paper, SIGMOD 2010, (pdf).
- M. Vartak, V. Raghavan, and E. Rundensteiner, "Recommendation-based Query Relaxation via Space Parti tioning", First Place Winner, ACM Student Research Competition, Grace Hopper Celebration of Women in Computing, Oct 2009.
- M. Vartak, V. Raghavan, E. Rundensteiner, Recommendation-based Query Relaxation via Mapping Functio ns and Space Partitioning", Winner, ACM SIGMOD Undergraduate Poster Competition, June 2009.
Indexing Support for Adaptive Multi-Route Query Processing
Adaptive multi-route query processing (AMR) is a recently emerging paradigm for processing stream queries in highly fluctuating environments. AMR dynamically routes batches of tuples to operators in the query network based on routing criteria and up-to-date system statistics. In the context of AMR systems, indexing, a core technology for efficient stream processing, has received little attention. Indexing in AMR systems is demanding as indices must adapt to serve continuously evolving query paths while maintaining index content under high volumes of data. Our Adaptive Multi-Route Index (AMRI) employs a bitmap time-partitioned design that while being versatile in serving a diverse ever changing workload of multiple query access patterns remains lightweight in terms of maintenance and storage requirements. In addition, our AMRI index design and migration strategies seek to met the indexing needs of both older partially serviced and newer incoming search requests. The major contributions of this work are:- K. Works, E. A. Rundensteiner, and E. Agu, ''Index Tuning for Adaptive Multi-Route Data Stream Systems'', International Workshop on Scalable Stream Processing Systems (SSPS) held in conjunction with IPDPS 2010 (pdf) (slides).
- K. Works, E. A. Rundensteiner, and E. Agu, "Index tuning for adaptive multi-route data stream systems'', Worcester Polytechnic Institute, Technical Report WPI-CS-TR-10-05, 2010. (pdf)
- K. Works, E. A. Rundensteiner, and E. Agu, "Optimizing Adaptive Multi-Route Query Processing via Time-Partitioned Indices'', JCSS 79 (2013)330-348.
Preferential Resource Allocation in Stream Processing Systems
In electronic monitoring applications (or EMAs) the required response time for objects monitored varies depending on the associated risk of the objects and the current system load. EMAs process queries online over large volumes of data arriving from high-speed data streams with fluctuating arrival rates for days, months, or even indefinitely. At times EMAs may not be able to process the sleuth of incoming data and thus to monitor all objects. The stringent requirement of processing incoming tuples within their response time demands efficient resource allocation. In addition to support complex queries requires large operator states. Thus, EMAs can be extremely resource intensive. To further complicate matters, to determine the significance of tuples may introduce additional costs at a time when resources are already at a premium. During such times it is crucial for an EMA to consider the significance of tuples when allocating resources and to proactively pull certain tuples forward ahead of others within the query pipeline. In this study we focused in depth on exploring solutions that support preferential resource allocation in stream processing systems. The major contributions of this work are:- K. Works, and E. A. Rundensteiner. "The Proactive Promotion Engine" ICDE '11. (pdf)
- K. Works, and E. A. Rundensteiner. "Proactive Promotion: Preferential Resource Allocation in Stream Processing Systems", in submission.