Research Topics
Query Mesh: The Multi-Route Query Model and Infrastructure
In traditional and stream database systems alike, the optimizer typically picks a single query plan for all data based on overall data statistics. However, many have observed that real-life datasets tend to have non-uniform distributions. Selecting a single query plan may result in ineffective query execution for possibly large portions of the actual data. Query Mesh (or QM) is a practical alternative to this state-of-the-art in query processing. The main idea of QM is to compute multiple routes (i.e., query plans), each design-ed for a particular subset of the data with distinct statistical properties. A classifer model is induced and used to assign the best route to process new data tuples based upon their data characteristics. In this study we focused in depth on the compile-time (STATIC) QM optimization problem (i.e., on how to produce QM assuming relatively stable statistics).The major contributions of this work are:
- R. V. Nehme, K. Works, E. A. Rundensteiner, and E. Bertino, 2009. ''Query mesh: multi-route query processing technology''. VLDB 2009. (pdf)
- R. V. Nehme. ''Efficient query processing for rich and diverse real-time data''. Diss. Purdue University, 2009. (pdf)
- R. V. Nehme, K. Works, C. Lei, E. A. Rundensteiner, and E. Bertino "Multi-Route Query Processing and Optimization", Journal of Computer and System Sciences 2013. (pdf)
Query Mesh: Adaptive Online Route Selection
In real-life applications, different subsets of data may have distinct statistical properties, e.g., various websites may have diverse visitation rates, different categories of stocks may have dissimilar price fluctuation patterns. For such applications, it can be fruitful to eliminate the commonly made single execution plan assumption and instead execute a query using several plans, each optimally serving a subset of data with particular statistical properties. Furthermore, in dynamic environments, data properties may change continuously, thus calling for adaptivity. The intriguing question is: can we have an execution strategy that (1) is plan-based to leverage on all the benefits of traditional plan-based systems, (2) supports multiple plans each customized for different subset of data, and yet (3) is as adaptive as “plan-less” sys- tems like Eddies? While the recently proposed Query Mesh (QM) approach provides a foundation for such an execution paradigm, it does not address the question of adaptivity required for highly dynamic environments. In this work, we fill this gap by proposing a Self-Tuning Query Mesh (ST-QM) – an adaptive solution for content-based multi-plan execution engines. The major contributions of this work are:- R. V. Nehme, E. A. Rundensteiner, and E. Bertino. ''Self-tuning query mesh for adaptive multi-route query processing'' EDBT 2009. (pdf)
- R. V. Nehme. ''Efficient query processing for rich and diverse real-time data''.Diss. Purdue University, 2009. (pdf)
Query Mesh: Cost-Aware Multi-Route Optimizer
State-of-the-art optimizers produce one single optimal query plan for all stream data, in spite of such a singleton plan typically being sub-optimal or even poor for highly correlated data. Recently a new stream processing paradigm, called multi-route approach, has emerged as a promising approach for tackling this problem. Multi-route first divides data streams into several partitions and then creates a separate query plan for each combination of partitions. Unfortunately current approaches suffer from severe shortcomings, in particular, the lack of an effective partitioning strategy and the prohibitive query optimization expense. In this study, we focused on the first practical multi-route optimizer named correlation-aware multi-route stream query optimizer (or CMR) that solves both problems. By exploiting both intra- and inter-stream correlations of streams, CMR produces effective partitions without having to undertake repeated expensive query plan generation. The produced partitions not only are best served by distinct optimal query plans, but also leverage the partition-driven pruning opportunity. Our major contributions of this work are:- L. Cao and E. Rundensteiner, "High Performance Stream Query Processing With Correlation-Aware Partitioning", VLDB 2013
Constraint-Exploiting Adaptive Stream Processing Engine
Data stream management systems processing long-running queries over large volumes of stream data must typically deliver time-critical responses. Since data in these applications is generated on the fly, meta knowledge about cardinality and data value arrival patterns is typically not obtainable when compile time query optimization decisions are made. Also, no pre-built index is available to be exploited for query processing. Therefore, traditional query optimization strategies, which heavily rely on pre-built indices, become inapplicable. As observed by S. Babu and J. Widom, in some applications meta knowledge on data values may become available as streaming data is generated. In this study we focused in depth on the problem of finding a semantic query optimization (SQO) approach that utilizes dynamic substream metadata at runtime to find a more efficient query plan than the one selected at compilation time. The major contributions of this work are:- K. Works. "Dissertation: Targeted Prioritized Processing in Overloaded Data Stream Systems". Worcester Polytechnic Institute. 2014
- L. Ding, K. Works, and E. A. Rundensteiner. "Semantic Stream Query Optimization Exploiting Dynamic Metadata" ICDE 2011. (pdf)
Multi-Resource Constraint Query Optimization
This project focuses on techniques that generate robust query plans that adhere to both CPU and memory restrictions. Polynomial time techniques to generate "near-optimal" continuous query plans are also proposed and experimentally compared.- V. Raghavan and E. Rundensteiner, "CAQE: A Contract Driven Approach to Processing Concurrent Decision Support Queries" EDBT 2014
- Y. Zhu, V. Raghavan and E. A. Rundensteiner, "A New Look At Generating Multi-Join Continuous Query Plans: A Qualified Plan Generation Problem", Data and Knowledge Engineering (DKE Journal), (pdf).
- V. Raghavan, Y. Zhu, E. A. Rundensteiner, and D. Dougherty, "Multi-Join Continuous Query Optimization: Covering the Spectrum of Linear, Acyclic and Cyclic Queries", British National Conference on Databases (BNCOD) 2009, (pdf).
Query Relaxation via Space Partitioning and Mapping Functions
Although a large proportion of queries used everyday have implicit or explicit cardinality constraints, database engines have minimal support for ensuring query cardinality. This leads to two important prob lems: the empty result problem and the few/many problem. The former problem arises when queries against the database simply return no results, while the latter is a more general problem where the number of results returned is either too large or insufficient for the user's purpose. This inability to produce results of the expected cardinality requires users to undertake a frustrating trial-and-error process t o formulate queries that may eventually return the desired number of results. Without prior knowledge of the precise data in the database, this process is laborious, and can waste a large amount of system resources during repeated query execution. Generating queries meeting particular cardinality constraint s is also essential in testing databases since specifically structured queries are required to thorough ly evaluate new database technologies. A solution to the cardinality assurance problem is query relaxation - the selective loosening of a set of query predicates to meet the required cardinality. However, query relaxation is an NP-hard problem p osing several challenges: (1) there are an exponential number of combinations of query predicates that can be relaxed; (2) the degree of relaxation required is not known beforehand; and (3) queries need to be re-executed to ensure the desired cardinality. In this project we will investigate new techniques t o relax queries by utilizing the principles of mapping functions and space partitioning. Additionally, the problems of formulating query recommendations and generating database testing queries is explored.- M. Vartak, V. Raghavan, E. Rundensteiner, "QRelX: Generating Meaningful Queries that Provide Cardina lity Assurance", Demo Paper, SIGMOD 2010, (pdf).
- M. Vartak, V. Raghavan, and E. Rundensteiner, "Recommendation-based Query Relaxation via Space Parti tioning", First Place Winner, ACM Student Research Competition, Grace Hopper Celebration of Women in Computing, Oct 2009.
- M. Vartak, V. Raghavan, E. Rundensteiner, Recommendation-based Query Relaxation via Mapping Functio ns and Space Partitioning", Winner, ACM SIGMOD Undergraduate Poster Competition, June 2009.
Indexing Support for Adaptive Multi-Route Query Processing
Adaptive multi-route query processing (AMR) is a recently emerging paradigm for processing stream queries in highly fluctuating environments. AMR dynamically routes batches of tuples to operators in the query network based on routing criteria and up-to-date system statistics. In the context of AMR systems, indexing, a core technology for efficient stream processing, has received little attention. Indexing in AMR systems is demanding as indices must adapt to serve continuously evolving query paths while maintaining index content under high volumes of data. Our Adaptive Multi-Route Index (AMRI) employs a bitmap time-partitioned design that while being versatile in serving a diverse ever changing workload of multiple query access patterns remains lightweight in terms of maintenance and storage requirements. In addition, our AMRI index design and migration strategies seek to met the indexing needs of both older partially serviced and newer incoming search requests. The major contributions of this work are:- K. Works, E. A. Rundensteiner, and E. Agu, "Optimizing adaptive multi-route query processing via time-partitioned indices" Journal of Computer and System Sciences 2013
- K. Works, E. A. Rundensteiner, and E. Agu, ''Index Tuning for Adaptive Multi-Route Data Stream Systems'', International Workshop on Scalable Stream Processing Systems (SSPS) held in conjunction with IPDPS 2010 (pdf) (slides).
- K. Works, E. A. Rundensteiner, and E. Agu, "Index tuning for adaptive multi-route data stream systems'', Worcester Polytechnic Institute, Technical Report WPI-CS-TR-10-05, 2010. (pdf)
Redoop: Scalable Recurring Processing
The growing demand for large-scale data analytics ranging from online advertisement placement, log processing, to fraud detection, has led to the design of highly scalable data-intensive computing infrastructures such as the Hadoop platform. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of these analytic applications. Despite their importance, the plain Hadoop along with its state-of-art extensions lack built-in support for recurring queries. In this work, we present the Redoop system, an extension of the Hadoop framework, designed to fill in this void. Redoop supports recurring queries as firstclass citizen in Hadoop without sacrificing any of its core features. More importantly, Redoop deploys innovative window-aware optimization techniques for recurring query execution including adaptive window-aware data partitioning, window-aware task scheduling, and inter-window caching mechanisms. Redoop retains the fault-tolerance of MapReduce via automatic cache recovery and task re-execution support.- C. Lei, E. Rundensteiner and M. Eltabakh. "Redoop: Supporting Recurring Queries in Hadoop". EDBT 2014. (pdf)
- C. Lei, Z. Zhuang, E. Rundensteiner and M. Eltabakh. "Redoop Infrastructure for Recurring Big Data Queries". VLDB 2014.
RLD: Robust Distributed Stream Processing
Distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. Designed and implemented RLD - a comprehensive solution that provides nearly optimal query performance under load fluctuations without suffering from the performance penalty caused by load migration. Proposed polynomial time technique to generate a robust logical and physical solution with a probabilistic bound on the space coverage. RLD consistently outperforms state-of-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data stream environments.- C. Lei, E. A. Rundensteiner, and J. D. Guttman. "Robust Distributed Stream Processing". ICDE 2013.
- C. Lei, E. A. Rundensteiner, and J. D. Guttman. "Robust Distributed Query Processing for Streaming Data". TODS 2014.
Preferential Resource Allocation in Stream Processing Systems
In electronic monitoring applications (or EMAs) the required response time for objects monitored varies depending on the associated risk of the objects and the current system load. EMAs process queries online over large volumes of data arriving from high-speed data streams with fluctuating arrival rates for days, months, or even indefinitely. At times EMAs may not be able to process the sleuth of incoming data and thus to monitor all objects. The stringent requirement of processing incoming tuples within their response time demands efficient resource allocation. In addition to support complex queries requires large operator states. Thus, EMAs can be extremely resource intensive. To further complicate matters, to determine the significance of tuples may introduce additional costs at a time when resources are already at a premium. During such times it is crucial for an EMA to consider the significance of tuples when allocating resources and to proactively pull certain tuples forward ahead of others within the query pipeline. In this study we focused in depth on exploring solutions that support preferential resource allocation in stream processing systems. The major contributions of this work are:- K. Works, and E. A. Rundensteiner. "The Proactive Promotion Engine" ICDE 2011. (pdf)
- K. Works, and E. A. Rundensteiner. "Preferential Resource Allocation in Stream Processing Systems", International Journal of Cooperative Information Systems 2014
- K. Works and E. Rundeisteiner. "Reliable Aggregation over Prioritized Data Streams". Transactions on Large-Scale Data and Knowledge-Centered Systems 2014
- K. Works and E. Rundensteiner. "Utilizing Dynamic Precedence Criteria To Ensure the Production of Critical Results from Big Data Streams". BigData 2014
Best paper award for NSF-supported research at the Big Data Conference, Harvard University 2014.