Query Mesh: Adaptive Multi-Route Stream Query Processing

Research Topics

Query Mesh: The Multi-Route Query Model and Infrastructure

In traditional and stream database systems alike, the optimizer typically picks a single query plan for all data based on overall data statistics. However, many have observed that real-life datasets tend to have non-uniform distributions. Selecting a single query plan may result in ineffective query execution for possibly large portions of the actual data. Query Mesh (or QM) is a practical alternative to this state-of-the-art in query processing. The main idea of QM is to compute multiple routes (i.e., query plans), each design-ed for a particular subset of the data with distinct statistical properties. A classifer model is induced and used to assign the best route to process new data tuples based upon their data characteristics. In this study we focused in depth on the compile-time (STATIC) QM optimization problem (i.e., on how to produce QM assuming relatively stable statistics).
The major contributions of this work are:

characterization of QM search space and its complexity

description of QM search cost model

design of heuristics that navigate complex search space to produce good QM solutions

description of physical architecture of QM system including infrastructure and ruster construction

evaluation of heuristics in terms of both plan quality and performance of multi-plan optimization

R. V. Nehme, K. Works, E. A. Rundensteiner, and E. Bertino, 2009. ''Query mesh: multi-route query processing technology''. VLDB 2009. (pdf)
R. V. Nehme. ''Efficient query processing for rich and diverse real-time data''. Diss. Purdue University, 2009. (pdf)
R. V. Nehme, K. Works, C. Lei, E. A. Rundensteiner, and E. Bertino "Multi-Route Query Processing and Optimization", Journal of Computer and System Sciences 2013. (pdf)

Query Mesh: Adaptive Online Route Selection

In real-life applications, different subsets of data may have distinct statistical properties, e.g., various websites may have diverse visitation rates, different categories of stocks may have dissimilar price fluctuation patterns. For such applications, it can be fruitful to eliminate the commonly made single execution plan assumption and instead execute a query using several plans, each optimally serving a subset of data with particular statistical properties. Furthermore, in dynamic environments, data properties may change continuously, thus calling for adaptivity. The intriguing question is: can we have an execution strategy that (1) is plan-based to leverage on all the benefits of traditional plan-based systems, (2) supports multiple plans each customized for different subset of data, and yet (3) is as adaptive as “plan-less” sys- tems like Eddies? While the recently proposed Query Mesh (QM) approach provides a foundation for such an execution paradigm, it does not address the question of adaptivity required for highly dynamic environments. In this work, we fill this gap by proposing a Self-Tuning Query Mesh (ST-QM) – an adaptive solution for content-based multi-plan execution engines. The major contributions of this work are:

Online QM monitoring: We abstract the adaptivity of a multi-plan solution QM as a concept drift problem. Our approach, based on monitoring and detection of concept drifts, can discard many insignificant adaptivity cases early, and thus minimize the adaptivity overhead.

Online QM concept drift detection: We present algorithms to efficiently determine virtual and real concept drifts in QM used to determine if and how the execution strategy should be adapted.

Actuation of the QM migration: The key feature : The key feature of our adaptive method is that all logical transformations to the current execution solution are translated into a single physical operation the change of the classifier, without effecting the rest of the execution infrastructure. This makes physical adaptivity extremely lightweight.

R. V. Nehme, E. A. Rundensteiner, and E. Bertino. ''Self-tuning query mesh for adaptive multi-route query processing'' EDBT 2009. (pdf)
R. V. Nehme. ''Efficient query processing for rich and diverse real-time data''.Diss. Purdue University, 2009. (pdf)

Query Mesh: Cost-Aware Multi-Route Optimizer

State-of-the-art optimizers produce one single optimal query plan for all stream data, in spite of such a singleton plan typically being sub-optimal or even poor for highly correlated data. Recently a new stream processing paradigm, called multi-route approach, has emerged as a promising approach for tackling this problem. Multi-route first divides data streams into several partitions and then creates a separate query plan for each combination of partitions. Unfortunately current approaches suffer from severe shortcomings, in particular, the lack of an effective partitioning strategy and the prohibitive query optimization expense. In this study, we focused on the first practical multi-route optimizer named correlation-aware multi-route stream query optimizer (or CMR) that solves both problems. By exploiting both intra- and inter-stream correlations of streams, CMR produces effective partitions without having to undertake repeated expensive query plan generation. The produced partitions not only are best served by distinct optimal query plans, but also leverage the partition-driven pruning opportunity. Our major contributions of this work are:

We introduce correlation indicators that reliably determine the applicability of the multi-route approach for a given workload.

We design a sample-based partitioning algorithm that discovers effective intra-stream partitions in sub-linear sample complexity.

We propose a partition optimization approach that based on the inter-stream correlation indicator, is shown to successfully exploit the join pair pruning opportunity.

Our experiments confirm that our CMR optimizer integrating the above strategies into one comprehensive solution framework outperforms the existing alternatives up to an order of magnitude.

L. Cao and E. Rundensteiner, "High Performance Stream Query Processing With Correlation-Aware Partitioning", VLDB 2013

Constraint-Exploiting Adaptive Stream Processing Engine

Data stream management systems processing long-running queries over large volumes of stream data must typically deliver time-critical responses. Since data in these applications is generated on the fly, meta knowledge about cardinality and data value arrival patterns is typically not obtainable when compile time query optimization decisions are made. Also, no pre-built index is available to be exploited for query processing. Therefore, traditional query optimization strategies, which heavily rely on pre-built indices, become inapplicable. As observed by S. Babu and J. Widom, in some applications meta knowledge on data values may become available as streaming data is generated. In this study we focused in depth on the problem of finding a semantic query optimization (SQO) approach that utilizes dynamic substream metadata at runtime to find a more efficient query plan than the one selected at compilation time. The major contributions of this work are:

Semantic Stream Query Optimization: We are the first to explore plan level semantic stream query optimization that exploits dynamic metadata. In particular, we identify four herald enabled semantic query optimization opportunities that that, once applied, guarantee performance gains.

Online Optimization Method: To minimize the optimization overhead, we develop PredSAT an efficient constraint reasoning algorithm based on classic satisfiability reasoning theory. PredSAT is guaranteed to identify all four herald-driven optimization opportunities incrementally at runtime.

Query Execution Paradigm: To achieve multiple concurrent logical plans with one single physical plan, we propose a novel query execution paradigm employing multi-modal operators with runtime configuration logic. This paradigm eliminates any replication of operator states or inter-operator queues, guarantees instantaneous application of herald-driven query optimizations, requires zero plan migration effort, and naturally supports highly flexible adaptive execution.

K. Works. "Dissertation: Targeted Prioritized Processing in Overloaded Data Stream Systems". Worcester Polytechnic Institute. 2014
L. Ding, K. Works, and E. A. Rundensteiner. "Semantic Stream Query Optimization Exploiting Dynamic Metadata" ICDE 2011. (pdf)

Multi-Resource Constraint Query Optimization

This project focuses on techniques that generate robust query plans that adhere to both CPU and memory restrictions. Polynomial time techniques to generate "near-optimal" continuous query plans are also proposed and experimentally compared.

V. Raghavan and E. Rundensteiner, "CAQE: A Contract Driven Approach to Processing Concurrent Decision Support Queries" EDBT 2014
Y. Zhu, V. Raghavan and E. A. Rundensteiner, "A New Look At Generating Multi-Join Continuous Query Plans: A Qualified Plan Generation Problem", Data and Knowledge Engineering (DKE Journal), (pdf).
V. Raghavan, Y. Zhu, E. A. Rundensteiner, and D. Dougherty, "Multi-Join Continuous Query Optimization: Covering the Spectrum of Linear, Acyclic and Cyclic Queries", British National Conference on Databases (BNCOD) 2009, (pdf).

Query Relaxation via Space Partitioning and Mapping Functions

Although a large proportion of queries used everyday have implicit or explicit cardinality constraints, database engines have minimal support for ensuring query cardinality. This leads to two important prob lems: the empty result problem and the few/many problem. The former problem arises when queries against the database simply return no results, while the latter is a more general problem where the number of results returned is either too large or insufficient for the user's purpose. This inability to produce results of the expected cardinality requires users to undertake a frustrating trial-and-error process t o formulate queries that may eventually return the desired number of results. Without prior knowledge of the precise data in the database, this process is laborious, and can waste a large amount of system resources during repeated query execution. Generating queries meeting particular cardinality constraint s is also essential in testing databases since specifically structured queries are required to thorough ly evaluate new database technologies. A solution to the cardinality assurance problem is query relaxation - the selective loosening of a set of query predicates to meet the required cardinality. However, query relaxation is an NP-hard problem p osing several challenges: (1) there are an exponential number of combinations of query predicates that can be relaxed; (2) the degree of relaxation required is not known beforehand; and (3) queries need to be re-executed to ensure the desired cardinality. In this project we will investigate new techniques t o relax queries by utilizing the principles of mapping functions and space partitioning. Additionally, the problems of formulating query recommendations and generating database testing queries is explored.

M. Vartak, V. Raghavan, E. Rundensteiner, "QRelX: Generating Meaningful Queries that Provide Cardina lity Assurance", Demo Paper, SIGMOD 2010, (pdf).
M. Vartak, V. Raghavan, and E. Rundensteiner, "Recommendation-based Query Relaxation via Space Parti tioning", First Place Winner, ACM Student Research Competition, Grace Hopper Celebration of Women in Computing, Oct 2009.
M. Vartak, V. Raghavan, E. Rundensteiner, Recommendation-based Query Relaxation via Mapping Functio ns and Space Partitioning", Winner, ACM SIGMOD Undergraduate Poster Competition, June 2009.

This particular effort has in part also been supported by the Computing Research Association's Committee on the Status of Women in Computing Research (CRA-W) as a "Collaborative Research Experience for Undergraduates (CREU)" as listed in CREU Final Reports. This research also was part of Manasi Vartak's Major Qualifying project in Computer Sciences under Prof. Rundensteiner's guidance (CS MQP project). Mansi was awarded the best CS MQP award from WPI for this research for 2010 (Provost Best Computer Science MQP Award 2010). Throughout this period, Manasi was being awarded several additional awards, including winner of the undergraduate poster session at the Grace Hopper Women in Computing conference held in Tucson, AZ (2009).

Indexing Support for Adaptive Multi-Route Query Processing

Adaptive multi-route query processing (AMR) is a recently emerging paradigm for processing stream queries in highly fluctuating environments. AMR dynamically routes batches of tuples to operators in the query network based on routing criteria and up-to-date system statistics. In the context of AMR systems, indexing, a core technology for efficient stream processing, has received little attention. Indexing in AMR systems is demanding as indices must adapt to serve continuously evolving query paths while maintaining index content under high volumes of data. Our Adaptive Multi-Route Index (AMRI) employs a bitmap time-partitioned design that while being versatile in serving a diverse ever changing workload of multiple query access patterns remains lightweight in terms of maintenance and storage requirements. In addition, our AMRI index design and migration strategies seek to met the indexing needs of both older partially serviced and newer incoming search requests. The major contributions of this work are:

An index design that serves workloads composed of multiple access patterns while remaining efficiently maintainable in highly dynamic DSMS.

Design of index assessment methods that reduce the system resources required by index assessment by eliminating statistics while maintaining the integrity of the index configuration.

We prove that the effect of statistic compaction on the potential loss of quality of the selected index configuration can be bounded by a preset constant.

We explore the question of whether utilizing a single or multiple index configurations is best at improving the throughput of all search requests in the system regardless of their proximity to completion

K. Works, E. A. Rundensteiner, and E. Agu, "Optimizing adaptive multi-route query processing via time-partitioned indices" Journal of Computer and System Sciences 2013
K. Works, E. A. Rundensteiner, and E. Agu, ''Index Tuning for Adaptive Multi-Route Data Stream Systems'', International Workshop on Scalable Stream Processing Systems (SSPS) held in conjunction with IPDPS 2010 (pdf) (slides).
K. Works, E. A. Rundensteiner, and E. Agu, "Index tuning for adaptive multi-route data stream systems'', Worcester Polytechnic Institute, Technical Report WPI-CS-TR-10-05, 2010. (pdf)

Redoop: Scalable Recurring Processing

The growing demand for large-scale data analytics ranging from online advertisement placement, log processing, to fraud detection, has led to the design of highly scalable data-intensive computing infrastructures such as the Hadoop platform. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of these analytic applications. Despite their importance, the plain Hadoop along with its state-of-art extensions lack built-in support for recurring queries. In this work, we present the Redoop system, an extension of the Hadoop framework, designed to fill in this void. Redoop supports recurring queries as firstclass citizen in Hadoop without sacrificing any of its core features. More importantly, Redoop deploys innovative window-aware optimization techniques for recurring query execution including adaptive window-aware data partitioning, window-aware task scheduling, and inter-window caching mechanisms. Redoop retains the fault-tolerance of MapReduce via automatic cache recovery and task re-execution support.

C. Lei, E. Rundensteiner and M. Eltabakh. "Redoop: Supporting Recurring Queries in Hadoop". EDBT 2014. (pdf)
C. Lei, Z. Zhuang, E. Rundensteiner and M. Eltabakh. "Redoop Infrastructure for Recurring Big Data Queries". VLDB 2014.

RLD: Robust Distributed Stream Processing

Distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. Designed and implemented RLD - a comprehensive solution that provides nearly optimal query performance under load fluctuations without suffering from the performance penalty caused by load migration. Proposed polynomial time technique to generate a robust logical and physical solution with a probabilistic bound on the space coverage. RLD consistently outperforms state-of-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data stream environments.

C. Lei, E. A. Rundensteiner, and J. D. Guttman. "Robust Distributed Stream Processing". ICDE 2013.
C. Lei, E. A. Rundensteiner, and J. D. Guttman. "Robust Distributed Query Processing for Streaming Data". TODS 2014.

Preferential Resource Allocation in Stream Processing Systems

In electronic monitoring applications (or EMAs) the required response time for objects monitored varies depending on the associated risk of the objects and the current system load. EMAs process queries online over large volumes of data arriving from high-speed data streams with fluctuating arrival rates for days, months, or even indefinitely. At times EMAs may not be able to process the sleuth of incoming data and thus to monitor all objects. The stringent requirement of processing incoming tuples within their response time demands efficient resource allocation. In addition to support complex queries requires large operator states. Thus, EMAs can be extremely resource intensive. To further complicate matters, to determine the significance of tuples may introduce additional costs at a time when resources are already at a premium. During such times it is crucial for an EMA to consider the significance of tuples when allocating resources and to proactively pull certain tuples forward ahead of others within the query pipeline. In this study we focused in depth on exploring solutions that support preferential resource allocation in stream processing systems. The major contributions of this work are:

Query Language and Algebra: We developed the promotion continuous query language (P-CQL) that supports the specification of multi-tiered monitoring criteria as part of a query. We designed the proactive promotion query algebra that supports the assignment and propagation of tuple significance.

Online Query Optimization: Due to the large complexity of the search space, locating the optimal query plan that will preferentially allocate resouces is not feasible in practice. To address this issue, we propose an optimization strategy that prunes the search space to efficiently locate the optimal query plan.

Query Execution Architecture: To promote specific tuples at runtime given the current system conditions, we efficiently adjusts when, which, and where tuples are preferentially dedicated resources in the query plan. Our architecture is configurable online, so that how resources are allocated can be adjusted without requiring any plan infrastructure changes.

K. Works, and E. A. Rundensteiner. "The Proactive Promotion Engine" ICDE 2011. (pdf)
K. Works, and E. A. Rundensteiner. "Preferential Resource Allocation in Stream Processing Systems", International Journal of Cooperative Information Systems 2014
K. Works and E. Rundeisteiner. "Reliable Aggregation over Prioritized Data Streams". Transactions on Large-Scale Data and Knowledge-Centered Systems 2014
K. Works and E. Rundensteiner. "Utilizing Dynamic Precedence Criteria To Ensure the Production of Critical Results from Big Data Streams". BigData 2014
Best paper award for NSF-supported research at the Big Data Conference, Harvard University 2014.