Data Warehouse Maintenance over Dynamic Distributed Information Sources

Elke A. Rundensteiner
Computer Science Department, Worcester Polytechnic Institute
100 Institute Rd, Worcester, MA 01609
Phone: (508) 831-5815, Fax: (508) 831-5776
E-mail: rundenst@cs.wpi.edu

WWW Page
http://davis.wpi.edu/dsrg/EVE

Project Award Information
Award Number: IIS-9988776; Duration: 9/20/2000 - 9/19/2003. Title: Data Warehouse Maintenance over Dynamic Distributed Information Sources

Keywords
Derived Databases, Information Integration, Data Warehousing, View Derivation and Synchronization, Schema Evolution, Interoperability.

Project Summary
As digital repositories of information are springing up everywhere and interconnectivity between computers around the world is being established, the construction and maintenance of data warehouses (views) over such distributed data sources has been recognized of great importance for modern applications. However, data sources continuously evolve by modifying not only their content but also their query capabilities and schema and by joining or leaving the environment. This research addresses this critical and timely problem of how to rewrite view definitions synchronously with the schema changes of data sources, coined the view synchronization problem. The Evolvable View Environment (EVE) solution is based on a preference model for view evolution and a meta model for capturing data source interrelationships. Algorithms that exploit different types of meta data and view evolution preferences to synchronize views are designed. Measures for ranking alternative view rewritings based on their quality are established. To assure survivability in any dynamic environment, a strategy for coordinating view synchronization under schema changes and the more traditional view maintenance under data changes is incorporated. Implementation of the EVE software system serves as a proof of concept and as an experimental test bed. Experimental evaluations are conducted to assess performance, applicability, and quality of the rewritings. A case study on applying the technology to web and E-commerce applications helps to determine the applicability as well as the limitation of the technology. This project will advance not only the state-of-the-art in data warehousing, a core area of database technology, but the benefits are the provision of techniques and software tools for simplifying access to large sets of dynamic distributed data sources.

Publications and Products
A prototype system EVE that has been the foundation for this project had been demonstrated (ACM SIGMOD'99), and a new system Dyda has been presented at ACM SIGMOD'2001 (May 2001).

The publications, also at http://davis.wpi.edu/dsrg/, include:

Project Impact
Impact on Human Resources. This project has partially funded several Ph.D. students in the database research groups: Andreas Koeller, Xin Zhang, and Songting Chen. Andreas Koeller has graduated with a Ph.d. in Dec 2001, and now has accepted a faculty position at MontClaire University, NY. Several Master students and undergraduate students  have also been involved, including Brian Murphy, Bin Liu, and Jehad El-Sabbad.
Impact on education and curriculum development at all levels.
Impact. Sustainable integration of data sources that survives even evolution and migration of the data sources, including their transformation of their schema, are critical problems faced by software industry. Our project promises to provide automated solutions to these goals.

Goals, Objectives, and Targeted Activities
Selected Accomplishments
The construction and maintenance of data warehouses (views) over distributed data sources has been recognized of great importance for modern applications. However, such modern data sources often are dynamic that is they modify not only their content but also their query capabilities and schema and they join or leave the environment. This research addresses this critical and timely problem of how to rewrite view definitions synchronously with the schema changes of data sources, coined the view synchronization problem.
EVE Framework. The Evolvable View Environment (EVE) solution is based on a preference model for view evolution and a meta model for capturing data source interrelationships. Some initial algorithms that exploit different types of meta data and view evolution preferences to synchronize views have been designed. Measures for ranking alternative view rewritings based on both their cost for maintenenace and their quality have been established established.
TnxWrap Wrapper Architecture. To assure scalability and survivability in any dynamic environment, a transaction-based strategy for coordinating view synchronization under schema changes and the more traditional view maintenance under data changes has been proposed. This transactional approach uses the concept of a "DWMS_Transaction" to encapsulate the complete data warehous e maintenance process. With the help of an additional level of materialization in special-purpose source wrappers, we propose a multiversion concurrency control strategy that guarantees a consistent view of the information source space data inside each DWMS_Transaction, thus removing the maintenance anomaly problem. This integrated solution called "TxnWrap" now achieves at least strong consistency of DW maintenance even under schema changes. TxnWrap is complementary to previous maintenance algorithms for DUs and SCs, because it removes concurrency consid erations from these maintenance algorithms. Our approach also places little cooperation assumptions on information sources. We have implemented a first prototype of the TxnWrap solution and succeeded to plug it into our existing data warehousing testbed at WPI. Experiments indicte its performance benefits in terms of robust steady behavior with an increase in update frequency.
TnxWrap Optimization via Parallel Scheduler We haved optimized the TxnWrap solution towards dynamic data integration in several ways. One, we have developed some storage space optimization for each information source wrapper, simply by filtering of the wrapper database both based on selection and projection conditions. We have also applied version-based clean-up to each wrapper, once a change has been committed to the data warehouse. Secondly, we have also developed a parallel scheduler for handling the maintenance process upon notification of a source update. This significantly improves the performance of the overall system, allowing us to better exploit the computational resources available at each source instead of sequentially spacing out the maintenance process.

Targeted Accomplishments


Continued TnxWrap Optimization via Batching. We plan to continue to optimize our TxnWrap solution. One, we have found that in some applications it is not critical to refresh the data warehouse after each and every update. We hence will be exploring the application of batching strategies to our system. The novelty will be how to handle batching when given a mixture of data updates and schema changes in one batch. We expect that this will significantly improve the performance of the overall system, allowing us to better exploit the computational resources available at each source instead of processing the maintenance for each individual update.
Integration of Schematically Heterogeneous Data Sources. We have began to address the issue of integration of sources that are semantically equivalent (i.e., whose states can be mapped onto each other by an isomorphism) but schematically heterogeneous. While two such data sources may capture the same information, one database may model the information as tuples (data) while the other may store it in attribute or relation names (schema). After initial solutions involving ad-hoc programs, declarative mechanisms for supporting such powerful source restructuring have been devised recently, for example a SQL query language extension called SchemaSQL. We are exploring the system integration of such sources into our system via special SchemaSQL wrappers. More importantly, the problem of maintenance of such restructuring views over semantically heterogeneous sources once established must be explored. We are developing strategies for incremental maintenance of such schema-restructuring views, to implement them, assess their performance by comparative studies, and also integrate the final wrappers into our data warehousing system. Experimental studies assessing the efficiency of such a system will be also undertaken.

Project References
See above "Products" category for a listing of recent project reports, software and demonstrations. Those and related products can also be downloaded and/or viewed from our Database Systems Research Group homepage: http://davis.wpi.edu/dsrg

Area Background
The area of this project is database views, in particular as studied for relational database systems. Views are named stored queries. View mechanisms serve the purpose of database customization, security and access rights, and information derivation and integration. One important issue for database views is maintenance , that is, the incremental modification of the database view, if materialized, whenever the underlying data source is being updated.

Area References

Potential Related Projects. Derived Views, Database Query Languages.