Data Warehouse Maintenance over Dynamic Distributed Information Sources

Elke A. Rundensteiner
Computer Science Department, Worcester Polytechnic Institute
100 Institute Rd, Worcester, MA 01609
Phone: (508) 831-5815, Fax: (508) 831-5776
E-mail: rundenst@cs.wpi.edu

WWW Page
http://davis.wpi.edu/dsrg/EVE

Project Award Information
Award Number: IIS-9988776; Duration: 9/20/2000 - 9/19/2003. Title: Data Warehouse Maintenance over Dynamic Distributed Information Sources

Keywords
Derived Databases, Information Integration, Data Warehousing, View Derivation and Synchronization, Schema Evolution, Interoperability.

Project Summary
As digital repositories of information are springing up everywhere and interconnectivity between computers around the world is being established, the construction and maintenance of data warehouses (views) over such distributed data sources has been recognized of great importance for modern applications. However, data sources continuously evolve by modifying not only their content but also their query capabilities and schema and by joining or leaving the environment. This research addresses this critical and timely problem of how to rewrite view definitions synchronously with the schema changes of data sources, coined the view synchronization problem. The Evolvable View Environment (EVE) solution is based on a preference model for view evolution and a meta model for capturing data source interrelationships. Algorithms that exploit different types of meta data and view evolution preferences to synchronize views are designed. Measures for ranking alternative view rewritings based on their quality are established. To assure survivability in any dynamic environment, a strategy for coordinating view synchronization under schema changes and the more traditional view maintenance under data changes is incorporated. Implementation of the EVE software system serves as a proof of concept and as an experimental test bed. Experimental evaluations are conducted to assess performance, applicability, and quality of the rewritings. A case study on applying the technology to web and E-commerce applications helps to determine the applicability as well as the limitation of the technology. In summary, this project will advance not only the state-of-the-art in data warehousing, a core area of database technology, but the benefits are potentially far-reaching by providing techniques and software tools for simplifying access to large sets of dynamic distributed data sources.

Publications and Products
A prototype system called EVE that has been the foundation for this project had been demonstrated (ACM SIGMOD'99), and a new prototype system called Dyda will be presented at ACM SIGMOD'2001 (May 2001). Its source code is expected to be released at our website Summer 2001.

Selective recent publications include : The publications listed below can all be retrieved from http://davis.wpi.edu/dsrg/.

Project Impact
Impact on Human Resources. This project has partially funded several Ph.D. students in my database research groups: Andreas Koeller, Xin Zhang, and Songting Chen. Also, some Master students and undergraduate students  have been involved, such as Brian Murphy.
Impact on education and curriculum development at all levels. This project has increased education at the undergraduate level by providing small projects into which we actively can involve undergraduate students via REUs and directed study projects. It has also enhanced our graduate courses, e.g, the Advanced Database course (CS561) as well as the Special Topics course on Web Databases at WPI. (CS525)
Industry. We have had several interactions and exchanges of ideas related to this project with others, most notably with Dr. Arnon Rosenthal at Mitre Corporation.
Impact. Sustainable integration of data sources that survives even evolution and migration of the data sources, including their transformation of their schema, are critical problems faced by software industry. Our project promises to provide automated solutions to these goals.

Goals, Objectives, and Targeted Activities
Selected Accomplishments
The construction and maintenance of data warehouses (views) over distributed data sources has been recognized of great importance for modern applications. However, such modern data sources often are dynamic that is they modify not only their content but also their query capabilities and schema and they join or leave the environment. This research addresses this critical and timely problem of how to rewrite view definitions synchronously with the schema changes of data sources, coined the view synchronization problem.
EVE Framework. The Evolvable View Environment (EVE) solution is based on a preference model for view evolution and a meta model for capturing data source interrelationships. Some initial algorithms that exploit different types of meta data and view evolution preferences to synchronize views have been designed. Measures for ranking alternative view rewritings based on both their cost for maintenenace and their quality have been established established.
TnxWrap Wrapper Architecture. To assure scalability and survivability in any dynamic environment, a transaction-based strategy for coordinating view synchronization under schema changes and the more traditional view maintenance under data changes has been proposed. This transactional approach uses the concept of a "DWMS_Transaction" to encapsulate the complete data warehous e maintenance process. With the help of an additional level of materialization in special-purpose source wrappers, we propose a multiversion concurrency control strategy that guarantees a consistent view of the information source space data inside each DWMS_Transaction, thus removing the maintenance anomaly problem. This integrated solution called "TxnWrap" now achieves at least strong consistency of DW maintenance even under schema changes. TxnWrap is complementary to previous maintenance algorithms for DUs and SCs, because it removes concurrency consid erations from these maintenance algorithms. Our approach also places little cooperation assumptions on information sources. We have implemented a first prototype of the TxnWrap solution and succeeded to plug it into our existing data warehousing testbed at WPI. Experiments to assess its performance will need to be conducted in the coming year.

Targeted Accomplishments


TnxWrap Optimization. We plan to optimize the TxnWrap solution towards dynamic data integration in several ways. One, we plan to look at the reduction of the storage space used by each information source wrapper, since at the moment it stores all versions of tuples modified at the information source. We will explore filtering of the wrapper database both based on selection and projection conditions as well as version-based clean up once a change has been committed to the data warehouse. Secondly, we also intend to develop a parallel scheduler for handling the maintenance process upon notification of a source update. This should significantly improve the performance of the overall system, allowing us to better exploit the computational resources available at each source instead of sequentially pacing the maintenance process.
Integration of Schematically Heterogeneous Data Sources. We plan to address the issue of integration of sources that are semantically equivalent (i.e., whose states can be mapped onto each other by an isomorphism) but schematically heterogeneous. While two such data sources may capture the same information, one database may model the information as tuples (data) while the other may store it in attribute or relation names (schema). After initial solutions involving ad-hoc programs, declarative mechanisms for supporting such powerful source restructuring have been devised recently, for example a SQL query language extension called SchemaSQL. We now want to explore the system integration of such sources into our system via special SchemaSQL wrappers. More importnatly, the problem of maintenance of such restructuring views over semantically heterogeneous sources once established must be explored. We plan to develop strategies for incremental maintenance of such schema-restructuring views, to implement them, assess their performance by comparative studies, and also integrate the final wrappers into our data warehousing system.

Project References
See above "Products" category for a listing of recent project reports, software and demonstrations. Those and related products can also be downloaded and/or viewed from our Database Systems Research Group homepage: http://davis.wpi.edu/dsrg

Area Background
The area of this project is database views, in particular as studied for relational database systems. Views are named stored queries. View mechanisms serve the purpose of database customization, security and access rights, and information derivation and integration. One important issue for database views is maintenance , that is, the incremental modification of the database view, if materialized, whenever the underlying data source is being updated. .

Area References

Potential Related Projects. Derived Views, Database Query Languages.