WWW Page
http://davis.wpi.edu/dsrg/EVE
Project Award Information
Award Number: IIS-9988776; Duration: 9/20/2000 - 9/19/2003.
Title: Data Warehouse Maintenance over Dynamic Distributed Information Sources
Keywords
Derived Databases, Information Integration,
Data Warehousing, View Derivation and Synchronization, Schema
Evolution, Interoperability.
Project Summary
As digital repositories of information are springing up everywhere and
interconnectivity between computers around the world is being
established, the construction and maintenance of data warehouses
(views) over such distributed data sources has been recognized of
great importance for modern applications. However, data sources
continuously evolve by modifying not only their content but also their
query capabilities and schema and by joining or leaving the
environment. This research addresses this critical and timely problem
of how to rewrite view definitions synchronously with the schema
changes of data sources, coined the view synchronization problem. The
Evolvable View Environment (EVE) solution is based on a preference
model for view evolution and a meta model for capturing data source
interrelationships. Algorithms that exploit different types of meta
data and view evolution preferences to synchronize views are designed.
Measures for ranking alternative view rewritings based on their
quality are established. To assure survivability in any dynamic
environment, a strategy for coordinating view synchronization under
schema changes and the more traditional view maintenance under data
changes is incorporated. Implementation of the EVE software system
serves as a proof of concept and as an experimental test bed.
Experimental evaluations are conducted to assess performance,
applicability, and quality of the rewritings. A case study on
applying the technology to web and E-commerce applications helps to
determine the applicability as well as the limitation of the
technology. This project will advance not only the
state-of-the-art in data warehousing, a core area of database
technology, but the benefits are the provision
of techniques and software tools for simplifying access to large sets of
dynamic distributed data sources.
Publications and Products
The publications, also at
http://davis.wpi.edu/dsrg/, include:
Project Impact
A prototype system EVE
that has been the foundation for this project
had been demonstrated (ACM SIGMOD'99),
and a new system Dyda
has been presented at ACM SIGMOD'2001 (May 2001).
Impact on Human Resources.
This project has partially funded several Ph.D. students
in the database research groups: Andreas Koeller,
Xin Zhang, and Songting Chen.
Andreas Koeller has graduated with a Ph.d. in Dec 2001,
and now has accepted a faculty position at MontClaire University, NY.
Several Master
students and undergraduate students  have also been involved,
including Brian Murphy, Bin Liu, and Jehad El-Sabbad.
Impact on education and curriculum development at
all levels.
Impact.
Sustainable integration of
data sources that survives even
evolution and migration of the data sources, including
their transformation of their schema,
are critical problems faced by software industry. Our
project promises to provide automated solutions to these goals.
Goals, Objectives, and Targeted Activities
Targeted Accomplishments
Project References
Area Background
Area References
Potential Related Projects. Derived Views, Database Query Languages.
Selected Accomplishments
The construction and maintenance of data warehouses
(views) over distributed data sources has been recognized of
great importance for modern applications. However,
such modern data sources
often are dynamic that is they modify not only their content but also their
query capabilities and schema and they join or leave the
environment. This research addresses this critical and timely problem
of how to rewrite view definitions synchronously with the schema
changes of data sources, coined the view synchronization problem.
EVE Framework.
The Evolvable View Environment (EVE) solution is based on a preference
model for view evolution and a meta model for capturing data source
interrelationships. Some initial
algorithms that exploit different types of meta
data and view evolution preferences to synchronize views
have been designed.
Measures
for ranking alternative view rewritings based on both their
cost for maintenenace and their
quality have been established established.
TnxWrap Wrapper Architecture.
To assure scalability and survivability in any dynamic
environment, a transaction-based
strategy for coordinating view synchronization under
schema changes and the more traditional view maintenance under data
changes has been proposed.
This transactional approach
uses the concept of a "DWMS_Transaction" to encapsulate
the complete data warehous e maintenance process. With the help of an
additional level of materialization in special-purpose source
wrappers, we propose a multiversion concurrency control strategy that
guarantees a consistent view of the information source space data
inside each DWMS_Transaction, thus removing the maintenance anomaly
problem. This integrated solution called "TxnWrap" now achieves at
least strong consistency of DW maintenance even under schema
changes. TxnWrap is complementary to previous maintenance algorithms
for DUs and SCs, because it removes concurrency consid erations from
these maintenance algorithms. Our approach also places little
cooperation assumptions on information sources. We have implemented
a first prototype of the
TxnWrap solution and succeeded to plug it into our
existing data warehousing testbed at WPI. Experiments
indicte its performance benefits in terms of robust steady
behavior with an increase in update frequency.
TnxWrap Optimization via Parallel Scheduler
We haved optimized the
TxnWrap solution towards dynamic data integration
in several ways. One, we have developed some
storage space optimization for each information source wrapper,
simply by filtering of the
wrapper database both based on selection and projection conditions.
We have also applied
version-based clean-up to each wrapper, once a change has been committed
to the data warehouse.
Secondly, we have also developed a parallel scheduler
for handling the maintenance process upon
notification of a source update. This significantly improves
the performance of the overall system, allowing us to better exploit
the computational resources available at each source
instead of sequentially spacing out the
maintenance process.
Continued TnxWrap Optimization via Batching.
We plan to continue to optimize our
TxnWrap solution.
One, we have found that in some applications
it is not critical to refresh the data warehouse after
each and every update. We hence will be exploring
the application of batching strategies to our
system. The novelty will be how to handle batching
when given a mixture of data updates and schema changes
in one batch.
We expect that this will significantly improve
the performance of the overall system, allowing us to better exploit
the computational resources available at each source
instead of processing the
maintenance for each individual update.
Integration of Schematically Heterogeneous
Data Sources.
We have began to address the issue of integration of
sources that are semantically equivalent (i.e., whose states can be
mapped onto each other by an isomorphism) but schematically
heterogeneous. While two such data sources may capture the same
information, one database may model the information as tuples
(data) while the other may store it in attribute or relation
names (schema). After initial solutions involving ad-hoc
programs, declarative mechanisms for supporting such powerful source
restructuring have been devised recently, for example a SQL query
language extension called SchemaSQL.
We are exploring the system integration of
such sources into our system via special SchemaSQL wrappers.
More importantly, the problem of maintenance of such
restructuring views over semantically heterogeneous sources
once established must be explored.
We are developing strategies for incremental
maintenance of such schema-restructuring views,
to implement them, assess their performance by comparative
studies, and also integrate the final wrappers into our
data warehousing system.
Experimental studies assessing the efficiency of such
a system will be also undertaken.
See above "Products" category for a listing of recent
project reports, software and demonstrations. Those
and related products can also be downloaded and/or viewed
from our Database Systems Research Group homepage:
http://davis.wpi.edu/dsrg
The area of this project is database views,
in particular as studied for relational database systems.
Views are named stored queries.
View mechanisms serve the purpose of database customization,
security and access rights, and information derivation
and integration.
One important issue for database views
is maintenance , that is,
the incremental modification of the database
view, if materialized, whenever the underlying
data source is being updated.