EVE: The Evolvable View Environment

Discovery of Inclusion Dependencies in Databases

Project Overview

Much meta information about sources, such as the semantics of schema objects, functional dependencies, or relationships between different data sources, is not always explicitly available for an integration effort. Often, only the schema is known (or can be queried) and data can be queried through some kind of interface (e.g. using a query language such as SQL). However, many other kinds of meta information about sources are beneficial to perform meaningful information integration. One important class of meta information are constraints that restrict the possible states of an information source. Such constraints are useful in the determination of relationships between sources and thus for information integration. Some integration systems perform a semi-automatic integration using such constraints. Such approaches, but also manual or (hypothetical) fully automatic integration systems, would benefit greatly from the availability of as much meta-information as possible about the sources to be integrated. The manual search for such constraints is tedious and often not possible. This is true in particular when many related information sources are available or when large relations (with many attributes) are to be compared for interrelationships. Therefore, the question whether it is possible to automatically discover meta-information in otherwise unknown data sources is relevant. Clearly, the existence of a constraint cannot be inferred from an inspection of the data, as any suspected constraint may only hold temporarily or accidentally. However, it is possible to gather evidence for the existence of a suspected constraint (a process usually referred to as discovery) by the determination of patterns in the data of data sources. The assumption underlying this discovery is that if a large part of a database supports a hypothesis about a constraint and there is no evidence to refute the hypothesis, then this constraint is likely to exist in the database. While some types of constraint discovery have been studied to some extent (for example, functional dependencies and various key constraints, one important class of constraints, namely inclusion dependencies, has received little treatment in the literature. Inclusion dependencies express subset-relationships between databases and are thus important indicators for redundancies between data sources. Therefore, the discovery of inclusion dependencies is important in the context of information integration. While inference rules on such dependencies have been derived early (for example, by Casanova (PODS'82) and by Mitchell (PODS'83), the discovery of inclusion dependencies has not yet been treated thoroughly. A paper by Kantola (Intl. J. of Inf.Sys. 1992) provides some initial ideas and a rough complexity bound. It has also been argued in the literature that a new database normal form based on inclusion dependencies (IDNF) would be beneficial for some applications. Furthermore, there is some work on the discovery of related web sites, in which web sites with their hyperlinks are modelled as graphs.

Description

Inclusion dependencies are rules between two tables and of the form $R[X]\subseteq S[Y]$ , with and attribute sets. The discovery of inclusion dependencies is an NP-complete problem, as KANTOLA et al. [Intl. J. of Inf. Sys. 1992] have shown. The number of potential inclusion dependencies between two tables is exponential in the number of attributes in the two relations, however the number of interesting inclusion dependencies (i.e., such dependencies that are not subsumed by others or that actually express some semantic relationship between tables) is often quite small. Therefore, the problem of finding those ``interesting'' dependencies is difficult, but solvable as we will show.

In this dissertation, we propose an automated process for the discovery of inclusion dependencies in databases under three assumptions:

The data model of the databases in question must include the concept of attributes (``columns'' of data), cannot have complex objects (such as nested relations), and cannot have pointers or cross-references between data objects.
The data sources must provide the names and possibly types of their schema elements (i.e., the schema must be available).
The data sources must support the testing of a given inclusion dependency, for example through some query language.

No further requirements are made of the data sources.

Under these assumptions, our algorithm, named FIND, will discover inclusion dependencies between two given data sources. Since the general problem is NP-complete, a full solution cannot always be found. Therefore, the FIND algorithm will first attempt to discover all such dependencies, if the problem size is small enough. For problems that exceed a certain size it will apply heuristics to either discover the largest inclusion dependency between the data sources or at least find some large dependency for very large problems. The algorithm will report whether it has found the complete or only a partial solution.

Our approach uses a mapping of the inclusion dependency discovery problem to the clique-finding problem in graphs and k-uniform hypergraphs. As the clique finding problem itself is also NP-complete, additional heuristics are used when the problem size exceeds certain limits to restrict the size of the graphs involved and find a partial or complete solution to the problem.

Research Issues

Theory: We give a complete theory of the discovery of inclusion dependencies
Algorithm: An algorithm for an exact solution of the problem for smaller problem sizes
Heuristics: For larger problem sizes, and data with certain properties (many duplicates), the simple algorithm does not perfom sufficiently well. Here, we apply heuristics that are derived from efficiently computable properties of databases (domains, number of distinct values, attribute value distributions)
Implementation: We have implemented all algorithms and heuristics in Java with JDBC-connected relational databases
Experimental Evaluation: We have performed extensive experiments on our implementation of the algorithm. The experiments show the feasibility of the approach and prove that even for relations of 80 attributes and 100,000 tuples, the discovery of inclusion dependencies is possible within reasonable time (on the order of magnitude of a few hours).

Project Members

Advisor

Elke Rundensteiner

Graduate Students

Andreas Koeller

Source Code

The source code (in Java) is available here as a tarball (gzipped tar-file). If you want to make use of it, please let us know as it may be difficult to get this code to work on another machine. The system requirements are: Java 1.1, Java Treebuilder (JTB), JavaCC, JDBC (with the appropriate database class library), and Oracle or another SQL-database with a similar SQL dialect. Documentation is mainly restricted to Javadoc-comments in the code which can be translated into html using javadoc. The shell-scripts for compilation are for Linux bash. The data sets we used for testing are from the UC Irvine KDD Archive. We loaded all files into a relational database.

Publications

As this project has been concluded very recently, there are no publications yet. A comprehensive description will be available in my dissertation, to appear soon....

Related Work (Selection)

: Marco A. Casanova, Ronald Fagin, and Christos H. Papadimitriou.
Inclusion dependencies and their interaction with functional dependencies.
In Proceedings of ACM Conference on Principles of Database Systems (PODS), pages 171-176, 1982.
: Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina.
Finding replicated Web collections.
SIGMOD Record (ACM Special Interest Group on Management of Data), 29(2):355-366, 2000.
: Michael R. Genesereth, Arthur M. Keller, and Oliver M. Duschka.
Infomaster: An information integration system.
SIGMOD Record (ACM Special Interest Group on Management of Data), 26(2):539ff., 1997.
: R. L. Graham, M. Grötschel, and L. Lovász, editors.
Handbook of Combinatorics.
Elsevier/MIT Press, 1995.
: M. Kantola, H. Mannila, K. J. Räihä, and H. Siirtola.
Discovering functional and inclusion dependencies in relational databases.
International J. Of Intelligent Systems, 7:591-607, 1992.
: T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivastava.
The Information Manifold.
Proceedings of the AAAI Spring Symposium on Information Gathering in Distributed Heterogeneous Environments, Stanford, California, March 1995.
: J. A. Larson, S. B. Navathe, and R. Elmasri.
A theory of attribute equivalence in databases with applications to schema integration.
IEEE Transactions on Software Engineering, 15(4):449-463, April 1989.
: Mark Levene and Millist W. Vincent.
Justification for inclusion dependency normal form.
IEEE Transactions on Knowledge and Data Engineering (TKDE), 12, 2000.
: John C. Mitchell.
Inference rules for functional and inclusion dependencies.
In Proceedings of ACM Symposium on Principles of Database Systems, pages 58-69, Atlanta, Georgia, 21-23 March 1983.
: I. Savnik and P. A. Flach.
Bottom-up induction of functional dependencies from relations.
In G. Piatetsky-Shapiro, editor, Proc. of AAAI-93 Workshop: Knowledge Discovery in Databases, pages 174-185, July 1993.

created by: Andreas Koeller. Last modified Oct. 31, 2001

Project Overview

Project Description

Research Issues

Project Members

Sponsors

System Source Code

Publications

Related Work

Project Overview

Description

Research Issues

Project Members

Advisor

Elke Rundensteiner

Graduate Students

Andreas Koeller

Sponsors

Source Code

Publications

Related Work (Selection)