Discovery of Inclusion Dependencies in Databases
Much meta information about sources, such as the semantics of schema
objects, functional dependencies, or relationships between different
data sources, is not always explicitly available for an integration
effort. Often, only the schema is known (or can be queried) and data
can be queried through some kind of interface (e.g. using a query
language such as SQL). However, many other kinds of meta information
about sources are beneficial to perform meaningful information
One important class of meta information are
constraints that restrict the possible states of an information
source. Such constraints are useful in the determination of
relationships between sources and thus for information
integration. Some integration systems perform a semi-automatic
integration using such constraints. Such approaches, but also
manual or (hypothetical) fully automatic integration systems, would
benefit greatly from the availability of as much meta-information as
possible about the sources to be integrated. The manual search for
such constraints is tedious and often not possible. This is true in
particular when many related information sources are available or when
large relations (with many attributes) are to be compared for
interrelationships. Therefore, the question whether it is possible to
automatically discover meta-information in otherwise unknown
data sources is relevant.
Clearly, the existence of a constraint cannot be inferred from
an inspection of the data, as any suspected constraint may only hold
temporarily or accidentally. However, it is possible to gather
evidence for the existence of a suspected constraint (a process
usually referred to as
discovery) by the determination of
patterns in the data of data sources. The assumption
underlying this discovery is that if a large part of a database
supports a hypothesis about a constraint and there is no evidence to
refute the hypothesis, then this constraint is likely to exist in the
While some types of constraint discovery have been studied to some
extent (for example, functional dependencies and various key
constraints, one important class of constraints, namely inclusion
dependencies, has received little treatment in the literature.
Inclusion dependencies express subset-relationships between databases
and are thus important indicators for redundancies between data
sources. Therefore, the discovery of inclusion dependencies is
important in the context of information integration. While inference
rules on such dependencies have been derived early (for example, by
Casanova (PODS'82) and by Mitchell (PODS'83), the discovery of
inclusion dependencies has not yet been treated thoroughly. A paper by
Kantola (Intl. J. of Inf.Sys. 1992) provides some initial ideas and a
rough complexity bound. It has also been argued in the literature
that a new database normal form based on inclusion dependencies (IDNF)
would be beneficial for some applications. Furthermore, there is some
work on the discovery of related web sites, in which web sites with
their hyperlinks are modelled as graphs.
Inclusion dependencies are rules between two tables and of the
, with and attribute sets. The
discovery of inclusion dependencies is an NP-complete problem, as
KANTOLA et al. [Intl. J. of
Inf. Sys. 1992] have shown. The number of
potential inclusion dependencies between two tables is exponential in
the number of attributes in the two relations, however the number of
interesting inclusion dependencies (i.e., such dependencies that
are not subsumed by others or that actually express some semantic
relationship between tables) is often quite small. Therefore, the
problem of finding those ``interesting'' dependencies is difficult,
but solvable as we will show.
In this dissertation, we propose an automated process for the
discovery of inclusion dependencies in databases under three
No further requirements are made of the data sources.
- The data model of the databases in question must include the concept of
attributes (``columns'' of data), cannot have complex objects (such as nested
relations), and cannot have pointers or cross-references between data
- The data sources must provide the names and possibly types of their schema
elements (i.e., the schema must be available).
- The data sources must support the testing of a given
inclusion dependency, for example through some query language.
Under these assumptions, our algorithm, named FIND, will discover
inclusion dependencies between two given data sources. Since the
general problem is NP-complete, a full solution cannot always be
found. Therefore, the FIND algorithm will first attempt to discover
all such dependencies, if the problem size is small enough. For
problems that exceed a certain size it will apply heuristics to either
discover the largest inclusion dependency between the data
sources or at least find some large dependency for very large
problems. The algorithm will report whether it has found the complete
or only a partial solution.
Our approach uses a mapping of the inclusion dependency discovery
problem to the clique-finding problem in graphs and
k-uniform hypergraphs. As the clique finding problem itself is
also NP-complete, additional heuristics are used when the problem size
exceeds certain limits to restrict the size of the graphs involved and
find a partial or complete solution to the problem.
- We give a complete theory of the discovery of
- An algorithm for an exact
solution of the problem for smaller problem sizes
For larger problem sizes, and data with certain properties (many
duplicates), the simple algorithm does not perfom sufficiently
well. Here, we apply heuristics that are derived from efficiently
computable properties of databases (domains, number of distinct
values, attribute value distributions)
have implemented all algorithms and heuristics in Java with
JDBC-connected relational databases
- Experimental Evaluation
We have performed extensive experiments on our implementation of the
algorithm. The experiments show the feasibility of the approach and
prove that even for relations of 80 attributes and 100,000 tuples, the
discovery of inclusion dependencies is possible within reasonable time
(on the order of magnitude of a few hours).
This project is in part funded by the National Science Foundation
under grant Award Number: IIS-9988776 and the project title "Data
Warehouse Maintenance over Dynamic Distributed Information Sources".
This NSF project time period ranges from Oct/1/2000 - Sept/30/2003.
The source code (in Java) is available here as a tarball (gzipped tar-file). If you
want to make use of it, please let us
know as it may be difficult to get this code to work on another
machine. The system requirements are: Java 1.1, Java Treebuilder
(JTB), JavaCC, JDBC (with the appropriate database class library), and
Oracle or another SQL-database with a similar SQL
dialect. Documentation is mainly restricted to Javadoc-comments in the
code which can be translated into html using javadoc. The
shell-scripts for compilation are for Linux bash.
The data sets we used for testing are from the UC Irvine KDD Archive. We loaded all
files into a relational database.
As this project has been concluded very recently, there are no
publications yet. A comprehensive description will be available in my
dissertation, to appear soon....
Related Work (Selection)
Marco A. Casanova, Ronald Fagin, and Christos H. Papadimitriou.
Inclusion dependencies and their interaction with functional
In Proceedings of ACM Conference on Principles of Database
Systems (PODS), pages 171-176, 1982.
Junghoo Cho, Narayanan Shivakumar, and Hector Garcia-Molina.
Finding replicated Web collections.
SIGMOD Record (ACM Special Interest Group on Management of
Data), 29(2):355-366, 2000.
Michael R. Genesereth, Arthur M. Keller, and Oliver M. Duschka.
Infomaster: An information integration system.
SIGMOD Record (ACM Special Interest Group on Management of
Data), 26(2):539ff., 1997.
R. L. Graham, M. Grötschel, and L. Lovász, editors.
Handbook of Combinatorics.
Elsevier/MIT Press, 1995.
M. Kantola, H. Mannila, K. J. Räihä, and H. Siirtola.
Discovering functional and inclusion dependencies in relational
International J. Of Intelligent Systems, 7:591-607, 1992.
T. Kirk, A. Y. Levy, Y. Sagiv, and D. Srivastava.
The Information Manifold.
Proceedings of the AAAI Spring Symposium on Information
Gathering in Distributed Heterogeneous Environments, Stanford, California,
J. A. Larson, S. B. Navathe, and R. Elmasri.
A theory of attribute equivalence in databases with applications to
IEEE Transactions on Software Engineering, 15(4):449-463,
Mark Levene and Millist W. Vincent.
Justification for inclusion dependency normal form.
IEEE Transactions on Knowledge and Data Engineering (TKDE), 12,
John C. Mitchell.
Inference rules for functional and inclusion dependencies.
In Proceedings of ACM Symposium on Principles of Database
Systems, pages 58-69, Atlanta, Georgia, 21-23 March 1983.
I. Savnik and P. A. Flach.
Bottom-up induction of functional dependencies from relations.
In G. Piatetsky-Shapiro, editor, Proc. of AAAI-93 Workshop:
Knowledge Discovery in Databases, pages 174-185, July 1993.
created by: Andreas Koeller. Last modified
Oct. 31, 2001