Next: Tools Used
Up: No Title
Previous: Converting HTML to Relational
An important question in the construction of web wrappers is the
reusability of a wrapper for other web sites and schemas. This is an
important issue that we tried to address in this project.
The main steps involved in the construction of the wrapper were:
- model web site in the relational model (define schema)
- examine capabilities of underlying web search forms and define
SQL parser that extracts queries adapted to search forms from
arbitrary SQL
- examine returned HTML-results, define HTML parser(s) for results
and map to records in relational model
- generate parsers with JavaCC and JJTree
- write wrapper code and call parsers
- store results and run queries against database
- define and code user interface (API)
For a new web site, steps 1 through 4 have to be re-executed. Step 4
is automatic, which leaves steps 1,2, and 3 to be adapted to each new
site. The main issues involved in these steps are:
- Find a stable and general relational schema that best fits the
data returned from the web site. This is not hard as long as the web
site itself does not change its capabilities significantly. Most
information returned in HTML pages can be modeled in a few tables
(like in this example).
- Defining the SQL parser is the hardest step in the process. The
main problems are not in the translation of arbitrary SQL to simple
queries but (1) the usage of as much query capability as the web site
offers (i.e., not asking queries that are too simple and return too
large sets) and (2) the translation of the pattern matching syntax of
the underlying RDBMS to the syntax of the web search engine.
In the current example, we translate only simple
Oracle-``like''-queries into search strings for the DBLP
site. DBLP supports only substring matching without special search
characters. So searching with the usual '?' and '*' operators is not
supported. Other web sites may support this or other capabilities
(e.g., the '+' operator in general web search engines such as
Excite). We have not addressed this pattern matching issue in this
project.
The usage of a maximum of query capability is another issue which
could be addressed further. A general way of defining the possible
searches in a web site and a definition of a grammar to translate SQL
into these forms is needed.
- Defining the HTML parsers is not a severe problem--the example
suggests that modeling an HTML page with a specialized HTML parser is
not too difficult.
These three steps have to be executed for each new wrapper that is
written. More work on defining grammars for these purposes could be
beneficial. The remaining software in the wrapper (Java-code) can
largely be reused. The only knowledge about the web site that is
hardcoded in the code is currently the translation of pattern matching
expression.
Next: Tools Used
Up: No Title
Previous: Converting HTML to Relational
Andreas Koeller
Mon May 10 13:40:38 EDT 1999