Motivation

As everyone knows, there is lots of information out on the web. But as everyone experiences, having information available does not necessarily mean it is available conveniently or in the form you want. The DBLP database is a case in point. It features an enormous collection of computer science publications, but can only handle simple queries for a single author or title name. For those of us of the fairly computer-literate and technology-pampered sort, we want more. I mean, if Oracle can do this and Oracle can do that, why can't this web site? That's where our wrapper comes in. The wrapper basically puts tables from the DBLP site into an actual Oracle database where users can have all the query complexity they want. Implementation consists of building the virtual database in Oracle, then simply querying that database.

Building the Virtual Database

--What Tables?

So here the main question is what tables to retrieve, i.e., given a however-complex SQL query, what is the smallest number of views necessary to complete the query? An interesting research topic in itself. Unfortunately, in attempts to find proof that an algorythm was the most efficient in this sense it was proved that no such general proof is possible. But regardless of the academic efficiency question, there are SQL parsers available which do this job of finding some subset of tables to answer your query for you. This parser accepts an SQL query in construction and has a method which returns a vector of strings on compilation. The strings which start with the token "Submit" are those tables to be retrieved.

--Querying the Web Site..

Now that we know the tables we want are marked with "Submit", we simply grab the attribute before the equals, and send the value that follows the equals to the corresponding table, e.g., the string: Name = Rundensteiner would be parsed submit "Rundensteiner" to the HTML page that searchs Authors since Name is a field in Author. The actual retrieval is fairly straightforward using Java's URL class since it provides methods that do all the work of opening a connection to a web page, allowing a printWriter to submit data to a web page, and retrieve the input stream of HTML that results from the browser.

(A Quirk of & Work-Around for this Web Site)

Although the DBLP site does not support complex queries, it does offer a few things of it's own. When a search returns multiple authors, for instance, the site returns a list of hyperlinks to the tables associated with those authors instead of all tables of each author as a strict query request would imply. This is nice for human users but not for our parser which lives on getting back exactly the tables input streams it asked for. Our work-around is a static method, ParseList.getURLStreams(), which parses the HTML response stream, figures out whether the HTML response stream is a list of hyperlinks or a straight-up result table, and if it is a list of hyperlinks, follows each link and returns a vector of URLStreams which the parser can query to get its tables.

--Extracting Table Data from the Web Site's response to the Oracle Database

Here we build upon the HTML parser which comes with JavaCC. We add a couple of rules to recognize authors, titles, homepages, etc, and copy the values of those fields into a vector of results once they are recognized. The extracted data is then piped into prepared SQL Insert Statements, which are then executed to the Oracle Database.

Querying the Database

Once the Virtual database is built, the original SQL statement can be executed against it. The resulting data is extracted from the result set and used to build a JTable which is displayed to the user.


  • return to Wrapper Home
  • jump to a description of the database