We believe that a general way to extract well structured information (such as relational type data) from a HTML result page is not possible yet since no standard in designing such pages exists (this should change with some level of acceptance for XML ). Therefore, our aim is to be able to extract data from specific HTML pages rather than from general ones. Fortunately, the HTML primitives mostly used for the return of information after a query such as lists or tables are conceptually close to relations in a relational model. So we could focus our work on pages containing such primitives.
Most of the resulting web pages are automatically generated so the structure of the lists and tables in such pages is independent from the actual query sent. This is an important assumption. A result page in which tables would have a different format for two distinct rows is difficult to handle. This assumption allows us to filter the result data by excluding the undesirable HTML meta-information from it. Moreover, sometimes a single character string in the result text contains information that has to be stored in more than one column in the relational table. The homogeneity of the data is again highly desirable in order to be able to split the string properly. It is not always achievable, though. An example for a successful split would be a pages field in an HTML document containing strings like ``234-256'' which could then be parsed into a start and end page number, respectively.
In our application we have both HTML-lists and HTML-tables. We want to be able to identify the result rows based on their HTML tags and parse them according to our relational data model. The special case when the author name is not specified well enough to match with a single record should be considered also. We have to be able to distinguish the two separate cases when the result page is either a list of candidate authors or a list of one author's papers. For our project this can be done due to the different format in which the authors and the papers are retrieved: tables versus lists. In general, some way has to be found to distinguish differently structured HTML-pages from each other that are returned as query results.