DeTable – A Wrapper for Complex HTML Tables

Aim

The main challenges in this project are defined as following:

  • Developing a parser for HTML tables to detect structure information
  • Detection of header cells
  • Detection of vertical and/or horizontal spanned cells
  • Disassemble cells and clone the contained information, if appropriate
  • Applying a single-pass method: therefore, a dynamic table structure is used
  • Accepting and handling missing tags

 


Table 1: Normalized complex table. Input for DeTable.
 
cause drug of choice dosage
adults Gonococcus Ceftriaxone 1g IM, single dose
  lavage infected eye
Chlamydia Azithromycin 1g orally single dose
or
Doxycycline 100 mg orally twice a day for 7 days
children Gonococcus Children who weigh < 45 kg Ceftriaxone 125 mg IM, single dose
Children who weigh > 45 kg   same treatment as adults
Chlamydia Children who weigh < 45 kg Erythromycin base 50 mg/kg/day orally in 4 divided doses for 10-14 days
Children under 8 years old who weigh > 45 kg Azithromycin 1 gm orally, single dose
Children 8 years old or older Azithromycin 1 gm orally, single dose
or
Doxycycline 100 mg orally, twice a day for 7 days
Neonates Ophthalmia neonatorum (Caused by N. gonorrhoeae) Ceftriaxone 25-50 mg/kg IV or IM, single dose, not to exceed 125 mg
Chlamydia Erythromycin 50 mg/kg/day orally in 4 divided doss for 10-14 days

 

 


Table 2: De-normalized table (in HTML format) including redundant information.
Output of DeTable in HTML format.
 
cause cause cause drug of choice dosage
adults Gonococcus   Ceftriaxone 1g IM, single dose
adults Gonococcus     lavage infected eye
adults Chlamydia   Azithromycin 1g orally single dose
adults Chlamydia   or or
adults Chlamydia   Doxycycline 100 mg orally twice a day for 7 days
children Gonococcus Children who weigh < 45 kg Ceftriaxone 125 mg IM, single dose
children Gonococcus Children who weigh > 45 kg   same treatment as adults
children Chlamydia Children who weigh < 45 kg Erythromycin base 50 mg/kg/day orally in 4 divided doses for 10-14 days
children Chlamydia Children under 8 years old who weigh > 45 kg Azithromycin 1 gm orally, single dose
children Chlamydia Children 8 years old or older Azithromycin 1 gm orally, single dose
children Chlamydia Children 8 years old or older or or
children Chlamydia Children 8 years old or older Doxycycline 100 mg orally, twice a day for 7 days
Neonates Ophthalmia neonatorum (Caused by N. gonorrhoeae)   Ceftriaxone 25-50 mg/kg IV or IM, single dose, not to exceed 125 mg
Neonates Chlamydia   Erythromycin 50 mg/kg/day orally in 4 divided doss for 10-14 days

 

Status
finished

The main purpose of this student project is a wrapper that transforms complex HTML tables into an XML format. The complexity of the HTML tables is defined by the occurrence of spanned cells.

Nowadays, many wrappers exist applying Information Extraction (IE) methods on semi-structured data, like HTML files. One drawback of many of these wrappers is the inability of handling complex tables. By means of complex tables information is structured and the representation of redundant information is omitted, i.e., the table is displayed in a normalized format. The advantage is that the layout is more concise, as HTML is mainly designed for layout presented to human users. But to support the computer-based processing of the information of these complex tables, it has to be de-normalized to allow a faster access to each record. Additionally, to enable a more efficient processing, the de-normalized table is not only represented in the HTML format, but also in an XML format.

The procedure can be described as following: The information in spanned cells is disassembled and the information is stored in the disassembled cells. Therefore, it is necessary to read the HTML file and parse the full HTML source code including text and attributes. The focus is on the special attributes COLSPAN and ROWSPAN of the table tags (<TD> and <TH>), because they indicate spanned cells. The application breaks up the spanned cell into an amount of cells defined by COLSPAN and ROWSPAN and clones the information into this broken up cells. In dependency of the header cells (configured via command line arguments or detected by the <TH> tag) the cloning can be suppressed for certain cells.

The output of the program can be stored both in an HTML and an XML file. The structure of the latter is defined by a DTD, which is included in the package.

Downloads