Linguistic Parsing of Structured Documents

Salah Ait-Mokhtar, Eva Banik, Veronika Lux
The aim of this report is to show how taking document structure into account helps to improve the performance of parsing. We restrict the linguistic analysis to technical documents and we consider one specific structure in a single markup language: lists in html documents. First we establish a typology of lists based on a corpus study. Then, after describing a transformation process that creates documents with uniform list markup, we show how the list tags can be incorporated into a XIP grammar, and how they enhance performance on every level of parsing.
Xerox Technical Report

Attachments (282.23 kB)