Split-Correctness in Information Extraction
release_uewly3kbmzgktclkjtqxrquwcy
by
Johannes Doleschal and Benny Kimelfeld and Wim Martens and Yoav
Nahshon and Frank Neven
2018
Abstract
Programs for extracting structured information from text, namely information
extractors, often operate separately on document segments obtained from a
generic splitting operation such as sentences, paragraphs, k-grams, HTTP
requests, and so on. An automated detection of this behavior of extractors,
which we refer to as split-correctness, would allow text analysis systems to
devise query plans with parallel evaluation on segments for accelerating the
processing of large documents. Other applications include the incremental
evaluation on dynamic content, where re-evaluation of information extractors
can be restricted to revised segments, and debugging, where developers of
information extractors are informed about potential boundary crossing of
different semantic components.
We propose a new formal framework for split-correctness within the formalism
of document spanners. Our preliminary analysis studies the complexity of
split-correctness over regular spanners. We also discuss different variants of
split-correctness, for instance, in the presence of black-box extractors with
so-called split constraints.
In text/plain
format
Archived Files and Locations
application/pdf 476.6 kB
file_jovqumkubncnhaqrwxigcvs66y
|
arxiv.org (repository) web.archive.org (webarchive) |
1810.03367v1
access all versions, variants, and formats of this works (eg, pre-prints)