Unified Pretraining Framework for Document Understanding
release_ejiftds4ofaarc6d3ngsmg7xki
by
Jiuxiang Gu, Jason Kuen, Vlad I. Morariu, Handong Zhao, Nikolaos Barmpalios, Rajiv Jain, Ani Nenkova, Tong Sun
2022
Abstract
Document intelligence automates the extraction of information from documents
and supports many business applications. Recent self-supervised learning
methods on large-scale unlabeled document datasets have opened up promising
directions towards reducing annotation efforts by training models with
self-supervised objectives. However, most of the existing document pretraining
methods are still language-dominated. We present UDoc, a new unified
pretraining framework for document understanding. UDoc is designed to support
most document understanding tasks, extending the Transformer to take multimodal
embeddings as input. Each input element is composed of words and visual
features from a semantic region of the input document image. An important
feature of UDoc is that it learns a generic representation by making use of
three self-supervised losses, encouraging the representation to model
sentences, learn similarities, and align modalities. Extensive empirical
analysis demonstrates that the pretraining procedure learns better joint
representations and leads to improvements in downstream tasks.
In text/plain
format
Archived Files and Locations
application/pdf 9.2 MB
file_db55xwpi5vajbpuy6bu5oox2ba
|
arxiv.org (repository) web.archive.org (webarchive) |
2204.10939v1
access all versions, variants, and formats of this works (eg, pre-prints)