AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization
release_4i3eeomzxjbdbf62o4ivibafy4
by
Xinsong Zhang, Hang Li
2020
Abstract
Pre-trained language models such as BERT have exhibited remarkable
performances in many tasks in natural language understanding (NLU). The tokens
in the models are usually fine-grained in the sense that for languages like
English they are words or sub-words and for languages like Chinese they are
characters. In English, for example, there are multi-word expressions which
form natural lexical units and thus the use of coarse-grained tokenization also
appears to be reasonable. In fact, both fine-grained and coarse-grained
tokenizations have advantages and disadvantages for learning of pre-trained
language models. In this paper, we propose a novel pre-trained language model,
referred to as AMBERT (A Multi-grained BERT), on the basis of both fine-grained
and coarse-grained tokenizations. For English, AMBERT takes both the sequence
of words (fine-grained tokens) and the sequence of phrases (coarse-grained
tokens) as input after tokenization, employs one encoder for processing the
sequence of words and the other encoder for processing the sequence of the
phrases, utilizes shared parameters between the two encoders, and finally
creates a sequence of contextualized representations of the words and a
sequence of contextualized representations of the phrases. Experiments have
been conducted on benchmark datasets for Chinese and English, including CLUE,
GLUE, SQuAD and RACE. The results show that AMBERT outperforms the existing
best performing models in almost all cases, particularly the improvements are
significant for Chinese.
In text/plain
format
Archived Files and Locations
application/pdf 2.6 MB
file_3urqbd4qljapxp2qgke7o25ehm
|
arxiv.org (repository) web.archive.org (webarchive) |
2008.11869v3
access all versions, variants, and formats of this works (eg, pre-prints)