SAS: Self-Augmented Strategy for Language Model Pre-training
release_5pihsqjmpndcblmlqf6kcxxfeu
by
Yifei Xu, Jingqiao Zhang, Ru He, Liangzhu Ge, Chao Yang, Cheng Yang, Ying Nian Wu
2021
Abstract
The core of a self-supervised learning method for pre-training language
models includes the design of appropriate data augmentation and corresponding
pre-training task(s). Most data augmentations in language model pre-training
are context-independent. The seminal contextualized augmentation recently
proposed by the ELECTRA requires a separate generator, which leads to extra
computation cost as well as the challenge in adjusting the capability of its
generator relative to that of the other model component(s). We propose a
self-augmented strategy (SAS) that uses a single forward pass through the model
to augment the input data for model training in the next epoch. Essentially our
strategy eliminates a separate generator network and uses only one network to
generate the data augmentation and undertake two pre-training tasks (the MLM
task and the RTD task) jointly, which naturally avoids the challenge in
adjusting the generator's capability as well as reduces the computation cost.
Additionally, our SAS is a general strategy such that it can seamlessly
incorporate many new techniques emerging recently or in the future, such as the
disentangled attention mechanism recently proposed by the DeBERTa model. Our
experiments show that our SAS is able to outperform the ELECTRA and other
state-of-the-art models in the GLUE tasks with the same or less computation
cost.
In text/plain
format
Archived Files and Locations
application/pdf 1.0 MB
file_vqc4virvyngrbh2b65agoq3xkm
|
arxiv.org (repository) web.archive.org (webarchive) |
2106.07176v2
access all versions, variants, and formats of this works (eg, pre-prints)