Visformer: The Vision-friendly Transformer
release_u42stbyc3zbjzguxtlg7g2ciky
by
Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, Qi Tian
2021
Abstract
The past year has witnessed the rapid development of applying the Transformer
module to vision problems. While some researchers have demonstrated that
Transformer-based models enjoy a favorable ability of fitting data, there are
still growing number of evidences showing that these models suffer over-fitting
especially when the training data is limited. This paper offers an empirical
study by performing step-by-step operations to gradually transit a
Transformer-based model to a convolution-based model. The results we obtain
during the transition process deliver useful messages for improving visual
recognition. Based on these observations, we propose a new architecture named
Visformer, which is abbreviated from the `Vision-friendly Transformer'. With
the same computational complexity, Visformer outperforms both the
Transformer-based and convolution-based models in terms of ImageNet
classification accuracy, and the advantage becomes more significant when the
model complexity is lower or the training set is smaller. The code is available
at https://github.com/danczs/Visformer.
In text/plain
format
Archived Files and Locations
application/pdf 500.3 kB
file_iyapkyfhcfeitjvbefwyhw35ze
|
arxiv.org (repository) web.archive.org (webarchive) |
2104.12533v3
access all versions, variants, and formats of this works (eg, pre-prints)