Improving Vision Transformers for Incremental Learning
release_l2rvwwcrsbexfnzdynhikiuzy4
by
Pei Yu, Yinpeng Chen, Ying Jin, Zicheng Liu
2022
Abstract
This paper studies using Vision Transformers (ViT) in class incremental
learning. Surprisingly, naive application of ViT to replace convolutional
neural networks (CNNs) results in performance degradation. Our analysis reveals
three issues of naively using ViT: (a) ViT has very slow convergence when class
number is small, (b) more bias towards new classes is observed in ViT than
CNN-based models, and (c) the proper learning rate of ViT is too low to learn a
good classifier. Base on this analysis, we show these issues can be simply
addressed by using existing techniques: using convolutional stem, balanced
finetuning to correct bias, and higher learning rate for the classifier. Our
simple solution, named ViTIL (ViT for Incremental Learning), achieves the new
state-of-the-art for all three class incremental learning setups by a clear
margin, providing a strong baseline for the research community. For instance,
on ImageNet-1000, our ViTIL achieves 69.20% top-1 accuracy for the protocol of
500 initial classes with 5 incremental steps (100 new classes for each),
outperforming LUCIR+DDE by 1.69%. For more challenging protocol of 10
incremental steps (100 new classes), our method outperforms PODNet by 7.27%
(65.13% vs. 57.86%).
In text/plain
format
Archived Content
There are no accessible files associated with this release. You could check other releases for this work for an accessible version.
Know of a fulltext copy of on the public web? Submit a URL and we will archive it
2112.06103v2
access all versions, variants, and formats of this works (eg, pre-prints)