Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

Xu, Yi-Ge; Qiu, Xi-Peng; Zhou, Li-Gao; Huang, Xuan-Jing

doi:10.1007/s11390-021-1119-0

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

Regular Paper
Published: 31 July 2023

Volume 38, pages 853–866, (2023)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Yi-Ge Xu^1,2,
Xi-Peng Qiu^1,2,
Li-Gao Zhou³ &
…
Xuan-Jing Huang^1,2

104 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Fine-tuning pre-trained language models like BERT have become an effective way in natural language processing (NLP) and yield state-of-the-art results on many downstream tasks. Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure, re-designing the pre-training tasks, and leveraging external data and knowledge. The fine-tuning strategy itself has yet to be fully explored. In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation. The self-ensemble mechanism utilizes the checkpoints from an experience pool to integrate the teacher model. In order to transfer knowledge from the teacher model to the student model efficiently, we further use knowledge distillation, which is called self-distillation because the distillation comes from the model itself through the time dimension. Experiments on the GLUE benchmark and the Text Classification benchmark show that our proposed approach can significantly improve the adaption of BERT without any external data or knowledge. We conduct exhaustive experiments to investigate the efficiency of the self-ensemble and self-distillation mechanisms, and our proposed approach achieves a new state-of-the-art result on the SNLI dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Devlin J, Chang M W, Lee K et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Jun. 2019, pp.4171–4186. https://doi.org/10.18653/v1/N19-1423.
Yang Z L, Dai Z H, Yang Y M et al. XLNet: Generalized autoregressive pretraining for language understanding. In Proc. the 33rd International Conference on Neural Information Processing Systems (NIPS), Dec. 2019, Article No. 517.
Liu Y H, Ott M, Goyal N et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv: 1907.11692, 2019. https://arxiv.org/abs/1907.11692, Aug. 2023.
Rajpurkar P, Zhang J, Lopyrev K et al. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov. 2016, pp.2383–2392. https://doi.org/10.18653/v1/D16-1264.
Bowman S R, Angeli G, Potts C et al. A large annotated corpus for learning natural language inference. In Proc. the 2015 EMNLP, Sept. 2015, pp.632–642. https://doi.org/10.18653/v1/D15-1075.
Qiu X P, Sun T X, Xu Y G et al. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 2020, 63(10): 1872–1897. https://doi.org/10.1007/s11431-020-1647-3.
Article Google Scholar
Peters M E, Ruder S, Smith N A. To tune or not to tune? Adapting pretrained representations to diverse tasks. In Proc. the 4th Workshop on Representation Learning for NLP, Aug. 2019, pp.7–14. https://doi.org/10.18653/v1/W19-4302.
Stickland A C, Murray I. BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning. In Proc. the 36th International Conference on Machine Learning (ICML), Jun. 2019, pp.5986–5995.
Houlsby N, Giurgiu A, Jastrzebski S et al. Parameter-efficient transfer learning for NLP. In Proc. the 36th ICML, Jun. 2019, pp.2790–2799.
Dong L, Yang N, Wang W H et al. Unified language model pre-training for natural language understanding and generation. arXiv: 1905.03197, 2019. https://arxiv.org/abs/1905.03197, Aug. 2023.
Liu X D, He P C, Chen W Z et al. Multi-task deep neural networks for natural language understanding. arXiv: 1901.11504, 2019. https://arxiv.org/pdf/1901.11504.pdf, Aug. 2023.
Raffel C, Shazeer N, Roberts A et al. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: 1910.10683, 2019. https://arxiv.org/abs/1910.10683, Aug. 2023.
Sun C, Qiu X P, Xu Y G et al. How to fine-tune BERT for text classification? In Proc. the 18th China National Conference on Chinese Computational Linguistics, Oct. 2019, pp.194–206. https://doi.org/10.1007/978-3-030-32381-3_16.
Li H, Wang X S, Ding S F. Research and development of neural network ensembles: A survey. Artificial Intelligence Review, 2018, 49(4): 455–479. https://doi.org/10.1007/s10462-016-9535-1.
Article Google Scholar
Polyak B T, Juditsky A B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 1992, 30(4): 838–855. https://doi.org/10.1137/0330046.
Article MathSciNet MATH Google Scholar
Schaul T, Quan J, Antonoglou I et al. Prioritized experience replay. In Proc. the 4th International Conference on Learning Representations (ICLR), May 2016.
Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv: 1503.02531, 2015. https://arxiv.org/abs/1503.02531, Aug. 2023.
Laine S, Aila T. Temporal ensembling for semi-supervised learning. In Proc. the 5th ICLR, Apr. 2017.
Tarvainen A, Valpola H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proc. the 31st NIPS, Dec. 2017, pp.1195–1204.
Wei H R, Huang S J, Wang R et al. Online distilling from checkpoints for neural machine translation. In Proc. the 2019 NAACL: Human Language Technologies, Jun. 2019, pp.1932–1941. https://doi.org/10.18653/v1/N19-1192.
Liu W J, Zhou P, Wang Z R et al. FastBERT: A self-distilling BERT with adaptive inference time. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Jul. 2020, pp.6035–6044. https://doi.org/10.18653/v1/2020.acl-main.537.
Wang A, Singh A, Michael J et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Nov. 2018, pp.353–355. https://doi.org/10.18653/v1/W18-5446.
Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. In Proc. the 31st NIPS, Dec. 2017, pp.5998–6008.
Sanh V, Debut L, Chaumond J et al. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv: 1910.01108, 2019. https://arxiv.org/abs/1910.01108, Aug. 2023.
Jiao X Q, Yin Y C, Shang L F et al. TinyBERT: Distilling BERT for natural language understanding. In Proc. the 2020 Findings of the Association for Computational Linguistics, Nov. 2020, pp.4163–4174. https://doi.org/10.18653/v1/2020.findings-emnlp.372.
Sun Z Q, Yu H K, Song X D et al. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proc. the 58th ACL, Jul. 2020, pp.2158–2170. https://doi.org/10.18653/v1/2020.acl-main.195.
Wang W H, Wei F R, Dong L et al. MINILM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proc. the 34th NIPS, Dec. 2020, Article No. 485.
Ganaie M A, Hu M H, Malik A K et al. Ensemble deep learning: A review. Eng. Appl. Artif. Intell., 2022, 105: 105–151. https://doi.org/10.1016/j.engappai.2022.105151.
Article Google Scholar
Andrychowicz M, Wolski F, Ray A et al. Hindsight experience replay. In Proc. the 31st NIPS, Dec. 2017, pp.5055–5065.
Horgan D, Quan J, Budden D et al. Distributed prioritized experience replay. In Proc. the 6th ICLR, Apr. 30–May 3, 2018.
Sun S Q, Cheng Y, Gan Z et al. Patient knowledge distillation for BERT model compression. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.4323–4332. https://doi.org/10.18653/v1/D19-1441.
Liu X D, He P C, Chen W Z et al. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv: 1904.09482, 2019. https://arxiv.org/abs/1904.09482, Aug. 2023.
Warstadt A, Singh A, Bowman S R. Neural network acceptability judgments. Trans. Association for Computational Linguistics, 2019, 7: 625–641. https://doi.org/10.1162/tacl_a_00290.
Article Google Scholar
Socher R, Perelygin A, Wu J et al. Recursive deep models for semantic compositionality over a sentiment Treebank. In Proc. EMNLP, Oct. 2013, pp.1631–1642.
Dolan W B, Brockett C. Automatically constructing a corpus of sentential paraphrases. In Proc. the 3rd International Workshop on Paraphrasing, Oct. 2005, pp.9–16.
Cer D, Diab M, Agirre E et al. SemEval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv: 1708.00055, 2017. https://arxiv.org/abs/1708.00055, Aug. 2023.
Williams A, Nangia N, Bowman S. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. the 2018 NAACL: Human Language Technologies, Jun. 2018, pp.1112–1122. 10.18653/v1/N18-1101.
Matthews B W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975, 405(2): 442–451. https://doi.org/10.1016/0005-2795(75)90109-9.
Maas A L, Daly R E, Pham P T et al. Learning word vectors for sentiment analysis. In Proc. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2011, pp.142–150.
Zhang X, Zhao J B, LeCun Y. Character-level convolutional networks for text classification. In Proc. the 28th NIPS, Dec. 2015, pp.649–657.
Pilault J, Elhattami A, Pal C. Conditionally adaptive multi-task learning: Improving transfer learning in NLP using fewer parameters & less data. arXiv: 2009.09139, 2020. https://arxiv.org/abs/2009.09139, Aug. 2023.
Howard J, Ruder S. Universal language model fine-tuning for text classification. In Proc. the 56th ACL, Jul. 2018, pp.328–339. https://doi.org/10.18653/v1/P18-1031.

Download references

Author information

Authors and Affiliations

School of Computer Science, Fudan University, Shanghai, 200433, China
Yi-Ge Xu, Xi-Peng Qiu & Xuan-Jing Huang
Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, 200433, China
Yi-Ge Xu, Xi-Peng Qiu & Xuan-Jing Huang
Huawei Technologies Co., Ltd., Hangzhou, 310052, China
Li-Gao Zhou

Authors

Yi-Ge Xu
View author publications
You can also search for this author in PubMed Google Scholar
Xi-Peng Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Li-Gao Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xuan-Jing Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xi-Peng Qiu.

Supplementary Information

ESM 1

(PDF 397 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, YG., Qiu, XP., Zhou, LG. et al. Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation. J. Comput. Sci. Technol. 38, 853–866 (2023). https://doi.org/10.1007/s11390-021-1119-0

Download citation

Received: 28 October 2020
Accepted: 18 June 2021
Published: 31 July 2023
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11390-021-1119-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

Abstract

Access this article

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation