Skip to main content
Log in

Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Fine-tuning pre-trained language models like BERT have become an effective way in natural language processing (NLP) and yield state-of-the-art results on many downstream tasks. Recent studies on adapting BERT to new tasks mainly focus on modifying the model structure, re-designing the pre-training tasks, and leveraging external data and knowledge. The fine-tuning strategy itself has yet to be fully explored. In this paper, we improve the fine-tuning of BERT with two effective mechanisms: self-ensemble and self-distillation. The self-ensemble mechanism utilizes the checkpoints from an experience pool to integrate the teacher model. In order to transfer knowledge from the teacher model to the student model efficiently, we further use knowledge distillation, which is called self-distillation because the distillation comes from the model itself through the time dimension. Experiments on the GLUE benchmark and the Text Classification benchmark show that our proposed approach can significantly improve the adaption of BERT without any external data or knowledge. We conduct exhaustive experiments to investigate the efficiency of the self-ensemble and self-distillation mechanisms, and our proposed approach achieves a new state-of-the-art result on the SNLI dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Devlin J, Chang M W, Lee K et al. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL): Human Language Technologies, Jun. 2019, pp.4171–4186. https://doi.org/10.18653/v1/N19-1423.

  2. Yang Z L, Dai Z H, Yang Y M et al. XLNet: Generalized autoregressive pretraining for language understanding. In Proc. the 33rd International Conference on Neural Information Processing Systems (NIPS), Dec. 2019, Article No. 517.

  3. Liu Y H, Ott M, Goyal N et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv: 1907.11692, 2019. https://arxiv.org/abs/1907.11692, Aug. 2023.

  4. Rajpurkar P, Zhang J, Lopyrev K et al. SQuAD: 100,000+ questions for machine comprehension of text. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov. 2016, pp.2383–2392. https://doi.org/10.18653/v1/D16-1264.

  5. Bowman S R, Angeli G, Potts C et al. A large annotated corpus for learning natural language inference. In Proc. the 2015 EMNLP, Sept. 2015, pp.632–642. https://doi.org/10.18653/v1/D15-1075.

  6. Qiu X P, Sun T X, Xu Y G et al. Pre-trained models for natural language processing: A survey. Science China Technological Sciences, 2020, 63(10): 1872–1897. https://doi.org/10.1007/s11431-020-1647-3.

    Article  Google Scholar 

  7. Peters M E, Ruder S, Smith N A. To tune or not to tune? Adapting pretrained representations to diverse tasks. In Proc. the 4th Workshop on Representation Learning for NLP, Aug. 2019, pp.7–14. https://doi.org/10.18653/v1/W19-4302.

  8. Stickland A C, Murray I. BERT and PALs: Projected attention layers for efficient adaptation in multi-task learning. In Proc. the 36th International Conference on Machine Learning (ICML), Jun. 2019, pp.5986–5995.

  9. Houlsby N, Giurgiu A, Jastrzebski S et al. Parameter-efficient transfer learning for NLP. In Proc. the 36th ICML, Jun. 2019, pp.2790–2799.

  10. Dong L, Yang N, Wang W H et al. Unified language model pre-training for natural language understanding and generation. arXiv: 1905.03197, 2019. https://arxiv.org/abs/1905.03197, Aug. 2023.

  11. Liu X D, He P C, Chen W Z et al. Multi-task deep neural networks for natural language understanding. arXiv: 1901.11504, 2019. https://arxiv.org/pdf/1901.11504.pdf, Aug. 2023.

  12. Raffel C, Shazeer N, Roberts A et al. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv: 1910.10683, 2019. https://arxiv.org/abs/1910.10683, Aug. 2023.

  13. Sun C, Qiu X P, Xu Y G et al. How to fine-tune BERT for text classification? In Proc. the 18th China National Conference on Chinese Computational Linguistics, Oct. 2019, pp.194–206. https://doi.org/10.1007/978-3-030-32381-3_16.

  14. Li H, Wang X S, Ding S F. Research and development of neural network ensembles: A survey. Artificial Intelligence Review, 2018, 49(4): 455–479. https://doi.org/10.1007/s10462-016-9535-1.

    Article  Google Scholar 

  15. Polyak B T, Juditsky A B. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 1992, 30(4): 838–855. https://doi.org/10.1137/0330046.

    Article  MathSciNet  MATH  Google Scholar 

  16. Schaul T, Quan J, Antonoglou I et al. Prioritized experience replay. In Proc. the 4th International Conference on Learning Representations (ICLR), May 2016.

  17. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv: 1503.02531, 2015. https://arxiv.org/abs/1503.02531, Aug. 2023.

  18. Laine S, Aila T. Temporal ensembling for semi-supervised learning. In Proc. the 5th ICLR, Apr. 2017.

  19. Tarvainen A, Valpola H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Proc. the 31st NIPS, Dec. 2017, pp.1195–1204.

  20. Wei H R, Huang S J, Wang R et al. Online distilling from checkpoints for neural machine translation. In Proc. the 2019 NAACL: Human Language Technologies, Jun. 2019, pp.1932–1941. https://doi.org/10.18653/v1/N19-1192.

  21. Liu W J, Zhou P, Wang Z R et al. FastBERT: A self-distilling BERT with adaptive inference time. In Proc. the 58th Annual Meeting of the Association for Computational Linguistics (ACL), Jul. 2020, pp.6035–6044. https://doi.org/10.18653/v1/2020.acl-main.537.

  22. Wang A, Singh A, Michael J et al. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proc. the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, Nov. 2018, pp.353–355. https://doi.org/10.18653/v1/W18-5446.

  23. Vaswani A, Shazeer N, Parmar N et al. Attention is all you need. In Proc. the 31st NIPS, Dec. 2017, pp.5998–6008.

  24. Sanh V, Debut L, Chaumond J et al. DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. arXiv: 1910.01108, 2019. https://arxiv.org/abs/1910.01108, Aug. 2023.

  25. Jiao X Q, Yin Y C, Shang L F et al. TinyBERT: Distilling BERT for natural language understanding. In Proc. the 2020 Findings of the Association for Computational Linguistics, Nov. 2020, pp.4163–4174. https://doi.org/10.18653/v1/2020.findings-emnlp.372.

  26. Sun Z Q, Yu H K, Song X D et al. MobileBERT: A compact task-agnostic BERT for resource-limited devices. In Proc. the 58th ACL, Jul. 2020, pp.2158–2170. https://doi.org/10.18653/v1/2020.acl-main.195.

  27. Wang W H, Wei F R, Dong L et al. MINILM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proc. the 34th NIPS, Dec. 2020, Article No. 485.

  28. Ganaie M A, Hu M H, Malik A K et al. Ensemble deep learning: A review. Eng. Appl. Artif. Intell., 2022, 105: 105–151. https://doi.org/10.1016/j.engappai.2022.105151.

    Article  Google Scholar 

  29. Andrychowicz M, Wolski F, Ray A et al. Hindsight experience replay. In Proc. the 31st NIPS, Dec. 2017, pp.5055–5065.

  30. Horgan D, Quan J, Budden D et al. Distributed prioritized experience replay. In Proc. the 6th ICLR, Apr. 30–May 3, 2018.

  31. Sun S Q, Cheng Y, Gan Z et al. Patient knowledge distillation for BERT model compression. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.4323–4332. https://doi.org/10.18653/v1/D19-1441.

  32. Liu X D, He P C, Chen W Z et al. Improving multi-task deep neural networks via knowledge distillation for natural language understanding. arXiv: 1904.09482, 2019. https://arxiv.org/abs/1904.09482, Aug. 2023.

  33. Warstadt A, Singh A, Bowman S R. Neural network acceptability judgments. Trans. Association for Computational Linguistics, 2019, 7: 625–641. https://doi.org/10.1162/tacl_a_00290.

    Article  Google Scholar 

  34. Socher R, Perelygin A, Wu J et al. Recursive deep models for semantic compositionality over a sentiment Treebank. In Proc. EMNLP, Oct. 2013, pp.1631–1642.

  35. Dolan W B, Brockett C. Automatically constructing a corpus of sentential paraphrases. In Proc. the 3rd International Workshop on Paraphrasing, Oct. 2005, pp.9–16.

  36. Cer D, Diab M, Agirre E et al. SemEval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv: 1708.00055, 2017. https://arxiv.org/abs/1708.00055, Aug. 2023.

  37. Williams A, Nangia N, Bowman S. A broad-coverage challenge corpus for sentence understanding through inference. In Proc. the 2018 NAACL: Human Language Technologies, Jun. 2018, pp.1112–1122. 10.18653/v1/N18-1101.

  38. Matthews B W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, 1975, 405(2): 442–451. https://doi.org/10.1016/0005-2795(75)90109-9.

  39. Maas A L, Daly R E, Pham P T et al. Learning word vectors for sentiment analysis. In Proc. the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Jun. 2011, pp.142–150.

  40. Zhang X, Zhao J B, LeCun Y. Character-level convolutional networks for text classification. In Proc. the 28th NIPS, Dec. 2015, pp.649–657.

  41. Pilault J, Elhattami A, Pal C. Conditionally adaptive multi-task learning: Improving transfer learning in NLP using fewer parameters & less data. arXiv: 2009.09139, 2020. https://arxiv.org/abs/2009.09139, Aug. 2023.

  42. Howard J, Ruder S. Universal language model fine-tuning for text classification. In Proc. the 56th ACL, Jul. 2018, pp.328–339. https://doi.org/10.18653/v1/P18-1031.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xi-Peng Qiu.

Supplementary Information

ESM 1

(PDF 397 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, YG., Qiu, XP., Zhou, LG. et al. Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation. J. Comput. Sci. Technol. 38, 853–866 (2023). https://doi.org/10.1007/s11390-021-1119-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-021-1119-0

Keywords

Navigation