skip to main content
research-article

On the Value of Head Labels in Multi-Label Text Classification

Authors Info & Claims
Published:26 March 2024Publication History
Skip Abstract Section

Abstract

A formidable challenge in the multi-label text classification (MLTC) context is that the labels often exhibit a long-tailed distribution, which typically prevents deep MLTC models from obtaining satisfactory performance. To alleviate this problem, most existing solutions attempt to improve tail performance by means of sampling or introducing extra knowledge. Data-rich labels, though more trustworthy, have not received the attention they deserve. In this work, we propose a multiple-stage training framework to exploit both model- and feature-level knowledge from the head labels, to improve both the representation and generalization ability of MLTC models. Moreover, we theoretically prove the superiority of our framework design over other alternatives. Comprehensive experiments on widely used MLTC datasets clearly demonstrate that the proposed framework achieves highly superior results to state-of-the-art methods, highlighting the value of head labels in MLTC.

REFERENCES

  1. [1] Aletras Nikolaos, Tsarapatsanis Dimitrios, Preotiuc-Pietro Daniel, and Lampos Vasileios. 2016. Predicting judicial decisions of the European Court of Human Rights: A natural language processing perspective. PeerJ Comput. Sci. 2 (2016), e93.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Allen-Zhu Zeyuan, Li Yuanzhi, and Song Zhao. 2019. A convergence theory for deep learning via over-parameterization. In Proceedings of the ICML. Vol. 97, PMLR, 242252.Google ScholarGoogle Scholar
  3. [3] Babbar Rohit and Schölkopf Bernhard. 2017. DiSMEC: Distributed sparse machines for extreme multi-label classification. In Proceedings of the WSDM. ACM, 721729.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Babbar Rohit and Schölkopf Bernhard. 2019. Data scarcity, robustness and extreme multi-label classification. Mach. Learn. 108, 8–9 (2019), 13291351.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Bhatia K., Dahiya K., Jain H., Mittal A., Prabhu Y., and Varma M.. 2016. The extreme classification repository: Multi-label datasets and code. Retrieved from http://manikvarma.org/downloads/XC/XMLRepository.html. Accessed 1-1-2024.Google ScholarGoogle Scholar
  6. [6] Bhatia Kush, Jain Himanshu, Kar Purushottam, Varma Manik, and Jain Prateek. 2015. Sparse local embeddings for extreme multi-label classification. In Proceedings of the NeurIPS. 730738.Google ScholarGoogle Scholar
  7. [7] Chalkidis Ilias, Fergadiotis Manos, Kotitsas Sotiris, Malakasiotis Prodromos, Aletras Nikolaos, and Androutsopoulos Ion. 2020. An empirical study on large-scale multi-label text classification including few and zero-shot labels. In Proceedings of the EMNLP. Association for Computational Linguistics, 75037515.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Chang Wei-Cheng, Yu Hsiang-Fu, Zhong Kai, Yang Yiming, and Dhillon Inderjit S.. 2020. Taming pretrained transformers for extreme multi-label text classification. In Proceedings of the SIGKDD. ACM, 31633171.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Chen Boli, Huang Xin, Xiao Lin, and Jing Liping. 2020. Hyperbolic capsule networks for multi-label classification. In Proceedings of the ACL. Association for Computational Linguistics, 31153124.Google ScholarGoogle ScholarCross RefCross Ref
  10. [10] Chen Yao-Nan and Lin Hsuan-Tien. 2012. Feature-aware label space dimension reduction for multi-label classification. In Proceedings of the NeurIPS. 15381546.Google ScholarGoogle Scholar
  11. [11] Devlin Jacob, Chang Ming-Wei, Lee Kenton, and Toutanova Kristina. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT. Association for Computational Linguistics, 41714186.Google ScholarGoogle Scholar
  12. [12] Izmailov Pavel, Podoprikhin Dmitrii, Garipov Timur, Vetrov Dmitry P., and Wilson Andrew Gordon. 2018. Averaging weights leads to wider optima and better generalization. In Proceedings of the UAI. AUAI Press, 876885.Google ScholarGoogle Scholar
  13. [13] Jain Himanshu, Prabhu Yashoteja, and Varma Manik. 2016. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the KDD. ACM, 935944.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Jiang Ting, Wang Deqing, Sun Leilei, Yang Huayi, Zhao Zhengyang, and Zhuang Fuzhen. 2021. LightXML: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In Proceedings of the AAAI. AAAI Press, 79877994.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] Jing L. and Tian Y.. 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 11 (2021), 4037–4058. DOI:.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Khandagale Sujay, Xiao Han, and Babbar Rohit. 2020. Bonsai: Diverse and shallow trees for extreme multi-label classification. Mach. Learn. 109, 11 (2020), 20992119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Li Yuncong, Yin Cunxiang, Zhong Sheng-hua, and Pan Xu. 2020. Multi-instance multi-label learning networks for aspect-category sentiment analysis. In Proceedings of the EMNLP. Association for Computational Linguistics, 35503560.Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Liu Jingzhou, Chang Wei-Cheng, Wu Yuexin, and Yang Yiming. 2017. Deep learning for extreme multi-label text classification. In Proceedings of the SIGIR. ACM, 115124.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Liu Weiwei and Shen Xiaobo. 2019. Sparse extreme multi-label learning with oracle property. In Proceedings of the ICML. Proceedings of Machine Learning Research, Vol. 97, PMLR, 40324041.Google ScholarGoogle Scholar
  20. [20] Liu Yixin and Liu Pengfei. 2021. SimCLS: A simple framework for contrastive learning of abstractive summarization. In Proceedings of the ACL/IJCNLP. Association for Computational Linguistics, 10651072.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Liu Yinhan, Ott Myle, Goyal Naman, Du Jingfei, Joshi Mandar, Chen Danqi, Levy Omer, Lewis Mike, Zettlemoyer Luke, and Stoyanov Veselin. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arxiv:1907.11692. Retrieved from http://arxiv.org/abs/1907.11692Google ScholarGoogle Scholar
  22. [22] Maurer Andreas. 2016. A vector-contraction inequality for rademacher complexities. In Proceedings of the ALT. Vol. 9925, 317.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Maurer Andreas, Pontil Massimiliano, and Romera-Paredes Bernardino. 2016. The benefit of multitask representation learning. J. Mach. Learn. Res. 17, 1 (2016), 2853–2884.Google ScholarGoogle Scholar
  24. [24] McAuley Julian J. and Leskovec Jure. 2013. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the RecSys. ACM, 165172.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Mencía Eneldo Loza and Fürnkranz Johannes. 2008. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Proceedings of the ECML/PKDD . Lecture Notes in Computer Science, Vol. 5212, Springer, 5065.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Mullenbach James, Wiegreffe Sarah, Duke Jon, Sun Jimeng, and Eisenstein Jacob. 2018. Explainable prediction of medical codes from clinical text. In Proceedings of the NAACL-HLT. Association for Computational Linguistics, 11011111.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Nam Jinseok, Mencía Eneldo Loza, Kim Hyunwoo J., and Fürnkranz Johannes. 2017. Maximizing subset accuracy with recurrent neural networks in multi-label classification. In Proceedings of the NeurIPS. 54135423.Google ScholarGoogle Scholar
  28. [28] Pan Sinno Jialin and Yang Qiang. 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 13451359.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Prabhu Yashoteja, Kag Anil, Harsola Shrutendra, Agrawal Rahul, and Varma Manik. 2018. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the WWW. ACM, 9931002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2021. Classifier chains: A review and perspectives. J. Artif. Intell. Res. 70 (2021), 683–718. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Rios Anthony and Kavuluru Ramakanth. 2018. Few-shot and zero-shot multi-label learning for structured label spaces. In Proceedings of the EMNLP. Association for Computational Linguistics, 31323142.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Mohammadreza Qaraei, Erik Schultheis, Priyanshu Gupta, and Rohit Babbar. 2021. Convex surrogates for unbiased loss functions in extreme classification with missing labels. WWW’21: The Web Conference 2021, Virtual Event/Ljubljana, Slovenia, April 19-23, 2021, Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia (Eds.). ACM/IW3C2, 3711–3720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Tagami Yukihiro. 2017. AnnexML: Approximate nearest neighbor search for extreme multi-label classification. In Proceedings of the SIGKDD. ACM, 455464.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Wang Hong, Wang Xin, Xiong Wenhan, Yu Mo, Guo Xiaoxiao, Chang Shiyu, and Wang William Yang. 2019. Self-supervised learning for contextualized extractive summarization. In Proceedings of the ACL. Association for Computational Linguistics, 22212227.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Wei Tong and Li Yu-Feng. 2020. Does tail label help for large-scale multi-label learning? IEEE Trans. Neural Networks Learn. Syst. 31, 7 (2020), 23152324.Google ScholarGoogle Scholar
  36. [36] Wei Tong, Tu Wei-Wei, Li Yu-Feng, and Yang Guo-Ping. 2021. Towards robust prediction on tail labels. In Proceedings of the SIGKDD. ACM, 18121820.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. [37] Wettig Alexander, Gao Tianyu, Zhong Zexuan, and Chen Danqi. 2023. Should you mask 15% in masked language modeling?. In Proceedings of the EACL. Association for Computational Linguistics, 29772992.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] Wolf Thomas, Debut Lysandre, Sanh Victor, Chaumond Julien, Delangue Clement, Moi Anthony, Cistac Pierric, Rault Tim, Louf Rémi, Funtowicz Morgan, and Brew Jamie. 2019. HuggingFace’s transformers: State-of-the-art natural language processing. arxiv:1910.03771. Retrieved from http://arxiv.org/abs/1910.03771Google ScholarGoogle Scholar
  39. [39] Xu Chang, Tao Dacheng, and Xu Chao. 2016. Robust extreme multi-label learning. In Proceedings of the SIGKDD. ACM, 12751284.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Yang Pengcheng, Sun Xu, Li Wei, Ma Shuming, Wu Wei, and Wang Houfeng. 2018. SGM: Sequence generation model for multi-label classification. In Proceedings of the COLING. Association for Computational Linguistics, 39153926.Google ScholarGoogle Scholar
  41. [41] Yang Yuzhe and Xu Zhi. 2020. Rethinking the value of labels for improving class-imbalanced learning. In Proceedings of the NeurIPS.Google ScholarGoogle Scholar
  42. [42] Yang Zhilin, Dai Zihang, Yang Yiming, Carbonell Jaime G., Salakhutdinov Ruslan, and Le Quoc V.. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the NeurIPS. 57545764.Google ScholarGoogle Scholar
  43. [43] Ye Hui, Chen Zhiyu, Wang Da-Han, and Davison Brian D.. 2020. Pretrained generalized autoregressive model with adaptive probabilistic label clusters for extreme multi-label text classification. In Proceedings of the ICML. Proceedings of Machine Learning Research, Vol. 119, PMLR, 1080910819.Google ScholarGoogle Scholar
  44. [44] Yen Ian En-Hsu, Huang Xiangru, Dai Wei, Ravikumar Pradeep, Dhillon Inderjit S., and Xing Eric P.. 2017. PPDsparse: A parallel primal-dual sparse method for extreme classification. In Proceedings of the SIGKDD. ACM, 545553.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. [45] You Ronghui, Zhang Zihan, Wang Ziye, Dai Suyang, Mamitsuka Hiroshi, and Zhu Shanfeng. 2019. AttentionXML: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Proceedings of the NeurIPS. 58125822.Google ScholarGoogle Scholar
  46. [46] Zubiaga Arkaitz. 2012. Enhancing navigation on Wikipedia with social tags. arxiv:1202.5469. Retrieved from http://arxiv.org/abs/1202.5469Google ScholarGoogle Scholar

Index Terms

  1. On the Value of Head Labels in Multi-Label Text Classification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 18, Issue 5
      June 2024
      699 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3613659
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 26 March 2024
      • Online AM: 5 February 2024
      • Accepted: 24 January 2024
      • Revised: 14 December 2023
      • Received: 26 May 2022
      Published in tkdd Volume 18, Issue 5

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)156
      • Downloads (Last 6 weeks)59

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text