Abstract
A formidable challenge in the multi-label text classification (MLTC) context is that the labels often exhibit a long-tailed distribution, which typically prevents deep MLTC models from obtaining satisfactory performance. To alleviate this problem, most existing solutions attempt to improve tail performance by means of sampling or introducing extra knowledge. Data-rich labels, though more trustworthy, have not received the attention they deserve. In this work, we propose a multiple-stage training framework to exploit both model- and feature-level knowledge from the head labels, to improve both the representation and generalization ability of MLTC models. Moreover, we theoretically prove the superiority of our framework design over other alternatives. Comprehensive experiments on widely used MLTC datasets clearly demonstrate that the proposed framework achieves highly superior results to state-of-the-art methods, highlighting the value of head labels in MLTC.
- [1] . 2016. Predicting judicial decisions of the European Court of Human Rights: A natural language processing perspective. PeerJ Comput. Sci. 2 (2016), e93.Google ScholarCross Ref
- [2] . 2019. A convergence theory for deep learning via over-parameterization. In Proceedings of the ICML. Vol. 97, PMLR, 242–252.Google Scholar
- [3] . 2017. DiSMEC: Distributed sparse machines for extreme multi-label classification. In Proceedings of the WSDM. ACM, 721–729.Google ScholarDigital Library
- [4] . 2019. Data scarcity, robustness and extreme multi-label classification. Mach. Learn. 108, 8–9 (2019), 1329–1351.Google ScholarDigital Library
- [5] . 2016. The extreme classification repository: Multi-label datasets and code. Retrieved from http://manikvarma.org/downloads/XC/XMLRepository.html. Accessed 1-1-2024.Google Scholar
- [6] . 2015. Sparse local embeddings for extreme multi-label classification. In Proceedings of the NeurIPS. 730–738.Google Scholar
- [7] . 2020. An empirical study on large-scale multi-label text classification including few and zero-shot labels. In Proceedings of the EMNLP. Association for Computational Linguistics, 7503–7515.Google ScholarCross Ref
- [8] . 2020. Taming pretrained transformers for extreme multi-label text classification. In Proceedings of the SIGKDD. ACM, 3163–3171.Google ScholarDigital Library
- [9] . 2020. Hyperbolic capsule networks for multi-label classification. In Proceedings of the ACL. Association for Computational Linguistics, 3115–3124.Google ScholarCross Ref
- [10] . 2012. Feature-aware label space dimension reduction for multi-label classification. In Proceedings of the NeurIPS. 1538–1546.Google Scholar
- [11] . 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT. Association for Computational Linguistics, 4171–4186.Google Scholar
- [12] . 2018. Averaging weights leads to wider optima and better generalization. In Proceedings of the UAI. AUAI Press, 876–885.Google Scholar
- [13] . 2016. Extreme multi-label loss functions for recommendation, tagging, ranking & other missing label applications. In Proceedings of the KDD. ACM, 935–944.Google ScholarDigital Library
- [14] . 2021. LightXML: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification. In Proceedings of the AAAI. AAAI Press, 7987–7994.Google ScholarCross Ref
- [15] . 2020. Self-supervised visual feature learning with deep neural networks: A survey. IEEE Trans. Pattern Anal. Mach. Intell. 43, 11 (2021), 4037–4058.
DOI: .Google ScholarCross Ref - [16] . 2020. Bonsai: Diverse and shallow trees for extreme multi-label classification. Mach. Learn. 109, 11 (2020), 2099–2119.Google ScholarDigital Library
- [17] . 2020. Multi-instance multi-label learning networks for aspect-category sentiment analysis. In Proceedings of the EMNLP. Association for Computational Linguistics, 3550–3560.Google ScholarCross Ref
- [18] . 2017. Deep learning for extreme multi-label text classification. In Proceedings of the SIGIR. ACM, 115–124.Google ScholarDigital Library
- [19] . 2019. Sparse extreme multi-label learning with oracle property. In Proceedings of the ICML.
Proceedings of Machine Learning Research , Vol. 97, PMLR, 4032–4041.Google Scholar - [20] . 2021. SimCLS: A simple framework for contrastive learning of abstractive summarization. In Proceedings of the ACL/IJCNLP. Association for Computational Linguistics, 1065–1072.Google ScholarCross Ref
- [21] . 2019. RoBERTa: A robustly optimized BERT pretraining approach.
arxiv:1907.11692. Retrieved from http://arxiv.org/abs/1907.11692Google Scholar - [22] . 2016. A vector-contraction inequality for rademacher complexities. In Proceedings of the ALT. Vol. 9925, 3–17.Google ScholarDigital Library
- [23] . 2016. The benefit of multitask representation learning. J. Mach. Learn. Res. 17, 1 (2016), 2853–2884.Google Scholar
- [24] . 2013. Hidden factors and hidden topics: Understanding rating dimensions with review text. In Proceedings of the RecSys. ACM, 165–172.Google ScholarDigital Library
- [25] . 2008. Efficient pairwise multilabel classification for large-scale problems in the legal domain. In Proceedings of the ECML/PKDD .
Lecture Notes in Computer Science , Vol. 5212, Springer, 50–65.Google ScholarDigital Library - [26] . 2018. Explainable prediction of medical codes from clinical text. In Proceedings of the NAACL-HLT. Association for Computational Linguistics, 1101–1111.Google ScholarCross Ref
- [27] . 2017. Maximizing subset accuracy with recurrent neural networks in multi-label classification. In Proceedings of the NeurIPS. 5413–5423.Google Scholar
- [28] . 2010. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 22, 10 (2010), 1345–1359.Google ScholarDigital Library
- [29] . 2018. Parabel: Partitioned label trees for extreme classification with application to dynamic search advertising. In Proceedings of the WWW. ACM, 993–1002.Google ScholarDigital Library
- [30] Jesse Read, Bernhard Pfahringer, Geoff Holmes, and Eibe Frank. 2021. Classifier chains: A review and perspectives. J. Artif. Intell. Res. 70 (2021), 683–718. Google ScholarDigital Library
- [31] . 2018. Few-shot and zero-shot multi-label learning for structured label spaces. In Proceedings of the EMNLP. Association for Computational Linguistics, 3132–3142.Google ScholarCross Ref
- [32] Mohammadreza Qaraei, Erik Schultheis, Priyanshu Gupta, and Rohit Babbar. 2021. Convex surrogates for unbiased loss functions in extreme classification with missing labels. WWW’21: The Web Conference 2021, Virtual Event/Ljubljana, Slovenia, April 19-23, 2021, Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia (Eds.). ACM/IW3C2, 3711–3720. Google ScholarDigital Library
- [33] . 2017. AnnexML: Approximate nearest neighbor search for extreme multi-label classification. In Proceedings of the SIGKDD. ACM, 455–464.Google ScholarDigital Library
- [34] . 2019. Self-supervised learning for contextualized extractive summarization. In Proceedings of the ACL. Association for Computational Linguistics, 2221–2227.Google ScholarCross Ref
- [35] . 2020. Does tail label help for large-scale multi-label learning? IEEE Trans. Neural Networks Learn. Syst. 31, 7 (2020), 2315–2324.Google Scholar
- [36] . 2021. Towards robust prediction on tail labels. In Proceedings of the SIGKDD. ACM, 1812–1820.Google ScholarDigital Library
- [37] . 2023. Should you mask 15% in masked language modeling?. In Proceedings of the EACL. Association for Computational Linguistics, 2977–2992.Google ScholarCross Ref
- [38] . 2019. HuggingFace’s transformers: State-of-the-art natural language processing.
arxiv:1910.03771. Retrieved from http://arxiv.org/abs/1910.03771Google Scholar - [39] . 2016. Robust extreme multi-label learning. In Proceedings of the SIGKDD. ACM, 1275–1284.Google ScholarDigital Library
- [40] . 2018. SGM: Sequence generation model for multi-label classification. In Proceedings of the COLING. Association for Computational Linguistics, 3915–3926.Google Scholar
- [41] . 2020. Rethinking the value of labels for improving class-imbalanced learning. In Proceedings of the NeurIPS.Google Scholar
- [42] . 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the NeurIPS. 5754–5764.Google Scholar
- [43] . 2020. Pretrained generalized autoregressive model with adaptive probabilistic label clusters for extreme multi-label text classification. In Proceedings of the ICML.
Proceedings of Machine Learning Research , Vol. 119, PMLR, 10809–10819.Google Scholar - [44] . 2017. PPDsparse: A parallel primal-dual sparse method for extreme classification. In Proceedings of the SIGKDD. ACM, 545–553.Google ScholarDigital Library
- [45] . 2019. AttentionXML: Label tree-based attention-aware deep model for high-performance extreme multi-label text classification. In Proceedings of the NeurIPS. 5812–5822.Google Scholar
- [46] . 2012. Enhancing navigation on Wikipedia with social tags.
arxiv:1202.5469. Retrieved from http://arxiv.org/abs/1202.5469Google Scholar
Index Terms
- On the Value of Head Labels in Multi-Label Text Classification
Recommendations
Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text Classification
WWW '22: Proceedings of the ACM Web Conference 2022Large-scale multi-label text classification (LMTC) aims to associate a document with its relevant labels from a large candidate set. Most existing LMTC approaches rely on massive human-annotated training data, which are often costly to obtain and suffer ...
Multi-label Text Classification with Label Correction under Noise
ICCPR '21: Proceedings of the 2021 10th International Conference on Computing and Pattern RecognitionMulti-label text classification (MLTC) is a fundamental but difficult problem in text mining, the goal of MLTC is to assign a set of most relevant labels for the given document. While existing supervised training of deep learning models for MLTC ...
Graph Convolutional Network Exploring Label Relations for Multi-label Text Classification
PRICAI 2021: Trends in Artificial IntelligenceAbstractMulti-label Text Classification (MLTC) aims to learn a classifier that is able to automatically annotate a data point with the most relevant subset of labels from an large number of labels. Label semantics and relationships are important ...
Comments