Abstract
In the field of data mining, data stream classification is an important research direction. However, the presence of issues such as multi-class imbalance, concept drift, and variable class imbalance ratio in data streams can greatly impact the performance of classification models, and the high cost of sample labeling has always been a focus of research. To address these problems, an online active learning method for multi-class imbalanced data stream (OALM-MI) is proposed. Firstly, a comprehensive sample weighting method based on cross-entropy and margin values is proposed to weight each incoming sample in the data stream according to its classification difficulty and importance, which aims to enhance the learning ability of the classifier for important samples. Besides, a comprehensive weighting and updating strategy for ensemble classifiers is introduced, which combines mean square error, improved square error, recall, and the weights of the classifiers in the previous sliding window of samples to weight and update the classifiers. Additionally, adaptive window is utilized to detect and handle concept drift, enabling better adaptation to the changes in the data stream during the learning process. Finally, a margin matrix label request strategy based on class imbalance ratio is proposed to assign labels to samples according to their imbalance ratio and classification difficulty, which can provide more learning opportunities for minority class samples and important samples. Comprehensive experiments were conducted on 12 synthetic data streams and six real data streams with seven state-of-the-art algorithms, and the results showed that the OALM-MI algorithm achieved the highest performance in terms of recall, precision, F1-score, Kappa, and G-mean.
Similar content being viewed by others
References
Kaddoura S, Arid AE, Moukhtar M (2022) Evaluation of supervised machine learning algorithms for multi-class intrusion detection systems. In: Proceedings of 2021 future technologies conference. Online: Springer, 1–16
Hong Yu, Deniu He, Guoyin W et al (2020) Big data for intelligent decision making. Acta Automatica Sinica 46(5):878–896
Liu W, Zhang H, Liu Q (2019) An air quality grade forecasting approach based on ensemble learning. In: Proceedings of 2019 international conference on artificial intelligence and advanced manufacturing. Dublin: IEEE, 87–91
Zhang X, Han M, Wu H et al (2021) An overview of complex data stream ensemble classification. J Intell Fuzzy Syst 41(2):3667–3695
Elreedy D, Atiya AF (2019) A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf Sci 505:32–64
Sun Y, Li M, Li L, et al. (2021) Cost-sensitive classification for evolving data streams with concept drift and class imbalance. Computational Intelligence and Neuroscience, 2021
Wang B, Pineau J (2016) Online bagging and boosting for imbalanced data streams. IEEE Trans Knowl Data Eng 28(12):3353–3366
Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of 2009 IEEE symposium on computational intelligence and data mining. IEEE, 324–331
Chawla N V, Lazarevic A, Hall LO et al. (2003) SMOTEBoost: Improving prediction of the minority class in boosting. In: Proceedings of knowledge discovery in databases: PKDD 2003: 7th European conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia, September 22–26, 2003. Proceedings 7. Springer Berlin Heidelberg, 107–119
Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2009) RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern-Part A: Syst Hum 40(1):185–197
Thabtah F, Hammoud S, Kamalov F et al (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441
Chen Z, Han M, Wu H et al (2023) A multi-level weighted concept drift detection method. J Supercomput 79(5):5154–5180
Han M, Chen Z, Li M et al (2022) A survey of active and passive concept drift handling methods. Comput Intell 38(4):1492–1535
Sun Y, Wang Z, Liu H et al (2016) Online ensemble using adaptive windowing for data streams with concept drift. Int J Distrib Sens Netw 12(5):4218973
Santos SGTC, Gonçalves Júnior PM, Silva GDS, et al. (2014) Speeding up recovery from concept drifts. In: Proceedings of machine learning and knowledge discovery in databases: european conference, ECML PKDD 2014, Nancy, France, September 15–19, 2014. Proceedings, Part III 14. Springer Berlin Heidelberg, 179–194
Barros RSM, Carvalho SGT, Júnior PMG (2016) A boosting-like online learning ensemble. In: Proceedings of 2016 international joint conference on neural networks. Vancouver: IEEE, 1871–1878
Gama J, Medas P, Castillo G, et al. (2004) Learning with drift detection. In: Proceedings of Advances in Artificial Intelligence–SBIA 2004: 17th brazilian symposium on artificial intelligence, Sao Luis, Maranhao, Brazil, September 29-Ocotber 1, 2004. Proceedings 17. Springer Berlin Heidelberg, 286–295
Pesaranghader A, Viktor HL (2016) Fast hoeffding drift detection method for evolving data streams In: Proceedings of machine learning and knowledge discovery in databases: european conference, ECML PKDD 2016, Riva del Garda, Italy, September 19–23, 2016, Proceedings, Part II 16. Springer International Publishing, 96–111
Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing In: Proceedings of the 2007 SIAM international conference on data mining. Society for industrial and applied mathematics, 443–448
Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams In: Proceedings of 2010 machine learning and knowledge discovery in databases: european conference. Barcelona: Springer, 135–150
Mirza B, Lin Z (2016) Meta-cognitive online sequential extreme learning machine for imbalanced and concept-drifting data classification. Neural Netw 80:79–94
Ferreira L E B, Gomes H M, Bifet A, et al. (2019) Adaptive random forests with resampling for imbalanced data streams In: Proceedings of 2019 international joint conference on neural networks. Budapest: IEEE, 1–6
Priya S, Uthra RA (2021) Comprehensive analysis for class imbalance data with concept drift using ensemble based classification. J Ambient Intell Hum Comput 12:4943–4956
Oza NC, Russell SJ (2001) Online bagging and boosting In: Proceedings of international workshop on artificial intelligence and statistics. PMLR, 229–236
Wang S, Minku L L, Yao X (2016) Dealing with multiple classes in online class imbalance learning In: Proceedings of 2016 international joint conference on artificial intelligence. New York: IJCAI, 2118–2124
Vafaie P, Viktor H, Michalowski W (2020) Multi-class imbalanced semi-supervised learning from streams through online ensembles In: Proceedings of international conference on data mining workshops. Sorrento: IEEE, 867–874
Gomes HM, Bifet A, Read J et al (2017) Adaptive random forests for evolving data stream classification. Mach Learn 106:1469–1495
Cano A, Krawczyk B (2020) Kappa updated ensemble for drifting data stream mining. Mach Learn 109:175–218
Cano A, Krawczyk B (2022) ROSE: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams. Mach Learn 111(7):2561–2599
Shan J, Zhang H, Liu W et al (2018) Online active learning ensemble framework for drifted data streams. IEEE Trans Neural Netw Learn Syst 30(2):486–498
Liu W, Zhang H, Ding Z et al (2021) A comprehensive active learning method for multiclass imbalanced data streams with concept drift. Knowl-Based Syst 215:106778
Li Y, Ren L, Wang S, et al. (2023) Online active learning method for imbalanced data stream. Acta Automatica Sinica, 1–13[2023–04–19]. http://kns.cnki.net/kcms/detail/11.2109.TP.20220608.0946.005.html)
Korycki Ł, Krawczyk B (2020) Online oversampling for sparsely labeled imbalanced and non-stationary data streams In: Proceedings of 2020 international joint conference on neural networks. Glasgow: IEEE, 1–8
Liu W, Zhu C, Ding Z et al (2023) Multiclass imbalanced and concept drift network traffic classification framework based on online active learning. Eng Appl Artif Intell 117:105607
Bahri M, Bifet A, Gama J et al (2021) Data stream analysis: foundations, major tasks and tools. Data Min Knowl Disc 11(3):e1405
Gomes HM, Barddal JP, Enembreck F et al (2017) A survey on ensemble learning for data stream classification. ACM Comput Surveys (CSUR) 50(2):1–36
Brzezinski D, Stefanowski J (2013) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94
Wang L, Han M, Li X et al (2022) Ensemble classification algorithm based on dynamic weighting function. J Comput Appl 42(04):1137–1147
Fan W, Greengrass E, McCloskey J, et al. (2005) Effective estimation of posterior probabilities: Explaining the accuracy of randomized decision tree approaches In: Proceedings of the 5th IEEE international conference on data mining. Houston, IEEE, 8
Williams CKI (2021) The effect of class imbalance on precision-recall curves. Neural Comput 33(4):853–857
Krawczyk B, Woźniak M (2017) Online query by committee for active learning from drifting data streams In: Proceedings of 2017 international joint conference on neural networks. Anchorage: IEEE, 2120–2127
Bifet A, Holmes G, Pfahringer B, et al (2010) Moa: Massive online analysis, a framework for stream classification and clustering In: Proceedings of the 1st workshop on applications of pattern analysis. Windsor: PMLR, 44–50
Woolson RF (2007) Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, 1–3
Deng H, Runger G, Tuv E et al (2013) A time series forest for classification and feature extraction. Inf Sci 239:142–153
Funding
This work was supported by the National Nature Science Foundation of China (62062004) and the Ningxia Natural Science Foundation Project (2022AAC03279).
Author information
Authors and Affiliations
Contributions
AL completed the main work, the coding of the model, experiments, and the writing of the main paper; MH reviewed the paper and provided guidance on experiments and funding support; DM completed some experiments and the production of experimental diagrams; and ZG and SL participated in the coordination of the study and reviewed the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Human or animal rights
With the unanimous consent of all our authors, the paper is only about a research on a machine learning algorithm and does not involve Human Participants and/or Animals. All data are open source and do not involve the interests of others.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, A., Han, M., Mu, D. et al. Online active learning method for multi-class imbalanced data stream. Knowl Inf Syst 66, 2355–2391 (2024). https://doi.org/10.1007/s10115-023-02027-w
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-023-02027-w