Skip to main content
Log in

Online active learning method for multi-class imbalanced data stream

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In the field of data mining, data stream classification is an important research direction. However, the presence of issues such as multi-class imbalance, concept drift, and variable class imbalance ratio in data streams can greatly impact the performance of classification models, and the high cost of sample labeling has always been a focus of research. To address these problems, an online active learning method for multi-class imbalanced data stream (OALM-MI) is proposed. Firstly, a comprehensive sample weighting method based on cross-entropy and margin values is proposed to weight each incoming sample in the data stream according to its classification difficulty and importance, which aims to enhance the learning ability of the classifier for important samples. Besides, a comprehensive weighting and updating strategy for ensemble classifiers is introduced, which combines mean square error, improved square error, recall, and the weights of the classifiers in the previous sliding window of samples to weight and update the classifiers. Additionally, adaptive window is utilized to detect and handle concept drift, enabling better adaptation to the changes in the data stream during the learning process. Finally, a margin matrix label request strategy based on class imbalance ratio is proposed to assign labels to samples according to their imbalance ratio and classification difficulty, which can provide more learning opportunities for minority class samples and important samples. Comprehensive experiments were conducted on 12 synthetic data streams and six real data streams with seven state-of-the-art algorithms, and the results showed that the OALM-MI algorithm achieved the highest performance in terms of recall, precision, F1-score, Kappa, and G-mean.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Algorithm 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. https://github.com/Waikato/moa

  2. https://www.kaggle.com/datasets/cicdataset/cicids2017

  3. https://archive.ics.uci.edu/datasets

  4. http://www.keel.es/

References

  1. Kaddoura S, Arid AE, Moukhtar M (2022) Evaluation of supervised machine learning algorithms for multi-class intrusion detection systems. In: Proceedings of 2021 future technologies conference. Online: Springer, 1–16

  2. Hong Yu, Deniu He, Guoyin W et al (2020) Big data for intelligent decision making. Acta Automatica Sinica 46(5):878–896

    Google Scholar 

  3. Liu W, Zhang H, Liu Q (2019) An air quality grade forecasting approach based on ensemble learning. In: Proceedings of 2019 international conference on artificial intelligence and advanced manufacturing. Dublin: IEEE, 87–91

  4. Zhang X, Han M, Wu H et al (2021) An overview of complex data stream ensemble classification. J Intell Fuzzy Syst 41(2):3667–3695

    Article  Google Scholar 

  5. Elreedy D, Atiya AF (2019) A comprehensive analysis of synthetic minority oversampling technique (SMOTE) for handling class imbalance. Inf Sci 505:32–64

    Article  Google Scholar 

  6. Sun Y, Li M, Li L, et al. (2021) Cost-sensitive classification for evolving data streams with concept drift and class imbalance. Computational Intelligence and Neuroscience, 2021

  7. Wang B, Pineau J (2016) Online bagging and boosting for imbalanced data streams. IEEE Trans Knowl Data Eng 28(12):3353–3366

    Article  Google Scholar 

  8. Wang S, Yao X (2009) Diversity analysis on imbalanced data sets by using ensemble models. In: Proceedings of 2009 IEEE symposium on computational intelligence and data mining. IEEE, 324–331

  9. Chawla N V, Lazarevic A, Hall LO et al. (2003) SMOTEBoost: Improving prediction of the minority class in boosting. In: Proceedings of knowledge discovery in databases: PKDD 2003: 7th European conference on principles and practice of knowledge discovery in databases, Cavtat-Dubrovnik, Croatia, September 22–26, 2003. Proceedings 7. Springer Berlin Heidelberg, 107–119

  10. Seiffert C, Khoshgoftaar TM, Van Hulse J et al (2009) RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern-Part A: Syst Hum 40(1):185–197

    Article  Google Scholar 

  11. Thabtah F, Hammoud S, Kamalov F et al (2020) Data imbalance in classification: experimental evaluation. Inf Sci 513:429–441

    Article  MathSciNet  Google Scholar 

  12. Chen Z, Han M, Wu H et al (2023) A multi-level weighted concept drift detection method. J Supercomput 79(5):5154–5180

    Article  Google Scholar 

  13. Han M, Chen Z, Li M et al (2022) A survey of active and passive concept drift handling methods. Comput Intell 38(4):1492–1535

    Article  Google Scholar 

  14. Sun Y, Wang Z, Liu H et al (2016) Online ensemble using adaptive windowing for data streams with concept drift. Int J Distrib Sens Netw 12(5):4218973

    Article  Google Scholar 

  15. Santos SGTC, Gonçalves Júnior PM, Silva GDS, et al. (2014) Speeding up recovery from concept drifts. In: Proceedings of machine learning and knowledge discovery in databases: european conference, ECML PKDD 2014, Nancy, France, September 15–19, 2014. Proceedings, Part III 14. Springer Berlin Heidelberg, 179–194

  16. Barros RSM, Carvalho SGT, Júnior PMG (2016) A boosting-like online learning ensemble. In: Proceedings of 2016 international joint conference on neural networks. Vancouver: IEEE, 1871–1878

  17. Gama J, Medas P, Castillo G, et al. (2004) Learning with drift detection. In: Proceedings of Advances in Artificial Intelligence–SBIA 2004: 17th brazilian symposium on artificial intelligence, Sao Luis, Maranhao, Brazil, September 29-Ocotber 1, 2004. Proceedings 17. Springer Berlin Heidelberg, 286–295

  18. Pesaranghader A, Viktor HL (2016) Fast hoeffding drift detection method for evolving data streams In: Proceedings of machine learning and knowledge discovery in databases: european conference, ECML PKDD 2016, Riva del Garda, Italy, September 19–23, 2016, Proceedings, Part II 16. Springer International Publishing, 96–111

  19. Bifet A, Gavalda R (2007) Learning from time-changing data with adaptive windowing In: Proceedings of the 2007 SIAM international conference on data mining. Society for industrial and applied mathematics, 443–448

  20. Bifet A, Holmes G, Pfahringer B (2010) Leveraging bagging for evolving data streams In: Proceedings of 2010 machine learning and knowledge discovery in databases: european conference. Barcelona: Springer, 135–150

  21. Mirza B, Lin Z (2016) Meta-cognitive online sequential extreme learning machine for imbalanced and concept-drifting data classification. Neural Netw 80:79–94

    Article  Google Scholar 

  22. Ferreira L E B, Gomes H M, Bifet A, et al. (2019) Adaptive random forests with resampling for imbalanced data streams In: Proceedings of 2019 international joint conference on neural networks. Budapest: IEEE, 1–6

  23. Priya S, Uthra RA (2021) Comprehensive analysis for class imbalance data with concept drift using ensemble based classification. J Ambient Intell Hum Comput 12:4943–4956

    Article  Google Scholar 

  24. Oza NC, Russell SJ (2001) Online bagging and boosting In: Proceedings of international workshop on artificial intelligence and statistics. PMLR, 229–236

  25. Wang S, Minku L L, Yao X (2016) Dealing with multiple classes in online class imbalance learning In: Proceedings of 2016 international joint conference on artificial intelligence. New York: IJCAI, 2118–2124

  26. Vafaie P, Viktor H, Michalowski W (2020) Multi-class imbalanced semi-supervised learning from streams through online ensembles In: Proceedings of international conference on data mining workshops. Sorrento: IEEE, 867–874

  27. Gomes HM, Bifet A, Read J et al (2017) Adaptive random forests for evolving data stream classification. Mach Learn 106:1469–1495

    Article  MathSciNet  Google Scholar 

  28. Cano A, Krawczyk B (2020) Kappa updated ensemble for drifting data stream mining. Mach Learn 109:175–218

    Article  MathSciNet  Google Scholar 

  29. Cano A, Krawczyk B (2022) ROSE: robust online self-adjusting ensemble for continual learning on imbalanced drifting data streams. Mach Learn 111(7):2561–2599

    Article  MathSciNet  Google Scholar 

  30. Shan J, Zhang H, Liu W et al (2018) Online active learning ensemble framework for drifted data streams. IEEE Trans Neural Netw Learn Syst 30(2):486–498

    Article  Google Scholar 

  31. Liu W, Zhang H, Ding Z et al (2021) A comprehensive active learning method for multiclass imbalanced data streams with concept drift. Knowl-Based Syst 215:106778

    Article  Google Scholar 

  32. Li Y, Ren L, Wang S, et al. (2023) Online active learning method for imbalanced data stream. Acta Automatica Sinica, 1–13[2023–04–19]. http://kns.cnki.net/kcms/detail/11.2109.TP.20220608.0946.005.html)

  33. Korycki Ł, Krawczyk B (2020) Online oversampling for sparsely labeled imbalanced and non-stationary data streams In: Proceedings of 2020 international joint conference on neural networks. Glasgow: IEEE, 1–8

  34. Liu W, Zhu C, Ding Z et al (2023) Multiclass imbalanced and concept drift network traffic classification framework based on online active learning. Eng Appl Artif Intell 117:105607

    Article  Google Scholar 

  35. Bahri M, Bifet A, Gama J et al (2021) Data stream analysis: foundations, major tasks and tools. Data Min Knowl Disc 11(3):e1405

    Article  Google Scholar 

  36. Gomes HM, Barddal JP, Enembreck F et al (2017) A survey on ensemble learning for data stream classification. ACM Comput Surveys (CSUR) 50(2):1–36

    Article  Google Scholar 

  37. Brzezinski D, Stefanowski J (2013) Reacting to different types of concept drift: the accuracy updated ensemble algorithm. IEEE Trans Neural Netw Learn Syst 25(1):81–94

    Article  Google Scholar 

  38. Wang L, Han M, Li X et al (2022) Ensemble classification algorithm based on dynamic weighting function. J Comput Appl 42(04):1137–1147

    Google Scholar 

  39. Fan W, Greengrass E, McCloskey J, et al. (2005) Effective estimation of posterior probabilities: Explaining the accuracy of randomized decision tree approaches In: Proceedings of the 5th IEEE international conference on data mining. Houston, IEEE, 8

  40. Williams CKI (2021) The effect of class imbalance on precision-recall curves. Neural Comput 33(4):853–857

    Article  MathSciNet  Google Scholar 

  41. Krawczyk B, Woźniak M (2017) Online query by committee for active learning from drifting data streams In: Proceedings of 2017 international joint conference on neural networks. Anchorage: IEEE, 2120–2127

  42. Bifet A, Holmes G, Pfahringer B, et al (2010) Moa: Massive online analysis, a framework for stream classification and clustering In: Proceedings of the 1st workshop on applications of pattern analysis. Windsor: PMLR, 44–50

  43. Woolson RF (2007) Wilcoxon signed-rank test. Wiley encyclopedia of clinical trials, 1–3

  44. Deng H, Runger G, Tuv E et al (2013) A time series forest for classification and feature extraction. Inf Sci 239:142–153

    Article  MathSciNet  Google Scholar 

Download references

Funding

This work was supported by the National Nature Science Foundation of China (62062004) and the Ningxia Natural Science Foundation Project (2022AAC03279).

Author information

Authors and Affiliations

Authors

Contributions

AL completed the main work, the coding of the model, experiments, and the writing of the main paper; MH reviewed the paper and provided guidance on experiments and funding support; DM completed some experiments and the production of experimental diagrams; and ZG and SL participated in the coordination of the study and reviewed the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Meng Han.

Ethics declarations

Competing interests

The authors declare no competing interests.

Human or animal rights

With the unanimous consent of all our authors, the paper is only about a research on a machine learning algorithm and does not involve Human Participants and/or Animals. All data are open source and do not involve the interests of others.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, A., Han, M., Mu, D. et al. Online active learning method for multi-class imbalanced data stream. Knowl Inf Syst 66, 2355–2391 (2024). https://doi.org/10.1007/s10115-023-02027-w

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-023-02027-w

Keywords

Navigation