skip to main content
research-article

AppClassNet: a commercial-grade dataset for application identification research

Published:06 September 2022Publication History
Skip Editorial Notes Section

Editorial Notes

The authors have requested minor, non-substantive changes to the Version of Record and, in accordance with ACM policies, a Corrected Version of Record was published on September 22, 2022. For reference purposes, the VoR may still be accessed via the Supplemental Material section on this page.

Skip Abstract Section

Abstract

The recent success of Artificial Intelligence (AI) is rooted into several concomitant factors, namely theoretical progress coupled with abundance of data and computing power. Large companies can take advantage of a deluge of data, typically withhold from the research community due to privacy or business sensitivity concerns, and this is particularly true for networking data. Therefore, the lack of high quality data is often recognized as one of the main factors currently limiting networking research from fully leveraging AI methodologies potential.

Following numerous requests we received from the scientific community, we release AppClassNet, a commercial-grade dataset for benchmarking traffic classification and management methodologies. AppClassNet is significantly larger than the datasets generally available to the academic community in terms of both the number of samples and classes, and reaches scales similar to the popular ImageNet dataset commonly used in computer vision literature. To avoid leaking user- and business-sensitive information, we opportunely anonymized the dataset, while empirically showing that it still represents a relevant benchmark for algorithmic research. In this paper, we describe the public dataset and our anonymization process. We hope that AppClassNet can be instrumental for other researchers to address more complex commercial-grade problems in the broad field of traffic classification and management.

Skip Supplemental Material Section

Supplemental Material

References

  1. https://www.image-net.org/download.php.Google ScholarGoogle Scholar
  2. https://commoncrawl.org/.Google ScholarGoogle Scholar
  3. https://recon.meddle.mobi/cross-market.html.Google ScholarGoogle Scholar
  4. https://wand.net.nz/projects/details/libprotoident.Google ScholarGoogle Scholar
  5. https://sourceforge.net/projects/l7-filter/.Google ScholarGoogle Scholar
  6. https://github.com/ntop/nDPI.Google ScholarGoogle Scholar
  7. https://www.cisco.com/c/en/us/products/ios-nx-os-software/network-based-application-recognition-nbar/index.html.Google ScholarGoogle Scholar
  8. https://www.ipoque.com/products/dpi-engine-rs-pace-2-for-application-awareness.Google ScholarGoogle Scholar
  9. https://support.huawei.com/enterprise/de/doc/EDOC1000012889?section=j00c.Google ScholarGoogle Scholar
  10. https://en.wikipedia.org/wiki/General_Data_Protection_Regulation.Google ScholarGoogle Scholar
  11. https://en.wikipedia.org/wiki/Personal_Information_Protection_Law_of_the_People's_Republic_of_China.Google ScholarGoogle Scholar
  12. https://figshare.com/articles/dataset/AppClassNet_-_A_commercial-grade_dataset_for_application_identification_research/20375580.Google ScholarGoogle Scholar
  13. Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, Valerio Persico, and Antonio Pescapé. Mirage: Mobile-app traffic capture and ground-truth creation. In International Conference on Computing, Communications and Security (ICCCS). IEEE, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  14. Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapé. Mobile encrypted traffic classification using deep learning. In Proc. IEEE TMA, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  15. Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapè. Mimetic: Mobile encrypted traffic classification using multimodal deep learning. Computer networks, 165:106944, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Mark Allman and Vern Paxson. Issues and etiquette concerning use of shared measurement data. In ACM SIGCOMM Internet Measurement Conference (IMC), pages 135--140, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Laurent Bernaille, Renata Teixeira, Ismael Akodkenou, Augustin Soule, and Kave Salamatian. Traffic classification on the fly. ACM SIGCOMM Computer Communication Review, 36(2):23--26, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Dario Bonfiglio, Marco Mellia, Michela Meo, Dario Rossi, and Paolo Tofanelli. Revealing skype traffic: when randomness plays with you. In Proc. ACM SIGCOMM, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. arXiv preprint arXiv:1912.03817, 2019.Google ScholarGoogle Scholar
  20. Raouf Boutaba, Mohammad A Salahuddin, Noura Limam, Sara Ayoubi, Nashid Shahriar, Felipe Estrada-Solano, and Oscar M Caicedo. A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications, 9(1):16, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  21. Giampaolo Bovenzi, Lixuan Yang, Alessandro Finamore, Giuseppe Aceto, Domenico Ciuonzo, Antonio Pescape, and Dario Rossi. A first look at class incremental learning in deep learning mobile traffic. In IFIP Traffic Monitoring and Analysis (TMA), 2021.Google ScholarGoogle Scholar
  22. L. Breiman. Random forests. Machine Learning, 45, 2001.Google ScholarGoogle Scholar
  23. L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. Taylor & Francis, 1984.Google ScholarGoogle Scholar
  24. Tomasz Bujlow, Valentin Carela-Espanol, and Pere Barlet-Ros. Independent comparison of popular dpi tools for traffic classification. Computer Networks, 76:75--89, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Nicholas Carlini, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, and Florian Tramèr. Neuracrypt is not private. CoRR, abs/2108.07256, 2021.Google ScholarGoogle Scholar
  26. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597--1607. PMLR, 2020.Google ScholarGoogle Scholar
  27. Zhitang Chen, Ke He, Jian Li, and Yanhui Geng. Seq2img: A sequence-to-image based approach towards ip traffic classification using convolutional neural networks. In Proc. IEEE BigData, pages 1271--1276, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  28. Manuel Crotti, Maurizio Dusi, Francesco Gringoli, and Luca Salgarelli. Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Computer Communication Review, 37(1):5--16, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P.M. Santiago del Rio, D. Rossi, F. Gringoli, L. Nava, L. Salgarelli, and J. Aracil. Wire-speed statistical classification of network traffic on commodity hardware. In Proc. ACM IMC, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997--2017, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Alessandro Finamore, James Roberts, Massimo Gallo, and Dario Rossi. Accelerating deep learning classification with error-controlled approximate-key caching. IEEE INFOCOM, 2022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Massimo Gallo, Alessandro Finamore, Gwendal Simon, and Dario Rossi. Fenxi: Deep-learning traffic analytics at the edge. ACM/IEEE Symposium on Edge Computing (SEC), 2021.Google ScholarGoogle Scholar
  33. I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. DW Gotterbarn, Bo Brinkman, Catherine Flick, Michael S Kirkpatrick, Keith Miller, Kate Vazansky, and Marty J Wolf. Acm code of ethics and professional conduct. 2018.Google ScholarGoogle Scholar
  35. Francesco Gringoli, Luca Salgarelli, Maurizio Dusi, Niccolo Cascarano, Fulvio Risso, and KC Claffy. GT: picking up the truth from the ground for internet traffic. ACM SIGCOMM Computer Communication Review, 39(5):12--18, 2009.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  37. Yangsibo Huang, Zhao Song, Kai Li, and Sanjeev Arora. InstaHide: Instance-hiding schemes for private distributed learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 4507--4518, Jul 2020.Google ScholarGoogle Scholar
  38. Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NeurIPS Workshop on Private Multi-Party Machine Learning, 2016.Google ScholarGoogle Scholar
  39. Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van Der Veen, and Christian Platzer. Andrubis-1,000,000 apps later: A view on current android malware behaviors. In IEEE International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), pages 3--17, 2014.Google ScholarGoogle Scholar
  40. Bo Liu, Ming Ding, Hanyu Xue, Tianqing Zhu, Dayong Ye, Li Song, and Wanlei Zhou. Dp-image: Differential privacy for image data in feature space, 2021.Google ScholarGoogle Scholar
  41. Manuel Lopez-Martin, Belen Carro, Antonio Sanchez-Esguevillas, and Jaime Lloret. Network traffic classifier with convolutional and recurrent neural networks for internet of things. IEEE Access, 5:18042--18050, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  42. Mohammad Lotfollahi, Mahdi Jafari Siavoshani, Ramin Shirali Hossein Zade, and Mohammdsadegh Saberian. Deep packet: A novel approach for encrypted traffic classification using deep learning. Soft Computing, 24(3), 2020.Google ScholarGoogle Scholar
  43. Meisam Mohammady, Lingyu Wang, Yuan Hong, Habib Louafi, Makan Pourzandi, and Mourad Debbabi. Preserving both privacy and utility in network trace anonymization. In ACM Conference on Computer and Communications Security (CCS), page 459--474, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Andrew W Moore and Konstantina Papagiannaki. Toward the accurate identification of network applications. In Proc. PAM, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Andrew W Moore and Denis Zuev. Internet traffic classification using bayesian analysis techniques. In Proc. ACM SIGMETRICS, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Thuy TT Nguyen and Grenville Armitage. Training on multiple sub-flows to optimise the use of machine learning classifiers in real-world ip networks. In IEEE LCN, pages 369--376, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  47. Thuy TT Nguyen and Grenville J Armitage. A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys and Tutorials, 10(1--4):56--76, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. F. Pacheco, E. Exposito, M. Gineste, C. Baudoin, and J. Aguilar. Towards the deployment of machine learning solutions in network traffic classification: A systematic survey. IEEE Communications Surveys and Tutorials, pages 1--1, 2018.Google ScholarGoogle Scholar
  49. Dario Pasquini, Giuseppe Ateniese, and Massimo Bernaschi. Unleashing the tiger: Inference attacks on split learning. In ACM Computer and Communications Security (CCS), 2021.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jingjing Ren, Martina Lindorfer, Daniel J Dubois, Ashwin Rao, David Choffnes, and Narseo Vallina-Rodriguez. A longitudinal study of PII leaks across android app versions. In Network and Distributed System Security Symposium (NDSS), volume 10, 2018.Google ScholarGoogle Scholar
  52. Markus Ring, Sarah Wunderlich, Deniz Scheuring, Dieter Landes, and Andreas Hotho. A survey of network-based intrusion detection data sets. Springer Computers and Security, 86:147--167, 2019.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211--252, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Tal Shapira and Yuval Shavitt. Flowpic: Encrypted internet traffic classification is as easy as image recognition. In IEEE INFOCOM Workshops, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  55. Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. CoRR, abs/2106.03253, 2021.Google ScholarGoogle Scholar
  56. Nazanin Takbiri, Amir Houmansadr, Dennis L Goeckel, and Hossein PishroNik. Matching anonymized and obfuscated time series to users' profiles. IEEE Transactions on Information Theory, 65(2):724--741, 2018.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Vincent F Taylor, Riccardo Spolaor, Mauro Conti, and Ivan Martinovic. App-scanner: Automatic fingerprinting of smartphone apps from encrypted network traffic. In Proc. IEEE EuroS&P, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  58. Md Shamim Towhid and Nashid Shahriar. Encrypted network traffic classification using self-supervised learning. In IEEE NetSoft, 2022.Google ScholarGoogle Scholar
  59. Thijs van Ede, Riccardo Bortolameotti, Andrea Continella, Jingjing Ren, Daniel J Dubois, Martina Lindorfer, David Choffnes, Maarten van Steen, and Andreas Peter. Flowprint: Semi-supervised mobile-app fingerprinting on encrypted network traffic. In Network and Distributed System Security Symposium (NDSS), volume 27, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  60. Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. Split learning for health: Distributed deep learning without sharing raw patient data. CoRR, abs/1812.00564, 2018.Google ScholarGoogle Scholar
  61. Ly Vu, Cong Thanh Bui, and Quang Uy Nguyen. A deep learning based method for handling imbalanced problem in network traffic classification. In ACM International Symposium on Information and Communication Technology, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. W. Wang, M. Zhu, J. Wang, X. Zeng, and Z. Yang. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In Proc. IEEE ISI, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. Malware traffic classification using convolutional neural network for representation learning. In Proc. IEEE ICOIN, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  64. Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1--34, 2020.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. Zhanyi Wang. The applications of deep learning on traffic identification. BlackHat USA, 2015.Google ScholarGoogle Scholar
  66. Jun Xu, Jinliang Fan, Mostafa H Ammar, and Sue B Moon. Prefix-preserving IP address anonymization: Measurement-based security evaluation and a new cryptography-based scheme. In IEEE International Conference on Network Protocols (ICNP), pages 280--289. IEEE, 2002.Google ScholarGoogle Scholar
  67. Adam Yala, Homa Esfahanizadeh, Rafael G. L. D' Oliveira, Ken R. Duffy, Manya Ghobadi, Tommi S. Jaakkola, Vinod Vaikuntanathan, Regina Barzilay, and Muriel Medard. Neuracrypt: Hiding private health data via random neural networks for public training, 2021.Google ScholarGoogle Scholar
  68. Adam Yala, Victor Quach, Homa Esfahanizadeh, Rafael G. L. D'Oliveira, Ken R. Duffy, Muriel Médard, Tommi S. Jaakkola, and Regina Barzilay. Syfer: Neural obfuscation for private data release, 2022.Google ScholarGoogle Scholar
  69. Lixuan Yang, Cedric Beliard, and Dario Rossi. Heterogeneous data-aware federated learning. In IJCAI Workshop on Federated Learning, 2020.Google ScholarGoogle Scholar
  70. Lixuan Yang, Alessandro Finamore, Jun Feng, and Dario Rossi. Deep learning and zero-day traffic classification: Lessons learned from a commercial-grade dataset. IEEE Transactions on Network and Service Management, 18, 2021.Google ScholarGoogle Scholar

Index Terms

  1. AppClassNet: a commercial-grade dataset for application identification research

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGCOMM Computer Communication Review
        ACM SIGCOMM Computer Communication Review  Volume 52, Issue 3
        July 2022
        27 pages
        ISSN:0146-4833
        DOI:10.1145/3561954
        Issue’s Table of Contents

        Copyright © 2022 Copyright is held by the owner/author(s)

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 September 2022

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader