Editorial Notes
The authors have requested minor, non-substantive changes to the Version of Record and, in accordance with ACM policies, a Corrected Version of Record was published on September 22, 2022. For reference purposes, the VoR may still be accessed via the Supplemental Material section on this page.
Abstract
The recent success of Artificial Intelligence (AI) is rooted into several concomitant factors, namely theoretical progress coupled with abundance of data and computing power. Large companies can take advantage of a deluge of data, typically withhold from the research community due to privacy or business sensitivity concerns, and this is particularly true for networking data. Therefore, the lack of high quality data is often recognized as one of the main factors currently limiting networking research from fully leveraging AI methodologies potential.
Following numerous requests we received from the scientific community, we release AppClassNet, a commercial-grade dataset for benchmarking traffic classification and management methodologies. AppClassNet is significantly larger than the datasets generally available to the academic community in terms of both the number of samples and classes, and reaches scales similar to the popular ImageNet dataset commonly used in computer vision literature. To avoid leaking user- and business-sensitive information, we opportunely anonymized the dataset, while empirically showing that it still represents a relevant benchmark for algorithmic research. In this paper, we describe the public dataset and our anonymization process. We hope that AppClassNet can be instrumental for other researchers to address more complex commercial-grade problems in the broad field of traffic classification and management.
Supplemental Material
Available for Download
Version of Record for "AppClassNet: a commercial-grade dataset for application identification research" by Wang et al., ACM SIGCOMM Computer Communication Review, Volume 52, Issue 3 (SIGCOMM CCR 52:3).
- https://www.image-net.org/download.php.Google Scholar
- https://commoncrawl.org/.Google Scholar
- https://recon.meddle.mobi/cross-market.html.Google Scholar
- https://wand.net.nz/projects/details/libprotoident.Google Scholar
- https://sourceforge.net/projects/l7-filter/.Google Scholar
- https://github.com/ntop/nDPI.Google Scholar
- https://www.cisco.com/c/en/us/products/ios-nx-os-software/network-based-application-recognition-nbar/index.html.Google Scholar
- https://www.ipoque.com/products/dpi-engine-rs-pace-2-for-application-awareness.Google Scholar
- https://support.huawei.com/enterprise/de/doc/EDOC1000012889?section=j00c.Google Scholar
- https://en.wikipedia.org/wiki/General_Data_Protection_Regulation.Google Scholar
- https://en.wikipedia.org/wiki/Personal_Information_Protection_Law_of_the_People's_Republic_of_China.Google Scholar
- https://figshare.com/articles/dataset/AppClassNet_-_A_commercial-grade_dataset_for_application_identification_research/20375580.Google Scholar
- Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, Valerio Persico, and Antonio Pescapé. Mirage: Mobile-app traffic capture and ground-truth creation. In International Conference on Computing, Communications and Security (ICCCS). IEEE, 2019.Google ScholarCross Ref
- Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapé. Mobile encrypted traffic classification using deep learning. In Proc. IEEE TMA, 2018.Google ScholarCross Ref
- Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapè. Mimetic: Mobile encrypted traffic classification using multimodal deep learning. Computer networks, 165:106944, 2019.Google ScholarDigital Library
- Mark Allman and Vern Paxson. Issues and etiquette concerning use of shared measurement data. In ACM SIGCOMM Internet Measurement Conference (IMC), pages 135--140, 2007.Google ScholarDigital Library
- Laurent Bernaille, Renata Teixeira, Ismael Akodkenou, Augustin Soule, and Kave Salamatian. Traffic classification on the fly. ACM SIGCOMM Computer Communication Review, 36(2):23--26, 2006.Google ScholarDigital Library
- Dario Bonfiglio, Marco Mellia, Michela Meo, Dario Rossi, and Paolo Tofanelli. Revealing skype traffic: when randomness plays with you. In Proc. ACM SIGCOMM, 2007.Google ScholarDigital Library
- Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. arXiv preprint arXiv:1912.03817, 2019.Google Scholar
- Raouf Boutaba, Mohammad A Salahuddin, Noura Limam, Sara Ayoubi, Nashid Shahriar, Felipe Estrada-Solano, and Oscar M Caicedo. A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications, 9(1):16, 2018.Google ScholarCross Ref
- Giampaolo Bovenzi, Lixuan Yang, Alessandro Finamore, Giuseppe Aceto, Domenico Ciuonzo, Antonio Pescape, and Dario Rossi. A first look at class incremental learning in deep learning mobile traffic. In IFIP Traffic Monitoring and Analysis (TMA), 2021.Google Scholar
- L. Breiman. Random forests. Machine Learning, 45, 2001.Google Scholar
- L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. Taylor & Francis, 1984.Google Scholar
- Tomasz Bujlow, Valentin Carela-Espanol, and Pere Barlet-Ros. Independent comparison of popular dpi tools for traffic classification. Computer Networks, 76:75--89, 2015.Google ScholarDigital Library
- Nicholas Carlini, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, and Florian Tramèr. Neuracrypt is not private. CoRR, abs/2108.07256, 2021.Google Scholar
- Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597--1607. PMLR, 2020.Google Scholar
- Zhitang Chen, Ke He, Jian Li, and Yanhui Geng. Seq2img: A sequence-to-image based approach towards ip traffic classification using convolutional neural networks. In Proc. IEEE BigData, pages 1271--1276, 2017.Google ScholarCross Ref
- Manuel Crotti, Maurizio Dusi, Francesco Gringoli, and Luca Salgarelli. Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Computer Communication Review, 37(1):5--16, 2007.Google ScholarDigital Library
- P.M. Santiago del Rio, D. Rossi, F. Gringoli, L. Nava, L. Salgarelli, and J. Aracil. Wire-speed statistical classification of network traffic on commodity hardware. In Proc. ACM IMC, 2012.Google ScholarDigital Library
- Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997--2017, 2019.Google ScholarDigital Library
- Alessandro Finamore, James Roberts, Massimo Gallo, and Dario Rossi. Accelerating deep learning classification with error-controlled approximate-key caching. IEEE INFOCOM, 2022.Google ScholarDigital Library
- Massimo Gallo, Alessandro Finamore, Gwendal Simon, and Dario Rossi. Fenxi: Deep-learning traffic analytics at the edge. ACM/IEEE Symposium on Edge Computing (SEC), 2021.Google Scholar
- I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.Google ScholarDigital Library
- DW Gotterbarn, Bo Brinkman, Catherine Flick, Michael S Kirkpatrick, Keith Miller, Kate Vazansky, and Marty J Wolf. Acm code of ethics and professional conduct. 2018.Google Scholar
- Francesco Gringoli, Luca Salgarelli, Maurizio Dusi, Niccolo Cascarano, Fulvio Risso, and KC Claffy. GT: picking up the truth from the ground for internet traffic. ACM SIGCOMM Computer Communication Review, 39(5):12--18, 2009.Google ScholarDigital Library
- K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.Google ScholarCross Ref
- Yangsibo Huang, Zhao Song, Kai Li, and Sanjeev Arora. InstaHide: Instance-hiding schemes for private distributed learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 4507--4518, Jul 2020.Google Scholar
- Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NeurIPS Workshop on Private Multi-Party Machine Learning, 2016.Google Scholar
- Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van Der Veen, and Christian Platzer. Andrubis-1,000,000 apps later: A view on current android malware behaviors. In IEEE International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), pages 3--17, 2014.Google Scholar
- Bo Liu, Ming Ding, Hanyu Xue, Tianqing Zhu, Dayong Ye, Li Song, and Wanlei Zhou. Dp-image: Differential privacy for image data in feature space, 2021.Google Scholar
- Manuel Lopez-Martin, Belen Carro, Antonio Sanchez-Esguevillas, and Jaime Lloret. Network traffic classifier with convolutional and recurrent neural networks for internet of things. IEEE Access, 5:18042--18050, 2017.Google ScholarCross Ref
- Mohammad Lotfollahi, Mahdi Jafari Siavoshani, Ramin Shirali Hossein Zade, and Mohammdsadegh Saberian. Deep packet: A novel approach for encrypted traffic classification using deep learning. Soft Computing, 24(3), 2020.Google Scholar
- Meisam Mohammady, Lingyu Wang, Yuan Hong, Habib Louafi, Makan Pourzandi, and Mourad Debbabi. Preserving both privacy and utility in network trace anonymization. In ACM Conference on Computer and Communications Security (CCS), page 459--474, 2018.Google ScholarDigital Library
- Andrew W Moore and Konstantina Papagiannaki. Toward the accurate identification of network applications. In Proc. PAM, 2005.Google ScholarDigital Library
- Andrew W Moore and Denis Zuev. Internet traffic classification using bayesian analysis techniques. In Proc. ACM SIGMETRICS, 2005.Google ScholarDigital Library
- Thuy TT Nguyen and Grenville Armitage. Training on multiple sub-flows to optimise the use of machine learning classifiers in real-world ip networks. In IEEE LCN, pages 369--376, 2006.Google ScholarCross Ref
- Thuy TT Nguyen and Grenville J Armitage. A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys and Tutorials, 10(1--4):56--76, 2008.Google ScholarDigital Library
- F. Pacheco, E. Exposito, M. Gineste, C. Baudoin, and J. Aguilar. Towards the deployment of machine learning solutions in network traffic classification: A systematic survey. IEEE Communications Surveys and Tutorials, pages 1--1, 2018.Google Scholar
- Dario Pasquini, Giuseppe Ateniese, and Massimo Bernaschi. Unleashing the tiger: Inference attacks on split learning. In ACM Computer and Communications Security (CCS), 2021.Google ScholarDigital Library
- F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.Google ScholarDigital Library
- Jingjing Ren, Martina Lindorfer, Daniel J Dubois, Ashwin Rao, David Choffnes, and Narseo Vallina-Rodriguez. A longitudinal study of PII leaks across android app versions. In Network and Distributed System Security Symposium (NDSS), volume 10, 2018.Google Scholar
- Markus Ring, Sarah Wunderlich, Deniz Scheuring, Dieter Landes, and Andreas Hotho. A survey of network-based intrusion detection data sets. Springer Computers and Security, 86:147--167, 2019.Google ScholarDigital Library
- Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211--252, 2015.Google ScholarDigital Library
- Tal Shapira and Yuval Shavitt. Flowpic: Encrypted internet traffic classification is as easy as image recognition. In IEEE INFOCOM Workshops, 2019.Google ScholarCross Ref
- Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. CoRR, abs/2106.03253, 2021.Google Scholar
- Nazanin Takbiri, Amir Houmansadr, Dennis L Goeckel, and Hossein PishroNik. Matching anonymized and obfuscated time series to users' profiles. IEEE Transactions on Information Theory, 65(2):724--741, 2018.Google ScholarDigital Library
- Vincent F Taylor, Riccardo Spolaor, Mauro Conti, and Ivan Martinovic. App-scanner: Automatic fingerprinting of smartphone apps from encrypted network traffic. In Proc. IEEE EuroS&P, 2016.Google ScholarCross Ref
- Md Shamim Towhid and Nashid Shahriar. Encrypted network traffic classification using self-supervised learning. In IEEE NetSoft, 2022.Google Scholar
- Thijs van Ede, Riccardo Bortolameotti, Andrea Continella, Jingjing Ren, Daniel J Dubois, Martina Lindorfer, David Choffnes, Maarten van Steen, and Andreas Peter. Flowprint: Semi-supervised mobile-app fingerprinting on encrypted network traffic. In Network and Distributed System Security Symposium (NDSS), volume 27, 2020.Google ScholarCross Ref
- Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. Split learning for health: Distributed deep learning without sharing raw patient data. CoRR, abs/1812.00564, 2018.Google Scholar
- Ly Vu, Cong Thanh Bui, and Quang Uy Nguyen. A deep learning based method for handling imbalanced problem in network traffic classification. In ACM International Symposium on Information and Communication Technology, 2017.Google ScholarDigital Library
- W. Wang, M. Zhu, J. Wang, X. Zeng, and Z. Yang. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In Proc. IEEE ISI, 2017.Google ScholarDigital Library
- Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. Malware traffic classification using convolutional neural network for representation learning. In Proc. IEEE ICOIN, 2017.Google ScholarCross Ref
- Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1--34, 2020.Google ScholarDigital Library
- Zhanyi Wang. The applications of deep learning on traffic identification. BlackHat USA, 2015.Google Scholar
- Jun Xu, Jinliang Fan, Mostafa H Ammar, and Sue B Moon. Prefix-preserving IP address anonymization: Measurement-based security evaluation and a new cryptography-based scheme. In IEEE International Conference on Network Protocols (ICNP), pages 280--289. IEEE, 2002.Google Scholar
- Adam Yala, Homa Esfahanizadeh, Rafael G. L. D' Oliveira, Ken R. Duffy, Manya Ghobadi, Tommi S. Jaakkola, Vinod Vaikuntanathan, Regina Barzilay, and Muriel Medard. Neuracrypt: Hiding private health data via random neural networks for public training, 2021.Google Scholar
- Adam Yala, Victor Quach, Homa Esfahanizadeh, Rafael G. L. D'Oliveira, Ken R. Duffy, Muriel Médard, Tommi S. Jaakkola, and Regina Barzilay. Syfer: Neural obfuscation for private data release, 2022.Google Scholar
- Lixuan Yang, Cedric Beliard, and Dario Rossi. Heterogeneous data-aware federated learning. In IJCAI Workshop on Federated Learning, 2020.Google Scholar
- Lixuan Yang, Alessandro Finamore, Jun Feng, and Dario Rossi. Deep learning and zero-day traffic classification: Lessons learned from a commercial-grade dataset. IEEE Transactions on Network and Service Management, 18, 2021.Google Scholar
Index Terms
- AppClassNet: a commercial-grade dataset for application identification research
Recommendations
Enhancing Tor's performance using real-time traffic classification
CCS '12: Proceedings of the 2012 ACM conference on Computer and communications securityTor is a low-latency anonymity-preserving network that enables its users to protect their privacy online. It consists of volunteer-operated routers from all around the world that serve hundreds of thousands of users every day. Due to congestion and a ...
DataZoo: Streamlining Traffic Classification Experiments
SAFE '23: Proceedings of the 2023 on Explainable and Safety Bounded, Fidelitous, Machine Learning for NetworkingThe machine learning communities, such as those around computer vision or natural language processing, have developed numerous supportive tools and benchmark datasets to accelerate the development. In contrast, the network traffic classification field ...
Automated Dataset Generation for Training Peer-to-Peer Machine Learning Classifiers
Peer-to-peer (P2P) classifications based on flow statistics have been proven accurate in detecting P2P traffic. A machine learning classification is affected by the quality and recency of the training dataset used. Hence, to classify P2P traffic on-line ...
Comments