research-article

AppClassNet: a commercial-grade dataset for application identification research

Authors:
Chao Wang

Huawei Technologies France SASU

Huawei Technologies France SASU
View Profile

,
Alessandro Finamore

Huawei Technologies France SASU

Huawei Technologies France SASU
View Profile

,
Lixuan Yang

Huawei Technologies France SASU

Huawei Technologies France SASU
View Profile

,
Kevin Fauvel

Huawei Technologies France SASU

Huawei Technologies France SASU
View Profile

,
Dario Rossi

Huawei Technologies France SASU

Huawei Technologies France SASU
View Profile

Authors Info & Claims

ACM SIGCOMM Computer Communication Review Volume 52 Issue 3July 2022pp 19–27https://doi.org/10.1145/3561954.3561958

Published:06 September 2022Publication History

ACM SIGCOMM Computer Communication Review

Editorial Notes

The authors have requested minor, non-substantive changes to the Version of Record and, in accordance with ACM policies, a Corrected Version of Record was published on September 22, 2022. For reference purposes, the VoR may still be accessed via the Supplemental Material section on this page.

Abstract

The recent success of Artificial Intelligence (AI) is rooted into several concomitant factors, namely theoretical progress coupled with abundance of data and computing power. Large companies can take advantage of a deluge of data, typically withhold from the research community due to privacy or business sensitivity concerns, and this is particularly true for networking data. Therefore, the lack of high quality data is often recognized as one of the main factors currently limiting networking research from fully leveraging AI methodologies potential.

Following numerous requests we received from the scientific community, we release AppClassNet, a commercial-grade dataset for benchmarking traffic classification and management methodologies. AppClassNet is significantly larger than the datasets generally available to the academic community in terms of both the number of samples and classes, and reaches scales similar to the popular ImageNet dataset commonly used in computer vision literature. To avoid leaking user- and business-sensitive information, we opportunely anonymized the dataset, while empirically showing that it still represents a relevant benchmark for algorithmic research. In this paper, we describe the public dataset and our anonymization process. We hope that AppClassNet can be instrumental for other researchers to address more complex commercial-grade problems in the broad field of traffic classification and management.

Supplemental Material

Available for Download

pdf

3561958-vor.pdf (535.1 KB)

Version of Record for "AppClassNet: a commercial-grade dataset for application identification research" by Wang et al., ACM SIGCOMM Computer Communication Review, Volume 52, Issue 3 (SIGCOMM CCR 52:3).

References

https://www.image-net.org/download.php.Google Scholar
https://commoncrawl.org/.Google Scholar
https://recon.meddle.mobi/cross-market.html.Google Scholar
https://wand.net.nz/projects/details/libprotoident.Google Scholar
https://sourceforge.net/projects/l7-filter/.Google Scholar
https://github.com/ntop/nDPI.Google Scholar
https://www.cisco.com/c/en/us/products/ios-nx-os-software/network-based-application-recognition-nbar/index.html.Google Scholar
https://www.ipoque.com/products/dpi-engine-rs-pace-2-for-application-awareness.Google Scholar
https://support.huawei.com/enterprise/de/doc/EDOC1000012889?section=j00c.Google Scholar
https://en.wikipedia.org/wiki/General_Data_Protection_Regulation.Google Scholar
https://en.wikipedia.org/wiki/Personal_Information_Protection_Law_of_the_People's_Republic_of_China.Google Scholar
https://figshare.com/articles/dataset/AppClassNet_-_A_commercial-grade_dataset_for_application_identification_research/20375580.Google Scholar
Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, Valerio Persico, and Antonio Pescapé. Mirage: Mobile-app traffic capture and ground-truth creation. In International Conference on Computing, Communications and Security (ICCCS). IEEE, 2019.Google ScholarCross Ref
Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapé. Mobile encrypted traffic classification using deep learning. In Proc. IEEE TMA, 2018.Google ScholarCross Ref
Giuseppe Aceto, Domenico Ciuonzo, Antonio Montieri, and Antonio Pescapè. Mimetic: Mobile encrypted traffic classification using multimodal deep learning. Computer networks, 165:106944, 2019.Google ScholarDigital Library
Mark Allman and Vern Paxson. Issues and etiquette concerning use of shared measurement data. In ACM SIGCOMM Internet Measurement Conference (IMC), pages 135--140, 2007.Google ScholarDigital Library
Laurent Bernaille, Renata Teixeira, Ismael Akodkenou, Augustin Soule, and Kave Salamatian. Traffic classification on the fly. ACM SIGCOMM Computer Communication Review, 36(2):23--26, 2006.Google ScholarDigital Library
Dario Bonfiglio, Marco Mellia, Michela Meo, Dario Rossi, and Paolo Tofanelli. Revealing skype traffic: when randomness plays with you. In Proc. ACM SIGCOMM, 2007.Google ScholarDigital Library
Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. arXiv preprint arXiv:1912.03817, 2019.Google Scholar
Raouf Boutaba, Mohammad A Salahuddin, Noura Limam, Sara Ayoubi, Nashid Shahriar, Felipe Estrada-Solano, and Oscar M Caicedo. A comprehensive survey on machine learning for networking: evolution, applications and research opportunities. Journal of Internet Services and Applications, 9(1):16, 2018.Google ScholarCross Ref
Giampaolo Bovenzi, Lixuan Yang, Alessandro Finamore, Giuseppe Aceto, Domenico Ciuonzo, Antonio Pescape, and Dario Rossi. A first look at class incremental learning in deep learning mobile traffic. In IFIP Traffic Monitoring and Analysis (TMA), 2021.Google Scholar
L. Breiman. Random forests. Machine Learning, 45, 2001.Google Scholar
L. Breiman, J. Friedman, C.J. Stone, and R.A. Olshen. Classification and Regression Trees. Taylor & Francis, 1984.Google Scholar
Tomasz Bujlow, Valentin Carela-Espanol, and Pere Barlet-Ros. Independent comparison of popular dpi tools for traffic classification. Computer Networks, 76:75--89, 2015.Google ScholarDigital Library
Nicholas Carlini, Sanjam Garg, Somesh Jha, Saeed Mahloujifar, Mohammad Mahmoody, and Florian Tramèr. Neuracrypt is not private. CoRR, abs/2108.07256, 2021.Google Scholar
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597--1607. PMLR, 2020.Google Scholar
Zhitang Chen, Ke He, Jian Li, and Yanhui Geng. Seq2img: A sequence-to-image based approach towards ip traffic classification using convolutional neural networks. In Proc. IEEE BigData, pages 1271--1276, 2017.Google ScholarCross Ref
Manuel Crotti, Maurizio Dusi, Francesco Gringoli, and Luca Salgarelli. Traffic classification through simple statistical fingerprinting. ACM SIGCOMM Computer Communication Review, 37(1):5--16, 2007.Google ScholarDigital Library
P.M. Santiago del Rio, D. Rossi, F. Gringoli, L. Nava, L. Salgarelli, and J. Aracil. Wire-speed statistical classification of network traffic on commodity hardware. In Proc. ACM IMC, 2012.Google ScholarDigital Library
Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. Neural architecture search: A survey. The Journal of Machine Learning Research, 20(1):1997--2017, 2019.Google ScholarDigital Library
Alessandro Finamore, James Roberts, Massimo Gallo, and Dario Rossi. Accelerating deep learning classification with error-controlled approximate-key caching. IEEE INFOCOM, 2022.Google ScholarDigital Library
Massimo Gallo, Alessandro Finamore, Gwendal Simon, and Dario Rossi. Fenxi: Deep-learning traffic analytics at the edge. ACM/IEEE Symposium on Edge Computing (SEC), 2021.Google Scholar
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.Google ScholarDigital Library
DW Gotterbarn, Bo Brinkman, Catherine Flick, Michael S Kirkpatrick, Keith Miller, Kate Vazansky, and Marty J Wolf. Acm code of ethics and professional conduct. 2018.Google Scholar
Francesco Gringoli, Luca Salgarelli, Maurizio Dusi, Niccolo Cascarano, Fulvio Risso, and KC Claffy. GT: picking up the truth from the ground for internet traffic. ACM SIGCOMM Computer Communication Review, 39(5):12--18, 2009.Google ScholarDigital Library
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.Google ScholarCross Ref
Yangsibo Huang, Zhao Song, Kai Li, and Sanjeev Arora. InstaHide: Instance-hiding schemes for private distributed learning. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 4507--4518, Jul 2020.Google Scholar
Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. In NeurIPS Workshop on Private Multi-Party Machine Learning, 2016.Google Scholar
Martina Lindorfer, Matthias Neugschwandtner, Lukas Weichselbaum, Yanick Fratantonio, Victor Van Der Veen, and Christian Platzer. Andrubis-1,000,000 apps later: A view on current android malware behaviors. In IEEE International Workshop on Building Analysis Datasets and Gathering Experience Returns for Security (BADGERS), pages 3--17, 2014.Google Scholar
Bo Liu, Ming Ding, Hanyu Xue, Tianqing Zhu, Dayong Ye, Li Song, and Wanlei Zhou. Dp-image: Differential privacy for image data in feature space, 2021.Google Scholar
Manuel Lopez-Martin, Belen Carro, Antonio Sanchez-Esguevillas, and Jaime Lloret. Network traffic classifier with convolutional and recurrent neural networks for internet of things. IEEE Access, 5:18042--18050, 2017.Google ScholarCross Ref
Mohammad Lotfollahi, Mahdi Jafari Siavoshani, Ramin Shirali Hossein Zade, and Mohammdsadegh Saberian. Deep packet: A novel approach for encrypted traffic classification using deep learning. Soft Computing, 24(3), 2020.Google Scholar
Meisam Mohammady, Lingyu Wang, Yuan Hong, Habib Louafi, Makan Pourzandi, and Mourad Debbabi. Preserving both privacy and utility in network trace anonymization. In ACM Conference on Computer and Communications Security (CCS), page 459--474, 2018.Google ScholarDigital Library
Andrew W Moore and Konstantina Papagiannaki. Toward the accurate identification of network applications. In Proc. PAM, 2005.Google ScholarDigital Library
Andrew W Moore and Denis Zuev. Internet traffic classification using bayesian analysis techniques. In Proc. ACM SIGMETRICS, 2005.Google ScholarDigital Library
Thuy TT Nguyen and Grenville Armitage. Training on multiple sub-flows to optimise the use of machine learning classifiers in real-world ip networks. In IEEE LCN, pages 369--376, 2006.Google ScholarCross Ref
Thuy TT Nguyen and Grenville J Armitage. A survey of techniques for internet traffic classification using machine learning. IEEE Communications Surveys and Tutorials, 10(1--4):56--76, 2008.Google ScholarDigital Library
F. Pacheco, E. Exposito, M. Gineste, C. Baudoin, and J. Aguilar. Towards the deployment of machine learning solutions in network traffic classification: A systematic survey. IEEE Communications Surveys and Tutorials, pages 1--1, 2018.Google Scholar
Dario Pasquini, Giuseppe Ateniese, and Massimo Bernaschi. Unleashing the tiger: Inference attacks on split learning. In ACM Computer and Communications Security (CCS), 2021.Google ScholarDigital Library
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825--2830, 2011.Google ScholarDigital Library
Jingjing Ren, Martina Lindorfer, Daniel J Dubois, Ashwin Rao, David Choffnes, and Narseo Vallina-Rodriguez. A longitudinal study of PII leaks across android app versions. In Network and Distributed System Security Symposium (NDSS), volume 10, 2018.Google Scholar
Markus Ring, Sarah Wunderlich, Deniz Scheuring, Dieter Landes, and Andreas Hotho. A survey of network-based intrusion detection data sets. Springer Computers and Security, 86:147--167, 2019.Google ScholarDigital Library
Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211--252, 2015.Google ScholarDigital Library
Tal Shapira and Yuval Shavitt. Flowpic: Encrypted internet traffic classification is as easy as image recognition. In IEEE INFOCOM Workshops, 2019.Google ScholarCross Ref
Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. CoRR, abs/2106.03253, 2021.Google Scholar
Nazanin Takbiri, Amir Houmansadr, Dennis L Goeckel, and Hossein PishroNik. Matching anonymized and obfuscated time series to users' profiles. IEEE Transactions on Information Theory, 65(2):724--741, 2018.Google ScholarDigital Library
Vincent F Taylor, Riccardo Spolaor, Mauro Conti, and Ivan Martinovic. App-scanner: Automatic fingerprinting of smartphone apps from encrypted network traffic. In Proc. IEEE EuroS&P, 2016.Google ScholarCross Ref
Md Shamim Towhid and Nashid Shahriar. Encrypted network traffic classification using self-supervised learning. In IEEE NetSoft, 2022.Google Scholar
Thijs van Ede, Riccardo Bortolameotti, Andrea Continella, Jingjing Ren, Daniel J Dubois, Martina Lindorfer, David Choffnes, Maarten van Steen, and Andreas Peter. Flowprint: Semi-supervised mobile-app fingerprinting on encrypted network traffic. In Network and Distributed System Security Symposium (NDSS), volume 27, 2020.Google ScholarCross Ref
Praneeth Vepakomma, Otkrist Gupta, Tristan Swedish, and Ramesh Raskar. Split learning for health: Distributed deep learning without sharing raw patient data. CoRR, abs/1812.00564, 2018.Google Scholar
Ly Vu, Cong Thanh Bui, and Quang Uy Nguyen. A deep learning based method for handling imbalanced problem in network traffic classification. In ACM International Symposium on Information and Communication Technology, 2017.Google ScholarDigital Library
W. Wang, M. Zhu, J. Wang, X. Zeng, and Z. Yang. End-to-end encrypted traffic classification with one-dimensional convolution neural networks. In Proc. IEEE ISI, 2017.Google ScholarDigital Library
Wei Wang, Ming Zhu, Xuewen Zeng, Xiaozhou Ye, and Yiqiang Sheng. Malware traffic classification using convolutional neural network for representation learning. In Proc. IEEE ICOIN, 2017.Google ScholarCross Ref
Yaqing Wang, Quanming Yao, James T Kwok, and Lionel M Ni. Generalizing from a few examples: A survey on few-shot learning. ACM Computing Surveys, 53(3):1--34, 2020.Google ScholarDigital Library
Zhanyi Wang. The applications of deep learning on traffic identification. BlackHat USA, 2015.Google Scholar
Jun Xu, Jinliang Fan, Mostafa H Ammar, and Sue B Moon. Prefix-preserving IP address anonymization: Measurement-based security evaluation and a new cryptography-based scheme. In IEEE International Conference on Network Protocols (ICNP), pages 280--289. IEEE, 2002.Google Scholar
Adam Yala, Homa Esfahanizadeh, Rafael G. L. D' Oliveira, Ken R. Duffy, Manya Ghobadi, Tommi S. Jaakkola, Vinod Vaikuntanathan, Regina Barzilay, and Muriel Medard. Neuracrypt: Hiding private health data via random neural networks for public training, 2021.Google Scholar
Adam Yala, Victor Quach, Homa Esfahanizadeh, Rafael G. L. D'Oliveira, Ken R. Duffy, Muriel Médard, Tommi S. Jaakkola, and Regina Barzilay. Syfer: Neural obfuscation for private data release, 2022.Google Scholar
Lixuan Yang, Cedric Beliard, and Dario Rossi. Heterogeneous data-aware federated learning. In IJCAI Workshop on Federated Learning, 2020.Google Scholar
Lixuan Yang, Alessandro Finamore, Jun Feng, and Dario Rossi. Deep learning and zero-day traffic classification: Lessons learned from a commercial-grade dataset. IEEE Transactions on Network and Service Management, 18, 2021.Google Scholar

Index Terms

AppClassNet: a commercial-grade dataset for application identification research
1. Computing methodologies
  1. Machine learning
2. Networks
  1. Network performance evaluation
    1. Network measurement

Recommendations

Enhancing Tor's performance using real-time traffic classification
CCS '12: Proceedings of the 2012 ACM conference on Computer and communications security

Tor is a low-latency anonymity-preserving network that enables its users to protect their privacy online. It consists of volunteer-operated routers from all around the world that serve hundreds of thousands of users every day. Due to congestion and a ...
Read More
DataZoo: Streamlining Traffic Classification Experiments
SAFE '23: Proceedings of the 2023 on Explainable and Safety Bounded, Fidelitous, Machine Learning for Networking

The machine learning communities, such as those around computer vision or natural language processing, have developed numerous supportive tools and benchmark datasets to accelerate the development. In contrast, the network traffic classification field ...
Read More
Automated Dataset Generation for Training Peer-to-Peer Machine Learning Classifiers

Peer-to-peer (P2P) classifications based on flow statistics have been proven accurate in detecting P2P traffic. A machine learning classification is affected by the quality and recency of the training dataset used. Hence, to classify P2P traffic on-line ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGCOMM Computer Communication Review Volume 52, Issue 3
July 2022
27 pages
ISSN:0146-4833
DOI:10.1145/3561954
Editor:
Steve Uhlig
Queen Mary, Univ. of London
Issue’s Table of Contents
Copyright © 2022 Copyright is held by the owner/author(s)
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 September 2022
Check for updates
Author Tags
application identification
machine learning
neural networks
open dataset
supervised learning
traffic classification
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 352
  Total Downloads
- Downloads (Last 12 months)160
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.