当前位置: X-MOL 学术arXiv.cs.CR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Counteracting Concept Drift by Learning with Future Malware Predictions
arXiv - CS - Cryptography and Security Pub Date : 2024-04-14 , DOI: arxiv-2404.09352
Branislav Bosansky, Lada Hospodkova, Michal Najman, Maria Rigaki, Elnaz Babayeva, Viliam Lisy

The accuracy of deployed malware-detection classifiers degrades over time due to changes in data distributions and increasing discrepancies between training and testing data. This phenomenon is known as the concept drift. While the concept drift can be caused by various reasons in general, new malicious files are created by malware authors with a clear intention of avoiding detection. The existence of the intention opens a possibility for predicting such future samples. Including predicted samples in training data should consequently increase the accuracy of the classifiers on new testing data. We compare two methods for predicting future samples: (1) adversarial training and (2) generative adversarial networks (GANs). The first method explicitly seeks for adversarial examples against the classifier that are then used as a part of training data. Similarly, GANs also generate synthetic training data. We use GANs to learn changes in data distributions within different time periods of training data and then apply these changes to generate samples that could be in testing data. We compare these prediction methods on two different datasets: (1) Ember public dataset and (2) the internal dataset of files incoming to Avast. We show that while adversarial training yields more robust classifiers, this method is not a good predictor of future malware in general. This is in contrast with previously reported positive results in different domains (including natural language processing and spam detection). On the other hand, we show that GANs can be successfully used as predictors of future malware. We specifically examine malware families that exhibit significant changes in their data distributions over time and the experimental results confirm that GAN-based predictions can significantly improve the accuracy of the classifier on new, previously unseen data.

中文翻译:

通过学习未来恶意软件预测来抵消概念漂移

由于数据分布的变化以及训练和测试数据之间的差异不断增大,已部署的恶意软件检测分类器的准确性会随着时间的推移而降低。这种现象称为概念漂移。虽然概念漂移通常是由各种原因引起的,但恶意软件作者创建新的恶意文件的目的很明显是为了避免检测。意图的存在为预测此类未来样本提供了可能性。因此,在训练数据中包含预测样本应该会提高分类器对新测试数据的准确性。我们比较了两种预测未来样本的方法:(1)对抗训练和(2)生成对抗网络(GAN)。第一种方法明确寻找针对分类器的对抗性示例,然后将其用作训练数据的一部分。同样,GAN 也会生成合成训练数据。我们使用 GAN 来学习训练数据不同时间段内数据分布的变化,然后应用这些变化来生成可能用于测试数据的样本。我们在两个不同的数据集上比较这些预测方法:(1) Ember 公共数据集和 (2) 传入 Avast 的文件的内部数据集。我们表明,虽然对抗性训练产生了更强大的分类器,但这种方法总体上并不能很好地预测未来的恶意软件。这与之前报道的不同领域(包括自然语言处理和垃圾邮件检测)的积极结果形成鲜明对比。另一方面,我们证明 GAN 可以成功地用作未来恶意软件的预测器。我们专门检查了数据分布随时间发生显着变化的恶意软件家族,实验结果证实,基于 GAN 的预测可以显着提高分类器对新的、以前未见过的数据的准确性。
更新日期:2024-04-16
down
wechat
bug