当前位置: X-MOL 学术Future Gener. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A Big Data architecture for early identification and categorization of dark web sites
Future Generation Computer Systems ( IF 7.5 ) Pub Date : 2024-03-20 , DOI: 10.1016/j.future.2024.03.025
Javier Pastor-Galindo , Hông-Ân Sandlin , Félix Gómez Mármol , Gérôme Bovet , Gregorio Martínez Pérez

The dark web has become notorious for its association with illicit activities and there is a growing need for systems to automate the monitoring of this space. This paper proposes an end-to-end scalable architecture for the continuous early identification of new Tor sites and the daily analysis of their content. The solution is built using an Open Source Big Data stack for data serving with Kubernetes, Kafka, Kubeflow, and MinIO, continuously discovering onion addresses in different sources (threat intelligence, code repositories, web-Tor gateways, and Tor repositories), downloading the HTML from Tor and deduplicating the content using MinHash LSH, and categorizing with the BERTopic modeling (SBERT embedding, UMAP dimensionality reduction, HDBSCAN document clustering and c-TF-IDF topic keywords). In 93 days, the system identified 80,049 onion services and characterized 90% of them, addressing the challenge of Tor volatility. A disproportionate amount of repeated content is found, with only 6.1% unique sites. From the HTML files of the dark sites, 31 different low-topics are extracted, manually labeled, and grouped into 11 high-level topics. The five most popular included sexual and violent content, repositories and search engines, carding, cryptocurrencies, and marketplaces. During the experiments, we identified 14 sites with 13,946 clones that shared a suspiciously similar mirroring rate per day, suggesting an extensive common phishing network. Among the related works, this study is the most representative characterization of onion services based on topics to date.

中文翻译:

用于早期识别和分类暗网站的大数据架构

暗网因其与非法活动的关联而变得臭名昭著,并且越来越需要系统来自动监控该空间。本文提出了一种端到端的可扩展架构,用于持续早期识别新的 Tor 站点并对其内容进行日常分析。该解决方案是使用开源大数据堆栈构建的,用于 Kubernetes、Kafka、Kubeflow 和 MinIO 的数据服务,不断发现不同来源(威胁情报、代码存储库、web-Tor 网关和 Tor 存储库)中的洋葱地址,下载来自 Tor 的 HTML 并使用 MinHash LSH 对内容进行重复数据删除,并使用 BERTopic 建模进行分类(SBERT 嵌入、UMAP 降维、HDBSCAN 文档聚类和 c-TF-IDF 主题关键字)。在 93 天内,系统识别了 80,049 个洋葱服务,并对其中 90% 的服务进行了特征描述,解决了 Tor 波动性的挑战。发现重复内容过多,只有 6.1% 的网站是唯一的。从暗网站的 HTML 文件中,提取了 31 个不同的低级主题,手动标记,并分组为 11 个高级主题。最受欢迎的五个内容包括性和暴力内容、存储库和搜索引擎、梳理、加密货币和市场。在实验过程中,我们发现了 14 个站点,其中包含 13,946 个克隆,这些站点每天的镜像率非常相似,这表明存在广泛的常见网络钓鱼网络。在相关工作中,这项研究是迄今为止基于主题的洋葱服务最具代表性的表征。
更新日期:2024-03-20
down
wechat
bug