当前位置: X-MOL 学术ACM Trans. Comput. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Partial Network Partitioning
ACM Transactions on Computer Systems ( IF 1.5 ) Pub Date : 2022-12-19 , DOI: https://dl.acm.org/doi/10.1145/3576192
Basil Alkhatib, Sreeharsha Udayashankar, Sara Qunaibi, Ahmed Alquraan, Mohammed Alfatafta, Wael Al-Manasrah, Alex Depoutovitch, Samer Al-Kiswany

We present an extensive study focused on partial network partitioning. Partial network partitions disrupt the communication between some but not all nodes in a cluster. First, we conduct a comprehensive study of system failures caused by this fault in 13 popular systems. Our study reveals that the studied failures are catastrophic (e.g., lead to data loss), easily manifest, and are mainly due to design flaws. Our analysis identifies vulnerabilities in core systems mechanisms including scheduling, membership management, and ZooKeeper-based configuration management.

Second, we dissect the design of nine popular systems and identify four principled approaches for tolerating partial partitions. Unfortunately, our analysis shows that implemented fault tolerance techniques are inadequate for modern systems; they either patch a particular mechanism or lead to a complete cluster shutdown, even when alternative network paths exist.

Finally, our findings motivate us to build Nifty, a transparent communication layer that masks partial network partitions. Nifty builds an overlay between nodes to detour packets around partial partitions. Nifty provides an approach for applications to optimize their operation during a partial partition. We demonstrate the benefit of this approach through integrating Nifty with VoltDB, HDFS, and Kafka.



中文翻译:

部分网络分区

我们提出了一项针对部分网络分区的广泛研究。部分网络分区会中断集群中部分而非所有节点之间的通信。首先,我们对 13 个流行系统中由该故障引起的系统故障进行了全面研究。我们的研究表明,所研究的故障是灾难性的(例如,导致数据丢失),很容易显现,并且主要是由于设计缺陷造成的。我们的分析确定了核心系统机制中的漏洞,包括调度、成员管理和基于 ZooKeeper 的配置管理。

其次,我们剖析了九个流行系统的设计,并确定了四种容忍部分分区的原则方法。不幸的是,我们的分析表明,实施的容错技术对于现代系统来说是不够的;它们要么修补特定机制,要么导致集群完全关闭,即使存在替代网络路径也是如此。

最后,我们的发现促使我们构建 Nifty,这是一个透明的通信层,可以屏蔽部分网络分区。Nifty 在节点之间构建覆盖层以绕过部分分区的数据包。Nifty 为应用程序提供了一种在部分分区期间优化其操作的方法。我们通过将 Nifty 与 VoltDB、HDFS 和 Kafka 集成来展示这种方法的好处。

更新日期:2022-12-20
down
wechat
bug