当前位置: X-MOL 学术ACM SIGCOMM Comput. Commun. Rev. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Surviving switch failures in cloud datacenters
ACM SIGCOMM Computer Communication Review ( IF 2.8 ) Pub Date : 2021-05-10 , DOI: 10.1145/3464994.3464996
Rachee Singh 1 , Muqeet Mukhtar 1 , Ashay Krishna 1 , Aniruddha Parkhi 1 , Jitendra Padhye 1 , David Maltz 1
Affiliation  

Switch failures can hamper access to client services, cause link congestion and blackhole network traffic. In this study, we examine the nature of switch failures in the datacenters of a large commercial cloud provider through the lens of survival theory. We study a cohort of over 180,000 switches with a variety of hardware and software configurations and find that datacenter switches have a 98% likelihood of functioning uninterrupted for over 3 months since deployment in production. However, there is significant heterogeneity in switch survival rates with respect to their hardware and software: the switches of one vendor are twice as likely to fail compared to the others. We attribute the majority of switch failures to hardware impairments and unplanned power losses. We find that the in-house switch operating system, SONiC, boosts the survival likelihood of switches in datacenters by 1% by eliminating switch failures caused by software bugs in vendor switch OSes.

中文翻译:

在云数据中心中幸存的交换机故障

交换机故障会阻碍对客户端服务的访问,导致链路拥塞和黑洞网络流量。在这项研究中,我们从生存理论的角度研究了大型商业云提供商数据中心中交换机故障的性质。我们研究了超过 180,000 台具有各种硬件和软件配置的交换机,发现数据中心交换机自投入生产以来 3 个月以上不间断运行的可能性为 98%。然而,就其硬件和软件而言,交换机的存活率存在显着的异质性:一个供应商的交换机发生故障的可能性是其他供应商的两倍。我们将大多数开关故障归因于硬件损伤和计划外的功率损耗。我们发现内部交换机操作系统 SONiC,
更新日期:2021-05-10
down
wechat
bug