当前位置: X-MOL 学术Cluster Comput. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Network states-aware collective communication optimization
Cluster Computing ( IF 4.4 ) Pub Date : 2024-03-10 , DOI: 10.1007/s10586-024-04330-9
Jingyuan Wang , Tianhai Zhao , Yunlan Wang

Message Passing Interface (MPI) is the de facto standard for parallel programming, and collective operations in MPI are widely utilized by numerous scientific applications. The efficiency of these collective operations greatly impacts the performance of parallel applications. With the increasing scale and heterogeneity of HPC systems, the network environment has become more complex. The network states vary widely and dynamically between node pairs, and this makes it more difficult to design efficient collective communication algorithms. In this paper, we propose a method to optimize collective operations by using real-time measured network states, specifically focusing on the binomial tree algorithm. Our approach employs a low-overhead method to measure the network states, and the binomial tree with small latency is constructed based on the measurement result. Additionally, we take into account the disparities between the two underlying MPI peer-to-peer communication protocols, eager and rendezvous, and design tailored binomial tree construction algorithms for each protocol. We have implemented hierarchical MPI_Bcast, MPI_Reduce, MPI_Gather and MPI_Scatter, utilizing our network states-aware binomial tree algorithm at the inter-node level. The benchmark results demonstrate that our algorithm effectively enhances performance in small and medium message communication when compared to the default binomial tree algorithm in Open MPI. Specifically, for MPI_Bcast, we observe an average performance improvement of over 15.5% when the message size is less than 64KB. Similarly, for MPI_Reduce, there is an average performance improvement of over 12.1% when the message is below 2KB. In addition, there is an average performance improvement of over 10% for MPI_Gather when the message ranging from 64B to 512B. For MPI_Scatter, our algorithm achieved performance improvement only for certain message sizes.



中文翻译:

网络状态感知集体通信优化

消息传递接口 (MPI) 是并行编程事实上的标准,MPI 中的集体操作被众多科学应用广泛使用。这些集体操作的效率极大地影响并行应用程序的性能。随着HPC系统的规模和异构性不断增加,网络环境变得更加复杂。节点对之间的网络状态变化很大且动态变化,这使得设计有效的集体通信算法变得更加困难。在本文中,我们提出了一种通过使用实时测量的网络状态来优化集体操作的方法,特别关注二项式树算法。我们的方法采用低开销的方法来测量网络状态,并根据测量结果构建具有小延迟的二项式树。此外,我们还考虑了两个底层 MPI 对等通信协议(eager 和 rendezvous)之间的差异,并为每个协议设计了定制的二项式树构造算法。我们在节点间级别利用我们的网络状态感知二项式树算法,实现了分层 MPI_Bcast、MPI_Reduce、MPI_Gather 和 MPI_Scatter。基准测试结果表明,与 Open MPI 中默认的二项式树算法相比,我们的算法有效地增强了中小型消息通信的性能。具体来说,对于 MPI_Bcast,当消息大小小于 64KB 时,我们观察到平均性能提升超过 15.5%。同样,对于 MPI_Reduce,当消息低于 2KB 时,平均性能提升超过 12.1%。此外,当消息大小从64B到512B时,MPI_Gather的平均性能提升超过10%。对于 MPI_Scatter,我们的算法仅针对某些消息大小实现了性能改进。

更新日期:2024-03-11
down
wechat
bug