当前位置: X-MOL 学术arXiv.cs.AR › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL
arXiv - CS - Hardware Architecture Pub Date : 2024-03-27 , DOI: arxiv-2403.18374
Marius Meyer, Tobias Kenter, Lucian Petrica, Kenneth O'Brien, Michaela Blott, Christian Pessl

Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off between the performance potential of improved communication and the impact of resource consumption for communication infrastructure, since the utilized resources on the FPGAs could otherwise be used for computations. In this work, we investigate this trade-off, firstly, by using synthetic benchmarks to evaluate the different configuration options of the communication framework ACCL and their impact on communication latency and throughput. Finally, we use our findings to implement a shallow water simulation whose scalability heavily depends on low-latency communication. With a suitable configuration of ACCL, good scaling behavior can be shown to all 48 FPGAs installed in the system. Overall, the results show that the availability of inter-FPGA communication frameworks as well as the configurability of framework and network stack are crucial to achieve the best application performance with low latency communication.

中文翻译:

使用 ACCL 优化多达 48 个 FPGA 上的延迟敏感型 HPC 应用的通信

HPC 领域中的大多数 FPGA 板都非常适合并行扩展,因为直接集成了多功能和高吞吐量网络端口。然而,其网络功能的利用通常具有挑战性且容易出错,因为整个网络堆栈和通信模式必须在 FPGA 上实现和管理。此外,这种方法在概念上涉及改进通信的性能潜力与通信基础设施资源消耗的影响之间的权衡,因为 FPGA 上的已利用资源可以用于计算。在这项工作中,我们首先通过使用综合基准来评估通信框架 ACCL 的不同配置选项及其对通信延迟和吞吐量的影响来研究这种权衡。最后,我们利用我们的发现来实现浅水模拟,其可扩展性在很大程度上取决于低延迟通信。通过适当的 ACCL 配置,系统中安装的所有 48 个 FPGA 都可以表现出良好的扩展行为。总体而言,结果表明,FPGA 间通信框架的可用性以及框架和网络堆栈的可配置性对于通过低延迟通信实现最佳应用性能至关重要。
更新日期:2024-03-28
down
wechat
bug