当前位置: X-MOL 学术ACM Trans. Archit. Code Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Characterizing Multi-Chip GPU Data Sharing
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2023-10-20 , DOI: 10.1145/3629521
Shiqing Zhang 1 , Mahmood Naderan-Tahan 1 , Magnus Jahre 2 , Lieven Eeckhout 1
Affiliation  

Multi-chip GPU systems are critical to scale performance beyond a single GPU chip for a wide variety of important emerging applications. A key challenge for multi-chip GPUs though is how to overcome the bandwidth gap between inter-chip and intra-chip communication. Accesses to shared data, i.e., data accessed by multiple chips, pose a major performance challenge as they incur remote memory accesses possibly congesting the inter-chip links and degrading overall system performance. This paper characterizes the shared data set in multi-chip GPUs in terms of (1) truly versus falsely shared data, (2) how the shared data set scales with input size, (3) along which dimensions the shared data set scales, and (4) how sensitive the shared data set is with respect to the input’s characteristics, i.e., node degree and connectivity in graph workloads. We observe significant variety in scaling behavior across workloads: some workloads feature a shared data set that scales linearly with input size, while others feature sublinear scaling (following a \(\sqrt {2} \) or \(\sqrt [3]{2} \) relationship). We further demonstrate how the shared data set affects the optimum last-level cache organization (memory-side versus SM-side) in multi-chip GPUs, as well as optimum memory page allocation and thread scheduling policy. Sensitivity analyses demonstrate the insights across the broad design space.



中文翻译:

表征多芯片 GPU 数据共享

多芯片 GPU 系统对于扩展单个 GPU 芯片的性能对于各种重要的新兴应用至关重要。然而,多芯片 GPU 面临的一个关键挑战是如何克服芯片间和芯片内通信之间的带宽差距。对共享数据(即由多个芯片访问的数据)的访问构成了主要的性能挑战,因为它们引起远程存储器访问,可能堵塞芯片间链路并降低整体系统性能。本文从以下几个方面描述了多芯片 GPU 中的共享数据集:(1) 真实共享数据与错误共享数据;(2) 共享数据集如何随输入大小缩放;(3) 共享数据集沿哪些维度缩放;以及(4) 共享数据集对输入特征的敏感程度,即图工作负载中的节点度和连接性。我们观察到不同工作负载的扩展行为存在显着差异:一些工作负载具有随输入大小线性扩展的共享数据集,而其他工作负载则具有次线性扩展(遵循 \(\sqrt {2} \) 或 \(\sqrt [3]{ 2} \) 关系)。我们进一步演示了共享数据集如何影响多芯片 GPU 中的最佳末级缓存组织(内存侧与 SM 侧),以及最佳内存页面分配和线程调度策略。敏感性分析展示了对广泛设计空间的见解。

更新日期:2023-10-21
down
wechat
bug