PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2023-12-14 , DOI: 10.1145/3624569
Petros Anastasiadis ₁ , Nikela Papadopoulou ₂ , Georgios Goumas ₁ , Nectarios Koziris ₁ , Dennis Hoppe ₃ , Li Zhong ₃

Affiliation

Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by two significant optimization challenges: data transfer bottlenecks as well as problem splitting and scheduling in multiple workers (GPUs) with distinct memories. We demonstrate that the current multi-GPU BLAS methods for tackling these challenges target very specific problem and data characteristics, resulting in serious performance degradation for any slightly deviating workload. Additionally, an even more critical decision is omitted because it cannot be addressed using current scheduler-based approaches: the determination of which devices should be used for a certain routine invocation. To address these issues we propose a model-based approach: using performance estimation to provide problem-specific autotuning during runtime. We integrate this autotuning into an end-to-end BLAS framework named PARALiA. This framework couples autotuning with an optimized task scheduler, leading to near-optimal data distribution and performance-aware resource utilization. We evaluate PARALiA in an HPC testbed with 8 NVIDIA-V100 GPUs, improving the average performance of GEMM by 1.7× and energy efficiency by 2.5× over the state-of-the-art in a large and diverse dataset and demonstrating the adaptability of our performance-aware approach to future heterogeneous systems.

中文翻译：

PARALiA：用于在异构系统上自动调整线性代数的性能感知运行时

密集线性代数运算在高性能计算 (HPC) 应用中出现得非常频繁，因此其性能对于实现最佳可扩展性至关重要。由于许多现代 HPC 集群包含多 GPU 节点，BLAS 操作经常被卸载到 GPU 上，因此需要使用优化的库来确保良好的性能。不幸的是，多 GPU 系统伴随着两个重大的优化挑战：数据传输瓶颈以及具有不同内存的多个工作线程 (GPU) 中的问题分割和调度。我们证明，当前用于应对这些挑战的多 GPU BLAS 方法针对的是非常具体的问题和数据特征，从而导致任何轻微偏差的工作负载的性能严重下降。此外，还省略了一个更关键的决策，因为它无法使用当前基于调度程序的方法来解决：确定哪些设备应该用于某个例程调用。为了解决这些问题，我们提出了一种基于模型的方法：使用性能估计在运行时提供针对特定问题的自动调整。我们将此自动调整集成到名为 PARALiA 的端到端 BLAS 框架中。该框架将自动调整与优化的任务调度程序结合起来，从而实现近乎最佳的数据分布和性能感知的资源利用率。我们在具有 8 个 NVIDIA-V100 GPU 的 HPC 测试台中评估 PARALiA，在大型且多样化的数据集中，与最新技术相比，GEMM 的平均性能提高了 1.7 倍，能效提高了 2.5 倍，并展示了我们的适应性未来异构系统的性能感知方法。

更新日期：2023-12-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>