SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation
arXiv - CS - Hardware Architecture Pub Date : 2024-03-25 , DOI: arxiv-2403.16863
Guoliang He, Eiko Yoneki

Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized as possible. In this work, we explore the possibility of GPU native instruction optimization to further push the CUDA kernels to extreme performance. Contrary to prior works, we adopt an automatic optimization approach by defining a search space of possible GPU native instruction schedules, and then we apply stochastic search to perform optimization. Experiments show that SIP can further improve CUDA kernel throughput by automatically discovering better GPU native instruction schedules and the optimized schedules are tested by 10 million test samples.

中文翻译：

SIP：通过随机指令扰动自动调整 GPU 本机调度

大型语言模型（LLM）自出现以来已成为一项重要的工作量。然而，它们的计算成本也很高，因为它们有数十亿个参数，并且需要接受大量数据的训练。因此，最近的工作开发了专用的 CUDA 内核用于 LLM 训练和推理，而不是依赖编译器生成的内核，以便尽可能充分地利用硬件资源。在这项工作中，我们探索 GPU 本机指令优化的可能性，以进一步将 CUDA 内核推向极致性能。与之前的工作相反，我们通过定义可能的 GPU 本机指令调度的搜索空间来采用自动优化方法，然后应用随机搜索来执行优化。实验表明，SIP可以通过自动发现更好的GPU原生指令调度来进一步提高CUDA内核吞吐量，并且优化后的调度经过1000万个测试样本的测试。

更新日期：2024-03-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>