Highly Concurrent Latency-tolerant Register Files for GPUs,ACM Transactions on Computer Systems

当前位置： X-MOL 学术 › ACM Trans. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Highly Concurrent Latency-tolerant Register Files for GPUs
ACM Transactions on Computer Systems ( IF 1.5 ) Pub Date : 2021-01-04 , DOI: 10.1145/3419973
Mohammad Sadrosadati ₁ , Amirhossein Mirhosseini ₂ , Ali Hajiabadi ₃ , Seyed Borna Ehsani ₃ , Hajar Falahati ₁ , Hamid Sarbazi-Azad ₄ , Mario Drumond ₅ , Babak Falsafi ₅ , Rachata Ausavarungnirun ₆ , Onur Mutlu ₇

Affiliation

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. We observe that register bank conflicts while prefetching the registers could greatly reduce the effectiveness of LTRF. Therefore, we devise a compile-time register renumbering technique to reduce the likelihood of register bank conflicts. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 34%.

中文翻译：

用于 GPU 的高并发延迟容错寄存器文件

图形处理单元 (GPU) 采用大型寄存器文件来容纳所有活动线程并加速上下文切换。不幸的是，由于长访问延迟、高功耗和大硅片区域配置，寄存器文件是未来 GPU 的可扩展性瓶颈。先前的工作提出了分层寄存器文件，以通过将寄存器缓存在较小的寄存器文件高速缓存中来降低寄存器文件的功耗。不幸的是，由于寄存器文件缓存中的低命中率，这种方法并不能改善寄存器访问延迟。在本文中，我们提出了延迟容忍寄存器文件 (LTRF) 架构，以在两级分层结构中实现低延迟，同时保持低功耗。我们观察到编译时间间隔分析使我们能够将 GPU 程序执行划分为间隔，并准确估计每个间隔内扭曲的聚合寄存器工作集。LTRF 的关键思想是在软件控制下，在每个间隔开始时，将估计的寄存器工作集从主寄存器文件预取到寄存器文件缓存中，并将预取延迟与其他 warp 的执行重叠。我们观察到预取寄存器时的寄存器组冲突会大大降低 LTRF 的有效性。因此，我们设计了一种编译时寄存器重新编号技术，以减少寄存器组冲突的可能性。我们的实验结果表明，LTRF 支持大容量但延迟长的主 GPU 寄存器文件，为各种优化铺平了道路。

更新日期：2021-01-04

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>