A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads,International Journal of Parallel Programming

当前位置： X-MOL 学术 › Int. J. Parallel. Program › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A Quantitative Study of Locality in GPU Caches for Memory-Divergent Workloads
International Journal of Parallel Programming ( IF 1.5 ) Pub Date : 2022-04-01 , DOI: 10.1007/s10766-022-00729-2
Sohan Lal _{1,

2} , Ben Juurlink ₂ , Bogaraju Sharatchandra Varma ₃

Affiliation

AbstractGPUs are capable of delivering peak performance in TFLOPs, however, peak performance is often difficult to achieve due to several performance bottlenecks. Memory divergence is one such performance bottleneck that makes it harder to exploit locality, cause cache thrashing, and high miss rate, therefore, impeding GPU performance. As data locality is crucial for performance, there have been several efforts to exploit data locality in GPUs. However, there is a lack of quantitative analysis of data locality, which could pave the way for optimizations. In this paper, we quantitatively study the data locality and its limits in GPUs at different granularities. We show that, in contrast to previous studies, there is a significantly higher inter-warp locality at the L1 data cache for memory-divergent workloads. We further show that about 50% of the cache capacity and other scarce resources such as NoC bandwidth are wasted due to data over-fetch caused by memory divergence. While the low spatial utilization of cache lines justifies the sectored-cache design to only fetch those sectors of a cache line that are needed during a request, our limit study reveals the lost spatial locality for which additional memory requests are needed to fetch the other sectors of the same cache line. The lost spatial locality presents opportunities for further optimizing the cache design.

中文翻译：

内存发散工作负载的 GPU 缓存局部性的定量研究

AbstractGPU 能够在 TFLOP 中提供峰值性能，但是，由于存在多个性能瓶颈，通常难以实现峰值性能。内存分歧是这样的性能瓶颈之一，它使得利用局部性变得更加困难，导致缓存抖动和高未命中率，因此阻碍了 GPU 性能。由于数据局部性对性能至关重要，因此已经做出了一些努力来利用 GPU 中的数据局部性。然而，缺乏对数据局部性的定量分析，这可能为优化铺平道路。在本文中，我们定量研究了不同粒度的 GPU 中的数据局部性及其限制。我们表明，与以前的研究相比，对于内存发散工作负载，L1 数据缓存的扭曲间局部性明显更高。我们进一步表明，由于内存分歧导致的数据过度获取，大约 50% 的缓存容量和其他稀缺资源（如 NoC 带宽）被浪费了。虽然高速缓存行的低空间利用率证明扇区高速缓存设计仅获取请求期间需要的高速缓存行的那些扇区，但我们的限制研究揭示了丢失的空间局部性，需要额外的内存请求来获取其他扇区相同的缓存行。丢失的空间局部性为进一步优化缓存设计提供了机会。我们的极限研究揭示了丢失的空间局部性，需要额外的内存请求来获取同一高速缓存行的其他扇区。丢失的空间局部性为进一步优化缓存设计提供了机会。我们的极限研究揭示了丢失的空间局部性，需要额外的内存请求来获取同一高速缓存行的其他扇区。丢失的空间局部性为进一步优化缓存设计提供了机会。

更新日期：2022-04-01

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>