Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory
arXiv - CS - Hardware Architecture Pub Date : 2024-03-14 , DOI: arxiv-2403.09358
Jeongmin Hong, Sungjun Cho, Geonwoo Park, Wonhyuk Yang, Young-Ho Gong, Gwangsun Kim

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW limitations of the SCM while minimizing cost overhead and considering GPU's characteristics. Because the massive number of GPU threads can thrash the DRAM cache, we first propose an SCM-aware DRAM cache bypass policy for GPUs that considers the multi-dimensional characteristics of memory accesses by GPUs with SCM to bypass DRAM for data with low performance utility. In addition, to reduce DRAM cache probes and increase effective DRAM BW with minimal cost, we propose a Configurable Tag Cache (CTC) that repurposes part of the L2 cache to cache DRAM cacheline tags. The L2 capacity used for the CTC can be adjusted by users for adaptability. Furthermore, to minimize DRAM cache probe traffic from CTC misses, our Aggregated Metadata-In-Last-column (AMIL) DRAM cache organization co-locates all DRAM cacheline tags in a single column within a row. The AMIL also retains the full ECC protection, unlike prior DRAM cache's Tag-And-Data (TAD) organization. Additionally, we propose SCM throttling to curtail power and exploiting SCM's SLC/MLC modes to adapt to workload's memory footprint. While our techniques can be used for different DRAM and SCM devices, we focus on a Heterogeneous Memory Stack (HMS) organization that stacks SCM dies on top of DRAM dies for high performance. Compared to HBM, HMS improves performance by up to 12.5x (2.9x overall) and reduces energy by up to 89.3% (48.1% overall). Compared to prior works, we reduce DRAM cache probe and SCM write traffic by 91-93% and 57-75%, respectively.

中文翻译：

适用于具有存储级内存的 GPU 的带宽有效 DRAM 缓存

我们建议通过大容量存储级内存 (SCM) 和 DRAM 缓存来克服 GPU 的内存容量限制。通过使用 SCM 显着增加内存容量，对于超额订阅内存的工作负载，GPU 可以捕获比 HBM 更大的内存占用量，从而实现高加速。然而，DRAM 缓存需要仔细设计，以解决 SCM 的延迟和带宽限制，同时最大限度地降低成本开销并考虑 GPU 的特性。由于大量的 GPU 线程会破坏 DRAM 缓存，因此我们首先提出了一种针对 GPU 的 SCM 感知 DRAM 缓存旁路策略，该策略考虑了具有 SCM 的 GPU 访问内存的多维特性，以绕过 DRAM 获取性能效用较低的数据。此外，为了减少 DRAM 缓存探测并以最小的成本增加有效 DRAM BW，我们提出了可配置标签缓存 (CTC)，它重新利用部分 L2 缓存来缓存 DRAM 缓存行标签。CTC使用的L2容量可以由用户调整以适应。此外，为了最大限度地减少 CTC 未命中造成的 DRAM 缓存探测流量，我们的最后列聚合元数据 (AMIL) DRAM 缓存组织将所有 DRAM 缓存行标签共同定位在一行内的单列中。与之前 DRAM 缓存的标签和数据 (TAD) 组织不同，AMIL 还保留了完整的 ECC 保护。此外，我们建议通过 SCM 节流来减少功耗，并利用 SCM 的 SLC/MLC 模式来适应工作负载的内存占用。虽然我们的技术可用于不同的 DRAM 和 SCM 设备，但我们专注于异构内存堆栈 (HMS) 组织，该组织将 SCM 芯片堆叠在 DRAM 芯片之上以实现高性能。与 HBM 相比，HMS 性能提升高达 12.5 倍（总体提升 2.9 倍），能耗降低高达 89.3%（总体提升 48.1%）。与之前的工作相比，我们将 DRAM 缓存探测和 SCM 写入流量分别减少了 91-93% 和 57-75%。

更新日期：2024-03-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>