Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access,arXiv - CS - Hardware Architecture

当前位置： X-MOL 学术 › arXiv.cs.AR › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access
arXiv - CS - Hardware Architecture Pub Date : 2024-04-17 , DOI: arxiv-2404.11044
Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang, Tianyue Lu, Mingyu Chen, Siwei Luo, Keji Huang

The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree of memory-level parallelism (MLP) is needed to tolerate the long access latency. While modern out-of-order processors are capable of exploiting a certain degree of MLP, they are constrained by resource limitations and hardware complexity. The key obstacle is the synchronous memory access semantics of traditional load/store instructions, which occupy critical hardware resources for a long time. The longer far memory latencies exacerbate this limitation. This paper proposes a set of Asynchronous Memory Access Instructions (AMI) and its supporting function unit, Asynchronous Memory Access Unit (AMU), inside a contemporary Out-of-Order Core. AMI separates memory request issuing from response handling to reduce resource occupation. Additionally, AMU architecture supports up to several hundreds of asynchronous memory requests through re-purposing a portion of L2 Cache as scratchpad memory (SPM) to provide sufficient temporal storage. Together with a coroutine-based programming framework, this scheme can achieve significantly higher MLP for hiding far memory latencies. Evaluation with a cycle-accurate simulation shows AMI achieves 2.42x speedup on average for memory-bound benchmarks with 1us additional far memory latency. Over 130 outstanding requests are supported with 26.86x speedup for GUPS (random access) with 5 us latency. These demonstrate how the techniques tackle far memory performance impacts through explicit MLP expression and latency adaptation.

中文翻译：

异步内存访问单元：利用大规模并行性进行远内存访问

现代应用不断增长的内存需求推动了数据中心采用远内存技术，以提供经济高效的大容量内存解决方案。然而，远存储器带来了新的性能挑战，因为它的访问延迟比本地 DRAM 明显更长且变化更大。对于要在远存储器上实现可接受的性能的应用程序，需要高度的存储器级并行性（MLP）来容忍较长的访问延迟。虽然现代乱序处理器能够利用一定程度的 MLP，但它们受到资源限制和硬件复杂性的限制。关键障碍是传统加载/存储指令的同步内存访问语义，它长期占用关键硬件资源。较长的远内存延迟加剧了这种限制。本文提出了现代乱序核心内的一组异步内存访问指令（AMI）及其支持功能单元异步内存访问单元（AMU）。 AMI 将内存请求发出与响应处理分开，以减少资源占用。此外，AMU 架构通过将 L2 Cache 的一部分重新用作暂存存储器 (SPM)，以提供足够的临时存储，从而支持多达数百个异步内存请求。与基于协程的编程框架一起，该方案可以实现显着更高的 MLP，以隐藏远内存延迟。通过周期精确仿真进行的评估表明，对于内存限制基准，AMI 平均实现了 2.42 倍的加速，并且远内存延迟增加了 1us。支持超过 130 个未完成的请求，GUPS（随机访问）加速为 26.86 倍，延迟为 5 秒。这些演示了这些技术如何通过显式 MLP 表达和延迟适应来解决远记忆性能影响。

更新日期：2024-04-18

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>