Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching,arXiv - CS - Operating Systems

当前位置： X-MOL 学术 › arXiv.cs.OS › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching
arXiv - CS - Operating Systems Pub Date : 2023-12-23 , DOI: arxiv-2401.06362
Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna

Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement, resulting in a 37.6% speed-up. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.

中文翻译：

注意力、蒸馏和表格化：走向实用的基于神经网络的预取

基于注意力的神经网络（NN）已经证明了它们在准确的内存访问预测方面的有效性，这是数据预取的一个重要步骤。然而，与这些模型相关的大量计算开销导致高推理延迟，限制了它们作为实际预取器的可行性。为了缩小差距，我们提出了一种基于表格化的新方法，可以在不牺牲预测准确性的情况下显着降低模型复杂性和推理延迟。我们新颖的表格化方法将经过提炼但高度准确的基于注意力的模型作为输入，用于内存访问预测，并有效地将其昂贵的矩阵乘法转换为快速表查找的层次结构。作为上述方法的一个范例，我们开发了 DART，一个由简单的表层次结构组成的预取器。随着 F1 分数小幅下降 0.09，DART 减少了大型基于注意力模型的算术运算 99.99%，比蒸馏模型减少了 91.83%。DART 将大型模型推理加速 170 倍，将蒸馏模型加速 9.4 倍。DART 的延迟和存储成本与最先进的基于规则的预取器 BO 相当，但 IPC 改进超过 6.1%，从而实现 37.6% 的加速。在 IPC 改进方面，DART 比最先进的基于神经网络的预取器 TransFetch 提高了 33.1%，比 Voyager 提高了 37.2%，这主要是由于其预取延迟较低。

更新日期：2023-12-23

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>