Novel adaptive quantization methodology for 8-bit floating-point DNN training,Design Automation for Embedded Systems

当前位置： X-MOL 学术 › Des. Autom. Embed. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Novel adaptive quantization methodology for 8-bit floating-point DNN training
Design Automation for Embedded Systems ( IF 1.4 ) Pub Date : 2024-02-16 , DOI: 10.1007/s10617-024-09282-2
Mohammad Hassani Sadi , Chirag Sudarshan , Norbert Wehn

There is a high energy cost associated with training Deep Neural Networks (DNNs). Off-chip memory access contributes a major portion to the overall energy consumption. Reduction in the number of off-chip memory transactions can be achieved by quantizing the data words to low data bit-width (E.g., 8-bit). However, low-bit-width data formats suffer from a limited dynamic range, resulting in reduced accuracy. In this paper, a novel 8-bit Floating Point (FP8) data format quantized DNN training methodology is presented, which adapts to the required dynamic range on-the-fly. Our methodology relies on varying the bias values of FP8 format to fit the dynamic range to the required range of DNN parameters and input feature maps. The range fitting during the training is adaptively performed by an online statistical analysis hardware unit without stalling the computation units or its data accesses. Our approach is compatible with any DNN compute cores without any major modifications to the architecture. We propose to integrate the new FP8 quantization unit in the memory controller. The FP32 data from the compute core are converted to FP8 in the memory controller before writing to the DRAM and converted back after reading the data from DRAM. Our results show that the DRAM access energy is reduced by 3.07\(\times \) while using an 8-bit data format instead of using 32-bit. The accuracy loss of the proposed methodology with 8-bit quantized training is \(\approx 1\%\) for various networks with image and natural language processing datasets.

中文翻译：

用于 8 位浮点 DNN 训练的新型自适应量化方法

训练深度神经网络 (DNN) 的能源成本很高。片外存储器访问占总能耗的主要部分。片外存储器事务数量的减少可以通过将数据字量化为低数据位宽度（例如，8位）来实现。然而，低位宽数据格式的动态范围有限，导致精度降低。本文提出了一种新颖的 8 位浮点 (FP8) 数据格式量化 DNN 训练方法，该方法可动态适应所需的动态范围。我们的方法依赖于改变 FP8 格式的偏差值，以使动态范围适合 DNN 参数和输入特征图所需的范围。训练期间的范围拟合由在线统计分析硬件单元自适应地执行，而不会停止计算单元或其数据访问。我们的方法与任何 DNN 计算核心兼容，无需对架构进行任何重大修改。我们建议在内存控制器中集成新的 FP8 量化单元。来自计算核心的 FP32 数据在写入 DRAM 之前在内存控制器中转换为 FP8，并在从 DRAM 读取数据后转换回来。我们的结果表明，使用 8 位数据格式而不是使用 32 位数据格式时，DRAM 访问能耗降低了 3.07 倍。对于具有图像和自然语言处理数据集的各种网络，所提出的 8 位量化训练方法的准确性损失约为 1% 。

更新日期：2024-02-16

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>