Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2023-10-26 , DOI: 10.1145/3617688
Donglei Wu ₁ , Weihao Yang ₁ , Xiangyu Zou ₁ , Wen Xia ₂ , Shiyi Li ₁ , Zhenbo Hu ₁ , Weizhe Zhang ₃ , Binxing Fang ₃

Affiliation

Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a Deep Neural Network (DNN) typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-constrained platforms such as mobile devices and IoT. Although mainstream DNNs compression techniques such as pruning, distillation, and quantization can reduce the memory overhead of model parameters during DNN inference, they suffer from three limitations: (i) low model compression ratio for the lightweight DNN structures with little redundancy, (ii) potential degradation in model inference accuracy, and (iii) inadequate memory compression ratio is attributable to ignoring the layering property of DNN inference. To address these issues, we propose a lightweight memory-efficient DNN inference framework called Smart-DNN+, which significantly reduces the memory costs of DNN inference without degrading the model quality. Specifically, ① Smart-DNN+ applies a layerwise binary-quantizer with a remapping mechanism to greatly reduce the model size by quantizing the typical floating-point DNN weights of 32-bit to the 1-bit signs layer by layer. To maintain model quality, ② Smart-DNN+ employs a bucket-encoder to keep the compressed quantization error by encoding the multiple similar floating-point residuals into the same integer bucket IDs. When running the compressed DNN in the user’s device, ③ Smart-DNN+ utilizes a partially decompressing strategy to greatly reduce the required memory overhead by first loading the compressed DNNs in memory and then dynamically decompressing the required materials for model inference layer by layer.

Experimental results on popular DNNs and datasets demonstrate that Smart-DNN+ achieves lower 0.17%–0.92% memory costs at lower runtime overheads compared with the states of the art without degrading the inference accuracy. Moreover, Smart-DNN+ potentially reduces the inference runtime up to 2.04× that of conventional DNN inference workflow.

中文翻译：

Smart-DNN+：用于模型推理的内存高效神经网络压缩框架

深度神经网络（DNN）在各种实际应用中取得了显着的成功。然而，运行深度神经网络 (DNN) 通常需要数百兆字节的内存占用，这使得在移动设备和物联网等资源受限的平台上部署具有挑战性。尽管剪枝、蒸馏和量化等主流 DNN 压缩技术可以减少 DNN 推理过程中模型参数的内存开销，但它们存在三个局限性：(i) 冗余很少的轻量级 DNN 结构的模型压缩率较低，(ii)模型推理精度的潜在下降，以及 (iii) 内存压缩率不足可归因于忽略 DNN 推理的分层特性。为了解决这些问题，我们提出了一种名为 Smart-DNN+ 的轻量级内存高效 DNN 推理框架，它可以显着降低 DNN 推理的内存成本，而不会降低模型质量。具体来说，①Smart-DNN+应用具有重映射机制的分层二进制量化器，通过将典型的32位浮点DNN权重逐层量化为1位符号，大大减小了模型大小。为了保持模型质量，②Smart-DNN+采用桶编码器，通过将多个相似的浮点残差编码到相同的整数桶ID中来保持压缩的量化误差。当在用户设备中运行压缩的DNN时，③Smart-DNN+采用部分解压策略，首先将压缩的DNN加载到内存中，然后逐层动态解压模型推理所需的材料，从而大大减少所需的内存开销。

在流行的 DNN 和数据集上的实验结果表明，与现有技术相比，Smart-DNN+ 在不降低推理精度的情况下，以较低的运行时开销实现了 0.17%–0.92% 的内存成本降低。此外，Smart-DNN+ 可能会将推理运行时间缩短至传统 DNN 推理工作流程的 2.04 倍。

更新日期：2023-10-28

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南