An Instruction Set Architecture for Machine Learning,ACM Transactions on Computer Systems

当前位置： X-MOL 学术 › ACM Trans. Comput. Syst. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

An Instruction Set Architecture for Machine Learning
ACM Transactions on Computer Systems ( IF 1.5 ) Pub Date : 2019-08-13 , DOI: 10.1145/3331469
Yunji Chen ₁ , Huiying Lan ₂ , Zidong Du ₂ , Shaoli Liu ₂ , Jinhua Tao ₂ , Dong Han ₂ , Tao Luo ₂ , Qi Guo ₂ , Ling Li ₃ , Yuan Xie ₄ , Tianshi Chen ₂

Affiliation

Machine Learning (ML) are a family of models for learning from the data to improve performance on a certain task. ML techniques, especially recent renewed neural networks (deep neural networks), have proven to be efficient for a broad range of applications. ML techniques are conventionally executed on general-purpose processors (such as CPU and GPGPU), which usually are not energy efficient, since they invest excessive hardware resources to flexibly support various workloads. Consequently, application-specific hardware accelerators have been proposed recently to improve energy efficiency. However, such accelerators were designed for a small set of ML techniques sharing similar computational patterns, and they adopt complex and informative instructions (control signals) directly corresponding to high-level functional blocks of an ML technique (such as layers in neural networks) or even an ML as a whole. Although straightforward and easy to implement for a limited set of similar ML techniques, the lack of agility in the instruction set prevents such accelerator designs from supporting a variety of different ML techniques with sufficient flexibility and efficiency. In this article, we first propose a novel domain-specific Instruction Set Architecture (ISA) for NN accelerators, called Cambricon, which is a load-store architecture that integrates scalar, vector, matrix, logical, data transfer, and control instructions, based on a comprehensive analysis of existing NN techniques. We then extend the application scope of Cambricon from NN to ML techniques. We also propose an assembly language, an assembler, and runtime to support programming with Cambricon, especially targeting large-scale ML problems. Our evaluation over a total of 16 representative yet distinct ML techniques have demonstrated that Cambricon exhibits strong descriptive capacity over a broad range of ML techniques and provides higher code density than general-purpose ISAs such as x86, MIPS, and GPGPU. Compared to the latest state-of-the-art NN accelerator design DaDianNao [7] (which can only accommodate three types of NN techniques), our Cambricon-based accelerator prototype implemented in TSMC 65nm technology incurs only negligible latency/power/area overheads, with a versatile coverage of 10 different NN benchmarks and 7 other ML benchmarks. Compared to the recent prevalent ML accelerator PuDianNao, our Cambricon-based accelerator is able to support all the ML techniques as well as the 10 NNs but with only approximate 5.1% performance loss.

中文翻译：

机器学习的指令集架构

机器学习 (ML) 是一系列模型，用于从数据中学习以提高特定任务的性能。ML 技术，尤其是最近更新的神经网络（深度神经网络），已被证明在广泛的应用中是有效的。ML 技术通常在通用处理器（例如 CPU 和 GPGPU）上执行，这些处理器通常不节能，因为它们投入过多的硬件资源来灵活支持各种工作负载。因此，最近提出了特定于应用程序的硬件加速器来提高能源效率。然而，这样的加速器是为一小组共享相似计算模式的 ML 技术而设计的，它们采用复杂且信息丰富的指令（控制信号），直接对应于机器学习技术的高级功能块（例如神经网络中的层），甚至是整个机器学习。尽管对于有限的一组类似 ML 技术而言直接且易于实现，但指令集中缺乏敏捷性使此类加速器设计无法以足够的灵活性和效率支持各种不同的 ML 技术。在本文中，我们首先提出了一种新的用于神经网络加速器的特定领域指令集架构（ISA），称为寒武纪，它是一种集成了标量、向量、矩阵、逻辑、数据传输和控制指令的加载存储架构，基于对现有 NN 技术的全面分析。然后我们将寒武纪的应用范围从 NN 扩展到 ML 技术。我们还提出了一种汇编语言、汇编程序和运行时来支持使用寒武纪进行编程，特别是针对大规模 ML 问题。我们对总共 16 种具有代表性但不同的 ML 技术的评估表明，Cambricon 在广泛的 ML 技术中表现出强大的描述能力，并提供比 x86、MIPS 和 GPGPU 等通用 ISA 更高的代码密度。与最新最先进的 NN 加速器设计 DaDianNao [7]（只能适应三种类型的 NN 技术）相比，我们在台积电 65nm 技术中实现的基于寒武纪的加速器原型仅产生可忽略不计的延迟/功耗/面积开销，涵盖 10 个不同的 NN 基准和 7 个其他 ML 基准。

更新日期：2019-08-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>