当前位置: X-MOL 学术ACM Trans. Archit. Code Optim. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2023-12-14 , DOI: 10.1145/3629523
Ziaul Choudhury 1 , Anish Gulati 1 , Suresh Purini 1
Affiliation  

The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative to custom ASICs for designing DSAs because of their low power consumption and a higher degree of parallelism. DSA design on FPGAs requires careful calibration of the FPGA compute and memory resources towards achieving optimal throughput.

Hardware Descriptive Languages (HDL) like Verilog have been traditionally used to design FPGA hardware. HDLs are not geared towards any domain, and the user has to put in much effort to describe the hardware at the register transfer level. Domain Specific Languages (DSLs) and compilers have been recently used to weave together handwritten HDLs templates targeting a specific domain. Recent efforts have designed DSAs with image-processing DSLs targeting FPGAs. Image computations in the DSL are lowered to pre-existing templates or lower-level languages like HLS-C. This approach requires expensive FPGA re-flashing for every new workload. In contrast to this fixed-function hardware approach, overlays are gaining traction. Overlays are DSAs resembling a processor, which is synthesized and flashed on the FPGA once but is flexible enough to process a broad class of computations through soft reconfiguration. Less work has been reported in the context of image processing overlays. Image processing algorithms vary in size and shape, ranging from simple blurring operations to complex pyramid systems. The primary challenge in designing an image-processing overlay is maintaining flexibility in mapping different algorithms.

This paper proposes a DSL-based overlay accelerator called FlowPix for image processing applications. The DSL programs are expressed as pipelines, with each stage representing a computational step in the overall algorithm. We implement 15 image-processing benchmarks using FlowPix on a Virtex-7-690t FPGA. The benchmarks range from simple blur operations to complex pipelines like Lucas-Kande optical flow. We compare FlowPix against existing DSL-to-FPGA frameworks like Hetero-Halide and Vitis Vision library that generate fixed-function hardware. On most benchmarks, we see up to 25% degradation in latency with approximately a 1.7x to 2x increase in the FPGA LUT consumption. Our ability to execute any benchmark without incurring the high costs of hardware synthesis, place-and-route, and FPGA re-flashing justifies the slight performance loss and increased resource consumption that we experience. FlowPix achieves an average frame rate of 170 FPS on HD frames of 1920 × 1080 pixels in the implemented benchmarks.



中文翻译:


FlowPix:使用特定领域编译器加速 FPGA 覆盖上的图像处理管道



近年来,摩尔定律所保证的指数性能增长开始逐渐减弱。与此同时,图像处理等新兴应用需要大量的计算性能。这些因素不可避免地导致特定领域加速器(DSA)的出现,以填补传统架构留下的性能空白。由于其低功耗和更高程度的并行性,FPGA 正在迅速发展成为用于设计 DSA 的定制 ASIC 的替代品。 FPGA 上的 DSA 设计需要仔细校准 FPGA 计算和内存资源,以实现最佳吞吐量。


Verilog 等硬件描述语言 (HDL) 传统上用于设计 FPGA 硬件。 HDL 不面向任何领域,用户必须花费大量精力来描述寄存器传输级别的硬件。领域特定语言 (DSL) 和编译器最近被用来将针对特定领域的手写 HDL 模板编织在一起。最近的工作是设计具有针对 FPGA 的图像处理 DSL 的 DSA。 DSL 中的图像计算被降低到预先存在的模板或较低级别的语言(如 HLS-C)。这种方法需要为每个新工作负载重新刷新昂贵的 FPGA。与这种固定功能的硬件方法相比,叠加技术正在获得越来越多的关注。 Overlays 是类似于处理器的 DSA,它在 FPGA 上合成并闪存一次,但足够灵活,可以通过软重配置处理广泛的计算类别。在图像处理叠加方面的工作报道较少。图像处理算法的大小和形状各不相同,从简单的模糊操作到复杂的金字塔系统。设计图像处理叠加的主要挑战是保持映射不同算法的灵活性。


本文提出了一种基于 DSL 的覆盖加速器,称为 FlowPix,用于图像处理应用。 DSL 程序被表示为管道,每个阶段代表整个算法中的一个计算步骤。我们在 Virtex-7-690t FPGA 上使用 FlowPix 实施 15 个图像处理基准测试。基准测试范围从简单的模糊操作到复杂的管道(如 Lucas-Kande 光流)。我们将 FlowPix 与生成固定功能硬件的现有 DSL 到 FPGA 框架(例如 Hetero-Halide 和 Vitis Vision 库)进行比较。在大多数基准测试中,我们发现延迟下降高达 25%,同时 FPGA LUT 消耗增加约 1.7 至 2 倍。我们能够执行任何基准测试,而不会产生硬件综合、布局布线和 FPGA 重新刷新的高成本,这证明了我们所经历的轻微性能损失和资源消耗增加是合理的。在实施的基准测试中,FlowPix 在 1920 × 1080 像素的高清帧上实现了 170 FPS 的平均帧速率。

更新日期:2023-12-14
down
wechat
bug