样式: 排序: IF: - GO 导出 标记为已读
-
PQA: Exploring the Potential of Product Quantization in DNN Hardware Acceleration ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-04-18 Ahmed F. AbouElhamayed, Angela Cui, Javier Fernandez-Marques, Nicholas D. Lane, Mohamed S. Abdelfattah
Conventional multiply-accumulate (MAC) operations have long dominated computation time for deep neural networks (DNNs), especially convolutional neural networks (CNNs). Recently, product quantization (PQ) has been applied to these workloads, replacing MACs with memory lookups to pre-computed dot products. To better understand the efficiency tradeoffs of product-quantized DNNs (PQ-DNNs), we create a
-
Toward FPGA Intellectual Property (IP) Encryption from Netlist to Bitstream ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-04-12 Daniel Hutchings, Adam Taylor, Jeffrey Goeders
Current IP encryption methods offered by FPGA vendors use an approach where the IP is decrypted during the CAD flow, and remains unencrypted in the bitstream. Given the ease of accessing modern bitstream-to-netlist tools, encrypted IP is vulnerable to inspection and theft from the IP user. While the entire bitstream can be encrypted, this is done by the user, and is not a mechanism to protect confidentiality
-
HierCGRA: A Novel Framework for Large-Scale CGRA with Hierarchical Modeling and Automated Design Space Exploration ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-04-08 Sichao Chen, Chang Cai, Su Zheng, Jiangnan Li, Guowei Zhu, Jingyuan Li, Yazhou Yan, Yuan Dai, Wenbo Yin, Lingli Wang
Coarse-grained reconfigurable arrays (CGRAs) are promising design choices in computation-intensive domains since they can strike a balance between energy efficiency and flexibility. A typical CGRA comprises processing elements (PEs) that can execute operations in applications and interconnections between them. Nevertheless, most CGRAs suffer from the ineffectiveness of supporting flexible architecture
-
R-Blocks: an Energy-Efficient, Flexible, and Programmable CGRA ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-04-08 Barry de Bruin, Kanishkan Vadivel, Mark Wijtvliet, Pekka Jääskeläinen, Henk Corporaal
Emerging data-driven applications in the embedded, e-Health, and internet of things (IoT) domain require complex on-device signal analysis and data reduction to maximize energy efficiency on these energy-constrained devices. Coarse-grained reconfigurable architectures (CGRAs) have been proposed as a good compromise between flexibility and energy efficiency for ultra-low power (ULP) signal processing
-
DANSEN: Database Acceleration on Native Computational Storage by Exploiting NDP ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-04-04 Sajjad Tamimi, Arthur Bernhardt, Florian Stock, Ilia Petrov, Andreas Koch
This paper introduces DANSEN, the hardware accelerator component for neoDBMS, a full-stack computational storage system designed to manage on-device execution of database queries/transactions as a Near-Data Processing (NDP)-operation. The proposed system enables Database Management Systems (DBMS) to offload NDP-operations to the storage while maintaining control over data through a native storage interface
-
Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-04-04 Hongzheng Chen, Jiahao Zhang, Yixiao Du, Shaojie Xiang, Zichao Yue, Niansong Zhang, Yaohui Cai, Zhiru Zhang
Recent advancements in large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads. While hardware accelerators for Transformer-based models have been extensively studied, the majority of existing approaches rely on temporal architectures that reuse hardware units for different network layers and operators. However
-
HLPerf: Demystifying the Performance of HLS-based Graph Neural Networks with Dataflow Architectures ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-04-02 Chenfeng Zhao, Clayton J. Faber, Roger D. Chamberlain, Xuan Zhang
The development of FPGA-based applications using HLS is fraught with performance pitfalls and large design space exploration times. These issues are exacerbated when the application is complicated and its performance is dependent on the input data set, as is often the case with graph neural network approaches to machine learning. Here, we introduce HLPerf, an open-source, simulation-based performance
-
PTME: A Regular Expression Matching Engine Based on Speculation and Enumerative Computation on FPGA ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-04-01 Mingqian Sun, Guangwei Xie, Fan Zhang, Wei Guo, Xitian Fan, Tianyang Li, Li Chen, Jiayu Du
Fast regular expression matching is an essential task for deep packet inspection. In previous works, the regular expression matching engine on FPGA struggled to achieve an ideal balance between resource consumption and throughput. Speculation and enumerative computation exploits the statistical properties of deterministic finite automata, allowing for more efficient pattern matching. Existing related
-
Design and implementation of hardware-software architecture based on hashes for SPHINCS+ ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-27 Jonathan López-Valdivieso, René Cumplido
Advances in quantum computing have posed a future threat to today’s cryptography. With the advent of these quantum computers, security could be compromised. Therefore, the National Institute of Standards and Technology (NIST) has issued a request for proposals to standardize algorithms for post-quantum cryptography (PQC), which is considered difficult to solve for both classical and quantum computers
-
FADO: Floorplan-Aware Directive Optimization Based on Synthesis and Analytical Models for High-Level Synthesis Designs on Multi-Die FPGAs ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-20 Linfeng Du, Tingyuan Liang, Xiaofeng Zhou, Jinming Ge, Shangkun Li, Sharad Sinha, Jieru Zhao, Zhiyao Xie, Wei Zhang
Multi-die FPGAs are widely adopted for large-scale accelerators, but optimizing high-level synthesis designs on these FPGAs faces two challenges. First, the delay caused by die-crossing nets creates an NP-hard floorplanning problem. Second, traditional directive optimization cannot consider resource constraints on each die or the timing issue incurred by the die-crossings. Furthermore, the high algorithmic
-
Designing an IEEE-compliant FPU that supports configurable precision for soft processors ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-15 Chris Keilbart, Yuhui Gao, Martin Chua, Eric Matthews, Steven J.E. Wilton, Lesley Shannon
Field Programmable Gate Arrays (FPGAs) are commonly used to accelerate floating-point (FP) applications. Although researchers have extensively studied FPGA FP implementations, existing work has largely focused on standalone operators and frequency-optimized designs. These works are not suitable for FPGA soft processors which are more sensitive to latency, impose a lower frequency ceiling, and require
-
L-FNNG: Accelerating Large-Scale KNN Graph Construction on CPU-FPGA Heterogeneous Platform ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-14 Chaoqiang Liu, Xiaofei Liao, Long Zheng, Yu Huang, Haifeng Liu, Yi Zhang, Haiheng He, Haoyan Huang, Jingyi Zhou, Hai Jin
Due to the high complexity of constructing exact k-nearest neighbor graphs, approximate construction has become a popular research topic. The NN-Descent algorithm is one of the representative in-memory algorithms. To effectively handle large datasets, existing state-of-the-art solutions combine the divide-and-conquer approach and the NN-Descent algorithm, where large datasets are divided into multiple
-
XVDPU: A High-Performance CNN Accelerator on the Versal Platform Powered by the AI Engine ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-13 Xijie Jia, Yu Zhang, Guangdong Liu, Xinlin Yang, Tianyu Zhang, Jia Zheng, Dongdong Xu, Zhuohuan Liu, Mengke Liu, Xiaoyang Yan, Hong Wang, Rongzhang Zheng, Li Wang, Dong Li, Satyaprakash Pareek, Jian Weng, Lu Tian, Dongliang Xie, Hong Luo, Yi Shan
Today, convolutional neural networks (CNNs) are widely used in computer vision applications. However, the trends of higher accuracy and higher resolution generate larger networks. The requirements of computation or I/O are the key bottlenecks. In this article, we propose XVDPU: the AI Engine (AIE)-based CNN accelerator on Versal chips to meet heavy computation requirements. To resolve the IO bottleneck
-
ExHiPR: Extended High-Level Partial Reconfiguration for Fast Incremental FPGA Compilation ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-13 Yuanlong Xiao, Dongjoon Park, Zeyu Jason Niu, Aditya Hota, André Dehon
Partial Reconfiguration (PR) is a key technique in the application design on modern FPGAs. However, current PR tools heavily rely on the developer to manually conduct PR module definition, floorplanning, and flow control at a low level. The existing PR tools do not consider High-Level-Synthesis languages either, which are of great interest to software developers. We propose HiPR, an open-source framework
-
GraphScale: Scalable Processing on FPGAs for HBM and Large Graphs ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-13 Jonas Dann, Daniel Ritter, Holger Fröning
Recent advances in graph processing on FPGAs promise to alleviate performance bottlenecks with irregular memory access patterns. Such bottlenecks challenge performance for a growing number of important application areas like machine learning and data analytics. While FPGAs denote a promising solution through flexible memory hierarchies and massive parallelism, we argue that current graph processing
-
The Open-source DeLiBA2 Hardware/Software Framework for Distributed Storage Accelerators ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-13 Babar Khan, Carsten Heinz, Andreas Koch
With the trend towards ever larger “big data” applications, many of the gains achievable by using specialized compute accelerators become diminished due to the growing I/O overheads. While there have been several research efforts into computational storage and FPGA implementations of the NVMe interface, to our knowledge, there have been only very limited efforts to move larger parts of the Linux block
-
Design, Calibration, and Evaluation of Real-time Waveform Matching on an FPGA-based Digitizer at 10 GS/s ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-13 Jens Trautmann, Paul Krüger, Andreas Becher, Stefan Wildermann, Jürgen Teich
Digitizing side-channel signals at high sampling rates produces huge amounts of data, while side-channel analysis techniques only need those specific trace segments containing Cryptographic Operations (COs). For detecting these segments, waveform-matching techniques have been established comparing the signal with a template of the CO’s characteristic pattern. Real-time waveform matching requires highly
-
DONGLE 2.0: Direct FPGA-Orchestrated NVMe Storage for HLS ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-03-05 Linus Y. Wong, Jialiang Zhang, Jing (Jane) Li
Rapid growth in data size poses significant computational and memory challenges to data processing. FPGA accelerators and near-storage processing have emerged as compelling solutions for tackling the growing computational and memory requirements. Many FPGA-based accelerators have shown to be effective in processing large data sets by leveraging the storage capability of either host-attached or FPGA-attached
-
ScalaBFS2: A High Performance BFS Accelerator on an HBM-enhanced FPGA Chip ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-02-29 Kexin Li, Shaoxian Xu, Zhiyuan Shao, Ran Zheng, Xiaofei Liao, Hai Jin
The introduction of High Bandwidth Memory (HBM) to the FPGA chip makes it possible for an FPGA-based accelerator to leverage the huge memory bandwidth of HBM to improve its performance when implementing a specific algorithm, which is especially true for the Breadth-First Search (BFS) algorithm that demands a high bandwidth on accessing the graph data stored in memory. Different from traditional FPGA-DRAM
-
AxOMaP: Designing FPGA-based Approximate Arithmetic Operators using Mathematical Programming ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-02-19 Siva Satyendra Sahoo, Salim Ullah, Akash Kumar
With the increasing application of machine learning (ML) algorithms in embedded systems, there is a rising necessity to design low-cost computer arithmetic for these resource-constrained systems. As a result, emerging models of computation, such as approximate and stochastic computing, that leverage the inherent error-resilience of such algorithms are being actively explored for implementing ML inference
-
Eciton: Very Low-power Recurrent Neural Network Accelerator for Real-time Inference at the Edge ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-02-12 Jeffrey Chen, Sang-Woo Jun, Sehwan Hong, Warrick He, Jinyeong Moon
This article presents Eciton, a very low-power recurrent neural network accelerator for time series data within low-power edge sensor nodes, achieving real-time inference with a power consumption of 17 mW under load. Eciton reduces memory and chip resource requirements via 8-bit quantization and hard sigmoid activation, allowing the accelerator as well as the recurrent neural network model parameters
-
Introduction to the FPL 2021 Special Section ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-02-12 Diana Göhringer, Georgios Keramidas, Akash Kumar
-
An Efficient FPGA-based Depthwise Separable Convolutional Neural Network Accelerator with Hardware Pruning ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-02-12 Zhengyan Liu, Qiang Liu, Shun Yan, Ray C. C. Cheung
Convolutional neural networks (CNNs) have been widely deployed in computer vision tasks. However, the computation and resource intensive characteristics of CNN bring obstacles to its application on embedded systems. This article proposes an efficient inference accelerator on Field Programmable Gate Array (FPGA) for CNNs with depthwise separable convolutions. To improve the accelerator efficiency, we
-
Evaluating the Impact of Using Multiple-Metal Layers on the Layout Area of Switch Blocks for Tile-Based FPGAs in FinFET 7nm ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-02-12 Sajjad Rostami Sani, Andy Ye
A new area model for estimating the layout area of switch blocks is introduced in this work. The model is based on a realistic layout strategy. As a result, it not only takes into consideration the active area that is needed to construct a switch block but also the number of metal layers available and the actual dimensions of these metals. The model assigns metal layers to the routing tracks in a way
-
An All-digital Compute-in-memory FPGA Architecture for Deep Learning Acceleration ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-02-12 Yonggen Li, Xin Li, Haibin Shen, Jicong Fan, Yanfeng Xu, Kejie Huang
Field Programmable Gate Array (FPGA) is a versatile and programmable hardware platform, which makes it a promising candidate for accelerating Deep Neural Networks (DNNs). However, FPGA’s computing energy efficiency is low due to the domination of energy consumption by interconnect data movement. In this article, we propose an all-digital Compute-in-memory FPGA architecture for deep learning acceleration
-
Exploring FPGA Switch-Blocks without Explicitly Listing Connectivity Patterns ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-02-12 Stefan Nikolić, Paolo Ienne
Increased lower metal resistance makes physical aspects of Field-Programmable Gate Array (FPGA) switch-blocks more relevant than before. The need to navigate a design space where each individual switch can have significant impact on the FPGA’s performance in turn makes automated switch-pattern exploration techniques increasingly appealing. However, most existing exploration techniques have a fundamental
-
High-Efficiency Compressor Trees for Latest AMD FPGAs ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-02-10 Konstantin J. Hoßfeld, Hans Jakob Damsgaard, Jari Nurmi, Michaela Blott, Thomas B. Preußer
High-fan-in dot product computations are ubiquitous in highly relevant application domains, such as signal processing and machine learning. Particularly, the diverse set of data formats used in machine learning poses a challenge for flexible efficient design solutions. Ideally, a dot product summation is composed from a carry-free compressor tree followed by a terminal carry-propagate addition. On
-
A Hardware Accelerator for the Semi-Global Matching Stereo Algorithm: An Efficient Implementation for the Stratix V and Zynq UltraScale+ FPGA Technology ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 John Kalomiros, John Vourvoulakis, Stavros Vologiannidis
The semi-global matching stereo algorithm is a top performing algorithm in stereo vision. The recursive nature of the computations involved in this algorithm introduces an inherent data dependency problem, hindering the progressive computations of disparities at pixel clock. In this work, a novel hardware implementation of the semi-global matching algorithm is presented. A hardware structure of parallel
-
Designing Deep Learning Models on FPGA with Multiple Heterogeneous Engines ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 Miguel Reis, Mário Véstias, Horácio Neto
Deep learning models are becoming more complex and heterogeneous with new layer types to improve their accuracy. This brings a considerable challenge to the designers of accelerators of deep neural networks. There have been several architectures and design flows to map deep learning models on hardware, but they are limited to a particular model and/or layer types. Also, the architectures generated
-
A Partitioned CAM Architecture with FPGA Acceleration for Binary Descriptor Matching ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 Parastoo Soleimani, David W. Capson, Kin Fun Li
An efficient architecture for image descriptor matching that uses a partitioned content-addressable memory (CAM)-based approach is proposed. CAM is frequently used in high-speed content-matching applications. However, due to its lack of functionality to support approximate matching, conventional CAM is not directly useful for image descriptor matching. Our modifications improve the CAM architecture
-
Tailor: Altering Skip Connections for Resource-Efficient Inference ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 Olivia Weng, Gabriel Marcano, Vladimir Loncar, Alireza Khodamoradi, Abarajithan G, Nojan Sheybani, Andres Meza, Farinaz Koushanfar, Kristof Denolf, Javier Mauricio Duarte, Ryan Kastner
Deep neural networks use skip connections to improve training convergence. However, these skip connections are costly in hardware, requiring extra buffers and increasing on- and off-chip memory utilization and bandwidth requirements. In this article, we show that skip connections can be optimized for hardware when tackled with a hardware-software codesign approach. We argue that while a network’s skip
-
Programmable Analog System Benchmarks Leading to Efficient Analog Computation Synthesis ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 Jennifer Hasler, Cong Hao
This effort develops the first rich suite of analog and mixed-signal benchmark of various sizes and domains, intended for use with contemporary analog and mixed-signal designs and synthesis tools. Benchmarking enables analog-digital co-design exploration as well as extensive evaluation of analog synthesis tools and the generated analog/mixed-signal circuit or device. The goals of this effort are defining
-
FDRA: A Framework for a Dynamically Reconfigurable Accelerator Supporting Multi-Level Parallelism ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 Yunhui Qiu, Yiqing Mao, Xuchen Gao, Sichao Chen, Jiangnan Li, Wenbo Yin, Lingli Wang
Coarse-grained reconfigurable architectures (CGRAs) have emerged as promising accelerators due to their high flexibility and energy efficiency. However, existing open source works often lack integration of CGRAs with CPU systems and corresponding toolchains. Moreover, there is rare support for the accelerator instruction pipelining to overlap data communication, computation, and configuration across
-
Strega: An HTTP Server for FPGAs ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 Fabio Maschi, Gustavo Alonso
The computer architecture landscape is being reshaped by the new opportunities, challenges, and constraints brought by the cloud. On the one hand, high-level applications profit from specialised hardware to boost their performance and reduce deployment costs. On the other hand, cloud providers maximise the CPU time allocated to client applications by offloading infrastructure tasks to hardware accelerators
-
Reprogrammable Non-Linear Circuits Using ReRAM for NN Accelerators ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 Rafael Fão De Moura, Luigi Carro
As the massive usage of artificial intelligence techniques spreads in the economy, researchers are exploring new techniques to reduce the energy consumption of Neural Network (NN) applications, especially as the complexity of NNs continues to increase. Using analog Resistive RAM devices to compute matrix-vector multiplication in O(1) time complexity is a promising approach, but it is true that these
-
Automated Buffer Sizing of Dataflow Applications in a High-level Synthesis Workflow ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 Alexandre Honorat, Mickaël Dardaillon, Hugo Miomandre, Jean-François Nezan
High-Level Synthesis (HLS) tools are mature enough to provide efficient code generation for computation kernels on FPGA hardware. For more complex applications, multiple kernels may be connected by a dataflow graph. Although some tools, such as Xilinx Vitis HLS, support dataflow directives, they lack efficient analysis methods to compute the buffer sizes between kernels in a dataflow graph. This article
-
Montgomery Multiplication Scalable Systolic Designs Optimized for DSP48E2 ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-27 Louis Noyez, Nadia El Mrabet, Olivier Potin, Pascal Veron
This article describes an extensive study of the use of DSP48E2 Slices in Ultrascale FPGAs to design hardware versions of the Montgomery Multiplication algorithm for the hardware acceleration of modular multiplications. Our fully scalable systolic architectures result in parallelized, DSP48E2-optimized scheduling of operations analogous to the FIOS block variant of the Montgomery Multiplication. We
-
High Throughput FPGA-Based Object Detection via Algorithm-Hardware Co-Design ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-15 Anupreetham Anupreetham, Mohamed Ibrahim, Mathew Hall, Andrew Boutros, Ajay Kuzhively, Abinash Mohanty, Eriko Nurvitadhi, Vaughn Betz, Yu Cao, Jae-Sun Seo
Object detection and classification is a key task in many computer vision applications such as smart surveillance and autonomous vehicles. Recent advances in deep learning have significantly improved the quality of results achieved by these systems, making them more accurate and reliable in complex environments. Modern object detection systems make use of lightweight convolutional neural networks (CNNs)
-
A Hardware Design Framework for Computer Vision Models Based on Reconfigurable Devices ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2024-01-15 Zimeng Fan, Wei Hu, Fang Liu, Dian Xu, Hong Guo, Yanxiang He, Min Peng
In computer vision, the joint development of the algorithm and computing dimensions cannot be separated. Models and algorithms are constantly evolving, while hardware designs must adapt to new or updated algorithms. Reconfigurable devices are recognized as important platforms for computer vision applications because of their reconfigurability. There are two typical design approaches: customized and
-
CSAIL2019 Crypto-Puzzle Solver Architecture ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-12-29 Sergey Gribok, Bogdan Pasca, Martin Langhammer
The CSAIL2019 time-lock puzzle is an unsolved cryptographic challenge introduced by Ron Rivest in 2019, replacing the solved LCS35 puzzle. Solving these types of puzzles requires large amounts of intrinsically sequential computations, with each iteration performing a very large (3072-bit for CSAIL2019) modular multiplication operation. The complexity of each iteration is several times greater than
-
AEKA: FPGA Implementation of Area-Efficient Karatsuba Accelerator for Ring-Binary-LWE-based Lightweight PQC ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-12-11 Tianyou Bao, Pengzhou He, Jiafeng Xie, H S. Jacinto
Lightweight PQC-related research and development have gradually gained attention from the research community recently. Ring-Binary-Learning-with-Errors (RBLWE)-based encryption scheme (RBLWE-ENC), a promising lightweight PQC based on small parameter sets to fit related applications (but not in favor of deploying popular fast algorithms like number theoretic transform). To solve this problem, in this
-
Introduction to the Special Section on FCCM 2022 ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-12-05 Jing Li, Martin Herbordt
No abstract available.
-
CHIP-KNNv2: A Configurable and High-Performance K-Nearest Neighbors Accelerator on HBM-based FPGAs ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-12-05 Kenneth Liu, Alec Lu, Kartik Samtani, Zhenman Fang, Licheng Guo
The k-nearest neighbors (KNN) algorithm is an essential algorithm in many applications, such as similarity search, image classification, and database query. With the rapid growth in the dataset size and the feature dimension of each data point, processing KNN becomes more compute and memory hungry. Most prior studies focus on accelerating the computation of KNN using the abundant parallel resource
-
TAPA: A Scalable Task-parallel Dataflow Programming Framework for Modern FPGAs with Co-optimization of HLS and Physical Design ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-12-05 Licheng Guo, Yuze Chi, Jason Lau, Linghao Song, Xingyu Tian, Moazin Khatti, Weikang Qiao, Jie Wang, Ecenur Ustun, Zhenman Fang, Zhiru Zhang, Jason Cong
In this article, we propose TAPA, an end-to-end framework that compiles a C++ task-parallel dataflow program into a high-frequency FPGA accelerator. Compared to existing solutions, TAPA has two major advantages. First, TAPA provides a set of convenient APIs that allows users to easily express flexible and complex inter-task communication structures. Second, TAPA adopts a coarse-grained floorplanning
-
High-efficiency TRNG Design Based on Multi-bit Dual-ring Oscillator ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-12-05 Yingchun Lu, Yun Yang, Rong Hu, Huaguo Liang, Maoxiang Yi, Huang Zhengfeng, Yuanming Ma, Tian Chen, Liang Yao
Unpredictable true random numbers are required in security technology fields such as information encryption, key generation, mask generation for anti-side-channel analysis, algorithm initialization, and so on. At present, the true random number generator (TRNG) is not enough to provide fast random bits by low-speed bits generation. Therefore, it is necessary to design a faster TRNG. This work presents
-
Covert-channels in FPGA-enabled SmartSSDs ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-12-04 Theodoros Trochatos, Anthony Etim, Jakub Szefer
Cloud computing providers today offer access to a variety of devices, which users can rent and access remotely in a shared setting. Among these devices are SmartSSDs, which a solid-state disks (SSD) augmented with an FPGA, enabling users to instantiate custom circuits within the FPGA, including potentially malicious circuits for power and temperature measurement. Normally, cloud users have no remote
-
Across Time and Space: Senju’s Approach for Scaling Iterative Stencil Loop Accelerators on Single and Multiple FPGAs ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-11-29 Emanuele Del Sozzo, Davide Conficconi, Kentaro Sano
Stencil-based applications play an essential role in high-performance systems as they occur in numerous computational areas, such as partial differential equation solving. In this context, Iterative Stencil Loops (ISLs) represent a prominent and well-known algorithmic class within the stencil domain. Specifically, ISL-based calculations iteratively apply the same stencil to a multi-dimensional point
-
On the Malicious Potential of Xilinx’ Internal Configuration Access Port (ICAP) ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-11-17 Nils Albartus, Maik Ender, Jan-Niklas Möller, Marc Fyrbiak, Christof Paar, Russell Tessier
FPGAs have become increasingly popular in computing platforms. With recent advances in bitstream format reverse engineering, the scientific community has widely explored static FPGA security threats. For example, it is now possible to convert a bitstream to a netlist, revealing design information, and apply modifications to the static bitstream based on this knowledge. However, a systematic study of
-
HyBNN: Quantifying and Optimizing Hardware Efficiency of Binary Neural Networks ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-11-07 Geng Yang, Jie Lei, Zhenman Fang, Yunsong li, Jiaqing Zhang, Weiying Xie
Binary neural network (BNN), where both the weight and the activation values are represented with one bit, provides an attractive alternative to deploy highly efficient deep learning inference on resource-constrained edge devices. However, our investigation reveals that, to achieve satisfactory accuracy gains, state-of-the-art (SOTA) BNNs, such as FracBNN and ReActNet, usually have to incorporate various
-
Constraint-Aware Multi-Technique Approximate High-Level Synthesis for FPGAs ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-10-09 Marcos T. Leipnitz, Gabriel L. Nazar
Numerous approximate computing (AC) techniques have been developed to reduce the design costs in error-resilient application domains, such as signal and multimedia processing, data mining, machine learning, and computer vision, to trade-off computation accuracy with area and power savings or performance improvements. Selecting adequate techniques for each application and optimization target is complex
-
FPGA-based Deep Learning Inference Accelerators: Where Are We Standing? ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-10-09 Anouar Nechi, Lukas Groth, Saleh Mulhem, Farhad Merchant, Rainer Buchty, Mladen Berekovic
Recently, artificial intelligence applications have become part of almost all emerging technologies around us. Neural networks, in particular, have shown significant advantages and have been widely adopted over other approaches in machine learning. In this context, high processing power is deemed a fundamental challenge and a persistent requirement. Recent solutions facing such a challenge deploy hardware
-
RapidStream 2.0: Automated Parallel Implementation of Latency–Insensitive FPGA Designs Through Partial Reconfiguration ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-09-01 Licheng Guo, Pongstorn Maidee, Yun Zhou, Chris Lavin, Eddie Hung, Wuxi Li, Jason Lau, Weikang Qiao, Yuze Chi, Linghao Song, Yuanlong Xiao, Alireza Kaviani, Zhiru Zhang, Jason Cong
Field-programmable gate arrays (FPGAs) require a much longer compilation cycle than conventional computing platforms such as CPUs. In this article, we shorten the overall compilation time by co-optimizing the HLS compilation (C-to-RTL) and the back-end physical implementation (RTL-to-bitstream). We propose a split compilation approach based on the pipelining flexibility at the HLS level, which allows
-
Topgun: An ECC Accelerator for Private Set Intersection ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-09-01 Guiming Wu, Qianwen He, Jiali Jiang, Zhenxiang Zhang, Yuan Zhao, Yinchao Zou, Jie Zhang, Changzheng Wei, Ying Yan, Hui Zhang
Elliptic Curve Cryptography (ECC), one of the most widely used asymmetric cryptographic algorithms, has been deployed in Transport Layer Security (TLS) protocol, blockchain, secure multiparty computation, and so on. As one of the most secure ECC curves, Curve25519 is employed by some secure protocols, such as TLS 1.3 and Diffie-Hellman Private Set Intersection (DH-PSI) protocol. High-performance implementation
-
An FPGA Accelerator for Genome Variant Calling ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-09-01 Tiancheng Xu, Scott Rixner, Alan L. Cox
In genome analysis, it is often important to identify variants from a reference genome. However, identifying variants that occur with low frequency can be challenging, as it is computationally intensive to do so accurately. LoFreq is a widely used program that is adept at identifying low-frequency variants. This article presents a design framework for an FPGA-based accelerator for LoFreq. In particular
-
Resource Sharing in Dataflow Circuits ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-09-01 Lana Josipović, Axel Marmet, Andrea Guerrieri, Paolo Ienne
To achieve resource-efficient hardware designs, high-level synthesis (HLS) tools share (i.e., time-multiplex) functional units among operations of the same type. This optimization is typically performed in conjunction with operation scheduling to ensure the best possible unit usage at each point in time. Dataflow circuits have emerged as an alternative HLS approach to efficiently handle irregular and
-
Parallelising Control Flow in Dynamic-scheduling High-level Synthesis ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-09-01 Jianyi Cheng, Lana Josipović, John Wickerson, George A. Constantinides
Recently, there is a trend to use high-level synthesis (HLS) tools to generate dynamically scheduled hardware. The generated hardware is made up of components connected using handshake signals. These handshake signals schedule the components at runtime when inputs become available. Such approaches promise superior performance on “irregular” source programs, such as those whose control flow depends
-
Logic Shrinkage: Learned Connectivity Sparsification for LUT-Based Neural Networks ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-09-01 Erwei Wang, Marie Auffret, Georgios-Ilias Stavrou, Peter Y. K. Cheung, George A. Constantinides, Mohamed S. Abdelfattah, James J. Davis
Field-programmable gate array (FPGA)–specific deep neural network (DNN) architectures using native lookup tables (LUTs) as independently trainable inference operators have been shown to achieve favorable area-accuracy and energy-accuracy trade-offs. The first work in this area, LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In this article, we propose the learned optimization
-
A Reconfigurable Architecture for Real-time Event-based Multi-Object Tracking ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-09-01 Yizhao Gao, Song Wang, Hayden Kwok-Hay So
Although advances in event-based machine vision algorithms have demonstrated unparalleled capabilities in performing some of the most demanding tasks, their implementations under stringent real-time and power constraints in edge systems remain a major challenge. In this work, a reconfigurable hardware-software architecture called REMOT, which performs real-time event-based multi-object tracking on
-
Increasing the Robustness of TERO-TRNGs Against Process Variation ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-07-27 Christian Skubich, Peter Reichel, Marc Reichenbach
The transition effect ring oscillator is a popular design for building entropy sources because it is compact, built from digital elements only, and is very well suited for FPGAs. However, it is known to be quite sensitive to process variation. Although the latter is useful for building physical unclonable functions, it is interfering with the application as an entropy source. In this article, we investigate
-
BLOOP: Boolean Satisfiability-based Optimized Loop Pipelining ACM Trans. Reconfig. Technol. Syst. (IF 2.3) Pub Date : 2023-07-27 Nicolai Fiege, Peter Zipf
Modulo scheduling is the premier technique for throughput maximization of loops in high-level synthesis by interleaving consecutive loop iterations. The number of clock cycles between data insertions is called the initiation interval (II). For throughput maximization, this value should be as low as possible; therefore, its minimization is the main optimization goal. Despite its long historical existence