-
Combining Power and Arithmetic Optimization via Datapath Rewriting arXiv.cs.AR Pub Date : 2024-04-18 Samuel Coward, Theo Drane, Emiliano Morini, George Constantinides
Industrial datapath designers consider dynamic power consumption to be a key metric. Arithmetic circuits contribute a major component of total chip power consumption and are therefore a common target for power optimization. While arithmetic circuit area and dynamic power consumption are often correlated, there is also a tradeoff to consider, as additional gates can be added to explicitly reduce arithmetic
-
Switchable Single/Dual Edge Registers for Pipeline Architecture arXiv.cs.AR Pub Date : 2024-04-18 Suyash Vardhan Singh, Rakeshkumar Mahto
The demand for low power processing is increasing due to mobile and portable devices. In a processor unit, an adder is an important building block since it is used in Floating Point Units (FPU) and Arithmetic Logic Units (ALU). Also, pipeline techniques are used extensively to improve the throughput of the processing unit. To implement a pipeline requires adding a register at each sub-stage that result
-
EN-TensorCore: Advancing TensorCores Performance through Encoder-Based Methodology arXiv.cs.AR Pub Date : 2024-04-18 Qizhe Wu, Yuchen Gui, Zhichen Zeng, Xiaotian Wang, Huawen Liang, Xi Jin
Tensor computations, with matrix multiplication being the primary operation, serve as the fundamental basis for data analysis, physics, machine learning, and deep learning. As the scale and complexity of data continue to grow rapidly, the demand for tensor computations has also increased significantly. To meet this demand, several research institutions have started developing dedicated hardware for
-
Cicero: Addressing Algorithmic and Architectural Bottlenecks in Neural Rendering by Radiance Warping and Memory Optimizations arXiv.cs.AR Pub Date : 2024-04-18 Yu Feng, Zihan Liu, Jingwen Leng, Minyi Guo, Yuhao Zhu
Neural Radiance Field (NeRF) is widely seen as an alternative to traditional physically-based rendering. However, NeRF has not yet seen its adoption in resource-limited mobile systems such as Virtual and Augmented Reality (VR/AR), because it is simply extremely slow. On a mobile Volta GPU, even the state-of-the-art NeRF models generally execute only at 0.8 FPS. We show that the main performance bottlenecks
-
Functionality Locality, Mixture & Control = Logic = Memory arXiv.cs.AR Pub Date : 2024-04-17 Xiangjun Peng
This work provides new insights and constructs to the field of computer architecture and systems, and these insights are expected to be useful for the broad software stack. First, this work introduces Functionality Locality: this form of Functionality Locality shows that functionalities can be changed with a single piece of information, by solely changing the access order. This broadens the scope of
-
Real Time Evolvable Hardware for Optimal Reconfiguration of Cusp-Like Pulse Shapers arXiv.cs.AR Pub Date : 2024-04-17 Juan Lanchares, Oscar Garnica, José L. Risco-Martín, J. Ignacio Hidalgo, J. Manuel Colmenar, Alfredo Cuesta
The design of a cusp-like digital pulse shaper for particle energy measurements requires the definition of four parameters whose values are defined based on the nature of the shaper input signal (timing, noise, ...) provided by a sensor. However, after high doses of radiation, sensors degenerate and their output signals do not meet the original characteristics, which may lead to erroneous measurements
-
Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs arXiv.cs.AR Pub Date : 2024-04-17 Endri Taka, Dimitrios Gourounas, Andreas Gerstlauer, Diana Marculescu, Aman Arora
FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 NX
-
Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access arXiv.cs.AR Pub Date : 2024-04-17 Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang, Tianyue Lu, Mingyu Chen, Siwei Luo, Keji Huang
The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree
-
A Dataset for Large Language Model-Driven AI Accelerator Generation arXiv.cs.AR Pub Date : 2024-04-16 Mahmoud Nazzal, Deepak Vungarala, Mehrdad Morsali, Chao Zhang, Arnob Ghosh, Abdallah Khreishah, Shaahin Angizi
In the ever-evolving landscape of Deep Neural Networks (DNN) hardware acceleration, unlocking the true potential of systolic array accelerators has long been hindered by the daunting challenges of expertise and time investment. Large Language Models (LLMs) offer a promising solution for automating code generation which is key to unlocking unprecedented efficiency and performance in various domains
-
Amplifying Main Memory-Based Timing Covert and Side Channels using Processing-in-Memory Operations arXiv.cs.AR Pub Date : 2024-04-17 Konstantinos Kanellopoulos, F. Nisa Bostanci, Ataberk Olgun, A. Giray Yaglikci, Ismail Emir Yuksel, Nika Mansouri Ghiasi, Zulal Bingol, Mohammad Sadrosadati, Onur Mutlu
The adoption of processing-in-memory (PiM) architectures has been gaining momentum because they provide high performance and low energy consumption by alleviating the data movement bottleneck. Yet, the security of such architectures has not been thoroughly explored. The adoption of PiM solutions provides a new way to directly access main memory, which can be potentially exploited by malicious user
-
From a Lossless (~1.5:1) Compression Algorithm for Llama2 7B Weights to Variable Precision, Variable Range, Compressed Numeric Data Types for CNNs and LLMs arXiv.cs.AR Pub Date : 2024-04-16 Vincenzo Liguori
This paper starts with a simple lossless ~1.5:1 compression algorithm for the weights of the Large Language Model (LLM) Llama2 7B [1] that can be implemented in ~200 LUTs in AMD FPGAs, processing over 800 million bfloat16 numbers per second. This framework is then extended to variable precision, variable range, compressed numerical data types that are a user defined super set of both floats and posits
-
AERO: Adaptive Erase Operation for Improving Lifetime and Performance of Modern NAND Flash-Based SSDs arXiv.cs.AR Pub Date : 2024-04-16 Sungjun Cho, Beomjun Kim, Hyunuk Cho, Gyeongseob Seo, Onur Mutlu, Myungsuk Kim, Jisung Park
This work investigates a new erase scheme in NAND flash memory to improve the lifetime and performance of modern solid-state drives (SSDs). In NAND flash memory, an erase operation applies a high voltage (e.g., > 20 V) to flash cells for a long time (e.g., > 3.5 ms), which degrades cell endurance and potentially delays user I/O requests. While a large body of prior work has proposed various techniques
-
Field-Programmable Gate Array Architecture for Deep Learning: Survey & Future Directions arXiv.cs.AR Pub Date : 2024-04-15 Andrew Boutros, Aman Arora, Vaughn Betz
Deep learning (DL) is becoming the cornerstone of numerous applications both in datacenters and at the edge. Specialized hardware is often necessary to meet the performance requirements of state-of-the-art DL models, but the rapid pace of change in DL models and the wide variety of systems integrating DL make it impossible to create custom computer chips for all but the largest markets. Field-programmable
-
Error Detection and Correction Codes for Safe In-Memory Computations arXiv.cs.AR Pub Date : 2024-04-15 Luca Parrini, Taha Soliman, Benjamin Hettwer, Jan Micha Borrmann, Simranjeet Singh, Ankit Bende, Vikas Rana, Farhad Merchant, Norbert Wehn
In-Memory Computing (IMC) introduces a new paradigm of computation that offers high efficiency in terms of latency and power consumption for AI accelerators. However, the non-idealities and defects of emerging technologies used in advanced IMC can severely degrade the accuracy of inferred Neural Networks (NN) and lead to malfunctions in safety-critical applications. In this paper, we investigate an
-
Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity arXiv.cs.AR Pub Date : 2024-04-15 Cenlin Duan, Jianlei Yang, Yiou Wang, Yikun Wang, Yingjie Qi, Xiaolin He, Bonan Yan, Xueyan Wang, Xiaotao Jia, Weisheng Zhao
Bit-level sparsity in neural network models harbors immense untapped potential. Eliminating redundant calculations of randomly distributed zero-bits significantly boosts computational efficiency. Yet, traditional digital SRAM-PIM architecture, limited by rigid crossbar architecture, struggles to effectively exploit this unstructured sparsity. To address this challenge, we propose Dyadic Block PIM (DB-PIM)
-
Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator arXiv.cs.AR Pub Date : 2024-04-14 Abhishek Tyagi, Reiley Jeyapaul, Chuteng Zhu, Paul Whatmough, Yuhao Zhu
As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of the NPUs. We present a reliability study of Arm's Ethos-U55, an important industrial-scale NPU being utilised in embedded and IoT applications. We perform
-
Efficient and accurate neural field reconstruction using resistive memory arXiv.cs.AR Pub Date : 2024-04-15 Yifei Yu, Shaocong Wang, Woyu Zhang, Xinyuan Zhang, Xiuzhe Wu, Yangu He, Jichang Yang, Yue Zhang, Ning Lin, Bo Wang, Xi Chen, Songqi Wang, Xumeng Zhang, Xiaojuan Qi, Zhongrui Wang, Dashan Shang, Qi Liu, Kwang-Ting Cheng, Ming Liu
Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods
-
PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices arXiv.cs.AR Pub Date : 2024-04-13 Si Ung Noh, Junguk Hong, Chaemin Lim, Seongyeon Park, Jeehyun Kim, Hanjun Kim, Youngsok Kim, Jinho Lee
Recent dual in-line memory modules (DIMMs) are starting to support processing-in-memory (PIM) by associating their memory banks with processing elements (PEs), allowing applications to overcome the data movement bottleneck by offloading memory-intensive operations to the PEs. Many highly parallel applications have been shown to benefit from these PIM-enabled DIMMs, but further speedup is often limited
-
Empowering Malware Detection Efficiency within Processing-in-Memory Architecture arXiv.cs.AR Pub Date : 2024-04-12 Sreenitha Kasarapu, Sathwika Bavikadi, Sai Manoj Pudukotai Dinakarrao
The widespread integration of embedded systems across various industries has facilitated seamless connectivity among devices and bolstered computational capabilities. Despite their extensive applications, embedded systems encounter significant security threats, with one of the most critical vulnerabilities being malicious software, commonly known as malware. In recent times, malware detection techniques
-
Modulo-$(2^{2n}+1)$ Arithmetic via Two Parallel n-bit Residue Channels arXiv.cs.AR Pub Date : 2024-04-12 Ghassem Jaberipur, Bardia Nadimi, Jeong-A Lee
Augmenting the balanced residue number system moduli-set $\{m_1=2^n,m_2=2^n-1,m_3=2^n+1\}$, with the co-prime modulo $m_4=2^{2n}+1$, increases the dynamic range (DR) by around 70%. The Mersenne form of product $m_2 m_3 m_4=2^{4n}-1$, in the moduli-set $\{m_1,m_2,m_3,m_4\}$, leads to a very efficient reverse convertor, based on the New Chinese remainder theorem. However, the double bit-width of the
-
LeapFrog: The Rowhammer Instruction Skip Attack arXiv.cs.AR Pub Date : 2024-04-11 Andrew Adiletta, Caner Tol, Berk Sunar
Since its inception, Rowhammer exploits have rapidly evolved into increasingly sophisticated threats not only compromising data integrity but also the control flow integrity of victim processes. Nevertheless, it remains a challenge for an attacker to identify vulnerable targets (i.e., Rowhammer gadgets), understand the outcome of the attempted fault, and formulate an attack that yields useful results
-
Modeling Analog-Digital-Converter Energy and Area for Compute-In-Memory Accelerator Design arXiv.cs.AR Pub Date : 2024-04-09 Tanner Andrulis, Ruicong Chen, Hae-Seung Lee, Joel S. Emer, Vivienne Sze
Analog Compute-in-Memory (CiM) accelerators use analog-digital converters (ADCs) to read the analog values that they compute. ADCs can consume significant energy and area, so architecture-level ADC decisions such as ADC resolution or number of ADCs can significantly impact overall CiM accelerator energy and area. Therefore, modeling how architecture-level decisions affect ADC energy and area is critical
-
WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads arXiv.cs.AR Pub Date : 2024-04-09 Diya Joseph, Juan Luis Aragón, Joan-Manuel Parcerisa, Antonio Gonzalez
Contemporary GPUs are designed to handle long-latency operations effectively; however, challenges such as core occupancy (number of warps in a core) and pipeline width can impede their latency management. This is particularly evident in Tile-Based Rendering (TBR) GPUs, where core occupancy remains low for extended durations. To address this challenge, we introduce WaSP, a lightweight warp scheduler
-
Design and implementation of a synchronous Hardware Performance Monitor for a RISC-V space-oriented processor arXiv.cs.AR Pub Date : 2024-04-08 Miguel Jiménez Arribas, Agustín Martínez Hellín, Manuel Prieto Mateo, Iván Gamino del Río, Andrea Fernandez Gallego, Oscar Rodríguez Polo, Antonio da Silva, Pablo Parra, Sebastián Sánchez
The ability to collect statistics about the execution of a program within a CPU is of the utmost importance across all fields of computing since it allows characterizing the timing performance of a program. This capability is even more relevant in safety-critical software systems, where it is mandatory to analyze software timing requirements to ensure the correct operation of the programs. Moreover
-
SRAM-PG: Power Delivery Network Benchmarks from SRAM Circuits arXiv.cs.AR Pub Date : 2024-04-08 Shan Shen, Zhiqiang Liu, Wenjian Yu
Designing the power delivery network (PDN) in very large-scale integrated (VLSI) circuits is increasingly important, especially for nowadays low-power integrated circuit (IC) design. In order to ensure that the designed PDN enables a low level of voltage drop and noise which is required for the success of IC design, accurate analysis of PDN is largely demanded and brings a challenge of computation
-
GDR-HGNN: A Heterogeneous Graph Neural Networks Accelerator Frontend with Graph Decoupling and Recoupling arXiv.cs.AR Pub Date : 2024-04-07 Runzhen Xue, Mingyu Yan, Dengke Han, Yihan Teng, Zhimin Tang, Xiaochun Ye, Dongrui Fan
Heterogeneous Graph Neural Networks (HGNNs) have broadened the applicability of graph representation learning to heterogeneous graphs. However, the irregular memory access pattern of HGNNs leads to the buffer thrashing issue in HGNN accelerators. In this work, we identify an opportunity to address buffer thrashing in HGNN acceleration through an analysis of the topology of heterogeneous graphs. To
-
Efficient Sparse Processing-in-Memory Architecture (ESPIM) for Machine Learning Inference arXiv.cs.AR Pub Date : 2024-04-06 Mingxuan He, Mithuna Thottethodi, T. N. Vijaykumar
Emerging machine learning (ML) models (e.g., transformers) involve memory pin bandwidth-bound matrix-vector (MV) computation in inference. By avoiding pin crossings, processing in memory (PIM) can improve performance and energy for pin-bound workloads, as evidenced by recent commercial efforts in (digital) PIM. Sparse models can improve performance and energy of inference without losing much accuracy
-
VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA arXiv.cs.AR Pub Date : 2024-04-06 Sachini Wickramasinghe, Dhruv Parikh, Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, Carl Busart
Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the current state-of-the-art in various computer vision applications, outperforming their CNN counterparts. However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive training
-
QED: Scalable Verification of Hardware Memory Consistency arXiv.cs.AR Pub Date : 2024-04-03 Gokulan Ravi, Xiaokang Qiu, Mithuna Thottethodi, T. N. Vijaykumar
Memory consistency model (MCM) issues in out-of-order-issue microprocessor-based shared-memory systems are notoriously non-intuitive and a source of hardware design bugs. Prior hardware verification work is limited to in-order-issue processors, to proving the correctness only of some test cases, or to bounded verification that does not scale in practice beyond 7 instructions across all threads. Because
-
Spin-NeuroMem: A Low-Power Neuromorphic Associative Memory Design Based on Spintronic Devices arXiv.cs.AR Pub Date : 2024-04-03 Siqing Fu, Tiejun Li, Chunyuan Zhang, Sheng Ma, Jianmin Zhang, Lizhou Wu
Biologically-inspired computing models have made significant progress in recent years, but the conventional von Neumann architecture is inefficient for the large-scale matrix operations and massive parallelism required by these models. This paper presents Spin-NeuroMem, a low-power circuit design of Hopfield network for the function of associative memory. Spin-NeuroMem is equipped with energy-efficient
-
NetSmith: An Optimization Framework for Machine-Discovered Network Topologies arXiv.cs.AR Pub Date : 2024-04-02 Conor Green, Mithuna Thottethodi
Over the past few decades, network topology design for general purpose, shared memory multicores has been primarily driven by human experts who use their insights to arrive at network designs that balance the competing goals of performance requirements (e.g., latency, bandwidth) and cost constraints (e.g., router radix, router counts). On the other hand, there have been automatic NoC synthesis methods
-
A Fully-Configurable Open-Source Software-Defined Digital Quantized Spiking Neural Core Architecture arXiv.cs.AR Pub Date : 2024-04-02 Shadi Matinizadeh, Noah Pacik-Nelson, Ioannis Polykretis, Krupa Tishbi, Suman Kumar, M. L. Varshika, Arghavan Mohammadhassani, Abhishek Mishra, Nagarajan Kandasamy, James Shackleford, Eric Gallo, Anup Das
We introduce QUANTISENC, a fully configurable open-source software-defined digital quantized spiking neural core architecture to advance research in neuromorphic computing. QUANTISENC is designed hierarchically using a bottom-up methodology with multiple neurons in each layer and multiple layers in each core. The number of layers and neurons per layer can be configured via software in a top-down methodology
-
Optimizing Offload Performance in Heterogeneous MPSoCs arXiv.cs.AR Pub Date : 2024-04-02 Luca Colagrande, Luca Benini
Heterogeneous multi-core architectures combine a few "host" cores, optimized for single-thread performance, with many small energy-efficient "accelerator" cores for data-parallel processing, on a single chip. Offloading a computation to the many-core acceleration fabric introduces a communication and synchronization cost which reduces the speedup attainable on the accelerator, particularly for small
-
Analyzing the Single Event Upset Vulnerability of Binarized Neural Networks on SRAM FPGAs arXiv.cs.AR Pub Date : 2024-04-02 Ioanna Souvatzoglou, Athanasios Papadimitriou, Aitzan Sari, Vasileios Vlagkoulis, Mihalis Psarakis
Neural Networks (NNs) are increasingly used in the last decade in several demanding applications, such as object detection and classification, autonomous driving, etc. Among different computing platforms for implementing NNs, FPGAs have multiple advantages due to design flexibility and high performance-to-watt ratio. Moreover, approximation techniques, such as quantization, have been introduced, which
-
Block-SSD: A New Block-Based Blocking SSD Architecture arXiv.cs.AR Pub Date : 2024-04-03 Ryan Wong, Arjun Tyagi, Sungjun Cho, Pratik Sampat, Yiqiu Sun
Computer science and related fields (e.g., computer engineering, computer hardware engineering, electrical engineering, electrical and computer engineering, computer systems engineering) often draw inspiration from other fields, areas, and the real world in order to describe topics in their area. One cross-domain example is the idea of a block. The idea of blocks comes in many flavors, including software
-
CAMO: Correlation-Aware Mask Optimization with Modulated Reinforcement Learning arXiv.cs.AR Pub Date : 2024-04-01 Xiaoxiao Liang, Haoyu Yang, Kang Liu, Bei Yu, Yuzhe Ma
Optical proximity correction (OPC) is a vital step to ensure printability in modern VLSI manufacturing. Various OPC approaches based on machine learning have been proposed to pursue performance and efficiency, which are typically data-driven and hardly involve any particular considerations of the OPC problem, leading to potential performance or efficiency bottlenecks. In this paper, we propose CAMO
-
Memristor-Based Lightweight Encryption arXiv.cs.AR Pub Date : 2024-03-29 Muhammad Ali Siddiqi, Jan Andrés Galvan Hernández, Anteneh Gebregiorgis, Rajendra Bishnoi, Christos Strydis, Said Hamdioui, Mottaqiallah Taouil
Next-generation personalized healthcare devices are undergoing extreme miniaturization in order to improve user acceptability. However, such developments make it difficult to incorporate cryptographic primitives using available target technologies since these algorithms are notorious for their energy consumption. Besides, strengthening these schemes against side-channel attacks further adds to the
-
Hot-LEGO: Architect Microfluidic Cooling Equipped 3DICs with Pre-RTL Thermal Simulation arXiv.cs.AR Pub Date : 2024-03-29 Runxi Wang, Jun-Han Han, Mircea Stan, Xinfei Guo
Microfluidic cooling has been recognized as one of the most promising solutions to achieve efficient thermal management for three-dimensional integrated circuits (3DICs). It enables more opportunities to architect 3DICs with different die configurations. It becomes increasingly important to perform thermal analysis in the early design phases to validate the architectural design decisions. This is even
-
Dataflow-Aware PIM-Enabled Manycore Architecture for Deep Learning Workloads arXiv.cs.AR Pub Date : 2024-03-28 Harsh Sharma, Gaurav Narang, Janardhan Rao Doppa, Umit Ogras, Partha Pratim Pande
Processing-in-memory (PIM) has emerged as an enabler for the energy-efficient and high-performance acceleration of deep learning (DL) workloads. Resistive random-access memory (ReRAM) is one of the most promising technologies to implement PIM. However, as the complexity of Deep convolutional neural networks (DNNs) grows, we need to design a manycore architecture with multiple ReRAM-based processing
-
Toward CXL-Native Memory Tiering via Device-Side Profiling arXiv.cs.AR Pub Date : 2024-03-27 Zhe Zhou, Yiqi Chen, Tao Zhang, Yang Wang, Ran Shu, Shuotao Xu, Peng Cheng, Lei Qu, Yongqiang Xiong, Guangyu Sun
The Compute Express Link (CXL) interconnect has provided the ability to integrate diverse memory types into servers via byte-addressable SerDes links. Harnessing the full potential of such heterogeneous memory systems requires efficient memory tiering. However, existing research in this domain has been constrained by low-resolution and high-overhead memory access profiling techniques. To address this
-
Annotating Slack Directly on Your Verilog: Fine-Grained RTL Timing Evaluation for Early Optimization arXiv.cs.AR Pub Date : 2024-03-27 Wenji Fang, Shang Liu, Hongce Zhang, Zhiyao Xie
In digital IC design, compared with post-synthesis netlists or layouts, the early register-transfer level (RTL) stage offers greater optimization flexibility for both designers and EDA tools. However, timing information is typically unavailable at this early stage. Some recent machine learning (ML) solutions propose to predict the total negative slack (TNS) and worst negative slack (WNS) of an entire
-
Merits of Time-Domain Computing for VMM -- A Quantitative Comparison arXiv.cs.AR Pub Date : 2024-03-27 Florian Freye, Jie Lou, Christian Lanius, Tobias Gemmeke
Vector-matrix-multiplication (VMM) accel-erators have gained a lot of traction, especially due to therise of convolutional neural networks (CNNs) and the desireto compute them on the edge. Besides the classical digitalapproach, analog computing has gone through a renais-sance to push energy efficiency further. A more recent ap-proach is called time-domain (TD) computing. In contrastto analog computing
-
Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL arXiv.cs.AR Pub Date : 2024-03-27 Marius Meyer, Tobias Kenter, Lucian Petrica, Kenneth O'Brien, Michaela Blott, Christian Pessl
Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off
-
SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation arXiv.cs.AR Pub Date : 2024-03-25 Guoliang He, Eiko Yoneki
Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized
-
Partially-Precise Computing Paradigm for Efficient Hardware Implementation of Application-Specific Embedded Systems arXiv.cs.AR Pub Date : 2024-03-25 Mohsen Faryabi, Amir Hossein Moradi
Nowadays, the number of emerging embedded systems rapidly grows in many application domains, due to recent advances in artificial intelligence and internet of things. The main inherent specification of these application-specific systems is that they have not a general nature and are basically developed to only perform a particular task and therefore, deal only with a limited and predefined range of
-
Thermal Analysis for NVIDIA GTX480 Fermi GPU Architecture arXiv.cs.AR Pub Date : 2024-03-24 Savinay Nagendra
In this project, we design a four-layer (Silicon|TIM|Silicon|TIM), 3D floor plan for NVIDIA GTX480 Fermi GPU architecture and compare heat dissipation and power trends for matrix multiplication and Needleman-Wunsch kernels. First, cuda kernels for the two algorithms are written. These kernels are compiled and executed with the GPGPU Simulator to extract power logs for varying tensor sizes. These power
-
Navigating the Landscape of Distributed File Systems: Architectures, Implementations, and Considerations arXiv.cs.AR Pub Date : 2024-03-23 Xueting Pan, Ziqian Luo, Lisang Zhou
Distributed File Systems (DFS) have emerged as sophisticated solutions for efficient file storage and management across interconnected computer nodes. The main objective of DFS is to achieve flexible, scalable, and resilient file storage management by dispersing file data across multiple interconnected computer nodes, enabling users to seamlessly access and manipulate files distributed across diverse
-
A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering arXiv.cs.AR Pub Date : 2024-03-22 Alexandre Valentin Jamet, Georgios Vavouliotis, Daniel A. Jiménez, Lluc Alvarez, Marc Casas
To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip with adaptive prefetch filtering at the first-level data cache (L1D). TLP is composed of two connected microarchitectural perceptron predictors, named
-
Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems arXiv.cs.AR Pub Date : 2024-03-22 Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang
The advent of Transformers has revolutionized computer vision, offering a powerful alternative to convolutional neural networks (CNNs), especially with the local attention mechanism that excels at capturing local structures within the input and achieve state-of-the-art performance. Processing in-memory (PIM) architecture offers extensive parallelism, low data movement costs, and scalable memory bandwidth
-
Beehive: A Flexible Network Stack for Direct-Attached Accelerators arXiv.cs.AR Pub Date : 2024-03-21 Katie Lim, Matthew Giordano, Theano Stavrinos, Baris Kasikci, Tom Anderson
Accelerators have become increasingly popular in datacenters due to their cost, performance, and energy benefits. Direct-attached accelerators, where the network stack is implemented in hardware and network traffic bypasses the main CPU, can further enhance these benefits. However, modern datacenter software network stacks are complex, with interleaved protocol layers, network management functions
-
Cross-layer Modeling and Design of Content Addressable Memories in Advanced Technology Nodes for Similarity Search arXiv.cs.AR Pub Date : 2024-03-22 Siri Narla, Piyush Kumar, Mohammad Adnaan, Azad Naeemi
In this paper we present a comprehensive design and benchmarking study of Content Addressable Memory (CAM) at the 7nm technology node in the context of similarity search applications. We design CAM cells based on SRAM, spin-orbit torque, and ferroelectric field effect transistor devices and from their layouts extract cell parasitics using state of the art EDA tools. These parasitics are used to develop
-
Accelerating ViT Inference on FPGA through Static and Dynamic Pruning arXiv.cs.AR Pub Date : 2024-03-21 Dhruv Parikh, Shouyi Li, Bingyi Zhang, Rajgopal Kannan, Carl Busart, Viktor Prasanna
Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamically
-
HCiM: ADC-Less Hybrid Analog-Digital Compute in Memory Accelerator for Deep Learning Workloads arXiv.cs.AR Pub Date : 2024-03-20 Shubham Negi, Utkarsh Saxena, Deepika Sharma, Kaushik Roy
Analog Compute-in-Memory (CiM) accelerators are increasingly recognized for their efficiency in accelerating Deep Neural Networks (DNN). However, their dependence on Analog-to-Digital Converters (ADCs) for accumulating partial sums from crossbars leads to substantial power and area overhead. Moreover, the high area overhead of ADCs constrains the throughput due to the limited number of ADCs that can
-
System Support for Environmentally Sustainable Computing in Data Centers arXiv.cs.AR Pub Date : 2024-03-19 Fan Chen
Modern data centers suffer from a growing carbon footprint due to insufficient support for environmental sustainability. While hardware accelerators and renewable energy have been utilized to enhance sustainability, addressing Quality of Service (QoS) degradation caused by renewable energy supply and hardware recycling remains challenging: (1) prior accelerators exhibit significant carbon footprints
-
Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic arXiv.cs.AR Pub Date : 2024-03-18 Daniel Gerlinghoff, Benjamin Chen Ming Choong, Rick Siow Mong Goh, Weng-Fai Wong, Tao Luo
Recent advancements in neural network quantisation have yielded remarkable outcomes, with three-bit networks reaching state-of-the-art full-precision accuracy in complex tasks. These achievements present valuable opportunities for accelerating neural networks by computing in reduced precision. Implementing it on FPGAs can take advantage of bit-level reconfigurability, which is not available on conventional
-
DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing arXiv.cs.AR Pub Date : 2024-03-16 Yansong Xu, Dongxu Lyu, Zhenyu Li, Zilong Wang, Yuzhou Chen, Gang Wang, Zhican Wang, Haomin Li, Guanghui He
Multi-scale deformable attention (MSDeformAttn) has emerged as a key mechanism in various vision tasks, demonstrating explicit superiority attributed to multi-scale grid-sampling. However, this newly introduced operator incurs irregular data access and enormous memory requirement, leading to severe PE underutilization. Meanwhile, existing approaches for attention acceleration cannot be directly applied
-
Strict Partitioning for Sporadic Rigid Gang Tasks arXiv.cs.AR Pub Date : 2024-03-15 Binqi Sun, Tomasz Kloda, Marco Caccamo
The rigid gang task model is based on the idea of executing multiple threads simultaneously on a fixed number of processors to increase efficiency and performance. Although there is extensive literature on global rigid gang scheduling, partitioned approaches have several practical advantages (e.g., task isolation and reduced scheduling overheads). In this paper, we propose a new partitioned scheduling
-
Taiyi: A high-performance CKKS accelerator for Practical Fully Homomorphic Encryption arXiv.cs.AR Pub Date : 2024-03-15 Shengyu Fan, Xianglong Deng, Zhuoyu Tian, Zhicheng Hu, Liang Chang, Rui Hou, Dan Meng, Mingzhe Zhang
Fully Homomorphic Encryption (FHE), a novel cryptographic theory enabling computation directly on ciphertext data, offers significant security benefits but is hampered by substantial performance overhead. In recent years, a series of accelerator designs have significantly enhanced the performance of FHE applications, bringing them closer to real-world applicability. However, these accelerators face
-
Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory arXiv.cs.AR Pub Date : 2024-03-14 Jeongmin Hong, Sungjun Cho, Geonwoo Park, Wonhyuk Yang, Young-Ho Gong, Gwangsun Kim
We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW
-
Analytical Heterogeneous Die-to-Die 3D Placement with Macros arXiv.cs.AR Pub Date : 2024-03-14 Yuxuan Zhao, Peiyu Liao, Siting Liu, Jiaxi Jiang, Yibo Lin, Bei Yu
This paper presents an innovative approach to 3D mixed-size placement in heterogeneous face-to-face (F2F) bonded 3D ICs. We propose an analytical framework that utilizes a dedicated density model and a bistratal wirelength model, effectively handling macros and standard cells in a 3D solution space. A novel 3D preconditioner is developed to resolve the topological and physical gap between macros and