arXiv - CS - Hardware Architecture期刊最新论文, 计算机, 工程类期刊,

Combining Power and Arithmetic Optimization via Datapath Rewriting

arXiv.cs.AR Pub Date : 2024-04-18
Samuel Coward, Theo Drane, Emiliano Morini, George Constantinides

Industrial datapath designers consider dynamic power consumption to be a key metric. Arithmetic circuits contribute a major component of total chip power consumption and are therefore a common target for power optimization. While arithmetic circuit area and dynamic power consumption are often correlated, there is also a tradeoff to consider, as additional gates can be added to explicitly reduce arithmetic

更新日期：2024-04-19

详情收藏

Switchable Single/Dual Edge Registers for Pipeline Architecture

arXiv.cs.AR Pub Date : 2024-04-18
Suyash Vardhan Singh, Rakeshkumar Mahto

The demand for low power processing is increasing due to mobile and portable devices. In a processor unit, an adder is an important building block since it is used in Floating Point Units (FPU) and Arithmetic Logic Units (ALU). Also, pipeline techniques are used extensively to improve the throughput of the processing unit. To implement a pipeline requires adding a register at each sub-stage that result

更新日期：2024-04-19

详情收藏

EN-TensorCore: Advancing TensorCores Performance through Encoder-Based Methodology

arXiv.cs.AR Pub Date : 2024-04-18
Qizhe Wu, Yuchen Gui, Zhichen Zeng, Xiaotian Wang, Huawen Liang, Xi Jin

Tensor computations, with matrix multiplication being the primary operation, serve as the fundamental basis for data analysis, physics, machine learning, and deep learning. As the scale and complexity of data continue to grow rapidly, the demand for tensor computations has also increased significantly. To meet this demand, several research institutions have started developing dedicated hardware for

更新日期：2024-04-19

详情收藏

Cicero: Addressing Algorithmic and Architectural Bottlenecks in Neural Rendering by Radiance Warping and Memory Optimizations

arXiv.cs.AR Pub Date : 2024-04-18
Yu Feng, Zihan Liu, Jingwen Leng, Minyi Guo, Yuhao Zhu

Neural Radiance Field (NeRF) is widely seen as an alternative to traditional physically-based rendering. However, NeRF has not yet seen its adoption in resource-limited mobile systems such as Virtual and Augmented Reality (VR/AR), because it is simply extremely slow. On a mobile Volta GPU, even the state-of-the-art NeRF models generally execute only at 0.8 FPS. We show that the main performance bottlenecks

更新日期：2024-04-19

详情收藏

Functionality Locality, Mixture & Control = Logic = Memory

arXiv.cs.AR Pub Date : 2024-04-17
Xiangjun Peng

This work provides new insights and constructs to the field of computer architecture and systems, and these insights are expected to be useful for the broad software stack. First, this work introduces Functionality Locality: this form of Functionality Locality shows that functionalities can be changed with a single piece of information, by solely changing the access order. This broadens the scope of

更新日期：2024-04-19

详情收藏

Real Time Evolvable Hardware for Optimal Reconfiguration of Cusp-Like Pulse Shapers

arXiv.cs.AR Pub Date : 2024-04-17
Juan Lanchares, Oscar Garnica, José L. Risco-Martín, J. Ignacio Hidalgo, J. Manuel Colmenar, Alfredo Cuesta

The design of a cusp-like digital pulse shaper for particle energy measurements requires the definition of four parameters whose values are defined based on the nature of the shaper input signal (timing, noise, ...) provided by a sensor. However, after high doses of radiation, sensors degenerate and their output signals do not meet the original characteristics, which may lead to erroneous measurements

更新日期：2024-04-18

详情收藏

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

arXiv.cs.AR Pub Date : 2024-04-17
Endri Taka, Dimitrios Gourounas, Andreas Gerstlauer, Diana Marculescu, Aman Arora

FPGAs are a promising platform for accelerating Deep Learning (DL) applications, due to their high performance, low power consumption, and reconfigurability. Recently, the leading FPGA vendors have enhanced their architectures to more efficiently support the computational demands of DL workloads. However, the two most prominent AI-optimized FPGAs, i.e., AMD/Xilinx Versal ACAP and Intel Stratix 10 NX

更新日期：2024-04-18

详情收藏

Asynchronous Memory Access Unit: Exploiting Massive Parallelism for Far Memory Access

arXiv.cs.AR Pub Date : 2024-04-17
Luming Wang, Xu Zhang, Songyue Wang, Zhuolun Jiang, Tianyue Lu, Mingyu Chen, Siwei Luo, Keji Huang

The growing memory demands of modern applications have driven the adoption of far memory technologies in data centers to provide cost-effective, high-capacity memory solutions. However, far memory presents new performance challenges because its access latencies are significantly longer and more variable than local DRAM. For applications to achieve acceptable performance on far memory, a high degree

更新日期：2024-04-18

详情收藏

A Dataset for Large Language Model-Driven AI Accelerator Generation

arXiv.cs.AR Pub Date : 2024-04-16
Mahmoud Nazzal, Deepak Vungarala, Mehrdad Morsali, Chao Zhang, Arnob Ghosh, Abdallah Khreishah, Shaahin Angizi

In the ever-evolving landscape of Deep Neural Networks (DNN) hardware acceleration, unlocking the true potential of systolic array accelerators has long been hindered by the daunting challenges of expertise and time investment. Large Language Models (LLMs) offer a promising solution for automating code generation which is key to unlocking unprecedented efficiency and performance in various domains

更新日期：2024-04-18

详情收藏

Amplifying Main Memory-Based Timing Covert and Side Channels using Processing-in-Memory Operations

arXiv.cs.AR Pub Date : 2024-04-17
Konstantinos Kanellopoulos, F. Nisa Bostanci, Ataberk Olgun, A. Giray Yaglikci, Ismail Emir Yuksel, Nika Mansouri Ghiasi, Zulal Bingol, Mohammad Sadrosadati, Onur Mutlu

The adoption of processing-in-memory (PiM) architectures has been gaining momentum because they provide high performance and low energy consumption by alleviating the data movement bottleneck. Yet, the security of such architectures has not been thoroughly explored. The adoption of PiM solutions provides a new way to directly access main memory, which can be potentially exploited by malicious user

更新日期：2024-04-18

详情收藏

From a Lossless (~1.5:1) Compression Algorithm for Llama2 7B Weights to Variable Precision, Variable Range, Compressed Numeric Data Types for CNNs and LLMs

arXiv.cs.AR Pub Date : 2024-04-16
Vincenzo Liguori

This paper starts with a simple lossless ~1.5:1 compression algorithm for the weights of the Large Language Model (LLM) Llama2 7B [1] that can be implemented in ~200 LUTs in AMD FPGAs, processing over 800 million bfloat16 numbers per second. This framework is then extended to variable precision, variable range, compressed numerical data types that are a user defined super set of both floats and posits

更新日期：2024-04-18

详情收藏

AERO: Adaptive Erase Operation for Improving Lifetime and Performance of Modern NAND Flash-Based SSDs

arXiv.cs.AR Pub Date : 2024-04-16
Sungjun Cho, Beomjun Kim, Hyunuk Cho, Gyeongseob Seo, Onur Mutlu, Myungsuk Kim, Jisung Park

This work investigates a new erase scheme in NAND flash memory to improve the lifetime and performance of modern solid-state drives (SSDs). In NAND flash memory, an erase operation applies a high voltage (e.g., > 20 V) to flash cells for a long time (e.g., > 3.5 ms), which degrades cell endurance and potentially delays user I/O requests. While a large body of prior work has proposed various techniques

更新日期：2024-04-17

详情收藏

Field-Programmable Gate Array Architecture for Deep Learning: Survey & Future Directions

arXiv.cs.AR Pub Date : 2024-04-15
Andrew Boutros, Aman Arora, Vaughn Betz

Deep learning (DL) is becoming the cornerstone of numerous applications both in datacenters and at the edge. Specialized hardware is often necessary to meet the performance requirements of state-of-the-art DL models, but the rapid pace of change in DL models and the wide variety of systems integrating DL make it impossible to create custom computer chips for all but the largest markets. Field-programmable

更新日期：2024-04-17

详情收藏

Error Detection and Correction Codes for Safe In-Memory Computations

arXiv.cs.AR Pub Date : 2024-04-15
Luca Parrini, Taha Soliman, Benjamin Hettwer, Jan Micha Borrmann, Simranjeet Singh, Ankit Bende, Vikas Rana, Farhad Merchant, Norbert Wehn

In-Memory Computing (IMC) introduces a new paradigm of computation that offers high efficiency in terms of latency and power consumption for AI accelerators. However, the non-idealities and defects of emerging technologies used in advanced IMC can severely degrade the accuracy of inferred Neural Networks (NN) and lead to malfunctions in safety-critical applications. In this paper, we investigate an

更新日期：2024-04-16

详情收藏

Towards Efficient SRAM-PIM Architecture Design by Exploiting Unstructured Bit-Level Sparsity

arXiv.cs.AR Pub Date : 2024-04-15
Cenlin Duan, Jianlei Yang, Yiou Wang, Yikun Wang, Yingjie Qi, Xiaolin He, Bonan Yan, Xueyan Wang, Xiaotao Jia, Weisheng Zhao

Bit-level sparsity in neural network models harbors immense untapped potential. Eliminating redundant calculations of randomly distributed zero-bits significantly boosts computational efficiency. Yet, traditional digital SRAM-PIM architecture, limited by rigid crossbar architecture, struggles to effectively exploit this unstructured sparsity. To address this challenge, we propose Dyadic Block PIM (DB-PIM)

更新日期：2024-04-16

详情收藏

Characterizing Soft-Error Resiliency in Arm's Ethos-U55 Embedded Machine Learning Accelerator

arXiv.cs.AR Pub Date : 2024-04-14
Abhishek Tyagi, Reiley Jeyapaul, Chuteng Zhu, Paul Whatmough, Yuhao Zhu

As Neural Processing Units (NPU) or accelerators are increasingly deployed in a variety of applications including safety critical applications such as autonomous vehicle, and medical imaging, it is critical to understand the fault-tolerance nature of the NPUs. We present a reliability study of Arm's Ethos-U55, an important industrial-scale NPU being utilised in embedded and IoT applications. We perform

更新日期：2024-04-16

详情收藏

Efficient and accurate neural field reconstruction using resistive memory

arXiv.cs.AR Pub Date : 2024-04-15
Yifei Yu, Shaocong Wang, Woyu Zhang, Xinyuan Zhang, Xiuzhe Wu, Yangu He, Jichang Yang, Yue Zhang, Ning Lin, Bo Wang, Xi Chen, Songqi Wang, Xumeng Zhang, Xiaojuan Qi, Zhongrui Wang, Dashan Shang, Qi Liu, Kwang-Ting Cheng, Ming Liu

Human beings construct perception of space by integrating sparse observations into massively interconnected synapses and neurons, offering a superior parallelism and efficiency. Replicating this capability in AI finds wide applications in medical imaging, AR/VR, and embodied AI, where input data is often sparse and computing resources are limited. However, traditional signal reconstruction methods

更新日期：2024-04-16

详情收藏

PID-Comm: A Fast and Flexible Collective Communication Framework for Commodity Processing-in-DIMM Devices

arXiv.cs.AR Pub Date : 2024-04-13
Si Ung Noh, Junguk Hong, Chaemin Lim, Seongyeon Park, Jeehyun Kim, Hanjun Kim, Youngsok Kim, Jinho Lee

Recent dual in-line memory modules (DIMMs) are starting to support processing-in-memory (PIM) by associating their memory banks with processing elements (PEs), allowing applications to overcome the data movement bottleneck by offloading memory-intensive operations to the PEs. Many highly parallel applications have been shown to benefit from these PIM-enabled DIMMs, but further speedup is often limited

更新日期：2024-04-16

详情收藏

Empowering Malware Detection Efficiency within Processing-in-Memory Architecture

arXiv.cs.AR Pub Date : 2024-04-12
Sreenitha Kasarapu, Sathwika Bavikadi, Sai Manoj Pudukotai Dinakarrao

The widespread integration of embedded systems across various industries has facilitated seamless connectivity among devices and bolstered computational capabilities. Despite their extensive applications, embedded systems encounter significant security threats, with one of the most critical vulnerabilities being malicious software, commonly known as malware. In recent times, malware detection techniques

更新日期：2024-04-16

详情收藏

Modulo-$(2^{2n}+1)$ Arithmetic via Two Parallel n-bit Residue Channels

arXiv.cs.AR Pub Date : 2024-04-12
Ghassem Jaberipur, Bardia Nadimi, Jeong-A Lee

Augmenting the balanced residue number system moduli-set $\{m_1=2^n,m_2=2^n-1,m_3=2^n+1\}$, with the co-prime modulo $m_4=2^{2n}+1$, increases the dynamic range (DR) by around 70%. The Mersenne form of product $m_2 m_3 m_4=2^{4n}-1$, in the moduli-set $\{m_1,m_2,m_3,m_4\}$, leads to a very efficient reverse convertor, based on the New Chinese remainder theorem. However, the double bit-width of the

更新日期：2024-04-15

详情收藏

LeapFrog: The Rowhammer Instruction Skip Attack

arXiv.cs.AR Pub Date : 2024-04-11
Andrew Adiletta, Caner Tol, Berk Sunar

Since its inception, Rowhammer exploits have rapidly evolved into increasingly sophisticated threats not only compromising data integrity but also the control flow integrity of victim processes. Nevertheless, it remains a challenge for an attacker to identify vulnerable targets (i.e., Rowhammer gadgets), understand the outcome of the attempted fault, and formulate an attack that yields useful results

更新日期：2024-04-12

详情收藏

Modeling Analog-Digital-Converter Energy and Area for Compute-In-Memory Accelerator Design

arXiv.cs.AR Pub Date : 2024-04-09
Tanner Andrulis, Ruicong Chen, Hae-Seung Lee, Joel S. Emer, Vivienne Sze

Analog Compute-in-Memory (CiM) accelerators use analog-digital converters (ADCs) to read the analog values that they compute. ADCs can consume significant energy and area, so architecture-level ADC decisions such as ADC resolution or number of ADCs can significantly impact overall CiM accelerator energy and area. Therefore, modeling how architecture-level decisions affect ADC energy and area is critical

更新日期：2024-04-11

详情收藏

WaSP: Warp Scheduling to Mimic Prefetching in Graphics Workloads

arXiv.cs.AR Pub Date : 2024-04-09
Diya Joseph, Juan Luis Aragón, Joan-Manuel Parcerisa, Antonio Gonzalez

Contemporary GPUs are designed to handle long-latency operations effectively; however, challenges such as core occupancy (number of warps in a core) and pipeline width can impede their latency management. This is particularly evident in Tile-Based Rendering (TBR) GPUs, where core occupancy remains low for extended durations. To address this challenge, we introduce WaSP, a lightweight warp scheduler

更新日期：2024-04-10

详情收藏

Design and implementation of a synchronous Hardware Performance Monitor for a RISC-V space-oriented processor

arXiv.cs.AR Pub Date : 2024-04-08
Miguel Jiménez Arribas, Agustín Martínez Hellín, Manuel Prieto Mateo, Iván Gamino del Río, Andrea Fernandez Gallego, Oscar Rodríguez Polo, Antonio da Silva, Pablo Parra, Sebastián Sánchez

The ability to collect statistics about the execution of a program within a CPU is of the utmost importance across all fields of computing since it allows characterizing the timing performance of a program. This capability is even more relevant in safety-critical software systems, where it is mandatory to analyze software timing requirements to ensure the correct operation of the programs. Moreover

更新日期：2024-04-09

详情收藏

SRAM-PG: Power Delivery Network Benchmarks from SRAM Circuits

arXiv.cs.AR Pub Date : 2024-04-08
Shan Shen, Zhiqiang Liu, Wenjian Yu

Designing the power delivery network (PDN) in very large-scale integrated (VLSI) circuits is increasingly important, especially for nowadays low-power integrated circuit (IC) design. In order to ensure that the designed PDN enables a low level of voltage drop and noise which is required for the success of IC design, accurate analysis of PDN is largely demanded and brings a challenge of computation

更新日期：2024-04-09

详情收藏

GDR-HGNN: A Heterogeneous Graph Neural Networks Accelerator Frontend with Graph Decoupling and Recoupling

arXiv.cs.AR Pub Date : 2024-04-07
Runzhen Xue, Mingyu Yan, Dengke Han, Yihan Teng, Zhimin Tang, Xiaochun Ye, Dongrui Fan

Heterogeneous Graph Neural Networks (HGNNs) have broadened the applicability of graph representation learning to heterogeneous graphs. However, the irregular memory access pattern of HGNNs leads to the buffer thrashing issue in HGNN accelerators. In this work, we identify an opportunity to address buffer thrashing in HGNN acceleration through an analysis of the topology of heterogeneous graphs. To

更新日期：2024-04-09

详情收藏

Efficient Sparse Processing-in-Memory Architecture (ESPIM) for Machine Learning Inference

arXiv.cs.AR Pub Date : 2024-04-06
Mingxuan He, Mithuna Thottethodi, T. N. Vijaykumar

Emerging machine learning (ML) models (e.g., transformers) involve memory pin bandwidth-bound matrix-vector (MV) computation in inference. By avoiding pin crossings, processing in memory (PIM) can improve performance and energy for pin-bound workloads, as evidenced by recent commercial efforts in (digital) PIM. Sparse models can improve performance and energy of inference without losing much accuracy

更新日期：2024-04-09

详情收藏

VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA

arXiv.cs.AR Pub Date : 2024-04-06
Sachini Wickramasinghe, Dhruv Parikh, Bingyi Zhang, Rajgopal Kannan, Viktor Prasanna, Carl Busart

Synthetic Aperture Radar (SAR) Automatic Target Recognition (ATR) is a key technique used in military applications like remote-sensing image recognition. Vision Transformers (ViTs) are the current state-of-the-art in various computer vision applications, outperforming their CNN counterparts. However, using ViTs for SAR ATR applications is challenging due to (1) standard ViTs require extensive training

更新日期：2024-04-09

详情收藏

QED: Scalable Verification of Hardware Memory Consistency

arXiv.cs.AR Pub Date : 2024-04-03
Gokulan Ravi, Xiaokang Qiu, Mithuna Thottethodi, T. N. Vijaykumar

Memory consistency model (MCM) issues in out-of-order-issue microprocessor-based shared-memory systems are notoriously non-intuitive and a source of hardware design bugs. Prior hardware verification work is limited to in-order-issue processors, to proving the correctness only of some test cases, or to bounded verification that does not scale in practice beyond 7 instructions across all threads. Because

更新日期：2024-04-05

详情收藏

Spin-NeuroMem: A Low-Power Neuromorphic Associative Memory Design Based on Spintronic Devices

arXiv.cs.AR Pub Date : 2024-04-03
Siqing Fu, Tiejun Li, Chunyuan Zhang, Sheng Ma, Jianmin Zhang, Lizhou Wu

Biologically-inspired computing models have made significant progress in recent years, but the conventional von Neumann architecture is inefficient for the large-scale matrix operations and massive parallelism required by these models. This paper presents Spin-NeuroMem, a low-power circuit design of Hopfield network for the function of associative memory. Spin-NeuroMem is equipped with energy-efficient

更新日期：2024-04-05

详情收藏

NetSmith: An Optimization Framework for Machine-Discovered Network Topologies

arXiv.cs.AR Pub Date : 2024-04-02
Conor Green, Mithuna Thottethodi

Over the past few decades, network topology design for general purpose, shared memory multicores has been primarily driven by human experts who use their insights to arrive at network designs that balance the competing goals of performance requirements (e.g., latency, bandwidth) and cost constraints (e.g., router radix, router counts). On the other hand, there have been automatic NoC synthesis methods

更新日期：2024-04-05

详情收藏

A Fully-Configurable Open-Source Software-Defined Digital Quantized Spiking Neural Core Architecture

arXiv.cs.AR Pub Date : 2024-04-02
Shadi Matinizadeh, Noah Pacik-Nelson, Ioannis Polykretis, Krupa Tishbi, Suman Kumar, M. L. Varshika, Arghavan Mohammadhassani, Abhishek Mishra, Nagarajan Kandasamy, James Shackleford, Eric Gallo, Anup Das

We introduce QUANTISENC, a fully configurable open-source software-defined digital quantized spiking neural core architecture to advance research in neuromorphic computing. QUANTISENC is designed hierarchically using a bottom-up methodology with multiple neurons in each layer and multiple layers in each core. The number of layers and neurons per layer can be configured via software in a top-down methodology

更新日期：2024-04-05

详情收藏

Optimizing Offload Performance in Heterogeneous MPSoCs

arXiv.cs.AR Pub Date : 2024-04-02
Luca Colagrande, Luca Benini

Heterogeneous multi-core architectures combine a few "host" cores, optimized for single-thread performance, with many small energy-efficient "accelerator" cores for data-parallel processing, on a single chip. Offloading a computation to the many-core acceleration fabric introduces a communication and synchronization cost which reduces the speedup attainable on the accelerator, particularly for small

更新日期：2024-04-03

详情收藏

Analyzing the Single Event Upset Vulnerability of Binarized Neural Networks on SRAM FPGAs

arXiv.cs.AR Pub Date : 2024-04-02
Ioanna Souvatzoglou, Athanasios Papadimitriou, Aitzan Sari, Vasileios Vlagkoulis, Mihalis Psarakis

Neural Networks (NNs) are increasingly used in the last decade in several demanding applications, such as object detection and classification, autonomous driving, etc. Among different computing platforms for implementing NNs, FPGAs have multiple advantages due to design flexibility and high performance-to-watt ratio. Moreover, approximation techniques, such as quantization, have been introduced, which

更新日期：2024-04-03

详情收藏

Block-SSD: A New Block-Based Blocking SSD Architecture

arXiv.cs.AR Pub Date : 2024-04-03
Ryan Wong, Arjun Tyagi, Sungjun Cho, Pratik Sampat, Yiqiu Sun

Computer science and related fields (e.g., computer engineering, computer hardware engineering, electrical engineering, electrical and computer engineering, computer systems engineering) often draw inspiration from other fields, areas, and the real world in order to describe topics in their area. One cross-domain example is the idea of a block. The idea of blocks comes in many flavors, including software

更新日期：2024-04-03

详情收藏

CAMO: Correlation-Aware Mask Optimization with Modulated Reinforcement Learning

arXiv.cs.AR Pub Date : 2024-04-01
Xiaoxiao Liang, Haoyu Yang, Kang Liu, Bei Yu, Yuzhe Ma

Optical proximity correction (OPC) is a vital step to ensure printability in modern VLSI manufacturing. Various OPC approaches based on machine learning have been proposed to pursue performance and efficiency, which are typically data-driven and hardly involve any particular considerations of the OPC problem, leading to potential performance or efficiency bottlenecks. In this paper, we propose CAMO

更新日期：2024-04-02

详情收藏

Memristor-Based Lightweight Encryption

arXiv.cs.AR Pub Date : 2024-03-29
Muhammad Ali Siddiqi, Jan Andrés Galvan Hernández, Anteneh Gebregiorgis, Rajendra Bishnoi, Christos Strydis, Said Hamdioui, Mottaqiallah Taouil

Next-generation personalized healthcare devices are undergoing extreme miniaturization in order to improve user acceptability. However, such developments make it difficult to incorporate cryptographic primitives using available target technologies since these algorithms are notorious for their energy consumption. Besides, strengthening these schemes against side-channel attacks further adds to the

更新日期：2024-04-02

详情收藏

Hot-LEGO: Architect Microfluidic Cooling Equipped 3DICs with Pre-RTL Thermal Simulation

arXiv.cs.AR Pub Date : 2024-03-29
Runxi Wang, Jun-Han Han, Mircea Stan, Xinfei Guo

Microfluidic cooling has been recognized as one of the most promising solutions to achieve efficient thermal management for three-dimensional integrated circuits (3DICs). It enables more opportunities to architect 3DICs with different die configurations. It becomes increasingly important to perform thermal analysis in the early design phases to validate the architectural design decisions. This is even

更新日期：2024-04-01

详情收藏

Dataflow-Aware PIM-Enabled Manycore Architecture for Deep Learning Workloads

arXiv.cs.AR Pub Date : 2024-03-28
Harsh Sharma, Gaurav Narang, Janardhan Rao Doppa, Umit Ogras, Partha Pratim Pande

Processing-in-memory (PIM) has emerged as an enabler for the energy-efficient and high-performance acceleration of deep learning (DL) workloads. Resistive random-access memory (ReRAM) is one of the most promising technologies to implement PIM. However, as the complexity of Deep convolutional neural networks (DNNs) grows, we need to design a manycore architecture with multiple ReRAM-based processing

更新日期：2024-03-29

详情收藏

Toward CXL-Native Memory Tiering via Device-Side Profiling

arXiv.cs.AR Pub Date : 2024-03-27
Zhe Zhou, Yiqi Chen, Tao Zhang, Yang Wang, Ran Shu, Shuotao Xu, Peng Cheng, Lei Qu, Yongqiang Xiong, Guangyu Sun

The Compute Express Link (CXL) interconnect has provided the ability to integrate diverse memory types into servers via byte-addressable SerDes links. Harnessing the full potential of such heterogeneous memory systems requires efficient memory tiering. However, existing research in this domain has been constrained by low-resolution and high-overhead memory access profiling techniques. To address this

更新日期：2024-03-28

详情收藏

Annotating Slack Directly on Your Verilog: Fine-Grained RTL Timing Evaluation for Early Optimization

arXiv.cs.AR Pub Date : 2024-03-27
Wenji Fang, Shang Liu, Hongce Zhang, Zhiyao Xie

In digital IC design, compared with post-synthesis netlists or layouts, the early register-transfer level (RTL) stage offers greater optimization flexibility for both designers and EDA tools. However, timing information is typically unavailable at this early stage. Some recent machine learning (ML) solutions propose to predict the total negative slack (TNS) and worst negative slack (WNS) of an entire

更新日期：2024-03-28

详情收藏

Merits of Time-Domain Computing for VMM -- A Quantitative Comparison

arXiv.cs.AR Pub Date : 2024-03-27
Florian Freye, Jie Lou, Christian Lanius, Tobias Gemmeke

Vector-matrix-multiplication (VMM) accel-erators have gained a lot of traction, especially due to therise of convolutional neural networks (CNNs) and the desireto compute them on the edge. Besides the classical digitalapproach, analog computing has gone through a renais-sance to push energy efficiency further. A more recent ap-proach is called time-domain (TD) computing. In contrastto analog computing

更新日期：2024-03-28

详情收藏

Optimizing Communication for Latency Sensitive HPC Applications on up to 48 FPGAs Using ACCL

arXiv.cs.AR Pub Date : 2024-03-27
Marius Meyer, Tobias Kenter, Lucian Petrica, Kenneth O'Brien, Michaela Blott, Christian Pessl

Most FPGA boards in the HPC domain are well-suited for parallel scaling because of the direct integration of versatile and high-throughput network ports. However, the utilization of their network capabilities is often challenging and error-prone because the whole network stack and communication patterns have to be implemented and managed on the FPGAs. Also, this approach conceptually involves a trade-off

更新日期：2024-03-28

详情收藏

SIP: Autotuning GPU Native Schedules via Stochastic Instruction Perturbation

arXiv.cs.AR Pub Date : 2024-03-25
Guoliang He, Eiko Yoneki

Large language models (LLMs) have become a significant workload since their appearance. However, they are also computationally expensive as they have billions of parameters and are trained with massive amounts of data. Thus, recent works have developed dedicated CUDA kernels for LLM training and inference instead of relying on compilergenerated ones, so that hardware resources are as fully utilized

更新日期：2024-03-26

详情收藏

Partially-Precise Computing Paradigm for Efficient Hardware Implementation of Application-Specific Embedded Systems

arXiv.cs.AR Pub Date : 2024-03-25
Mohsen Faryabi, Amir Hossein Moradi

Nowadays, the number of emerging embedded systems rapidly grows in many application domains, due to recent advances in artificial intelligence and internet of things. The main inherent specification of these application-specific systems is that they have not a general nature and are basically developed to only perform a particular task and therefore, deal only with a limited and predefined range of

更新日期：2024-03-26

详情收藏

Thermal Analysis for NVIDIA GTX480 Fermi GPU Architecture

arXiv.cs.AR Pub Date : 2024-03-24
Savinay Nagendra

In this project, we design a four-layer (Silicon|TIM|Silicon|TIM), 3D floor plan for NVIDIA GTX480 Fermi GPU architecture and compare heat dissipation and power trends for matrix multiplication and Needleman-Wunsch kernels. First, cuda kernels for the two algorithms are written. These kernels are compiled and executed with the GPGPU Simulator to extract power logs for varying tensor sizes. These power

更新日期：2024-03-26

详情收藏

Navigating the Landscape of Distributed File Systems: Architectures, Implementations, and Considerations

arXiv.cs.AR Pub Date : 2024-03-23
Xueting Pan, Ziqian Luo, Lisang Zhou

Distributed File Systems (DFS) have emerged as sophisticated solutions for efficient file storage and management across interconnected computer nodes. The main objective of DFS is to achieve flexible, scalable, and resilient file storage management by dispersing file data across multiple interconnected computer nodes, enabling users to seamlessly access and manipulate files distributed across diverse

更新日期：2024-03-26

详情收藏

A Two Level Neural Approach Combining Off-Chip Prediction with Adaptive Prefetch Filtering

arXiv.cs.AR Pub Date : 2024-03-22
Alexandre Valentin Jamet, Georgios Vavouliotis, Daniel A. Jiménez, Lluc Alvarez, Marc Casas

To alleviate the performance and energy overheads of contemporary applications with large data footprints, we propose the Two Level Perceptron (TLP) predictor, a neural mechanism that effectively combines predicting whether an access will be off-chip with adaptive prefetch filtering at the first-level data cache (L1D). TLP is composed of two connected microarchitectural perceptron predictors, named

更新日期：2024-03-26

详情收藏

Allspark: Workload Orchestration for Visual Transformers on Processing In-Memory Systems

arXiv.cs.AR Pub Date : 2024-03-22
Mengke Ge, Junpeng Wang, Binhan Chen, Yingjian Zhong, Haitao Du, Song Chen, Yi Kang

The advent of Transformers has revolutionized computer vision, offering a powerful alternative to convolutional neural networks (CNNs), especially with the local attention mechanism that excels at capturing local structures within the input and achieve state-of-the-art performance. Processing in-memory (PIM) architecture offers extensive parallelism, low data movement costs, and scalable memory bandwidth

更新日期：2024-03-26

详情收藏

Beehive: A Flexible Network Stack for Direct-Attached Accelerators

arXiv.cs.AR Pub Date : 2024-03-21
Katie Lim, Matthew Giordano, Theano Stavrinos, Baris Kasikci, Tom Anderson

Accelerators have become increasingly popular in datacenters due to their cost, performance, and energy benefits. Direct-attached accelerators, where the network stack is implemented in hardware and network traffic bypasses the main CPU, can further enhance these benefits. However, modern datacenter software network stacks are complex, with interleaved protocol layers, network management functions

更新日期：2024-03-26

详情收藏

Cross-layer Modeling and Design of Content Addressable Memories in Advanced Technology Nodes for Similarity Search

arXiv.cs.AR Pub Date : 2024-03-22
Siri Narla, Piyush Kumar, Mohammad Adnaan, Azad Naeemi

In this paper we present a comprehensive design and benchmarking study of Content Addressable Memory (CAM) at the 7nm technology node in the context of similarity search applications. We design CAM cells based on SRAM, spin-orbit torque, and ferroelectric field effect transistor devices and from their layouts extract cell parasitics using state of the art EDA tools. These parasitics are used to develop

更新日期：2024-03-26

详情收藏

Accelerating ViT Inference on FPGA through Static and Dynamic Pruning

arXiv.cs.AR Pub Date : 2024-03-21
Dhruv Parikh, Shouyi Li, Bingyi Zhang, Rajgopal Kannan, Carl Busart, Viktor Prasanna

Vision Transformers (ViTs) have achieved state-of-the-art accuracy on various computer vision tasks. However, their high computational complexity prevents them from being applied to many real-world applications. Weight and token pruning are two well-known methods for reducing complexity: weight pruning reduces the model size and associated computational demands, while token pruning further dynamically

更新日期：2024-03-22

详情收藏

HCiM: ADC-Less Hybrid Analog-Digital Compute in Memory Accelerator for Deep Learning Workloads

arXiv.cs.AR Pub Date : 2024-03-20
Shubham Negi, Utkarsh Saxena, Deepika Sharma, Kaushik Roy

Analog Compute-in-Memory (CiM) accelerators are increasingly recognized for their efficiency in accelerating Deep Neural Networks (DNN). However, their dependence on Analog-to-Digital Converters (ADCs) for accumulating partial sums from crossbars leads to substantial power and area overhead. Moreover, the high area overhead of ADCs constrains the throughput due to the limited number of ADCs that can

更新日期：2024-03-21

详情收藏

System Support for Environmentally Sustainable Computing in Data Centers

arXiv.cs.AR Pub Date : 2024-03-19
Fan Chen

Modern data centers suffer from a growing carbon footprint due to insufficient support for environmental sustainability. While hardware accelerators and renewable energy have been utilized to enhance sustainability, addressing Quality of Service (QoS) degradation caused by renewable energy supply and hardware recycling remains challenging: (1) prior accelerators exhibit significant carbon footprints

更新日期：2024-03-20

详情收藏

Table-Lookup MAC: Scalable Processing of Quantised Neural Networks in FPGA Soft Logic

arXiv.cs.AR Pub Date : 2024-03-18
Daniel Gerlinghoff, Benjamin Chen Ming Choong, Rick Siow Mong Goh, Weng-Fai Wong, Tao Luo

Recent advancements in neural network quantisation have yielded remarkable outcomes, with three-bit networks reaching state-of-the-art full-precision accuracy in complex tasks. These achievements present valuable opportunities for accelerating neural networks by computing in reduced precision. Implementing it on FPGAs can take advantage of bit-level reconfigurability, which is not available on conventional

更新日期：2024-03-19

详情收藏

DEFA: Efficient Deformable Attention Acceleration via Pruning-Assisted Grid-Sampling and Multi-Scale Parallel Processing

arXiv.cs.AR Pub Date : 2024-03-16
Yansong Xu, Dongxu Lyu, Zhenyu Li, Zilong Wang, Yuzhou Chen, Gang Wang, Zhican Wang, Haomin Li, Guanghui He

Multi-scale deformable attention (MSDeformAttn) has emerged as a key mechanism in various vision tasks, demonstrating explicit superiority attributed to multi-scale grid-sampling. However, this newly introduced operator incurs irregular data access and enormous memory requirement, leading to severe PE underutilization. Meanwhile, existing approaches for attention acceleration cannot be directly applied

更新日期：2024-03-19

详情收藏

Strict Partitioning for Sporadic Rigid Gang Tasks

arXiv.cs.AR Pub Date : 2024-03-15
Binqi Sun, Tomasz Kloda, Marco Caccamo

The rigid gang task model is based on the idea of executing multiple threads simultaneously on a fixed number of processors to increase efficiency and performance. Although there is extensive literature on global rigid gang scheduling, partitioned approaches have several practical advantages (e.g., task isolation and reduced scheduling overheads). In this paper, we propose a new partitioned scheduling

更新日期：2024-03-19

详情收藏

Taiyi: A high-performance CKKS accelerator for Practical Fully Homomorphic Encryption

arXiv.cs.AR Pub Date : 2024-03-15
Shengyu Fan, Xianglong Deng, Zhuoyu Tian, Zhicheng Hu, Liang Chang, Rui Hou, Dan Meng, Mingzhe Zhang

Fully Homomorphic Encryption (FHE), a novel cryptographic theory enabling computation directly on ciphertext data, offers significant security benefits but is hampered by substantial performance overhead. In recent years, a series of accelerator designs have significantly enhanced the performance of FHE applications, bringing them closer to real-world applicability. However, these accelerators face

更新日期：2024-03-18

详情收藏

Bandwidth-Effective DRAM Cache for GPUs with Storage-Class Memory

arXiv.cs.AR Pub Date : 2024-03-14
Jeongmin Hong, Sungjun Cho, Geonwoo Park, Wonhyuk Yang, Young-Ho Gong, Gwangsun Kim

We propose overcoming the memory capacity limitation of GPUs with high-capacity Storage-Class Memory (SCM) and DRAM cache. By significantly increasing the memory capacity with SCM, the GPU can capture a larger fraction of the memory footprint than HBM for workloads that oversubscribe memory, achieving high speedups. However, the DRAM cache needs to be carefully designed to address the latency and BW

更新日期：2024-03-15

详情收藏

Analytical Heterogeneous Die-to-Die 3D Placement with Macros

arXiv.cs.AR Pub Date : 2024-03-14
Yuxuan Zhao, Peiyu Liao, Siting Liu, Jiaxi Jiang, Yibo Lin, Bei Yu

This paper presents an innovative approach to 3D mixed-size placement in heterogeneous face-to-face (F2F) bonded 3D ICs. We propose an analytical framework that utilizes a dedicated density model and a bistratal wirelength model, effectively handling macros and standard cells in a 3D solution space. A novel 3D preconditioner is developed to resolve the topological and physical gap between macros and

更新日期：2024-03-15

详情收藏