样式: 排序: IF: - GO 导出 标记为已读
-
SLAP: Segmented Reuse-Time-Label Based Admission Policy for Content Delivery Network Caching ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-23 Ke Liu, Kan Wu, Hua Wang, Ke Zhou, Peng Wang, Ji Zhang, Cong Li
‘‘Learned” admission policies have shown promise in improving Content Delivery Network (CDN) cache performance and lowering operational costs. Unfortunately, existing learned policies are optimized with a few fixed cache sizes while in reality, cache sizes often vary over time in an unpredictable manner. As a result, existing solutions cannot provide consistent benefits in production settings. We present
-
iSwap: A New Memory Page Swap Mechanism for Reducing Ineffective I/O Operations in Cloud Environments ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-23 Zhuohao Wang, Lei Liu, Limin Xiao
This paper proposes iSwap, a new memory page swap mechanism that reduces the ineffective I/O swap operations and improves the QoS for applications with a high priority in the cloud environments. iSwap works in the OS kernel. iSwap accurately learns the reuse patterns for memory pages and makes the swap decisions accordingly to avoid ineffective operations. In the cases where memory pressure is high
-
An Instruction Inflation Analyzing Framework for Dynamic Binary Translators ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-23 Benyi Xie, Yue Yan, Chenghao Yan, Sicheng Tao, Zhuangzhuang Zhang, Xinyu Li, Yanzhi Lan, Xiang Wu, Tianyi Liu, Tingting Zhang, Fuxin Zhang
Dynamic binary translators (DBTs) are widely used to migrate applications between different instruction set architectures (ISAs). Despite extensive research to improve DBT performance, noticeable overhead remains, preventing near-native performance, especially when translating from complex instruction set computer (CISC) to reduced instruction set computer (RISC). For computational workloads, the main
-
Cost-aware Service Placement and Scheduling in the Edge-Cloud Continuum ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-23 Samuel Rac, Mats Brorsson
The edge to data center computing continuum is the aggregation of computing resources located anywhere between the network edge (e.g., close to 5G antennas), and servers in traditional data centers. Kubernetes is the de facto standard for the orchestration of services in data center environments, where it is very efficient. It, however, fails to give the same performance when including edge resources
-
Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-23 Feng Xue, Chenji Han, Xinyu Li, Junliang Wu, Tingting Zhang, Tianyi Liu, Yifan Hao, Zidong Du, Qi Guo, Fuxin Zhang
Indirect memory accesses (IMAs, i.e., A[f(B[i])]) are typical memory access patterns in applications such as graph analysis, machine learning, and database. IMAs are composed of producer-consumer pairs, where the consumers’ memory addresses are derived from the producers’ memory data. Due to the built-in value-dependent feature, IMAs exhibit poor locality, making prefetching ineffective. Hindered by
-
Winols: A Large-Tiling Sparse Winograd CNN Accelerator on FPGAs ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-23 Kunpeng Xie, Ye Lu, Xinyu He, Dezhi Yi, Huijuan Dong, Yao Chen
Convolutional Neural Networks (CNNs) can benefit from the computational reductions provided by the Winograd minimal filtering algorithm and weight pruning. However, harnessing the potential of both methods simultaneously introduces complexity in designing pruning algorithms and accelerators. Prior studies aimed to establish regular sparsity patterns in the Winograd domain, but they were primarily suited
-
ReSA: Reconfigurable Systolic Array for Multiple Tiny DNN Tensors ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-21 Ching-Jui Lee, Tsung Tai Yeh
Systolic array architecture has significantly accelerated deep neural networks (DNNs). A systolic array comprises multiple processing elements (PEs) that can perform multiply-accumulate (MAC). Traditionally, the systolic array can execute a certain amount of tensor data that matches the size of the systolic array simultaneously at each cycle. However, hyper-parameters of DNN models differ across each
-
Cross-Core Data Sharing for Energy-Efficient GPUs ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-18 Hajar Falahati, Mohammad Sadrosadati, Qiumin Xu, Juan Gómez-Luna, Banafsheh Saber Latibari, Hyeran Jeon, Shaahin Hesaabi, Hamid Sarbazi-Azad, Onur Mutlu, Murali Annavaram, Masoud Pedram
Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. Each Streaming Multiprocessor (SM) contains an L1 data cache (L1D) to exploit the locality in data accesses. L1D misses are costly for GPUs due to
-
Cerberus: Triple Mode Acceleration of Sparse Matrix and Vector Multiplication ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-17 Soojin Hwang, Daehyeon Baek, Jongse Park, Jaehyuk Huh
The multiplication of sparse matrix and vector (SpMV) is one of the most widely used kernels in high-performance computing as well as machine learning acceleration for sparse neural networks. The design space of SpMV accelerators has two axes: algorithm and matrix representation. There have been two widely used algorithms and data representations. Two algorithms, scalar multiplication and dot product
-
Orchard: Heterogeneous Parallelism and Fine-grained Fusion for Complex Tree Traversals ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-15 Vidush Singhal, Laith Sakka, Kirshanthan Sundararajah, Ryan R. Newton, Milind Kulkarni
Many applications are designed to perform traversals on tree-like data structures. Fusing and parallelizing these traversals enhance the performance of applications. Fusing multiple traversals improves the locality of the application. The runtime of an application can be significantly reduced by extracting parallelism and utilizing multi-threading. Prior frameworks have tried to fuse and parallelize
-
TEA+: A Novel Temporal Graph Random Walk Engine With Hybrid Storage Architecture ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-14 Chengying Huan, Yongchao Liu, Heng Zhang, Shuaiwen Song, Santosh Pandey, Shiyang Chen, Xiangfei Fang, Yue Jin, Baptiste Lepers, Yanjun Wu, Hang Liu
Many real-world networks are characterized by being temporal and dynamic, wherein the temporal information signifies the changes in connections, such as the addition or removal of links between nodes. Employing random walks on these temporal networks is a crucial technique for understanding the structural evolution of such graphs over time. However, existing state-of-the-art sampling methods are designed
-
NEM-GNN - DAC/ADC-less, scalable, reconfigurable, graph and sparsity-aware near-memory accelerator for graph neural networks ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-14 Siddhartha Raman Sundara Raman, Lizy John, Jaydeep P. Kulkarni
Graph neural networks (GNN) are of great interest in real-life applications such as citation networks, drug discovery owing to GNN’s ability to apply machine learning techniques on graphs. GNNs utilize a two-step approach to classify the nodes in a graph into pre-defined categories. The first step uses a combination kernel to perform data-intensive convolution operations with regular memory access
-
xMeta: SSD-HDD-Hybrid Optimization for Metadata Maintenance of Cloud-Scale Object Storage ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-03-13 Yan Chen, Qiwen Ke, Huiba Li, Yongwei Wu, Yiming Zhang
Object storage has been widely used in the cloud. Traditionally, the size of object metadata is much smaller than that of object data, and thus existing object storage systems (like Ceph and Oasis) can place object data and metadata respectively on hard disk drives (HDDs) and solid-state drives (SSDs) to achieve high I/O performance at a low monetary cost. Currently, however, a wide range of cloud
-
The Droplet Search Algorithm for Kernel Scheduling ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-29 Michael Canesche, Vanderson M. Rosario, Edson Borin, Fernando Magno Quintão Pereira
Kernel scheduling is the problem of finding the most efficient implementation for a computational kernel. Identifying this implementation involves experimenting with the parameters of compiler optimizations, such as the size of tiling windows and unrolling factors. This paper shows that it is possible to organize these parameters as points in a coordinate space. The function that maps these points
-
Camouflage: Utility-Aware Obfuscation for Accurate Simulation of Sensitive Program Traces ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-29 Asmita Pal, Keerthana Desai, Rahul Chatterjee, Joshua San Miguel
Trace-based simulation is a widely used methodology for system design exploration. It relies on realistic traces that represent a range of behaviors necessary to be evaluated, containing a lot of information about the application, its inputs and the underlying system on which it was generated. Consequently, generating traces from real-world executions risk leakage of sensitive information. To prevent
-
FASA-DRAM: Reducing DRAM Latency with Destructive Activation and Delayed Restoration ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-23 Haitao Du, Yuhan Qin, Song Chen, Yi Kang
DRAM memory is a performance bottleneck for many applications, due to its high access latency. Previous work has mainly focused on data locality, introducing small-but-fast regions to cache frequently accessed data, thereby reducing the average latency. However, these locality-based designs have three challenges in modern multi-core systems: 1) Inter-application interference leads to random memory
-
Assessing the Impact of Compiler Optimizations on GPUs Reliability ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-15 Fernando Fernandes Dos Santos, Luigi Carro, Flavio Vella, Paolo Rech
Graphics Processing Units (GPUs) compilers have evolved in order to support general-purpose programming languages for multiple architectures. NVIDIA CUDA Compiler (NVCC) has many compilation levels before generating the machine code and applies complex optimizations to improve performance. These optimizations modify how the software is mapped in the underlying hardware; thus, as we show in this article
-
Architectural support for sharing, isolating and virtualizing FPGA resources. ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-16 Panagiotis Miliadis, Dimitris Theodoropoulos, Dionisios N. Pnevmatikatos, Nectarios Koziris
FPGAs are increasingly popular in cloud environments for their ability to offer on-demand acceleration and improved compute efficiency. Providers would like to increase utilization, by multiplexing customers on a single device, similar to how processing cores and memory are shared. Nonetheless, multi-tenancy still faces major architectural limitations including: a) inefficient sharing of memory interfaces
-
Highly Efficient Self-checking Matrix Multiplication on Tiled AMX Accelerators ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-15 Chandra Sekhar Mummidi, Victor C. Ferreira, Sudarshan Srinivasan, Sandip Kundu
General Matrix Multiplication (GEMM) is a computationally expensive operation that is used in many applications such as machine learning. Hardware accelerators are increasingly popular for speeding up GEMM computation, with Tiled Matrix Multiplication (TMUL) in recent Intel processors being an example. Unfortunately, the TMUL hardware is susceptible to errors, necessitating online error detection.
-
WIPE: A Write-Optimized Learned Index for Persistent Memory ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-15 Zhonghua Wang, Chen Ding, Fengguang Song, Kai Lu, Jiguang Wan, Zhihu Tan, Changsheng Xie, Guokuan Li
Learned Index, which utilizes effective machine learning models to accelerate locating sorted data positions, has gained increasing attention in many big data scenarios. Using efficient learned models, the learned indexes build large nodes and flat structures, thereby greatly improving the performance. However, most of the state-of-the-art learned indexes are designed for DRAM, and there is hence an
-
Coherence Attacks and Countermeasures in Interposer-based Chiplet Systems ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-15 Gino A. Chacon, Charles Williams, Johann Knechtel, Ozgur Sinanoglu, Paul V. Gratz, Vassos Soteriou
Industry is moving towards large-scale hardware systems that bundle processor cores, memories, accelerators, and so on. via 2.5D integration. These components are fabricated separately as chiplets and then integrated using an interposer as an interconnect carrier. This new design style is beneficial in terms of yield and economies of scale, as chiplets may come from various vendors and are relatively
-
A Concise Concurrent B+-Tree for Persistent Memory ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-15 Yan Wei, Zhang Xingjun
Persistent memory (PM) presents a unique opportunity for designing data management systems that offer improved performance, scalability, and instant restart capability. As a widely used data structure for managing data in such systems, B+-Tree must address the challenges presented by PM in both data consistency and device performance. However, existing studies suffer from significant performance degradation
-
An Efficient Hybrid Deep Learning Accelerator for Compact and Heterogeneous CNNs ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-15 Fareed Qararyah, Muhammad Waqar Azhar, Pedro Trancoso
Resource-efficient Convolutional Neural Networks (CNNs) are gaining more attention. These CNNs have relatively low computational and memory requirements. A common denominator among such CNNs is having more heterogeneity than traditional CNNs. This heterogeneity is present at two levels: intra-layer type and inter-layer type. Generic accelerators do not capture these levels of heterogeneity, which harms
-
Dedicated Hardware Accelerators for Processing of Sparse Matrices and Vectors: A Survey ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-02-15 Valentin Isaac–Chassande, Adrian Evans, Yves Durand, Frédéric Rousseau
Performance in scientific and engineering applications such as computational physics, algebraic graph problems or Convolutional Neural Networks (CNN), is dominated by the manipulation of large sparse matrices—matrices with a large number of zero elements. Specialized software using data formats for sparse matrices has been optimized for the main kernels of interest: SpMV and SpMSpM matrix multiplications
-
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-19 Xueying Wang, Guangli Li, Zhen Jia, Xiaobing Feng, Yida Wang
Low-precision computation has emerged as one of the most effective techniques for accelerating convolutional neural networks and has garnered widespread support on modern hardware. Despite its effectiveness in accelerating convolutional neural networks, low-precision computation has not been commonly applied to fast convolutions, such as the Winograd algorithm, due to numerical issues. In this article
-
Abakus: Accelerating k-mer Counting with Storage Technology ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-18 Lingxi Wu, Minxuan Zhou, Weihong Xu, Ashish Venkat, Tajana Rosing, Kevin Skadron
This work seeks to leverage Processing-with-storage-technology (PWST) to accelerate a key bioinformatics kernel called k-mer counting, which involves processing large files of sequence data on the disk to build a histogram of fixed-size genome sequence substrings and thereby entails prohibitively high I/O overhead. In particular, this work proposes a set of accelerator designs called Abakus that offer
-
WA-Zone: Wear-Aware Zone Management Optimization for LSM-Tree on ZNS SSDs ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-18 Linbo Long, Shuiyong He, Jingcheng Shen, Renping Liu, Zhenhua Tan, Congming Gao, Duo Liu, Kan Zhong, Yi Jiang
ZNS SSDs divide the storage space into sequential-write zones, reducing costs of DRAM utilization, garbage collection, and over-provisioning. The sequential-write feature of zones is well-suited for LSM-based databases, where random writes are organized into sequential writes to improve performance. However, the current compaction mechanism of LSM-tree results in widely varying access frequencies (i
-
QoS-pro: A QoS-enhanced Transaction Processing Framework for Shared SSDs ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-18 Hao Fan, Yiliang Ye, Shadi Ibrahim, Zhuo Huang, Xingru Li, Weibin Xue, Song Wu, Chen Yu, Xuanhua Shi, Hai Jin
Solid State Drives (SSDs) are widely used in data-intensive scenarios due to their high performance and decreasing cost. However, in shared environments, concurrent workloads can interfere with each other, leading to a violation of Quality of Service (QoS). While QoS mechanisms like fairness guarantees and latency constraints have been integrated into SSDs, existing transaction processing frameworks
-
SAC: An Ultra-Efficient Spin-based Architecture for Compressed DNNs ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-19 Yunping Zhao, Sheng Ma, Heng Liu, Libo Huang, Yi Dai
Deep Neural Networks (DNNs) have achieved great progress in academia and industry. But they have become computational and memory intensive with the increase of network depth. Previous designs seek breakthroughs in software and hardware levels to mitigate these challenges. At the software level, neural network compression techniques have effectively reduced network scale and energy consumption. However
-
Efficient Cross-platform Multiplexing of Hardware Performance Counters via Adaptive Grouping ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-19 Tong-Yu Liu, Jianmei Guo, Bo Huang
Collecting sufficient microarchitecture performance data is essential for performance evaluation and workload characterization. There are many events to be monitored in a modern processor while only a few hardware performance monitoring counters (PMCs) can be used, so multiplexing is commonly adopted. However, inefficiency commonly exists in state-of-the-art profiling tools when grouping events for
-
QuCloud+: A Holistic Qubit Mapping Scheme for Single/Multi-programming on 2D/3D NISQ Quantum Computers ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-18 Lei Liu, Xinglei Dou
Qubit mapping for NISQ superconducting quantum computers is essential to fidelity and resource utilization. The existing qubit mapping schemes meet challenges, e.g., crosstalk, SWAP overheads, diverse device topologies, etc., leading to qubit resource underutilization and low fidelity in computing results. This article introduces QuCloud+, a new qubit mapping scheme that tackles these challenges. QuCloud+
-
ISP Agent: A Generalized In-storage-processing Workload Offloading Framework by Providing Multiple Optimization Opportunities ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-19 Seokwon Kang, Jongbin Kim, Gyeongyong Lee, Jeongmyung Lee, Jiwon Seo, Hyungsoo Jung, Yong Ho Song, Yongjun Park
As solid-state drives (SSDs) with sufficient computing power have recently become the dominant devices in modern computer systems, in-storage processing (ISP), which processes data within the storage without transferring it to the host memory, is being utilized in various emerging applications. The main challenge of ISP is to deliver storage data to the offloaded workload. This is difficult because
-
COWS for High Performance: Cost Aware Work Stealing for Irregular Parallel Loop ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-19 Prasoon Mishra, V. Krishna Nandivada
Parallel libraries such as OpenMP distribute the iterations of parallel-for-loops among the threads, using a programmer-specified scheduling policy. While the existing scheduling policies perform reasonably well in the context of balanced workloads, in computations that involve highly imbalanced workloads it is extremely non-trivial to obtain an efficient distribution of work (even using non-static
-
Hardware-hardened Sandbox Enclaves for Trusted Serverless Computing ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-18 Joongun Park, Seunghyo Kang, Sanghyeon Lee, Taehoon Kim, Jongse Park, Youngjin Kwon, Jaehyuk Huh
In cloud-based serverless computing, an application consists of multiple functions provided by mutually distrusting parties. For secure serverless computing, the hardware-based trusted execution environment (TEE) can provide strong isolation among functions. However, not only protecting each function from the host OS and other functions, but also protecting the host system from the functions, is critical
-
Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual Memory ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-18 Tyler Allen, Bennett Cooper, Rong Ge
The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for ease of use provided by system-managed memory with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is currently the primary real-world implementation of such abstraction and offers a functionally equivalent
-
Rcmp: Reconstructing RDMA-Based Memory Disaggregation via CXL ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2024-01-19 Zhonghua Wang, Yixing Guo, Kai Lu, Jiguang Wan, Daohui Wang, Ting Yao, Huatao Wu
Memory disaggregation is a promising architecture for modern datacenters that separates compute and memory resources into independent pools connected by ultra-fast networks, which can improve memory utilization, reduce cost, and enable elastic scaling of compute and memory resources. However, existing memory disaggregation solutions based on remote direct memory access (RDMA) suffer from high latency
-
Exploring Data Layout for Sparse Tensor Times Dense Matrix on GPUs ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-28 Khalid Ahmad, Cris Cecka, Michael Garland, Mary Hall
An important sparse tensor computation is sparse-tensor-dense-matrix multiplication (SpTM), which is used in tensor decomposition and applications. SpTM is a multi-dimensional analog to sparse-matrix-dense-matrix multiplication (SpMM). In this paper, we employ a hierarchical tensor data layout that can unfold a multidimensional tensor to derive a 2D matrix, making it possible to compute SpTM using
-
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-28 Can Firtina, Kamlesh Pillai, Gurpreet S. Kalsi, Bharathwaj Suresh, Damla Senol Cali, Jeremie S. Kim, Taha Shahroodi, Meryem Banu Cavlak, Joël Lindegger, Mohammed Alser, Juan Gómez Luna, Sreenivas Subramoney, Onur Mutlu
Profile hidden Markov models (pHMMs) are widely employed in various bioinformatics applications to identify similarities between biological sequences, such as DNA or protein sequences. In pHMMs, sequences are represented as graph structures, where states and edges capture modifications (i.e., insertions, deletions, and substitutions) by assigning probabilities to them. These probabilities are subsequently
-
Improving Utilization of Dataflow Unit for Multi-Batch Processing. ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-18 Zhihua Fan, Wenming Li, Zhen Wang, Yu Yang, Xiaochun Ye, Dongrui Fan, Ninghui Sun, Xuejun An
Dataflow architectures can achieve much better performance and higher efficiency than general-purpose core, approaching the performance of a specialized design while retaining programmability. However, advanced application scenarios place higher demands on the hardware in terms of cross-domain and multi-batch processing. In this paper, we propose a unified scale-vector architecture that can work in
-
DAG-Order: An Order-Based Dynamic DAG Scheduling for Real-Time Networks-on-Chip ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-15 Peng Chen, Hui Chen, Weichen Liu, Linbo Long, Wanli Chang, Nan Guan
With the high-performance requirement of safety-critical real-time tasks, the platforms of many-core processors with high parallelism are widely utilized, where network-on-chip (NoC) is generally employed for inter-core communication due to its scalability and high efficiency. Unfortunately, large uncertainties are suffered on NoCs from both the overly parallel architecture and the distributed scheduling
-
Critical Data Backup with Hybrid Flash-Based Consumer Devices ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-15 Longfei Luo, Dingcui Yu, Yina Lv, Liang Shi
Hybrid flash-based storage constructed with high-density and low-cost flash memory has become increasingly popular in consumer devices in the last decade due to its low cost. However, its poor reliability is one of the major concerns. To protect critical data for guaranteeing user experience, some methods are proposed to improve the reliability of consumer devices with non-hybrid flash storage. However
-
JiuJITsu: Removing Gadgets with Safe Register Allocation for JIT Code Generation ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-15 Zhang Jiang, Ying Chen, Xiaoli Gong, Jin Zhang, Wenwen Wang, Pen-Chung Yew
Code-reuse attacks have the capability to craft malicious instructions from small code fragments, commonly referred to as “gadgets.” These gadgets are generated by JIT (Just-In-Time) engines as integral components of native instructions, with the flexibility to be embedded in various fields, including Displacement. In this article, we introduce a novel approach for potential gadget insertion, achieved
-
Autovesk: Automatic Vectorized Code Generation from Unstructured Static Kernels Using Graph Transformations ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-15 Hayfa Tayeb, Ludovic Paillat, Bérenger Bramas
Leveraging the SIMD capability of modern CPU architectures is mandatory to take full advantage of their increased performance. To exploit this capability, binary executables must be vectorized, either manually by developers or automatically by a tool. For this reason, the compilation research community has developed several strategies for transforming scalar code into a vectorized implementation. However
-
MicroProf: Code-level Attribution of Unnecessary Data Transfer in Microservice Applications ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-14 Syed Salauddin Mohammad Tariq, Lance Menard, Pengfei Su, Probir Roy
The microservice architecture style has gained popularity due to its ability to fault isolation, ease of scaling applications, and developer’s agility. However, writing applications in the microservice design style has its challenges. Due to the loosely coupled nature, services communicate with others through standard communication APIs. This incurs significant overhead in the application due to communication
-
gPPM: A Generalized Matrix Operation and Parallel Algorithm to Accelerate the Encoding/Decoding Process of Erasure Codes ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-14 Shiyi Li, Qiang Cao, Shenggang Wan, Wen Xia, Changsheng Xie
Erasure codes are widely deployed in modern storage systems, leading to frequent usage of their encoding/decoding operations. The encoding/decoding process for erasure codes is generally carried out using the parity-check matrix approach. However, this approach is serial and computationally expensive, mainly due to dealing with matrix operations, which results in low encoding/decoding performance.
-
PARALiA: A Performance Aware Runtime for Auto-tuning Linear Algebra on Heterogeneous Systems ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-14 Petros Anastasiadis, Nikela Papadopoulou, Georgios Goumas, Nectarios Koziris, Dennis Hoppe, Li Zhong
Dense linear algebra operations appear very frequently in high-performance computing (HPC) applications, rendering their performance crucial to achieve optimal scalability. As many modern HPC clusters contain multi-GPU nodes, BLAS operations are frequently offloaded on GPUs, necessitating the use of optimized libraries to ensure good performance. Unfortunately, multi-GPU systems are accompanied by
-
RACE: An Efficient Redundancy-aware Accelerator for Dynamic Graph Neural Network ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-14 Hui Yu, Yu Zhang, Jin Zhao, Yujian Liao, Zhiying Huang, Donghao He, Lin Gu, Hai Jin, Xiaofei Liao, Haikun Liu, Bingsheng He, Jianhui Yue
Dynamic Graph Neural Network (DGNN) has recently attracted a significant amount of research attention from various domains, because most real-world graphs are inherently dynamic. Despite many research efforts, for DGNN, existing hardware/software solutions still suffer significantly from redundant computation and memory access overhead, because they need to irregularly access and recompute all graph
-
Mapi-Pro: An Energy Efficient Memory Mapping Technique for Intermittent Computing ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-14 Satya Jaswanth Badri, Mukesh Saini, Neeraj Goel
Battery-less technology evolved to replace battery usage in space, deep mines, and other environments to reduce cost and pollution. Non-volatile memory (NVM) based processors were explored for saving the system state during a power failure. Such devices have a small SRAM and large non-volatile memory. To make the system energy efficient, we need to use SRAM efficiently. So we must select some portions
-
Multiply-and-Fire: An Event-Driven Sparse Neural Network Accelerator ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-14 Miao Yu, Tingting Xiang, Venkata Pavan Kumar Miriyala, Trevor E. Carlson
Deep neural network inference has become a vital workload for many systems from edge-based computing to data centers. To reduce the performance and power requirements for deep neural networks (DNNs) running on these systems, pruning is commonly used as a way to maintain most of the accuracy of the system while significantly reducing the workload requirements. Unfortunately, accelerators designed for
-
FlowPix: Accelerating Image Processing Pipelines on an FPGA Overlay using a Domain Specific Compiler ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-12-14 Ziaul Choudhury, Anish Gulati, Suresh Purini
The exponential performance growth guaranteed by Moore’s law has started to taper in recent years. At the same time, emerging applications like image processing demand heavy computational performance. These factors inevitably lead to the emergence of domain-specific accelerators (DSA) to fill the performance void left by conventional architectures. FPGAs are rapidly evolving towards becoming an alternative
-
Extension VM: Interleaved Data Layout in Vector Memory ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-11-07 Dunbo Zhang, Qingjie Lang, Ruoxi Wang, Li Shen
While vector architecture is widely employed in processors for neural networks, signal processing, and high-performance computing; however, its performance is limited by inefficient column-major memory access. The column-major access limitation originates from the unsuitable mapping of multidimensional data structures to two-dimensional vector memory spaces. In addition, the traditional data layout
-
High-performance Deterministic Concurrency Using Lingua Franca ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-10-26 Christian Menard, Marten Lohstroh, Soroush Bateni, Matthew Chorlian, Arthur Deng, Peter Donovan, Clément Fournier, Shaokai Lin, Felix Suchert, Tassilo Tanneberger, Hokeun Kim, Jeronimo Castrillon, Edward A. Lee
Actor frameworks and similar reactive programming techniques are widely used for building concurrent systems. They promise to be efficient and scale well to a large number of cores or nodes in a distributed system. However, they also expose programmers to nondeterminism, which often makes implementations hard to understand, debug, and test. The recently proposed reactor model is a promising alternative
-
Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-10-26 Jiangsu Du, Jiazhi Jiang, Jiang Zheng, Hongbin Zhang, Dan Huang, Yutong Lu
Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference
-
A Compilation Tool for Computation Offloading in ReRAM-based CIM Architectures ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-10-26 Hai Jin, Bo Lei, Haikun Liu, Xiaofei Liao, Zhuohui Duan, Chencheng Ye, Yu Zhang
Computing-in-Memory (CIM) architectures using Non-volatile Memories (NVMs) have emerged as a promising way to address the “memory wall” problem in traditional Von Neumann architectures. CIM accelerators can perform arithmetic or Boolean logic operations in NVMs by fully exploiting their high parallelism for bit-wise operations. These accelerators are often used in cooperation with general-purpose processors
-
Smart-DNN+: A Memory-efficient Neural Networks Compression Framework for the Model Inference ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-10-26 Donglei Wu, Weihao Yang, Xiangyu Zou, Wen Xia, Shiyi Li, Zhenbo Hu, Weizhe Zhang, Binxing Fang
Deep Neural Networks (DNNs) have achieved remarkable success in various real-world applications. However, running a Deep Neural Network (DNN) typically requires hundreds of megabytes of memory footprints, making it challenging to deploy on resource-constrained platforms such as mobile devices and IoT. Although mainstream DNNs compression techniques such as pruning, distillation, and quantization can
-
At the Locus of Performance: Quantifying the Effects of Copious 3D-Stacked Cache on HPC Workloads ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-10-25 Jens Domke, Emil Vatai, Balazs Gerofi, Yuetsu Kodama, Mohamed Wahib, Artur Podobas, Sparsh Mittal, Miquel Pericàs, Lingqi Zhang, Peng Chen, Aleksandr Drozd, Satoshi Matsuoka
Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method
-
ULEEN: A Novel Architecture for Ultra Low-Energy Edge Neural Networks ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-10-25 Zachary Susskind, Aman Arora, Igor D. S. Miranda, Alan T. L. Bacellar, Luis A. Q. Villon, Rafael F. Katopodis, Leandro S. de Araújo, Diego L. C. Dutra, Priscila M. V. Lima, Felipe M. G. França, Mauricio Breternitz Jr., Lizy K. John
”Extreme edge“ devices such as smart sensors are a uniquely challenging environment for the deployment of machine learning. The tiny energy budgets of these devices lie beyond what is feasible for conventional deep neural networks, particularly in high-throughput scenarios, requiring us to rethink how we approach edge inference. In this work, we propose ULEEN, a model and FPGA-based accelerator architecture
-
Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-10-25 Jia Wei, Xingjun Zhang, Longxiang Wang, Zheng Wei
In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and natural language processing (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing
-
Characterizing Multi-Chip GPU Data Sharing ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-10-20 Shiqing Zhang, Mahmood Naderan-Tahan, Magnus Jahre, Lieven Eeckhout
Multi-chip GPU systems are critical to scale performance beyond a single GPU chip for a wide variety of important emerging applications. A key challenge for multi-chip GPUs though is how to overcome the bandwidth gap between inter-chip and intra-chip communication. Accesses to shared data, i.e., data accessed by multiple chips, pose a major performance challenge as they incur remote memory accesses
-
DxPU: Large Scale Disaggregated GPU Pools in the Datacenter ACM Trans. Archit. Code Optim. (IF 1.6) Pub Date : 2023-10-05 Bowen He, Xiao Zheng, Yuan Chen, Weinan Li, Yajin Zhou, Xin Long, Pengcheng Zhang, Xiaowei Lu, Linquan Jiang, Qiang Liu, Dennis Cai, Xiantao Zhang
The rapid adoption of AI and convenience offered by cloud services have resulted in the growing demands for GPUs in the cloud. Generally, GPUs are physically attached to host servers as PCIe devices. However, the fixed assembly combination of host servers and GPUs is extremely inefficient in resource utilization, upgrade, and maintenance. Due to these issues, the GPU disaggregation technique has been