research-article

Free Access

Just Accepted

Cross-Core Data Sharing for Energy-Efficient GPUs

Authors:
Hajar Falahati

Sharif University of Technology, School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

Sharif University of Technology, School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
View Profile

,
Mohammad Sadrosadati

School of Computer Science, IPM, Tehran, Iran

School of Computer Science, IPM, Tehran, Iran
View Profile

,
Qiumin Xu

University of Southern California, Losangeles, USA

University of Southern California, Losangeles, USA
View Profile

,
Juan Gómez-Luna

ETH Zürich, Zürich, Switzerland

ETH Zürich, Zürich, Switzerland
View Profile

,
Banafsheh Saber Latibari

Sharif University of Technology, Tehran, Iran

Sharif University of Technology, Tehran, Iran
View Profile

,
Hyeran Jeon

San José State University, Fresno, CA, USA

San José State University, Fresno, CA, USA
View Profile

,
Shaahin Hesaabi

Sharif University of Technology, Tehran, Iran

Sharif University of Technology, Tehran, Iran
View Profile

,
Hamid Sarbazi-Azad

Sharif University of Technology, School of Computer Science, IPM, Tehran, Iran

Sharif University of Technology, School of Computer Science, IPM, Tehran, Iran
View Profile

,
Onur Mutlu

ETH Zürich, Carnegie Mellon University, Zürich, Switzerland

ETH Zürich, Carnegie Mellon University, Zürich, Switzerland
View Profile

,
Murali Annavaram

University of Southern California, Los Angeles, USA

University of Southern California, Los Angeles, USA
View Profile

,
Masoud Pedram

University of Southern California, Los Angeles, USA

University of Southern California, Los Angeles, USA
View Profile

Authors Info & Claims

ACM Transactions on Architecture and Code OptimizationAccepted on March 2023https://doi.org/10.1145/3653019

Online AM:18 March 2024Publication History

ACM Transactions on Architecture and Code Optimization

Abstract

Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. Each Streaming Multiprocessor (SM) contains an L1 data cache (L1D) to exploit the locality in data accesses. L1D misses are costly for GPUs due to two reasons. First, L1D misses consume a lot of energy as they need to access the L2 cache (L2) via an on-chip network and the off-chip DRAM in case of L2 misses. Second, L1D misses impose performance overhead if the GPU does not have enough active warps to hide the long memory access latency. We observe that threads running on different SMs share 55% of the data they read from the memory. Unfortunately, as the L1Ds are in the non-coherent memory domain, each SM independently fetches data from the L2 or the off-chip memory into its L1D, even though the data may be currently available in the L1D of another SM. Our goal is to service L1D read misses via other SMs, as much as possible, to cut down costly accesses to the L2 or the off-chip DRAM. To this end, we propose a new data sharing mechanism, called Cross-Core Data Sharing (CCDS). CCDS employs a predictor to estimate whether or not the required cache block exists in another SM. If the block is predicted to exist in another SM’s L1D, CCDS fetches the data from the L1D that contains the block. Our experiments on a suite of 26 workloads show that CCDS improves average energy and performance by 1.30 × and 1.20 ×, respectively, compared to the baseline GPU. Compared to the state-of-the-art data-sharing mechanism, CCDS improves average energy and performance by 1.37 × and 1.11 ×, respectively.

References

2009. Whitepaper: NVIDIA’s Next Generation CUDA^TM Compute Architecture: Fermi^TM. Technical Report. NVIDIA.Google Scholar
2012. Whitepaper: NVIDIA’s Next Generation CUDA^TM Compute Architecture: Kepler^TMGK110. Technical Report. NVIDIA.Google Scholar
2014. Whitepaper: NVIDIA GeForce GTX980. Technical Report. NVIDIA.Google Scholar
2016. Whitepaper: NVIDIA GeForce GP100. Technical Report. NVIDIA.Google Scholar
Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped Register File: A Power Efficient Register File for GPGPUs. In HPCA.Google Scholar
Mohammad Abdel-Majeed, Alireza Shafaei, Hyeran Jeon, Massoud Pedram, and Murali Annavaram. 2017. Pilot register file: Energy efficient partitioned register file for GPUs. In HPCA.Google Scholar
Mohammad Abdel-Majeed, Daniel Wong, and Murali Annavaram. 2013. Warped gates: gating aware scheduling and power gating for GPGPUs. In MICRO.Google Scholar
Mohammad Abdel-Majeed, Daniel Wong, Justin Kuang, and Murali Annavaram. 2016. Origami: Folding warps for energy efficient GPUs. In ICS.Google ScholarDigital Library
Neha Agarwal, David Nellans, Eiman Ebrahimi, Thomas F Wenisch, John Danskin, and Stephen W Keckler. 2016. Selective GPU caches to eliminate CPU-GPU HW cache coherence. In HPCA.Google Scholar
Ehoud Ahronovitz, Jean-Pierre Aubert, and Christophe Fiorio. 1995. The star-topology: a topology for image analysis. In DGCI’05: 5th International Conference on Discrete Geometry for Computer Imagery.Google Scholar
Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Vijay Narayanan, and Yuan Xie. 2009. 3D GPU architecture using cache stacking: Performance, cost, power and thermal analysis. In ICCD.Google Scholar
Arunkumar, Akhil and Bolotin, Evgeny and Cho, Benjamin and Milic, Ugljesa and Ebrahimi, Eiman and Villa, Oreste and Jaleel, Aamer and Wu, Carole-Jean and Nellans, David. 2017. MCM-GPU: Multi-chip-module GPUs for continued performance scalability. In ISCA.Google Scholar
Rachata Ausavarungnirun, Kevin Kai-Wei Chang, Lavanya Subramanian, Gabriel H Loh, and Onur Mutlu. 2012. Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems. In ISCA.Google Scholar
R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In PACT.Google Scholar
Rachata Ausavarungnirun, Saugata Ghose, Onur Kayiran, Gabriel H Loh, Chita R Das, Mahmut T Kandemir, and Onur Mutlu. 2015. Exploiting inter-warp heterogeneity to improve GPGPU performance. In PACT.Google Scholar
Rachata Ausavarungnirun, Joshua Landgraf, Vance Miller, Saugata Ghose, Jayneel Gandhi, Christopher J Rossbach, and Onur Mutlu. 2017. Mosaic: A GPU memory manager with application-transparent support for multiple page sizes. In MICRO.Google Scholar
Rachata Ausavarungnirun, Vance Miller, Joshua Landgraf, Saugata Ghose, Jayneel Gandhi, Adwait Jog, Christopher J Rossbach, and Onur Mutlu. 2018. Mask: Redesigning the GPU Memory Hierarchy to Support Multi-application Concurrency. In ASPLOS.Google Scholar
A. Bakhoda, G. L. Yuan, Wilson W. L. Fung, H. Wong, and Tor M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In ISPASS.Google Scholar
Balasubramanian, Raghuraman and Gangadhar, Vinay and Guo, Ziliang and Ho, Chen-Han and Joseph, Cherin and Menon, Jaikrishnan and Drumond, Mario Paulo and Paul, Robin and Prasad, Sharath and Valathol, Pradip and others. 2015. Enabling GPGPU low-level hardware explorations with MIAOW: An open-source RTL implementation of a GPGPU. ACM TACO (2015).Google Scholar
Burtscher, Martin and Nasre, Rupesh and Pingali, Keshav. 2012. A quantitative study of irregular programs on GPUs. In IISWC.Google Scholar
Shuai Che, Bradford M. Beckmann, Steven K. Reinhardt, and Kevin Skadron. 2013. Pannotia: Understanding irregular GPGPU graph applications. In IISWC.Google Scholar
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In IISWC.Google Scholar
Li-Jhan Chen, Hsiang-Yun Cheng, Po-Han Wang, and Chia-Lin Yang. 2017. Improving GPGPU Performance via Cache Locality Aware Thread Block Scheduling. CAL (2017).Google Scholar
Design Compiler. 2000. Synopsys inc.Google Scholar
Sina Darabi, Ehsan Yousefzadeh-Asl-Miandoab, Negar Akbarzadeh, Hajar Falahati, Pejman Lotfi-Kamran, Mohammad Sadrosadati, and Hamid Sarbazi-Azad. 2022. OSM: Off-chip shared memory for GPUs. IEEE Transactions on Parallel and Distributed Systems 33, 12 (2022), 3415–3429.Google ScholarDigital Library
Darabi, Sina and Mahani, Negin and Baxishi, Hazhir and Yousefzadeh-Asl-Miandoab, Ehsan and Sadrosadati, Mohammad and Sarbazi-Azad, Hamid. 2022. NURA: A framework for supporting non-uniform resource accesses in GPUs. ACM PoMACS (2022).Google Scholar
Darabi, Sina and Sadrosadati, Mohammad and Akbarzadeh, Negar and Lindegger, Joël and Hosseini, Mohammad and Park, Jisung and Gómez-Luna, Juan and Mutlu, Onur and Sarbazi-Azad, Hamid. 2022. Morpheus: Extending the Last Level Cache Capacity in GPU Systems Using Idle GPU Core Resources. In MICRO.Google Scholar
Dublish, Saumay and Nagarajan, Vijay and Topham, Nigel. 2016. Characterizing memory bottlenecks in GPGPU workloads. In IISWC.Google Scholar
Dublish, Saumay and Nagarajan, Vijay and Topham, Nigel. 2016. Cooperative Caching for GPUs. ACM TOPC.Google Scholar
Dublish, Saumay and Nagarajan, Vijay and Topham, Nigel. 2017. Evaluating and mitigating bandwidth bottlenecks across the memory hierarchy in GPUs. In ISPASS.Google Scholar
Dublish, Saumay and Nagarajan, Vijay and Topham, Nigel. 2019. Poise: Balancing thread-level parallelism and memory system performance in GPUs using machine learning. In HPCA.Google Scholar
Hajar Falahati, Mania Abdi, Amirali Baniasadi, and Shaahin Hessabi. 2013. ISP: Using Idle SMs in Hardware-based Prefetching. In CADS.Google Scholar
Hajar Falahati, Mania Abdi, Amirali Baniasadi, and Shaahin Hessabi. 2015. Power-Efficient Prefetching in GPGPUs. In JSC.Google Scholar
Hajar Falahati, Pejman Lotfi-Kamran, Mohammad Sadrosadati, and Hamid Sarbazi-Azad. 2018. ORIGAMI: A heterogeneous split architecture for in-memory acceleration of learning. arXiv preprint arXiv:1812.11473(2018).Google Scholar
Hajar Falahati, Masoud Peyro, Hossein Amini, Mehran Taghian, Mohammad Sadrosadati, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2021. Data-Aware compression of neural networks. IEEE Computer Architecture Letters 20, 2 (2021), 94–97.Google ScholarDigital Library
Mark Gebhart, Daniel R. Johnson, David Tarjan, Stephen W. Keckler, William J. Dally, Erik Lindholm, and Kevin Skadron. 2011. Energy-efficient Mechanisms for Managing Thread Context in Throughput Processors. In ISCA.Google Scholar
Syed Zohaib Gilani, Nam Sung Kim, and Michael J Schulte. 2013. Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency. In MICRO.Google Scholar
Gilani, Syed Zohaib and Kim, Nam Sung and Schulte, Michael J. 2013. Power-efficient computing for compute-intensive GPGPU applications. In HPCA.Google Scholar
Juan Gómez-Luna, Li-Wen Chang, I-Jui Sung, Wen-Mei Hwu, and Nicolás Guil. 2015. In-place data sliding algorithms for many-core architectures. In ICPP.Google Scholar
Nilanjan Goswami, Bingyi Cao, and Tao Li. 2013. Power-performance co-optimization of throughput core architecture using resistive memory. In HPCA.Google Scholar
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In InPar.Google Scholar
Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. 2008. Mars: A MapReduce Framework on Graphics Processors. In PACT.Google Scholar
Ibrahim, Mohamed Assem and Liu, Hongyuan and Kayiran, Onur and Jog, Adwait. 2019. Analyzing and leveraging remote-core bandwidth for enhanced performance in GPUs. In PACT.Google Scholar
Paul Indrani, Huang Wei, Manish Arora, and Sudhakar Yalmanchili. 2015. Harmonia: Balancing Compute and Memory Power In High-Performance GPUs. In ISCA.Google Scholar
Hyeran Jeon and M. Annavaram. 2012. Warped-DMR: Light-weight Error Detection for GPGPU. In MICRO.Google Scholar
Naifeng Jing, Jianfei Wang, Fengfeng Fan, Wenkang Yu, Li Jiang, Chao Li, and Xiaoyao Liang. 2016. Cache-emulated register file: An integrated on-chip memory architecture for high performance GPGPUs. In MICRO.Google Scholar
Adwait Jog, Onur Kayiran, Nachiappan Chidambaram, Asit K.Mishra, Mahmut T.Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R.Das. 2013. OWL: Cooperative Thread Array Aware Scheduling Techniques for Improving GPGPU Performance. In ASPLOS.Google Scholar
Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. In ISCA.Google Scholar
Adwait Jog, Onur Kayiran, Ashutosh Pattnaik, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2016. Exploiting core criticality for enhanced GPU performance. In SIGMETRICS.Google Scholar
Kandiah, Vijay and Peverelle, Scott and Khairy, Mahmoud and Pan, Junrui and Manjunath, Amogh and Rogers, Timothy G and Aamodt, Tor M and Hardavellas, Nikos. 2021. AccelWattch: A Power Modeling Framework for Modern GPUs. In MICRO.Google Scholar
Onur Kayiran, Await Jog, Mahmut T. Kandemir, and Chita R. Das. 2013. Neither More nor Less: Optimizing Thread-level Parallelism for GPGPUs. In PACT.Google Scholar
Onur Kayiran, Nachiappan Chidambaram Nachiappan, Adwait Jog, Rachata Ausavarungnirun, Mahmut T. Kandemir, Gabriel H. Loh, Onur Mutlu, and Chita R. Das. 2014. Managing GPU Concurrency in Heterogeneous Architectures. In MICRO.Google Scholar
M. M. Keshtegar, H. Falahati, and S. Hessabi. 2015. Cluster-based approach for improving graphics processing unit performance by inter streaming multiprocessors locality. In IET-CDT.Google Scholar
Khairy, Mahmoud and Shen, Zhesheng and Aamodt, Tor M and Rogers, Timothy G. 2020. Accel-Sim: An extensible simulation framework for validated GPU modeling. In ISCA.Google Scholar
Gunjae Koo, Hyeran Jeon, and Murali Annavaram. 2015. Revealing Critical Loads and Hidden Data Locality in GPGPU Applications. In IISWC.Google Scholar
Nagesh B. Lakshminarayana and Hyesoon Kim. 2014. Spare Register Aware Prefetching for Graph Algorithms on GPUs. In HPCA.Google Scholar
Jaekyu Lee and Hyesoon Kim. 2012. TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture. In HPCA.Google Scholar
Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In MICRO.Google Scholar
Lee, Minseok and Song, Seokwoo and Moon, Joosik and Kim, John and Seo, Woong and Cho, Yeongon and Ryu, Soojung. 2014. Improving GPGPU resource utilization through alternative thread block scheduling. In HPCA.Google Scholar
Jingwen Leng, Tayler Hetherington, Ahmed ElTantawy, Syed Gilani, Nam Sung Kim, Tor M. Aamodt, and Vijay Janapa Reddi. 2013. GPUWattch: Enabling Energy Optimizations in GPGPUs. In ISCA.Google Scholar
Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In SC.Google Scholar
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic GPU cache bypassing. In ICS.Google Scholar
Dong Li, Minsoo Rhu, Daniel R. Johnson, Mike O’Connor, Mattan Erez, Doug Burger, Donald S. Fussell, and Stephen W. Keckler. 2015. Priority-Based Cache Allocation in Throughput Processors. In HPCA.Google Scholar
Zhenhong Liu, Syed Gilani, Murali Annavaram, and Nam Sung Kim. 2017. G-scalar: Cost-effective generalized scalar execution architecture for power-efficient GPUs. In HPCA.Google Scholar
Tor M.Aamodt, Wilson W. L. Fung, and Tayler H. Hetherington. 2018. Cuda9.0 prgramming guide. http://gpgpu-sim.org/ manual/index.php5/GPGPU-Sim_3.x_ManualGoogle Scholar
Megalingam R.Kannan and M. Arunkumar and V. A.Ashok and Krishnan Nived and C. J. Daniel. 2010. Power-Efficient Cache Design Using Dual-Edge Clocking Scheme in Sun OpenSPARC T1 and Alpha AXP Processors. Journal of Communications in Computer and Information Science (2010).Google Scholar
Amirhossein Mirhosseini, Mohammad Sadrosadati, Behnaz Soltani, Hamid Sarbazi-Azad, and Thomas F Wenisch. 2017. BiNoCHS: Bimodal Network-on-Chip for CPU-GPU Heterogeneous Systems. In NOCS.Google Scholar
Mirhosseini, Amirhossein and Sadrosadati, Mohammad and Aghamohammadi, Fatemeh and Modarressi, Mehdi and Sarbazi-Azad, Hamid. 2019. BARAN: Bimodal adaptive reconfigurable-allocator network-on-chip. ACM TOPC (2019).Google Scholar
Sparsh Mittal. 2016. A survey of cache bypassing techniques. JOLPE (2016).Google Scholar
Saba Mostofi, Hajar Falahati, Negin Mahani, Pejman Lotfi-Kamran, and Hamid Sarbazi-Azad. 2023. Snake: A Variable-length Chain-based Prefetching for GPUs. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 728–741.Google ScholarDigital Library
Aaftab Munshi. 2008. The OpenCL specification. In Khronos OpenCL Working Group.Google Scholar
Hoda Naghibijouybari, Khaled N Khasawneh, and Nael Abu-Ghazaleh. 2017. Constructing and characterizing covert channels on GPGPUs. In MICRO.Google Scholar
Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N.Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling. In MICRO.Google Scholar
Negin Nematollahi, Mohammad Sadrosadati, Hajar Falahati, Marzieh Barkhordar, and Hamid Sarbazi-Azad. 2018. Neda: Supporting direct inter-core neighbor data exchange in GPUs. IEEE Computer Architecture Letters 17, 2 (2018), 225–229.Google ScholarDigital Library
Nematollahi, Negin and Sadrosadati, Mohammad and Falahati, Hajar and Barkhordar, Marzieh and Drumond, Mario Paulo and Sarbazi-Azad, Hamid and Falsafi, Babak. 2020. Efficient nearest-neighbor data sharing in GPUs. ACM Transactions on Architecture and Code Optimization (TACO) 18, 1(2020), 1–26.Google Scholar
NVIDIA. 2009. CUDA SDK 2.3. https://developer.nvidia.com/cuda-toolkit-23-downloadsGoogle Scholar
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh. 2016. Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming. In ISCA.Google Scholar
Seung In Park, Sean P Ponce, Jing Huang, Yong Cao, and Francis Quek. 2008. Low-cost, high-speed computer vision using NVIDIA’s CUDA architecture. In AIPR.Google Scholar
Massoud Pedram, Qing Wu, and Xunwei Wu. 1998. A New Design for Double Edge Triggered Flip-flops. In ASP-AC.Google Scholar
Gennady Pekhimenko, Evgeny Bolotin, Mike O’Connor, Onur Mutlu, Todd C Mowry, and Stephen W Keckler. 2015. Toggle-Aware Compression for GPUs.CAL (2015).Google Scholar
Guillem Pratx and Lei Xing. 2011. GPU computing in medical physics: A review. Medical physics (2011).Google Scholar
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2012. Cache-conscious wavefront scheduling. In MICRO.Google Scholar
Mohammad Sadrosadati, Seyed Borna Ehsani, Hajar Falahati, Rachata Ausavarungnirun, Arash Tavakkol, Mojtaba Abaee, Lois Orosa, Yaohua Wang, Hamid Sarbazi-Azad, and Onur Mutlu. 2019. ITAP: Idle-time-aware power management for GPU execution units. ACM Transactions on Architecture and Code Optimization (TACO) 16, 1(2019), 1–26.Google ScholarDigital Library
Mohammad Sadrosadati, Amirhossein Mirhosseini, Seyed Borna Ehsani, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2018. LTRF: Enabling High-Capacity Register Files for GPUs via Hardware/Software Cooperative Register Prefetching. In ASPLOS.Google ScholarDigital Library
Mohammad Sadrosadati, Amirhossein Mirhosseini, Ali Hajiabadi, Seyed Borna Ehsani, Hajar Falahati, Hamid Sarbazi-Azad, Mario Drumond, Babak Falsafi, Rachata Ausavarungnirun, and Onur Mutlu. 2021. Highly concurrent latency-tolerant register files for GPUs. ACM Transactions on Computer Systems (TOCS) 37, 1-4 (2021), 1–36.Google ScholarDigital Library
Sadrosadati, Mohammad and Mirhosseini, Amirhossein and Roozkhosh, Shahin and Bakhishi, Hazhir and Sarbazi-Azad, Hamid. 2017. Effective cache bank placement for GPUs. In DATE.Google Scholar
Mohammad Hossein Samavatian, Hamed Abbasitabar, Mohammad Arjomand, and Hamid Sarbazi-Azad. 2014. An efficient STT-RAM last level cache architecture for GPUs. In DAC.Google Scholar
I Schmerken. 2009. Wall street accelerates options analysis with GPU technology. Wall Street Technology(2009).Google Scholar
Ankit Seething, Ganesh Dasika, Mehrzad Samadi, and Scott Mahlke. 2010. Apogee: Adaptive prefetching on GPUs for energy efficiency. In PACT.Google Scholar
Ankit Sethia, D. Anoushe Jamshidi, and Scott Mahlke. 2015. Mascar: Speeding up GPU Warps by Reducing Memory Pitstops. In HPCA.Google Scholar
Ankit Sethia and Scott Mahlke. 2014. Equalizer: Dynamic Tuning of GPU Resources for Efficient Execution. In MICRO.Google Scholar
Kevin Skadron, Margaret Martonosi, and Douglas W. Clark. 2009. A Taxonomy of Branch Mispredictions, and Alloyed Prediction as a Robust Solution to Wrong-History Mispredictions. In PACT.Google Scholar
Sam S Stone, Justin P Haldar, Stephanie C Tsao, BP Sutton, Z-P Liang, et al. 2008. Accelerating advanced MRI reconstructions on GPUs. JPDC (2008).Google Scholar
John A. Stratton, Christopher Rodrigues, I-Jui Sung, Nady Obeid, Li-Wen Chang, Nasser Anssari, Geng Daniek Liu, and Wen Mei Hwu. 2012. Parboil: A revised benchmark suite for scientific and commercial throughput computing. Technical Report.Google Scholar
Sun, Chen and Chen, Chia-Hsin Owen and Kurian, George and Wei, Lan and Miller, Jason and Agarwal, Anant and Peh, Li-Shiuan and Stojanovic, Vladimir. 2012. DSENT-a tool connecting emerging photonics with electronics for opto-electronic networks-on-chip modeling. In NOCS.Google Scholar
Synopsys. 2022. TTSMC-28nm. https://www.synopsys.com/dw/emllselector.php?f=TSMC&n=28&s=wMkRWAGoogle Scholar
Abdulaziz Tabbakh, Murali Annavaram, and Xuehai Qian. 2017. Power Efficient Sharing-Aware GPU Data Management. In IPDPS.Google Scholar
David Tarjan and Kevin Skadron. 2010. The Sharing Tracker: Using Ideas from Cache Coherence Hardware to Reduce Off-Chip Memory Traffic with Non-Coherent Caches. In SC.Google Scholar
Yingying Tian, Sooraj Puthoor, Joseph L Greathouse, Bradford M Beckmann, and Daniel A Jiménez. 2015. Adaptive GPU cache bypassing. In GPGPU.Google Scholar
Timothy G. Rogers and Mike O’Connor and Tor M. Aamodt. 2013. Divergence-Aware Warp Scheduling. In MICRO.Google Scholar
Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C Mowry, and Onur Mutlu. 2015. A case for core-assisted bottleneck acceleration in GPUs: enabling flexible data compression with assist warps. In ACM SIGARCH Computer Architecture News.Google Scholar
Bin Wang, Bo Wu, Dong Li, Xipeng Shen, Weikuan Yu, Yizheng Jiao, and Jeffrey S Vetter. 2013. Exploring hybrid memory for GPU energy efficiency through software-hardware co-design. In PACT.Google Scholar
Jin Wang, Norm Rubin, Albert Sidelnik, and Sudhakar Yalamanchili. 2015. Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs. In ISCA.Google Scholar
Wang, Jin and Rubin, Norm and Sidelnik, Albert and Yalamanchili, Sudhakar. 2016. Laperm: Locality aware scheduler for dynamic parallelism on gpus. In ISCA.Google Scholar
Wang, Lu and Zhao, Xia and Kaeli, David and Wang, Zhiying and Eeckhout, Lieven. 2018. Intra-cluster coalescing to reduce gpu noc pressure. In IPDPS.Google Scholar
Steven JE Wilton and Norman P Jouppi. 1996. CACTI: An enhanced cache access and cycle time model. JSSC (1996).Google Scholar
S Yu Wing-Kei, Ruirui Huang, Sarah Q Xu, Sung-En Wang, Edwin Kan, and G Edward Suh. 2011. SRAM-DRAM hybrid memory with applications to efficient register files in fine-grained multi-threading. In ISCA.Google Scholar
Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs. In ICCAD.Google Scholar
Xiaolong Xie, Yun Liang, Yu Wang, Guangyu Sun, and Tao Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In HPCA.Google Scholar
Qiumin Xu and Murali Annavaram. 2014. PATS: pattern aware scheduling and power gating for GPGPUs. In PACT.Google Scholar
Qiumin Xu, Hyeran Jeon, and Murali Annavaram. 2014. Graph processing on GPUs: Where are the bottlenecks?. In IISWC.Google Scholar
Xu, Qiumin and Jeon, Hyeran and Annavaram, Murali. 2014. Graph processing on GPUs: Where are the bottlenecks?. In IISWC.Google Scholar
Jieming Yin, Pingqiang Zhou, Anup Holey, Sachin S Sapatnekar, and Antonia Zhai. 2012. Energy-efficient non-minimal path on-chip interconnection network for heterogeneous systems. In ISLPED.Google Scholar
Jia Zhan, Onur Kayıran, Gabriel H Loh, Chita R Das, and Yuan Xie. 2016. OSCAR: Orchestrating STT-RAM cache traffic for heterogeneous CPU-GPU architectures. In MICRO.Google Scholar
Zhao, Xia and Liu, Yuxi and Adileh, Almutaz and Eeckhout, Lieven. 2016. LA-LLC: Inter-core locality-aware last-level cache to exploit many-to-many traffic in GPGPUs. IEEE CAL (2016).Google Scholar
Zhao, Xia and Ma, Sheng and Li, Chen and Eeckhout, Lieven and Wang, Zhiying. 2016. A heterogeneous low-cost and low-latency ring-chain network for GPGPUs. In ICCD.Google Scholar
Ziabari, Amir Kavyan and Abellán, José L and Ma, Yenai and Joshi, Ajay and Kaeli, David. 2015. Asymmetric NoC architectures for GPU systems. In NOCS.Google Scholar

Index Terms

Cross-Core Data Sharing for Energy-Efficient GPUs
1. Computer systems organization
  1. Architectures

Recommendations

Energy-efficient hardware data prefetching

Extensive research has been done in prefetching techniques that hide memory latency in microprocessors leading to performance improvements. However, the energy aspect of prefetching is relatively unknown. While aggressive prefetching techniques often ...
Read More
SAC: Sharing-Aware Caching in Multi-Chip GPUs
ISCA '23: Proceedings of the 50th Annual International Symposium on Computer Architecture

Bandwidth non-uniformity in multi-chip GPUs poses a major design challenge for its last-level cache (LLC) architecture. Whereas a memory-side LLC caches data from the local memory partition while being accessible by all chips, an SM-side LLC is ...
Read More
Locality-Aware CTA Clustering for Modern GPUs
ASPLOS '17

Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Architecture and Code Optimization Just Accepted
ISSN:1544-3566
EISSN:1544-3973
Table of Contents

Copyright © 2024 Copyright held by the owner/author(s).
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Online AM: 18 March 2024
- Accepted: 5 March 2023
- Revised: 24 January 2023
- Received: 30 June 2022
Published in taco Just Accepted

Check for updates
Author Tags
GPU
Memory Access
Energy Efficiency
Data Sharing
Inter-SM communication.
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 90
  Total Downloads
- Downloads (Last 12 months)90
- Downloads (Last 6 weeks)59
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cross-Core Data Sharing for Energy-Efficient GPUs

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Energy-efficient hardware data prefetching

SAC: Sharing-Aware Caching in Multi-Chip GPUs

Locality-Aware CTA Clustering for Modern GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cross-Core Data Sharing for Energy-Efficient GPUs

ACM Transactions on Architecture and Code Optimization

Abstract

References

Cited By

Index Terms

Recommendations

Energy-efficient hardware data prefetching

SAC: Sharing-Aware Caching in Multi-Chip GPUs

Locality-Aware CTA Clustering for Modern GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media