wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

Gao, Wan-Rong; Fang, Jian-Bin; Huang, Chun; Xu, Chuan-Fu; Wang, Zheng

doi:10.1007/s11390-021-1251-x

wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

Regular Paper
Published: 30 November 2023

Volume 38, pages 1323–1338, (2023)
Cite this article

Journal of Computer Science and Technology Aims and scope Submit manuscript

Wan-Rong Gao¹,
Jian-Bin Fang¹,
Chun Huang¹,
Chuan-Fu Xu¹ &
…
Zheng Wang²

100 Accesses
1 Altmetric
Explore all metrics

Abstract

Cache performance is a critical design constraint for modern many-core systems. Since the cache often works in a “black-box” manner, it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware. To better support code optimization, we need to understand and characterize the cache behavior. While cache performance characterization is heavily studied on traditional x86 architectures, there is little work for understanding the cache implementations on emerging ARMv8-based many-cores. This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores, Phytium 2000+, ThunderX2, and Kunpeng 920 (KP920). To this end, we develop wrBench, a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication. Our evaluation provides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores. The quantitative performance data is shown in tables. We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors, Phytium 2000+, ThunderX2, and KP920. Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Laurenzano M A, Tiwari A, Cauble-Chantrenne A, Jundt A, Ward W A, Campbell R, Carrington L. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36–45. https://doi.org/10.1109/ISPASS.2016.7482072.
Stephens N. ARMv8-A next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, Aug. 2016. https://doi.org/10.1109/HOTCHIPS.2016.7936203.
Zhang C. Mars: A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. https://doi.org/10.1109/HOTCHIPS.2015.7477454.
Arima E, Kodama Y, Odajima T, Tsuji M, Sato M. Power/Performance/Area evaluations for next-generation HPC processors using the A64FX chip. In Proc. the 2021 IEEE Symposium in Low-Power and High-Speed Chips, Apr. 2021. https://doi.org/10.1109/COOLCHIPS52128.2021.9410320.
Odajima T, Kodama Y, Tsuji M, Matsuda M, Maruyama Y, Sato M. Preliminary performance evaluation of the Fujitsu A64FX using HPC applications. In Proc. the 2020 IEEE International Conference on Cluster Computing, Sept. 2020, pp.523–530. https://doi.org/10.1109/CLUSTER49012.2020.00075.
Pedretti K T, Younge A J, Hammond S D, Laros III J H, Curry M L, Aguilar M J, Hoekstra R J, Brightwell R. Chronicles of Astra: Challenges and lessons from the first Petascale Arm supercomputer. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020. https://doi.org/10.1109/SC41405.2020.00052.
Mantovani F, Garcia-Gasulla M, Gracia J, Stafford E, Banchelli F, Josep-Fabrego M, Criado-Ledesma J, Nachtmann M. Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU. Future Gener. Comput. Syst., 2020, 112: 800–818. https://doi.org/10.1016/j.future.2020.06.033.
Article Google Scholar
Hill M D, Marty M R. Amdahl’s law in the multicore era. IEEE Computer, 2008, 41(7): 33–38. https://doi.org/10.1109/MC.2008.209.
Article Google Scholar
McCalpin J D. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture Newsletter, 1995, 2: 19–25.
McVoy L M, Staelin C. lmbench: Portable tools for performance analysis. In Proc. the USENIX Annual Technical Conference, Jan. 1996, pp.279–294.
Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2009, pp.261–270. https://doi.org/10.1109/PACT.2009.22.
Ramos S, Hoefler T. Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. In Proc. the 22nd International Symposium on High-Performance Parallel and Distributed Computing, Jun. 2013, pp.97–108. https://doi.org/10.1145/2493123.2462916.
Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the ACM/SPEC International Conference on Performance Engineering, Mar. 2014, pp.137–148. https://doi.org/10.1145/2568088.2576799.
Fang J, Liao X, Huang C, Dong D. Performance evaluation of memory-centric ARMv8 many-core architectures: A case study with Phytium 2000+. Journal of Computer Science and Technology, 2021, 36(1): 33–43. https://doi.org/10.1007/s11390-020-0741-6.
Article Google Scholar
Xia J, Cheng C, Zhou X, Hu Y, Chun P. Kunpeng 920: The first 7-nm chiplet-based 64-Core ARM SoC for cloud services. IEEE Micro, 2021, 41(5): 67–75. https://doi.org/10.1109/MM.2021.3085578.
Article Google Scholar
Hackenberg D, Molka D, Nagel W E. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2009, pp.413–422. https://doi.org/10.1145/1669112.1669165.
Ballard G, Druinsky A, Knight N, Schwartz O. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 2016, 3(3): Article 18. https://doi.org/10.1145/3015144.
Babka V, Tůma P. Investigating cache parameters of x86 family processors. In Proc. the SPEC Benchmark Workshop, Jan. 2009, pp.77–96. https://doi.org/10.1007/978-3-540-93799-9_5.
Wong H, Papadopoulou M, Sadooghi-Alvandi M. Demystifying GPU microarchitecture through microbenchmarking. In Proc. the 2010 IEEE International Symposium on Performance Analysis of Systems Software, Mar. 2010, pp.235–246. https://doi.org/10.1109/ISPASS.2010.5452013.
Mei X, Chu X. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(1): 72–86. https://doi.org/10.1109/TPDS.2016.2549523.
Article Google Scholar
Lin J, Xu Z, Cai L, Nukada A, Matsuoka S. Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations. Parallel Computing, 2018, 77: 128–143. https://doi.org/10.1016/j.parco.2018.06.001.
Article MathSciNet Google Scholar
McIntosh-Smith S, Price J, Deakin T, Poenaru A. A performance analysis of the first generation of HPC-optimized Arm processors. Concurrency and Computation: Practice and Experience, 2019, 31(16): e5110. https://doi.org/10.1002/cpe.5110.

Download references

Author information

Authors and Affiliations

College of Computer Science, National University of Defense Technology, Changsha, 410073, China
Wan-Rong Gao, Jian-Bin Fang, Chun Huang & Chuan-Fu Xu
School of Computing, University of Leeds, Leeds, LS2 9JT, UK
Zheng Wang

Authors

Wan-Rong Gao
View author publications
You can also search for this author in PubMed Google Scholar
Jian-Bin Fang
View author publications
You can also search for this author in PubMed Google Scholar
Chun Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chuan-Fu Xu
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jian-Bin Fang.

Supplementary Information

ESM 1

(PDF 704 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, WR., Fang, JB., Huang, C. et al. wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems. J. Comput. Sci. Technol. 38, 1323–1338 (2023). https://doi.org/10.1007/s11390-021-1251-x

Download citation

Received: 31 December 2020
Accepted: 14 November 2021
Published: 30 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s11390-021-1251-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

Abstract

Access this article

References

Author information

Authors and Affiliations

Corresponding author

Supplementary Information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation