Skip to main content
Log in

wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Cache performance is a critical design constraint for modern many-core systems. Since the cache often works in a “black-box” manner, it is difficult for the software to reason about the cache behavior to match the running software to the underlying hardware. To better support code optimization, we need to understand and characterize the cache behavior. While cache performance characterization is heavily studied on traditional x86 architectures, there is little work for understanding the cache implementations on emerging ARMv8-based many-cores. This paper presents a comprehensive study to evaluate the cache architecture design on three representative ARMv8 multi-cores, Phytium 2000+, ThunderX2, and Kunpeng 920 (KP920). To this end, we develop wrBench, a micro-benchmark suite to measure the realized latency and bandwidth of caches at different memory hierarchies when performing core-to-core communication. Our evaluation provides inter-core latency and bandwidth in different cache levels and coherency states for the three ARMv8 many-cores. The quantitative performance data is shown in tables. We mine the characteristics of caches and coherency protocols by analyzing the data for the three processors, Phytium 2000+, ThunderX2, and KP920. Our paper also provides discussions and guidelines for optimizing memory access on ARMv8 many-cores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Laurenzano M A, Tiwari A, Cauble-Chantrenne A, Jundt A, Ward W A, Campbell R, Carrington L. Characterization and bottleneck analysis of a 64-bit ARMv8 platform. In Proc. the 2016 IEEE International Symposium on Performance Analysis of Systems and Software, April 2016, pp.36–45. https://doi.org/10.1109/ISPASS.2016.7482072.

  2. Stephens N. ARMv8-A next-generation vector architecture for HPC. In Proc. the 2016 IEEE Hot Chips 28 Symposium, Aug. 2016. https://doi.org/10.1109/HOTCHIPS.2016.7936203.

  3. Zhang C. Mars: A 64-core ARMv8 processor. In Proc. the 2015 IEEE Hot Chips 27 Symposium, Aug. 2015. https://doi.org/10.1109/HOTCHIPS.2015.7477454.

  4. Arima E, Kodama Y, Odajima T, Tsuji M, Sato M. Power/Performance/Area evaluations for next-generation HPC processors using the A64FX chip. In Proc. the 2021 IEEE Symposium in Low-Power and High-Speed Chips, Apr. 2021. https://doi.org/10.1109/COOLCHIPS52128.2021.9410320.

  5. Odajima T, Kodama Y, Tsuji M, Matsuda M, Maruyama Y, Sato M. Preliminary performance evaluation of the Fujitsu A64FX using HPC applications. In Proc. the 2020 IEEE International Conference on Cluster Computing, Sept. 2020, pp.523–530. https://doi.org/10.1109/CLUSTER49012.2020.00075.

  6. Pedretti K T, Younge A J, Hammond S D, Laros III J H, Curry M L, Aguilar M J, Hoekstra R J, Brightwell R. Chronicles of Astra: Challenges and lessons from the first Petascale Arm supercomputer. In Proc. the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2020. https://doi.org/10.1109/SC41405.2020.00052.

  7. Mantovani F, Garcia-Gasulla M, Gracia J, Stafford E, Banchelli F, Josep-Fabrego M, Criado-Ledesma J, Nachtmann M. Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU. Future Gener. Comput. Syst., 2020, 112: 800–818. https://doi.org/10.1016/j.future.2020.06.033.

    Article  Google Scholar 

  8. Hill M D, Marty M R. Amdahl’s law in the multicore era. IEEE Computer, 2008, 41(7): 33–38. https://doi.org/10.1109/MC.2008.209.

    Article  Google Scholar 

  9. McCalpin J D. Memory bandwidth and machine balance in current high performance computers. IEEE Computer Society Technical Committee on Computer Architecture Newsletter, 1995, 2: 19–25.

  10. McVoy L M, Staelin C. lmbench: Portable tools for performance analysis. In Proc. the USENIX Annual Technical Conference, Jan. 1996, pp.279–294.

  11. Molka D, Hackenberg D, Schöne R, Müller M S. Memory performance and cache coherency effects on an Intel Nehalem multiprocessor system. In Proc. the 18th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2009, pp.261–270. https://doi.org/10.1109/PACT.2009.22.

  12. Ramos S, Hoefler T. Modeling communication in cache-coherent SMP systems: A case-study with Xeon Phi. In Proc. the 22nd International Symposium on High-Performance Parallel and Distributed Computing, Jun. 2013, pp.97–108. https://doi.org/10.1145/2493123.2462916.

  13. Fang J, Sips H J, Zhang L, Xu C, Che Y, Varbanescu A L. Test-driving Intel Xeon Phi. In Proc. the ACM/SPEC International Conference on Performance Engineering, Mar. 2014, pp.137–148. https://doi.org/10.1145/2568088.2576799.

  14. Fang J, Liao X, Huang C, Dong D. Performance evaluation of memory-centric ARMv8 many-core architectures: A case study with Phytium 2000+. Journal of Computer Science and Technology, 2021, 36(1): 33–43. https://doi.org/10.1007/s11390-020-0741-6.

    Article  Google Scholar 

  15. Xia J, Cheng C, Zhou X, Hu Y, Chun P. Kunpeng 920: The first 7-nm chiplet-based 64-Core ARM SoC for cloud services. IEEE Micro, 2021, 41(5): 67–75. https://doi.org/10.1109/MM.2021.3085578.

    Article  Google Scholar 

  16. Hackenberg D, Molka D, Nagel W E. Comparing cache architectures and coherency protocols on x86-64 multicore SMP systems. In Proc. the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, Dec. 2009, pp.413–422. https://doi.org/10.1145/1669112.1669165.

  17. Ballard G, Druinsky A, Knight N, Schwartz O. Hypergraph partitioning for sparse matrix-matrix multiplication. ACM Trans. Parallel Comput. 2016, 3(3): Article 18. https://doi.org/10.1145/3015144.

  18. Babka V, Tůma P. Investigating cache parameters of x86 family processors. In Proc. the SPEC Benchmark Workshop, Jan. 2009, pp.77–96. https://doi.org/10.1007/978-3-540-93799-9_5.

  19. Wong H, Papadopoulou M, Sadooghi-Alvandi M. Demystifying GPU microarchitecture through microbenchmarking. In Proc. the 2010 IEEE International Symposium on Performance Analysis of Systems Software, Mar. 2010, pp.235–246. https://doi.org/10.1109/ISPASS.2010.5452013.

  20. Mei X, Chu X. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(1): 72–86. https://doi.org/10.1109/TPDS.2016.2549523.

    Article  Google Scholar 

  21. Lin J, Xu Z, Cai L, Nukada A, Matsuoka S. Evaluating the SW26010 many-core processor with a micro-benchmark suite for performance optimizations. Parallel Computing, 2018, 77: 128–143. https://doi.org/10.1016/j.parco.2018.06.001.

    Article  MathSciNet  Google Scholar 

  22. McIntosh-Smith S, Price J, Deakin T, Poenaru A. A performance analysis of the first generation of HPC-optimized Arm processors. Concurrency and Computation: Practice and Experience, 2019, 31(16): e5110. https://doi.org/10.1002/cpe.5110.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jian-Bin Fang.

Supplementary Information

ESM 1

(PDF 704 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, WR., Fang, JB., Huang, C. et al. wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems. J. Comput. Sci. Technol. 38, 1323–1338 (2023). https://doi.org/10.1007/s11390-021-1251-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-021-1251-x

Keywords

Navigation