Abstract
The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction footprints. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-off between metadata storage cost and performance. Temporal Stream prefetchers deliver high performance but require a prohibitive amount of metadata to accommodate the temporal history. Meanwhile, BTB-directed prefetchers incur low cost by using the existing in-core branch prediction structures but fall short on performance due to BTB’s inability to capture the massive control flow working set of server applications. This work overcomes the fundamental limitation of BTB-directed prefetchers, which is capturing a large control flow working set within an affordable BTB storage budget. We re-envision the BTB organization to maximize its control flow coverage by observing that an application’s instruction footprint can be mapped as a combination of its unconditional branch working set and, for each unconditional branch, a spatial encoding of the cache blocks around the branch target. Effectively capturing a map of the application’s instruction footprint in the BTB enables highly effective BTB-directed prefetching that outperforms the state-of-the-art prefetchers by up to 10% for equivalent storage budget.
- [1] . 2018. Blasting through the front-end bottleneck with shotgun. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, New York, NY, 30–42. http://dx.doi.org/10.1145/3173162.3173178 Google ScholarDigital Library
- [2] . 1999. DBMSs on a modern processor: Where does time go? In International Conference on Very Large Data Bases. 266–277. Google ScholarDigital Library
- [3] . 1998. Performance characterization of a quad Pentium pro SMP using OLTP workloads. In International Symposium on Computer Architecture. 15–26. Google ScholarDigital Library
- [4] . 1998. Performance of database workloads on shared-memory systems with out-of-order processors. In International Conference on Architectural Support for Programming Languages and Operating Systems. 307–318. Google ScholarDigital Library
- [5] . 2015. Profiling a warehouse-scale computer. In International Symposium on Computer Architecture. 158–169. Google ScholarDigital Library
- [6] . 1997. Instruction prefetching using branch prediction information. In International Conference on Computer Design. 593–601. Google ScholarDigital Library
- [7] . 2008. Temporal instruction fetch streaming. In International Symposium on Microarchitecture. 1–10. Google ScholarDigital Library
- [8] . 2013. RDIP: Return-address-stack directed instruction prefetching. In The 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). 260–271. Google ScholarDigital Library
- [9] . 1999. Fetch directed instruction prefetching. In International Symposium on Microarchitecture. IEEE, 16–27. Google ScholarDigital Library
- [10] . 2005. Effective instruction prefetching in chip multiprocessors for modern commercial applications. In 11th International Symposium on High-Performance Computer Architecture. 225–236. Google ScholarDigital Library
- [11] . 2013. Two level bulk preload branch prediction. In International Symposium on High-Performance Computer Architecture. 71–82. Google ScholarDigital Library
- [12] . 2009. Phantom-BTB: A virtualized branch target buffer design. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09). 313–324. http://dx.doi.org/10.1145/1508244.1508281 Google ScholarDigital Library
- [13] . 2013. SHIFT: Shared history instruction fetch for lean-core server processors. In International Symposium on Microarchitecture. 272–283. Google ScholarDigital Library
- [14] 2015. Confluence: Unified instruction supply for scale-out servers. In International Symposium on Microarchitecture. 166–177. Google ScholarDigital Library
- [15] . 2017. Boomerang: A metadata-free architecture for control flow delivery. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 493–504. http://dx.doi.org/10.1109/HPCA.2017.53Google Scholar
- [16] . 2011. Proactive instruction fetch. In International Symposium on Microarchitecture. 152–162. Google ScholarDigital Library
- [17] . 1992. A comprehensive instruction fetch mechanism for a processor supporting speculative execution. In International Symposium on Microarchitecture. 129–139. Google ScholarDigital Library
- [18] . 2006. SimFlex: Statistical sampling of computer system simulation. IEEE Micro 26, 4 (2006), 18–31. Google ScholarDigital Library
- [19] . 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In International Symposium on Computer Architecture. 84–95. Google ScholarDigital Library
- [20] . 2006. A case for (partially) TAgged GEometric history length branch prediction. Journal of Instruction-Level Parallelism 8 (2006). https://jilp.org/vol8/index.html.Google Scholar
- [21] . 1978. Sequential program prefetching in memory hierarchies. Computer 11, 12 (
Dec. 1978), 7–21. http://dx.doi.org/10.1109/C-M.1978.218016 Google ScholarDigital Library - [22] . 1997. Instruction prefetching using branch prediction information. In Proceedings International Conference on Computer Design VLSI in Computers and Processors. 593–601. http://dx.doi.org/10.1109/ICCD.1997.628926 Google ScholarDigital Library
- [23] . 2001. Branch history guided instruction prefetching. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA’01). IEEE Computer Society, 291. Google ScholarDigital Library
- [24] . 1999. Non-sequential instruction cache prefetching for multiple.issue processors. International Journal of High Speed Computing 10, 1 (1999), 115–140. http://dx.doi.org/10.1142/S0129053399000065Google ScholarCross Ref
- [25] . 2002. Execution history guided instruction prefetching. In Proceedings of the 16th International Conference on Supercomputing (ICS’02). Association for Computing Machinery, New York, NY, 199–208. http://dx.doi.org/10.1145/514191.514220 Google ScholarDigital Library
- [26] . 2020. The entangling instruction prefetcher. IEEE Computer Architecture Letters 19, 2 (2020), 84–87. http://dx.doi.org/10.1109/LCA.2020.3002947Google ScholarCross Ref
- [27] . 2016. pTask: A smart prefetching scheme for OS intensive applications. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 1–12. http://dx.doi.org/10.1109/MICRO.2016.7783706 Google ScholarDigital Library
- [28] . 2016. AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications. In 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’16). 12–23. Google ScholarDigital Library
- [29] . 2010. Lightweight feedback-directed cross-module optimization. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10). Association for Computing Machinery, New York, NY, 53–61. http://dx.doi.org/10.1145/1772954.1772964 Google ScholarDigital Library
- [30] . 2017. Optimizing function placement for large-scale data-center applications. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17). 233–244. http://dx.doi.org/10.1109/CGO.2017.7863743 Google ScholarDigital Library
- [31] . 2004. Ispike: A post-link optimizer for the Intel/spl reg/ Itanium/spl reg/ architecture. In International Symposium on Code Generation and Optimization, 2004 (CGO’04).15–26. http://dx.doi.org/10.1109/CGO.2004.1281660 Google ScholarDigital Library
- [32] . 2019. BOLT: A practical binary optimizer for data centers and beyond. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, 2–14. Google ScholarDigital Library
- [33] . 1998. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture. 182–193. http://dx.doi.org/10.1109/MICRO.1998.742780 Google ScholarDigital Library
- [34] . 2001. Call graph prefetching for database applications. In Proceedings HPCA 7th International Symposium on High-Performance Computer Architecture. 281–290. http://dx.doi.org/10.1109/HPCA.2001.903270 Google ScholarDigital Library
- [35] . 2020. AsmDB: Understanding and mitigating front-end stalls in warehouse-scale computers. IEEE Micro 40, 3 (2020), 56–63. http://dx.doi.org/10.1109/MM.2020.2986212Google ScholarCross Ref
- [36] . 2020. I-SPY: Context-driven conditional instruction prefetching with coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). 146–159. http://dx.doi.org/10.1109/MICRO50266.2020.00024Google Scholar
- [37] . 2021. BTB-X: A storage-effective BTB organization. IEEE Computer Architecture Letters 20, 2 (2021), 134–137.Google ScholarCross Ref
- [38] AMD Software Optimization Guide. Section 2.8.1.2. ([n. d.]). https://www.amd.com/system/files/TechDocs/56665.zip.Google Scholar
- [39] . 2020. Divide and conquer frontend bottleneck. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). 65–78. http://dx.doi.org/10.1109/ISCA45697.2020.00017 Google ScholarDigital Library
- [40] Tanvir Ahmed Khan, Nathan Brown, Akshitha Sriraman, Niranjan K. Soundararajan, Rakesh Kumar, Joseph Devietti, Sreenivas Subramoney, Gilles A. Pokam, Heiner Litz, and Baris Kasikci. 2021. Twig: Profile-guided BTB prefetching for data center applications. In 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18-22, 2021. ACM, 816–829. https://doi.org/10.1145/3466752.3480124Google Scholar
Index Terms
- Shooting Down the Server Front-End Bottleneck
Recommendations
Blasting through the Front-End Bottleneck with Shotgun
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsThe front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction working sets. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a ...
Blasting through the Front-End Bottleneck with Shotgun
ASPLOS '18The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction working sets. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a ...
Selective Victim Caching: A Method to Improve the Performance of Direct-Mapped Caches
Although direct-mapped caches suffer from higher miss ratios as compared to set-associative caches, they are attractive for today's high-speed pipelined processors that require very low access times. Victim caching was proposed by Jouppi [1] as an ...
Comments