skip to main content
research-article

Shooting Down the Server Front-End Bottleneck

Published:04 January 2022Publication History
Skip Abstract Section

Abstract

The front-end bottleneck is a well-established problem in server workloads owing to their deep software stacks and large instruction footprints. Despite years of research into effective L1-I and BTB prefetching, state-of-the-art techniques force a trade-off between metadata storage cost and performance. Temporal Stream prefetchers deliver high performance but require a prohibitive amount of metadata to accommodate the temporal history. Meanwhile, BTB-directed prefetchers incur low cost by using the existing in-core branch prediction structures but fall short on performance due to BTB’s inability to capture the massive control flow working set of server applications. This work overcomes the fundamental limitation of BTB-directed prefetchers, which is capturing a large control flow working set within an affordable BTB storage budget. We re-envision the BTB organization to maximize its control flow coverage by observing that an application’s instruction footprint can be mapped as a combination of its unconditional branch working set and, for each unconditional branch, a spatial encoding of the cache blocks around the branch target. Effectively capturing a map of the application’s instruction footprint in the BTB enables highly effective BTB-directed prefetching that outperforms the state-of-the-art prefetchers by up to 10% for equivalent storage budget.

REFERENCES

  1. [1] Kumar Rakesh, Grot Boris, and Nagarajan Vijay. 2018. Blasting through the front-end bottleneck with shotgun. In Proceedings of the 23rd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’18). ACM, New York, NY, 3042. http://dx.doi.org/10.1145/3173162.3173178 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. [2] Ailamaki Anastassia, DeWitt David J., Hill Mark D., and Wood David A.. 1999. DBMSs on a modern processor: Where does time go? In International Conference on Very Large Data Bases. 266277. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. [3] Keeton Kimberly, Patterson David A., He Yong Qiang, Raphael Roger C., and Baker Walter E.. 1998. Performance characterization of a quad Pentium pro SMP using OLTP workloads. In International Symposium on Computer Architecture. 1526. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. [4] Ranganathan Parthasarathy, Gharachorloo Kourosh, Adve Sarita V., and Barroso Luiz André. 1998. Performance of database workloads on shared-memory systems with out-of-order processors. In International Conference on Architectural Support for Programming Languages and Operating Systems. 307318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. [5] Kanev Svilen, Darago Juan Pablo, Hazelwood Kim M., Ranganathan Parthasarathy, Moseley Tipp, Wei Gu-Yeon, and Brooks David M.. 2015. Profiling a warehouse-scale computer. In International Symposium on Computer Architecture. 158169. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. [6] Chen I-Cheng K., Lee Chih-Chieh, and Mudge Trevor N.. 1997. Instruction prefetching using branch prediction information. In International Conference on Computer Design. 593601. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Ferdman Michael, Wenisch Thomas F., Ailamaki Anastasia, Falsafi Babak, and Moshovos Andreas. 2008. Temporal instruction fetch streaming. In International Symposium on Microarchitecture. 110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Kolli Aasheesh, Saidi Ali G., and Wenisch Thomas F.. 2013. RDIP: Return-address-stack directed instruction prefetching. In The 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). 260271. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Reinman Glenn, Calder Brad, and Austin Todd. 1999. Fetch directed instruction prefetching. In International Symposium on Microarchitecture. IEEE, 1627. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Spracklen L., Chou Yuan, and Abraham S. G.. 2005. Effective instruction prefetching in chip multiprocessors for modern commercial applications. In 11th International Symposium on High-Performance Computer Architecture. 225236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Bonanno J., Collura A., Lipetz D., Mayer U., Prasky B., and Saporito A.. 2013. Two level bulk preload branch prediction. In International Symposium on High-Performance Computer Architecture. 7182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Burcea Ioana and Moshovos Andreas. 2009. Phantom-BTB: A virtualized branch target buffer design. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09). 313324. http://dx.doi.org/10.1145/1508244.1508281 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Kaynak Cansu, Grot Boris, and Falsafi Babak. 2013. SHIFT: Shared history instruction fetch for lean-core server processors. In International Symposium on Microarchitecture. 272283. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Kaynak Cansu, Grot Boris, and Falsafi. Babak2015. Confluence: Unified instruction supply for scale-out servers. In International Symposium on Microarchitecture. 166177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Kumar Rakesh, Huang Cheng-Chieh, Grot Boris, and Nagarajan Vijay. 2017. Boomerang: A metadata-free architecture for control flow delivery. In 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA’17). 493504. http://dx.doi.org/10.1109/HPCA.2017.53Google ScholarGoogle Scholar
  16. [16] Ferdman Michael, Kaynak Cansu, and Falsafi Babak. 2011. Proactive instruction fetch. In International Symposium on Microarchitecture. 152162. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Yeh Tse-Yu and Patt Yale N.. 1992. A comprehensive instruction fetch mechanism for a processor supporting speculative execution. In International Symposium on Microarchitecture. 129139. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Wenisch Thomas F., Wunderlich Roland E., Ferdman Michael, Ailamaki Anastassia, Falsafi Babak, and Hoe James C.. 2006. SimFlex: Statistical sampling of computer system simulation. IEEE Micro 26, 4 (2006), 1831. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. [19] Wunderlich Roland E., Wenisch Thomas F., Falsafi Babak, and Hoe James C.. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In International Symposium on Computer Architecture. 8495. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Seznec André and Michaud Pierre. 2006. A case for (partially) TAgged GEometric history length branch prediction. Journal of Instruction-Level Parallelism 8 (2006). https://jilp.org/vol8/index.html.Google ScholarGoogle Scholar
  21. [21] Smith A. J.. 1978. Sequential program prefetching in memory hierarchies. Computer 11, 12 (Dec. 1978), 721. http://dx.doi.org/10.1109/C-M.1978.218016 Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Chen I-Cheng K., Lee Chih-Chieh, and Mudge T. N.. 1997. Instruction prefetching using branch prediction information. In Proceedings International Conference on Computer Design VLSI in Computers and Processors. 593601. http://dx.doi.org/10.1109/ICCD.1997.628926 Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Srinivasan Viji, Davidson Edward S., Tyson Gary S., Charney Mark J., and Puzak Thomas R.. 2001. Branch history guided instruction prefetching. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA’01). IEEE Computer Society, 291. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Veidenbaum Alexander V., Zhao Qingbo, and Shameer Abduhl. 1999. Non-sequential instruction cache prefetching for multiple.issue processors. International Journal of High Speed Computing 10, 1 (1999), 115140. http://dx.doi.org/10.1142/S0129053399000065Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Zhang Yi, Haga Steve, and Barua Rajeev. 2002. Execution history guided instruction prefetching. In Proceedings of the 16th International Conference on Supercomputing (ICS’02). Association for Computing Machinery, New York, NY, 199208. http://dx.doi.org/10.1145/514191.514220 Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Ros Alberto and Jimborean Alexandra. 2020. The entangling instruction prefetcher. IEEE Computer Architecture Letters 19, 2 (2020), 8487. http://dx.doi.org/10.1109/LCA.2020.3002947Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Kallurkar P. and Sarangi S. R.. 2016. pTask: A smart prefetching scheme for OS intensive applications. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’16). 112. http://dx.doi.org/10.1109/MICRO.2016.7783706 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Chen Dehao, Moseley Tipp, and Li David Xinliang. 2016. AutoFDO: Automatic feedback-directed optimization for warehouse-scale applications. In 2016 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’16). 1223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. [29] Li David Xinliang, Ashok Raksit, and Hundt Robert. 2010. Lightweight feedback-directed cross-module optimization. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10). Association for Computing Machinery, New York, NY, 5361. http://dx.doi.org/10.1145/1772954.1772964 Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Ottoni Guilherme and Maher Bertrand. 2017. Optimizing function placement for large-scale data-center applications. In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’17). 233244. http://dx.doi.org/10.1109/CGO.2017.7863743 Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Luk C.-K., Muth R., Patil Harish, Cohn R., and Lowney G.. 2004. Ispike: A post-link optimizer for the Intel/spl reg/ Itanium/spl reg/ architecture. In International Symposium on Code Generation and Optimization, 2004 (CGO’04).1526. http://dx.doi.org/10.1109/CGO.2004.1281660 Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. [32] Panchenko Maksim, Auler Rafael, Nell Bill, and Ottoni Guilherme. 2019. BOLT: A practical binary optimizer for data centers and beyond. In Proceedings of the 2019 IEEE/ACM International Symposium on Code Generation and Optimization (CGO’19). IEEE Press, 214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. [33] Luk Chi-Keung and Mowry T. C.. 1998. Cooperative prefetching: Compiler and hardware support for effective instruction prefetching in modern processors. In Proceedings. 31st Annual ACM/IEEE International Symposium on Microarchitecture. 182193. http://dx.doi.org/10.1109/MICRO.1998.742780 Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. [34] Annavaram M., Patel J. M., and Davidson E. S.. 2001. Call graph prefetching for database applications. In Proceedings HPCA 7th International Symposium on High-Performance Computer Architecture. 281290. http://dx.doi.org/10.1109/HPCA.2001.903270 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Nagendra Nayana Prasad, Ayers Grant, August David I., Cho Hyoun Kyu, Kanev Svilen, Kozyrakis Christos, Krishnamurthy Trivikram, Litz Heiner, Moseley Tipp, and Ranganathan Parthasarathy. 2020. AsmDB: Understanding and mitigating front-end stalls in warehouse-scale computers. IEEE Micro 40, 3 (2020), 5663. http://dx.doi.org/10.1109/MM.2020.2986212Google ScholarGoogle ScholarCross RefCross Ref
  36. [36] Khan Tanvir Ahmed, Sriraman Akshitha, Devietti Joseph, Pokam Gilles, Litz Heiner, and Kasikci Baris. 2020. I-SPY: Context-driven conditional instruction prefetching with coalescing. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’20). 146159. http://dx.doi.org/10.1109/MICRO50266.2020.00024Google ScholarGoogle Scholar
  37. [37] Asheim Truls, Grot Boris, and Kumar Rakesh. 2021. BTB-X: A storage-effective BTB organization. IEEE Computer Architecture Letters 20, 2 (2021), 134–137.Google ScholarGoogle ScholarCross RefCross Ref
  38. [38] AMD Software Optimization Guide. Section 2.8.1.2. ([n. d.]). https://www.amd.com/system/files/TechDocs/56665.zip.Google ScholarGoogle Scholar
  39. [39] Ansari Ali, Lotfi-Kamran Pejman, and Sarbazi-Azad Hamid. 2020. Divide and conquer frontend bottleneck. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA’20). 6578. http://dx.doi.org/10.1109/ISCA45697.2020.00017 Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. [40] Tanvir Ahmed Khan, Nathan Brown, Akshitha Sriraman, Niranjan K. Soundararajan, Rakesh Kumar, Joseph Devietti, Sreenivas Subramoney, Gilles A. Pokam, Heiner Litz, and Baris Kasikci. 2021. Twig: Profile-guided BTB prefetching for data center applications. In 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual Event, Greece, October 18-22, 2021. ACM, 816–829. https://doi.org/10.1145/3466752.3480124Google ScholarGoogle Scholar

Index Terms

  1. Shooting Down the Server Front-End Bottleneck

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Computer Systems
      ACM Transactions on Computer Systems  Volume 38, Issue 3-4
      November 2020
      92 pages
      ISSN:0734-2071
      EISSN:1557-7333
      DOI:10.1145/3481705
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 4 January 2022
      • Accepted: 1 August 2021
      • Revised: 1 May 2021
      • Received: 1 October 2020
      Published in tocs Volume 38, Issue 3-4

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format