Abstract
Persistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from many performance issues. First, they may introduce repeated cache line flushes and small random accesses in persistent memory for their poor heap metadata management. Second, they use static slab segregation resulting in a dramatic increase in memory consumption when allocation request size is changed. Third, they are not aware of NUMA effect, leading to remote persistent memory accesses in memory allocation and deallocation processes. In this paper, we design a novel allocator, named PMAlloc, to solve the above issues simultaneously. (1) PMAlloc eliminates cache line reflushes by mapping contiguous data blocks in slabs to interleaved metadata entries stored in different cache lines. (2) It writes small metadata units to a persistent bookkeeping log in a sequential pattern to remove random heap metadata accesses in persistent memory. (3) Instead of using static slab segregation, it supports slab morphing, which allows slabs to be transformed between size classes to significantly improve slab usage. (4) It uses a local-first allocation policy to avoid allocating remote memory blocks. And it supports a two-phase deallocation mechanism including recording and synchronization to minimize the number of remote memory access in the deallocation. PMAlloc is complementary to the existing consistency models. Results on 6 benchmarks demonstrate that PMAlloc improves the performance of state-of-the-art persistent memory allocators by up to 6.4x and 57x for small and large allocations, respectively. PMAlloc with NUMA optimizations brings a 2.9x speedup in multi-socket evaluation and is up to 36x faster than other persistent memory allocators. Using PMAlloc reduces memory usage by up to 57.8%. Besides, we integrate PMAlloc in a persistent FPTree. Compared to the state-of-the-art allocators, PMAlloc improves the performance of this application by up to 3.1x.
- Wilhelm Ackermann. 1928. Zum hilbertschen aufbau der reellen zahlen. Math. Ann. 99, 1 (1928), 118–133.Google ScholarCross Ref
- Martin Aigner, Christoph M Kirsch, Michael Lippautz, and Ana Sokolova. 2015. Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation through Large Virtual Memory and Global Data Structures. ACM SIGPLAN Notices 50, 10 (2015), 451–469.Google ScholarDigital Library
- Chloe Alverti, Vasileios Karakostas, Nikhita Kunati, Georgios Goumas, and Michael Swift. 2022. DaxVM: Stressing the Limits of Memory as a File Interface. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 369–387.Google ScholarDigital Library
- Joy Arulraj, Andrew Pavlo, and Subramanya R. Dulloor. 2015. Let’s Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD). Association for Computing Machinery, 707–722.Google Scholar
- Josh Barnes and Piet Hut. 1986. A hierarchical O (N log N) force-calculation algorithm. nature 324, 6096 (1986), 446–449.Google Scholar
- Emery D Berger, Kathryn S McKinley, Robert D Blumofe, and Paul R Wilson. 2000. Hoard: A Scalable Memory Allocator for Multithreaded Applications. ACM Sigplan Notices 35, 11 (2000), 117–128.Google ScholarDigital Library
- Bo Bernhardsson. 1991. Explicit solutions to the n-queens problem for all n. ACM SiGART Bulletin 2, 2 (1991), 7.Google ScholarDigital Library
- Kumud Bhandari, Dhruva R Chakrabarti, and Hans-J Boehm. 2016. Makalu: Fast Recoverable Allocation of Non-Volatile Memory. ACM SIGPLAN Notices 51, 10 (2016), 677–694.Google ScholarDigital Library
- Hans-Juergen Boehm. 1993. Space Efficient Conservative Garbage Collection. In Proceedings of the ACM SIGPLAN 1993 Conference on Programming Language Design and Implementation (PLDI) (Albuquerque, New Mexico, USA). New York, NY, USA, 197–206.Google ScholarDigital Library
- Wentao Cai, Haosen Wen, H Alan Beadle, Chris Kjellqvist, Mohammad Hedayati, and Michael L Scott. 2020. Understanding and Optimizing Persistent Memory Allocation. In Proceedings of the 2020 ACM SIGPLAN International Symposium on Memory Management (ISMM). 60–73.Google ScholarDigital Library
- Zhichao Cao, Siying Dong, Sagar Vemuri, and David HC Du. 2020. Characterizing, Modeling, and Benchmarking RocksDB Key-Value Workloads at Facebook. In 18th USENIX Conference on File and Storage Technologies (FAST). 209–223.Google ScholarDigital Library
- Guoyang Chen, Lei Zhang, Richa Budhiraja, Xipeng Shen, and Youfeng Wu. 2017. Efficient Support of Position Independence on Non-Volatile Memory. In Proceedings of the 50th Annual IEEE/ACM International Symposium Fon Microarchitecture (MICRO). 191–203.Google ScholarDigital Library
- Youmin Chen, Youyou Lu, Fan Yang, Qing Wang, Yang Wang, and Jiwu Shu. 2020. FlatStore: An Efficient Log-Structured Key-Value Storage Engine for Persistent Memory. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 1077–1091.Google ScholarDigital Library
- Zhangyu Chen, Yu Huang, Bo Ding, and Pengfei Zuo. 2020. Lock-Free Concurrent Level Hashing for Persistent Memory. In Proceedings of the 2020 USENIX Annual Technical Conference (ATC). 799–812.Google Scholar
- Joel Coburn, Adrian M Caulfield, Ameen Akel, Laura M Grupp, Rajesh K Gupta, Ranjit Jhala, and Steven Swanson. 2011. NV-Heaps: Making Persistent Objects Fast and Safe with Next-Generation, Non-Volatile Memories. ACM SIGARCH Computer Architecture News 39, 1 (2011), 105–118.Google Scholar
- Intel Corporation. 2018. Redis. https://github.com/pmem/redis/tree/3.2-nvml/.Google Scholar
- Intel Corporation. 2020. Persistent Memory Development Kit. http://pmem.io/.Google Scholar
- Intel Corporation. 2021. eADR: New Opportunities for Persistent Memory Applications. https://www.intel.com/content/www/us/en/developer/articles/technical/eadr-new-opportunities-for-persistent-memory-applications.htmlGoogle Scholar
- Andreia Correia, Pascal Felber, and Pedro Ramalhete. 2018. Romulus: Efficient Algorithms for Persistent Transactional Memory. In Proceedings of the 30th on Symposium on Parallelism in Algorithms and Architectures(SPAA). 271–282.Google ScholarDigital Library
- Zheng Dang, Shuibing He, Peiyi Hong, Zhenxin Li, Xuechen Zhang, Xian-He Sun, and Gang Chen. 2022. NVAlloc: Rethinking Heap Metadata Management in Persistent Memory Allocators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) (Lausanne, Switzerland). New York, NY, USA, 115–127.Google ScholarDigital Library
- Mohammad Dashti, Alexandra Fedorova, Justin Funston, Fabien Gaud, Renaud Lachaize, Baptiste Lepers, Vivien Quema, and Mark Roth. 2013. Traffic Management: A Holistic Approach to Memory Placement on NUMA Systems. In Proceedings of the Eighteenth International Conference on Architectural Support for Programming Languages and Operating Systems. New York, NY, USA, 381–394.Google ScholarDigital Library
- Arnaldo Carvalho De Melo. 2010. The new linux perf tools. In Slides from Linux Kongress, Vol. 18. 1–42.Google Scholar
- Anthony Demeri, Wook-Hee Kim, R Madhava Krishnan, Jaeho Kim, Mohannad Ismail, and Changwoo Min. 2020. Poseidon: Safe, Fast and Scalable Persistent Memory Allocator. In Proceedings of the 21st International Middleware Conference (Middleware). 207–220.Google ScholarDigital Library
- Matthias Diener, Eduardo HM Cruz, Marco AZ Alves, Philippe OA Navaux, Anselm Busse, and Hans-Ulrich Heiss. 2015. Kernel-based thread and data mapping for improved memory affinity. IEEE Transactions on Parallel and Distributed Systems 27, 9 (2015), 2653–2666.Google ScholarDigital Library
- Dominik Durner, Viktor Leis, and Thomas Neumann. 2019. On the Impact of Memory Allocation on High-Performance Query Processing. In Proceedings of the 15th International Workshop on Data Management on New Hardware (DaMoN). 1–3.Google ScholarDigital Library
- Jason Evans. 2021. jemalloc. https://github.com/jemalloc/jemalloc/.Google Scholar
- Fabien Gaud, Baptiste Lepers, Jeremie Decouchant, Justin Funston, Alexandra Fedorova, and Vivien Quéma. 2014. Large pages may be harmful on {NUMA} systems. In 2014 USENIX Annual Technical Conference (USENIX ATC 14). 231–242.Google Scholar
- Inc Google. 2021. tcmalloc. https://github.com/google/tcmalloc.Google Scholar
- Jinyu Gu, Qianqian Yu, Xiayang Wang, Zhaoguo Wang, Binyu Zang, Haibing Guan, and Haibo Chen. 2019. Pisces: A Scalable and Efficient Persistent Transactional Memory. In 2019 USENIX Annual Technical Conference (ATC). USENIX Association, 913–928.Google Scholar
- Tom’s Hardware. 2022. Samsung’s Memory-Semantic CXL SSD Brings a 20X Performance Uplift. https://www.tomshardware.com/news/samsung-memory-semantic-cxl-ssd-brings-20x-performance-uplift.Google Scholar
- Qingda Hu, Jinglei Ren, Anirudh Badam, Jiwu Shu, and Thomas Moscibroda. 2017. Log-Structured Non-Volatile Main Memory. In Proceedings of the 2017 USENIX Annual Technical Conference (ATC). 703–717.Google Scholar
- Xiameng Hu, Xiaolin Wang, Yechen Li, Lan Zhou, Yingwei Luo, Chen Ding, Song Jiang, and Zhenlin Wang. 2015. LAMA: Optimized Locality-aware Memory Allocation for Key-value Cache. In 2015 USENIX Annual Technical Conference (ATC). USENIX Association, Santa Clara, CA, 57–69. https://www.usenix.org/conference/atc15/technical-session/presentation/huGoogle Scholar
- Intel. 2018. 5-Level Paging and 5-Level EPT White Paper. https://www.intel.com/content/www/us/en/content-details/671442/5-level-paging-and-5-level-ept-white-paper.html.Google Scholar
- Inc Intel. 2022. IPMCTL: A Command Line Interface (CLI) application for configuring and managing PMems. https://github.com/intel/ipmctl/.Google Scholar
- Inc Intel. 2022. Processor Counter Monitor (PCM). https://github.com/intel/pcm/.Google Scholar
- Inc Intel. 2023. Intel® 64 and IA-32 Architectures Optimization Reference Manual.Google Scholar
- Abdullah Al Raqibul Islam and Dong Dai. 2023. DGAP: Efficient Dynamic Graph Analysis on Persistent Memory. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC). Association for Computing Machinery, New York, NY, USA, Article 93, 13 pages.Google Scholar
- Keita Iwabuchi, Lance Lebanoff, Maya Gokhale, and Roger Pearce. 2019. Metall: A Persistent Memory Allocator Enabling Graph Processing. In 2019 IEEE/ACM 9th Workshop on Irregular Applications: Architectures and Algorithms (IA3). 39–44.Google Scholar
- jemalloc. 2023. jemalloc(3) manual page. https://jemalloc.net/jemalloc.3.html.Google Scholar
- Hai Jin, Zhiwei Li, Haikun Liu, Xiaofei Liao, and Yu Zhang. 2020. Hotspot-Aware Hybrid Memory Management for In-Memory Key-Value Stores. IEEE Transactions on Parallel and Distributed Systems (TPDS) 31, 4(2020), 779–792. https://doi.org/10.1109/TPDS.2019.2945315Google ScholarDigital Library
- Mark S Johnstone and Paul R Wilson. 1998. The Memory Fragmentation Problem: Solved?ACM Sigplan Notices 34, 3 (1998), 26–36.Google ScholarDigital Library
- Myoungsoo Jung. 2022. Hello bytes, bye blocks: PCIe storage meets compute express link for memory expansion (CXL-SSD). In Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage). 45–51.Google ScholarDigital Library
- Patryk Kaminski. 2009. NUMA aware heap memory manager. AMD Developer Central(2009), 46.Google Scholar
- Sanidhya Kashyap, Changwoo Min, Kangnyeon Kim, and Taesoo Kim. 2018. A scalable ordering primitive for multicore machines. In Proceedings of the Thirteenth EuroSys Conference (EuroSys) (Porto, Portugal). Association for Computing Machinery, New York, NY, USA, Article 34, 15 pages.Google ScholarDigital Library
- Wonbae Kim, Chanyeol Park, Dongui Kim, Hyeongjun Park, Young ri Choi, Alan Sussman, and Beomseok Nam. 2022. ListDB: Union of Write-Ahead Logs and Persistent SkipLists for Incremental Checkpointing on Persistent Memory. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). Carlsbad, CA, 161–177.Google Scholar
- Wook-Hee Kim, R. Madhava Krishnan, Xinwei Fu, Sanidhya Kashyap, and Changwoo Min. 2021. PACTree: A High Performance Persistent Range Index Using PAC Guidelines. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP) (Virtual Event, Germany). Association for Computing Machinery, New York, NY, USA, 424–439.Google ScholarDigital Library
- Joseph B Kruskal. 1956. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceedings of the American Mathematical society 7, 1 (1956), 48–50.Google ScholarCross Ref
- Mohan Kumar Kumar, Steffen Maass, Sanidhya Kashyap, Ján Veselý, Zi Yan, Taesoo Kim, Abhishek Bhattacharjee, and Tushar Krishna. 2018. LATR: Lazy Translation Coherence. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). New York, NY, USA, 651–664.Google ScholarDigital Library
- Per-Åke Larson and Murali Krishnan. 1998. Memory Allocation for Long-Running Server Applications. In Proceedings of the 1st International Symposium on Memory Management (ISMM). 176–185.Google ScholarDigital Library
- Se Kwon Lee, K Hyun Lim, Hyunsub Song, Beomseok Nam, and Sam H Noh. 2017. WORT: Write Optimal Radix Tree for Persistent Memory Storage Systems. In Proceedings of the 15th USENIX Conference on File and Storage Technologies (FAST). 257–270.Google Scholar
- Se Kwon Lee, Jayashree Mohan, Sanidhya Kashyap, Taesoo Kim, and Vijay Chidambaram. 2019. Recipe: Converting Concurrent DRAM Indexes to Persistent-Memory Indexes. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP). 462–477.Google ScholarDigital Library
- Daan Leijen. 2019. MiMalloc Benchmarks. GitHub repository. https://github.com/daanx/mimalloc-benchGoogle Scholar
- Daan Leijen, Benjamin Zorn, and Leonardo de Moura. 2019. Mimalloc: Free list sharding in action. In Programming Languages and Systems: 17th Asian Symposium, APLAS 2019, Nusa Dua, Bali, Indonesia, December 1–4, 2019, Proceedings 17. Springer, 244–265.Google ScholarCross Ref
- Lenovo. 2018. Memcached-PMEM. https://github.com/lenovo/memcached-pmem/.Google Scholar
- Baptiste Lepers, Vivien Quéma, and Alexandra Fedorova. 2015. Thread and memory placement on {NUMA} systems: Asymmetry matters. In 2015 USENIX annual technical conference (USENIX ATC 15). 277–289.Google Scholar
- Zhenxin Li, Bing Jiao, Shuibing He, and Weikuan Yu. 2022. PhaST: Hierarchical Concurrent Log-Free Skip List for Persistent Memory. IEEE Transactions on Parallel and Distributed Systems (TPDS) 33, 12(2022), 3929–3941.Google ScholarDigital Library
- Jihang Liu, Shimin Chen, and Lujun Wang. 2020. LB+ Trees: Optimizing Persistent Index Performance on 3DXPoint Memory. Proceedings of the VLDB Endowment 13, 7 (2020), 1078–1090.Google ScholarDigital Library
- Baotong Lu, Xiangpeng Hao, Tianzheng Wang, and Eric Lo. 2020. Dash: Scalable Hashing on Persistent Memory. Proc. VLDB Endow. 13, 8 (2020), 1147–1161.Google ScholarDigital Library
- Shaonan Ma, Kang Chen, Shimin Chen, Mengxing Liu, Jianglang Zhu, Hongbo Kang, and Yongwei Wu. 2021. ROART: Range-Query Optimized Persistent ART. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST). 1–16.Google Scholar
- Inc MicroQuill. 2014. shbench. http://www.microquill.com/.Google Scholar
- Chi Cao Minh, JaeWoong Chung, Christos Kozyrakis, and Kunle Olukotun. 2008. STAMP: Stanford transactional applications for multi-processing. In 2008 IEEE International Symposium on Workload Characterization. IEEE, 35–46.Google ScholarCross Ref
- Iulian Moraru, David G Andersen, Michael Kaminsky, Niraj Tolia, Parthasarathy Ranganathan, and Nathan Binkert. 2013. Consistent, Durable, and Safe Memory Management for Byte-Addressable Non-Volatile Main Memory. In Proceedings of the First ACM SIGOPS Conference on Timely Results in Operating Systems (TRIOS). 1–17.Google ScholarDigital Library
- Ismail Oukid, Daniel Booss, Adrien Lespinasse, Wolfgang Lehner, Thomas Willhalm, and Grégoire Gomes. 2017. Memory Management Techniques for Large-Scale Persistent-Main-Memory Systems. Proceedings of the VLDB Endowment 10, 11 (2017), 1166–1177.Google ScholarDigital Library
- Ismail Oukid, Johan Lasperas, Anisoara Nica, Thomas Willhalm, and Wolfgang Lehner. 2016. FPTree: A Hybrid SCM-DRAM Persistent and Concurrent B-tree for Storage Class Memory. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD). 371–386.Google ScholarDigital Library
- Xing Pan, Yasaswini Jyothi Gownivaripalli, and Frank Mueller. 2016. TintMalloc: Reducing Memory Access Divergence via Controller-Aware Coloring. In 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 363–372. https://doi.org/10.1109/IPDPS.2016.26Google ScholarCross Ref
- Mihail Popov, Alexandra Jimborean, and David Black-Schaffer. 2019. Efficient thread/page/parallelism autotuning for numa systems. In Proceedings of the ACM International Conference on Supercomputing. 342–353.Google ScholarDigital Library
- Bobby Powers, David Tench, Emery D Berger, and Andrew McGregor. 2019. Mesh: Compacting Memory Management for C/C++ Applications. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 333–346.Google ScholarDigital Library
- Andy Rudoff. 2020. Persistent Memory Programming without All That Cache Flushing. SDC (2020).Google Scholar
- Stephen M Rumble, Ankita Kejriwal, and John Ousterhout. 2014. Log-Structured Memory for DRAM-Based Storage. In Proceedings of the 12th USENIX Conference on File and Storage Technologies (FAST). 1–16.Google ScholarDigital Library
- Scott Schneider, Christos D Antonopoulos, and Dimitrios S Nikolopoulos. 2006. Scalable Locality-Conscious Multithreaded Memory Allocation. In Proceedings of the 5th International Symposium on Memory Management (ISMM). 84–94.Google ScholarDigital Library
- David Schwalb, Tim Berning, Martin Faust, Markus Dreseler, and Hasso Plattner. 2015. nvm malloc: Memory Allocation for NVRAM. ADMS@ VLDB 15(2015), 61–72.Google Scholar
- Haris Volos, Andres Jaan Tack, and Michael M Swift. 2011. Mnemosyne: Lightweight Persistent Memory. ACM SIGARCH Computer Architecture News 39, 1 (2011), 91–104.Google ScholarDigital Library
- Mehul Wagle, Daniel Booss, Ivan Schreter, and Daniel Egenolf. 2015. NUMA-aware memory management with in-memory databases. In Technology Conference on Performance Evaluation and Benchmarking (TPCTC). Springer, 45–60.Google Scholar
- Qing Wang, Youyou Lu, Junru Li, and Jiwu Shu. 2021. Nap: A Black-Box Approach to NUMA-Aware Persistent Memory Indexes. In 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, 93–111.Google Scholar
- Rui Wang, Shuibing He, Weixu Zong, Yongkun Li, and Yinlong Xu. 2022. XPGraph: XPline-Friendly Persistent Memory Graph Stores for Large-Scale Evolving Graphs. In 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO). 1308–1325.Google Scholar
- Paul R Wilson, Mark S Johnstone, Michael Neely, and David Boles. 1995. Dynamic Storage Allocation: A Survey and Critical Review. In Proceedings of the International Workshop on Memory Management (IWMM). Springer, 1–116.Google ScholarCross Ref
- Kai Wu, Jie Ren, Ivy Peng, and Dong Li. 2021. ArchTM: Architecture-Aware, High Performance Transaction for Persistent Memory. In Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST). 141–153.Google Scholar
- Lingfeng Xiang, Xingsheng Zhao, Jia Rao, Song Jiang, and Hong Jiang. 2022. Characterizing the performance of intel optane persistent memory: a close look at its on-DIMM buffering. In Proceedings of the Seventeenth European Conference on Computer Systems (EuroSys). 488–505.Google ScholarDigital Library
- Jian Yang, Juno Kim, Morteza Hoseinzadeh, Joseph Izraelevitz, and Steve Swanson. 2020. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory. In Proceedings of the 18th USENIX Conference on File and Storage Technologies (FAST). 169–182.Google ScholarDigital Library
- Diyu Zhou, Yuchen Qian, Vishal Gupta, Zhifei Yang, Changwoo Min, and Sanidhya Kashyap. 2022. ODINFS: Scaling PM Performance with Opportunistic Delegation. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI). USENIX Association, Carlsbad, CA, 179–193.Google Scholar
Index Terms
- PMAlloc: A Holistic Approach to Improving Persistent Memory Allocation
Recommendations
NVAlloc: rethinking heap metadata management in persistent memory allocators
ASPLOS '22: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating SystemsPersistent memory allocation is a fundamental building block for developing high-performance and in-memory applications. Existing persistent memory allocators suffer from suboptimal heap organizations that introduce repeated cache line flushes and small ...
LiwePMS: A Lightweight Persistent Memory with Wear-aware Memory Management
Next-generation Storage Class Memory (SCM) offers low-latency, high-density, byte-addressable access and persistency. The potent combination of these attractive characteristics makes it possible for SCM to unify the main memory and storage to reduce the ...
PM-Migration: A Page Placement Mechanism for Real-Time Systems with Hybrid Memory Architecture
Algorithms and Architectures for Parallel ProcessingAbstractDue to its higher storage density and lower energy consumption compared to DRAM, persistent memory (PM) holds the potential to address the growing memory demands of applications, such as Deep Neural Network (DNN) training. However, PM also suffers ...
Comments