Abstract
Log-Structured Merge Key-Value stores (LSM KVs) are designed to offer good write performance, by capturing client writes in memory, and only later flushing them to storage. Writes are later compacted into a tree-like data structure on disk to improve read performance and to reduce storage space use. It has been widely documented that compactions severely hamper throughput. Various optimizations have successfully dealt with this problem. These techniques include, among others, rate-limiting flushes and compactions, selecting among compactions for maximum effect, and limiting compactions to the highest level by so-called fragmented LSMs.
In this article, we focus on latencies rather than throughput. We first document the fact that LSM KVs exhibit high tail latencies. The techniques that have been proposed for optimizing throughput do not address this issue, and, in fact, in some cases, exacerbate it. The root cause of these high tail latencies is interference between client writes, flushes, and compactions. Another major cause for tail latency is the heterogeneous nature of the workloads in terms of operation mix and item sizes whereby a few more computationally heavy requests slow down the vast majority of smaller requests.
We introduce the notion of an Input/Output (I/O) bandwidth scheduler for an LSM-based KV store to reduce tail latency caused by interference of flushing and compactions and by workload heterogeneity. We explore three techniques as part of this I/O scheduler: (1) opportunistically allocating more bandwidth to internal operations during periods of low load, (2) prioritizing flushes and compactions at the lower levels of the tree, and (3) separating client requests by size and by data access path. SILK+ is a new open-source LSM KV that incorporates this notion of an I/O scheduler.
- Muhammad Yousuf Ahmad and Bettina Kemme. 2015. Compaction management in distributed key-value datastores. In Proceedings of VLDB.Google ScholarDigital Library
- Jung-Sang Ahn, Chiyoung Seo, Ravi Mayuram, Rahim Yaseen, Jin-Soo Kim, and Seungryoul Maeng. 2015. ForestDB: A fast key-value storage system for variable-length string keys. IEEE Trans. Comput. 65, 3 (2015), 902–915.Google ScholarDigital Library
- Ali Anwar, Yue Cheng, Hai Huang, Jingoo Han, Hyogi Sim, Dongyoon Lee, Fred Douglis, and Ali R. Butt. 2018. BespoKV: Application tailored scale-out key-value stores. In Proceedings of SC18.Google Scholar
- Berk Atikoglu, Yuehai Xu, Eitan Frachtenberg, Song Jiang, and Mike Paleczny. 2012. Workload analysis of a large-scale key-value store. In Proceedings of ACM SIGMETRICS.Google ScholarDigital Library
- Oana Balmau, Diego Didona, Rachid Guerraoui, Willy Zwaenepoel, Huapeng Yuan, Aashray Arora, Karan Gupta, and Pavan Konka. 2017. TRIAD: Creating synergies between memory, disk and log in log structured key-value stores. In Proceedings of USENIX ATC.Google Scholar
- Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. 2019. SILK: Preventing latency spikes in log-structured merge key-value stores. In Proceedings of USENIX ATC.Google Scholar
- Oana Balmau, Rachid Guerraoui, Vasileios Trigonakis, and Igor Zablotchi. 2017. FloDB: Unlocking memory in persistent key-value stores. In Proceedings of EuroSys.Google ScholarDigital Library
- Nikhil Bansal and Mor Harchol-Balter. 2001. Analysis of SRPT scheduling: Investigating unfairness. In Proceedings ACM SIGMETRICS.Google ScholarDigital Library
- Michael A. Bender, Martin Farach-Colton, William Jannen, Rob Johnson, Bradley C. Kuszmaul, Donald E. Porter, Jun Yuan, and Yang Zhan. 2015. An introduction to B-trees and Write-optimization. ;login: 40, 5 (2015).Google Scholar
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.Google ScholarDigital Library
- Michaela Blott, Ling Liu, Kimon Karras, and Kees Vissers. 2015. Scaling out to a single-node 80 gbps memcached server with 40 terabytes of memory. In Proceedings of USENIX HotStorage.Google Scholar
- Edward Bortnikov, Anastasia Braginsky, Eshcar Hillel, Idit Keidar, and Gali Sheffi. 2018. Accordion: Better memory organization for LSM key-value stores. In Proceedings of VLDB.Google ScholarDigital Library
- Gerth Stolting Brodal and Rolf Fagerberg. 2003. Lower bounds for external memory dictionaries. In Proceedings of SODA.Google Scholar
- Helen H. W. Chan, Yongkun Li, Patrick P. C. Lee, and Yinlong Xu. 2018. HashKV: Enabling efficient updates in KV storage via hashing. In Proceedings of USENIX ATC.Google Scholar
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of SoCC.Google ScholarDigital Library
- Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal navigable key-value store. In Proceedings of SIGMOD.Google ScholarDigital Library
- Niv Dayan and Stratos Idreos. 2018. Dostoevsky: Better space-time trade-offs for lsm-tree based key-value stores via adaptive removal of superfluous merging. In Proceedings of SIGMOD.Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. [n.d.]. LevelDB. Retrieved January 2019 from https://github.com/google/leveldb.Google Scholar
- Pamela Delgado, Diego Didona, Florin Dinu, and Willy Zwaenepoel. 2016. Job-aware scheduling in eagle: Divide and stick to your probes. In Proceedings of SoCC.Google ScholarDigital Library
- Pamela Delgado, Diego Didona, Florin Dinu, and Willy Zwaenepoel. 2018. Kairos: Preemptive data center scheduling without runtime estimates. In Proceedings of SoCC.Google ScholarDigital Library
- Pamela Delgado, Florin Dinu, Anne-Marie Kermarrec, and Willy Zwaenepoel. 2015. Hawk: Hybrid datacenter scheduling. In Proceedings of USENIX ATC.Google Scholar
- Christina Delimitrou and Christos Kozyrakis. 2018. Amdahl’s law for tail latency. Commun. ACM 61, 8 (2018), 65–72.Google ScholarDigital Library
- Diego Didona and Willy Zwaenepoel. 2019. Size-aware sharding for improving tail latencies in in-memory key-value stores. In Proceedings of NSDI.Google Scholar
- Siying Dong, Mark Callaghan, Leonidas Galanis, Dhruba Borthakur, Tony Savor, and Michael Strum. 2017. Optimizing space amplification in rocksDB. In Proceedings of CIDR.Google Scholar
- Assaf Eisenman, Asaf Cidon, Evgenya Pergament, Or Haimovich, Ryan Stutsman, Mohammad Alizadeh, and Sachin Katti. 2019. Flashield: A hybrid key-value cache that controls flash write amplification. In Proceedings of NSDI.Google Scholar
- Facebook. [n.d.]. RocksDB: A Persistent Key-value Store for Fast Storage Environments. Retrieved January 2019 from https://rocksdb.org.Google Scholar
- Facebook. [n.d.]. RocksDB Autotuned Rate Limiter. Retrieved January 2019 from https://rocksdb.org/blog/2017/12/18/17-auto-tuned-rate-limiter.html.Google Scholar
- Facebook. [n.d.]. RocksDB Benchmarking Tools. Retrieved January 2019 from https://github.com/facebook/rocksdb/wiki/Benchmarking-tools.Google Scholar
- Facebook. [n.d.]. RocksDB Level-based Compaction Changes. Retrieved January 2019 from https://rocksdb.org/blog/2017/06/26/17-level-based-changes.html.Google Scholar
- Facebook. [n.d.]. RocksDB Rate Limiter. Retrieved January 2019 from https://github.com/facebook/rocksdb/wiki/Rate-Limiter.Google Scholar
- Facebook. [n.d.]. RocksDB Tuing Guide. Retrieved January 2019 from https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide.Google Scholar
- Guy Golan-Gueta, Edward Bortnikov, Eshcar Hillel, and Idit Keidar. 2015. Scaling concurrent log-structured data stores. In Proceedings of EuroSys.Google ScholarDigital Library
- Mor Harchol-Balter. 2013. Performance Modeling and Design of Computer Systems: Queueing Theory in Action. Cambridge University Press.Google ScholarDigital Library
- Yu Hua, Bin Xiao, Bharadwaj Veeravalli, and Dan Feng. 2011. Locality-sensitive bloom filter for approximate membership query. IEEE Trans. Comput. 61, 6 (2011), 817–830.Google ScholarDigital Library
- Hyperdex. [n.d.]. HyperLevelDB. Retrieved January 2019 from https://github.com/rescrv/HyperLevelDB.Google Scholar
- William Jannen, Jun Yuan, Yang Zhan, Amogh Akshintala, John Esmet, Yizheng Jiao, Ankur Mittal, Prashant Pandey, Phaneendra Reddy, Leif Walsh, Michael Bender, Martin Farach-Colton, Rob Johnson, Bradley C. Kuszmaul, and Donald E. Porter. 2015. BetrFS: Write-optimization in a kernel file system. ACM Trans. Storage (TOS) 11, 4 (2015), 1–29.Google ScholarDigital Library
- Kostis Kaffes, Timothy Chong, Jack Tigar Humphries, Adam Belay, David Mazieres, and Christos Kozyrakis. 2019. Shinjuku: Preemptive scheduling for second-scale tail latency. In Proceedings of NSDI.Google Scholar
- Sudarsun Kannan, Nitish Bhat, Ada Gavrilovska, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2018. Redesigning LSMs for nonvolatile memory with noveLSM. In Proceedings of USENIX ATC.Google Scholar
- Leonard Kleinrock. 1975. Theory, volume 1, queueing systems. (1975).Google ScholarDigital Library
- Chunbo Lai, Song Jiang, Liqiong Yang, Shiding Lin, Guangyu Sun, Zhenyu Hou, Can Cui, and Jason Cong. 2015. Atlas: Baidu’s key-value storage system for cloud data. In Proceedings of MSST.Google ScholarCross Ref
- Avinash Lakshman and Prashant Malik. 2010. Cassandra: A decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44, 2 (2010), 35–40.Google ScholarDigital Library
- Baptiste Lepers, Oana Balmau, Karan Gupta, and Willy Zwaenepoel. 2019. KVell: the design and implementation of a fast persistent key-value store. In Proceedings of SOSP.Google ScholarDigital Library
- Hyeontaek Lim, David G. Andersen, and Michael Kaminsky. 2016. Towards accurate and fast evaluation of multi-stage log-structured designs. In Proceedings of FAST.Google Scholar
- Kevin Lim, David Meisner, Ali G. Saidi, Parthasarathy Ranganathan, and Thomas F. Wenisch. 2013. Thin servers with smart pipes: Designing SoC accelerators for memcached. In Proceedings of ACM SIGARCH.Google Scholar
- Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2016. WiscKey: Separating keys from values in SSD-conscious storage. In Proceedings of FAST.Google Scholar
- Fei Mei, Qiang Cao, Hong Jiang, and Jingjun Li. 2018. SifrDB: A unified solution for write-optimized key-value stores in large datacenter. In Proceedings of SoCC.Google ScholarDigital Library
- Fei Mei, Qiang Cao, Hong Jiang, and Lei Tian Tintri. 2017. LSM-tree managed storage for large-scale key-value store. In Proceedings of SoCC.Google ScholarDigital Library
- Memcached. [n.d.]. memcached: Free 8 open source, high-performance, distributed memory object caching system. Retrieved May 2019 from https://memcached.org/.Google Scholar
- Alexander Merritt, Ada Gavrilovska, Yuan Chen, and Dejan Milojicic. 2017. Concurrent log-structured memory for many-core key-value stores. In Proceedings of VLDB.Google ScholarDigital Library
- Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. The log-structured merge-tree (LSM-tree). Acta Inf. 33, 4 (1996), 351–385.Google ScholarDigital Library
- John Ousterhout and Fred Douglis. 1989. Beating the I/O bottleneck: A case for log-structured file systems. ACM SIGOPS Operating Systems Review 23, 1 (1989), 11–28.Google ScholarDigital Library
- Anastasios Papagiannis, Giorgos Saloustros, Pilar González-Férez, and Angelos Bilas. 2016. Tucana: Design and implementation of a fast and efficient scale-up key-value store. In Proceedings of USENIX ATC.Google Scholar
- Pandian Raju, Rohan Kadekodi, Vijay Chidambaram, and Ittai Abraham. 2017. PebblesDB: Building key-value stores using fragmented log-structured merge trees. In Proceedings of SOSP.Google ScholarDigital Library
- Waleed Reda, Marco Canini, Lalith Suresh, Dejan Kostić, and Sean Braithwaite. 2017. Rein: Taming tail latency in key-value stores via multiget scheduling. In Proceedings of EuroSys.Google ScholarDigital Library
- Kai Ren, Qing Zheng, Joy Arulraj, and Garth Gibson. 2017. SlimDB: A space-efficient key-value storage engine for semi-sorted data. In Proceedings of VLDB.Google ScholarDigital Library
- Russell Sears and Raghu Ramakrishnan. 2012. bLSM: A general purpose log structured merge tree. In Proceedings of SIGMOD.Google ScholarDigital Library
- Peng Wang, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. An efficient design and implementation of LSM-tree based key-value store on open-channel SSD. In Proceedings of EuroSys.Google ScholarDigital Library
- Xingbo Wu, Yuehai Xu, Zili Shao, and Song Jiang. 2015. LSM-trie: An LSM-tree-based ultra-large key-value store for small data. In Proceedings of USENIX ATC.Google Scholar
- Ting Yao, Jiguang Wan, Ping Huang, Xubin He, Fei Wu, and Changsheng Xie. 2017. Building efficient key-value stores via a lightweight compaction tree. ACM Trans. Storage (TOS) 13, 4 (2017), 1–28.Google ScholarDigital Library
- Qi Zhang, Alma Riska, Wei Sun, Evgenia Smirni, and Gianfranco Ciardo. 2005. Workload-aware load balancing for clustered web servers. IEEE Trans. Parallel Distrib. Syst. 16, 3 (2005), 219–233.Google ScholarDigital Library
Index Terms
- SILK+ Preventing Latency Spikes in Log-Structured Merge Key-Value Stores Running Heterogeneous Workloads
Recommendations
Splaying Log-Structured Merge-Trees
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataModern persistent key-value stores typically use a log-structured merge-tree (LSM-tree) design, which allows for high write throughput. Our observation is that the LSM-tree, however, has suboptimal performance during read-intensive workload windows with ...
Design of LSM-tree-based Key-value SSDs with Bounded Tails
Key-value store based on a log-structured merge-tree (LSM-tree) is preferable to hash-based key-value store, because an LSM-tree can support a wider variety of operations and show better performance, especially for writes. However, LSM-tree is difficult ...
Building Efficient Key-Value Stores via a Lightweight Compaction Tree
Special Issue on MSST 2017 and Regular PapersLog-Structure Merge tree (LSM-tree) has been one of the mainstream indexes in key-value systems supporting a variety of write-intensive Internet applications in today’s data centers. However, the performance of LSM-tree is seriously hampered by ...
Comments