Abstract
Predictable latency on flash storage is a long-pursuit goal, yet unpredictability stays due to the unavoidable disturbance from many well-known SSD internal activities. To combat this issue, the recent NVMe IO Determinism (IOD) interface advocates host-level controls to SSD internal management tasks. Although promising, challenges remain on how to exploit it for truly predictable performance.
We present IODA,1 an I/O deterministic flash array design built on top of small but powerful extensions to the IOD interface for easy deployment. IODA exploits data redundancy in the context of IOD for a strong latency predictability contract. In IODA, SSDs are expected to quickly fail an I/O on purpose to allow predictable I/Os through proactive data reconstruction. In the case of concurrent internal operations, IODA introduces busy remaining time exposure and predictable-latency-window formulation to guarantee predictable data reconstructions. Overall, IODA only adds five new fields to the NVMe interface and a small modification in the flash firmware while keeping most of the complexity in the host OS. Our evaluation shows that IODA improves the 95–99.99th latencies by up to 75×. IODA is also the nearest to the ideal, no disturbance case compared to seven state-of-the-art preemption, suspension, GC coordination, partitioning, tiny-tail flash controller, prediction, and proactive approaches.
- [1] . 2021. IODA: A host/device co-design for strong predictability contract on modern flash storage. In Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP’21).Google ScholarDigital Library
- [2] . 2017. Attack of the killer microseconds. Communications of the ACM 60, 4 (2017), 48–54.Google ScholarDigital Library
- [3] . 2013. The tail at scale. Communications of the ACM 56, 2 (2013), 74–80.Google ScholarDigital Library
- [4] Architecting It. 2018. Why Deterministic Storage Performance is Important. Retrieved November 26, 2022 from https://www.architecting.it/blog/deterministic-storage-performance/.Google Scholar
- [5] Samsung. 2020. All-Flash NVMe Reference Architecture. Retrieved November 26, 2022 from https://www.samsung.com/semiconductor/global.semi/file/resource/2020/05/redhat-ceph-whitepaper-0521.pdf.Google Scholar
- [6] Micron. 2020. Micron 9100 U.2 and HHHL NVMe PCIe SSDs. Retrieved November 26, 2022 from https://www.micron.com/-/media/client/global/documents/products/data-sheet/ssd/9100_hhhl_u_2_pcie_ssd.pdf.Google Scholar
- [7] Intel. 2021. Achieve Consistent Low Latency for Your Storage-Intensive Workloads. Retrieved November 26, 2022 from https://www.intel.com/content/www/us/en/architecture-and-technology/optane-technology/low-latency-for-storage-intensive-workloads-article-brief.html.Google Scholar
- [8] . 2020. NVMe Cloud SSD Specification. Retrieved November 26, 2022 from https://www.opencompute.org/documents/nvme-cloud-ssd-specification-v1-0-3-pdf.Google Scholar
- [9] Violin. 2020. Storage Latency in Flash Arrays. Retrieved November 26, 2022 from https://www.violinsystems.com/wp-content/uploads/Storage-Mojo-WP-storage-latency.pdf.Google Scholar
- [10] . 2017. Tiny-tail flash: Near-perfect elimination of garbage collection tail latencies in NAND SSDs. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).Google ScholarDigital Library
- [11] . 2019. Trimming the tail for deterministic read performance in SSDs. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’19).Google ScholarCross Ref
- [12] . 2014. SDF: Software-defined flash for web-scale internet storage system. In Proceedings of the 19th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’14).Google ScholarDigital Library
- [13] Silverton Consulting. 2016. GreyBeards on Storage. Retrieved November 26, 2022 from https://silvertonconsulting.com/gbos2/tag/tail-latency/.Google Scholar
- [14] . 2018. Enabling NVMe I/O Determinism @ Scale. Retrieved November 26, 2022 from https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2018/20180807_INVT-102A-1_Petersen.pdf.Google Scholar
- [15] . 2018. Using Software to Reduce High Tail Latencies on SSDs. Retrieved November 26, 2022 from https://www.flashmemorysummit.com/English/Collaterals/Proceedings/2018/20180808_SOFT-201-1_Karkar.pdf.Google Scholar
- [16] Incits. 2022. Data Set Management Commands Proposal for ATA8-ACS2. Retrieved December 5, 2022 from https://www.t13.org/.Google Scholar
- [17] NVM Express. 2020. NVM Express Base Specification 1.0. Retrieved November 26, 2022 from https://nvmexpress.org/wp-content/uploads/NVM-Express-1_0e.pdf.Google Scholar
- [18] . 2019. Fully automatic stream management for multi-streamed SSDs using program contexts. In Proceedings of the 17th USENIX Symposium on File and Storage Technologies (FAST’19).Google ScholarDigital Library
- [19] NVM Express. 2020. NVM Express Base Specification 1.4. Retrieved November 26, 2022 from https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf.Google Scholar
- [20] . 2012. Memory Management System and Method. Retrieved November 26, 2022 from https://www.google.com/patents/US8200887.Google Scholar
- [21] . 2016. EC-cache: Load-balanced, low-latency cluster caching with online erasure coding. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16).Google Scholar
- [22] . 2017. Latency reduction and load balancing in coded storage systems. In Proceedings of the 8th ACM Symposium on Cloud Computing (SoCC’17).Google ScholarDigital Library
- [23] . 2022. RAIL: Predictable, low tail latency for NVMe flash. ACM Transactions on Storage 18, 1 (2022), Article 5, 21 pages.Google ScholarDigital Library
- [24] . 2018. The CASE of FEMU: Cheap, accurate, scalable and extensible flash emulator. In Proceedings of the 16th USENIX Symposium on File and Storage Technologies (FAST’18).Google Scholar
- [25] . 2017. LightNVM: The Linux open-channel SSD subsystem. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).Google Scholar
- [26] . 2011. A semi-preemptive garbage collector for solid state drives. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS’11).Google ScholarDigital Library
- [27] William Wu, Shai Traister, Jianmin Huang, Neil David Hutchinson, and Steven Sprouse. 2014. Pre-emptive Garbage Collection of Memory Blocks. Retrieved November 26, 2022 from https://www.google.com/patents/US8626986.Google Scholar
- [28] . 2013. Preemptible I/O scheduling of garbage collection for solid state drives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 32, 2 (2013), 247–260.Google Scholar
- [29] . 2012. Reducing SSD read latency via NAND flash program and erase suspension. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST’12).Google ScholarDigital Library
- [30] . 2019. Practical erase suspension for modern low-latency SSDs. In Proceedings of the 2019 USENIX Annual Technical Conference (ATC’19).Google ScholarDigital Library
- [31] Jea Woong Hyun and David Nellans. 2015. Erase Suspend/Resume for Memory. Retrieved November 26, 2022 from https://patents.google.com/patent/US9223514B2/en.Google Scholar
- [32] . 2015. Purity: Building fast, highly-available enterprise flash storage from commodity components. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD’15).Google ScholarDigital Library
- [33] . 2018. GC-aware request steering with improved performance and reliability for SSD-based RAIDs. In Proceedings of the 32nd IEEE International Parallel and Distributed Processing Symposium (IPDPS’18).Google ScholarCross Ref
- [34] . 2011. Harmonia: A globally coordinated garbage collector for arrays of solid-state drives. In Proceedings of the 27th IEEE Symposium on Massive Storage Systems and Technologies (MSST’11).Google Scholar
- [35] . 2019. Alleviating garbage collection interference through spatial separation in all flash arrays. In Proceedings of the 2019 USENIX Annual Technical Conference (ATC’19).Google ScholarDigital Library
- [36] . 2014. Flash on rails: Consistent flash performance through redundancy. In Proceedings of the 2014 USENIX Annual Technical Conference (ATC’14).Google Scholar
- [37] . 2017. FlashBlox: Achieving both performance isolation and uniform lifetime for virtualized SSDs. In Proceedings of the 15th USENIX Symposium on File and Storage Technologies (FAST’17).Google Scholar
- [38] . 2015. Towards SLO complying SSDs through OPS isolation. In Proceedings of the 13th USENIX Symposium on File and Storage Technologies (FAST’15).Google ScholarDigital Library
- [39] . 2017. MittOS: Supporting millisecond tail tolerance with fast rejecting SLO-aware OS interface. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP’17).Google ScholarDigital Library
- [40] Samsung. 2014. MZHPV128HDGM (SM951) 128 GB PCIe Gen3 8Gb/s x4 M.2. Retrieved December 5, 2022 from https://icecat.biz/rest/product-pdf?productId=26302110&lang=en.Google Scholar
- [41] . 2018. PEN: Design and evaluation of partial-erase for 3D NAND-based high density SSDs. In Proceedings of the 16th USENIX Symposium on File and Storage Technologies (FAST’18).Google Scholar
- [42] . 2011. Differentiated storage services. In Proceedings of the 23rd ACM Symposium on Operating Systems Principles (SOSP’11).Google ScholarDigital Library
- [43] . 2015. Opportunistic storage maintenance. In Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP’15).Google ScholarDigital Library
- [44] . 2018. FlashShare: Punching through server storage stack from kernel to firmware for ultra-low latency SSDs. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18).Google Scholar
- [45] . 2021. Prolonging 3D NAND SSD lifetime via read latency relaxation. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’21).Google ScholarDigital Library
- [46] . 2018. Partitioned real-time NAND flash storage. In Proceedings of the 39th IEEE Real-Time Systems Symposium (RTSS’18).Google ScholarCross Ref
- [47] . 2015. C3: Cutting tail latency in cloud data stores via adaptive replica selection. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15).Google Scholar
- [48] . 2015. CosTLO: Cost-effective redundancy for lower latency variance on cloud storage services. In Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI’15).Google Scholar
- [49] . 2020. LinnOS: Predictability on unpredictable flash storage with a light neural network. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20).Google Scholar
- [50] . 2013. Gecko: Contention-oblivious disk arrays for cloud storage. In Proceedings of the 11th USENIX Symposium on File and Storage Technologies (FAST’13).Google Scholar
- [51] . 2014. Coordinating garbage collection for arrays of solid-state drives. IEEE Transactions on Computers 63, 4 (
April 2014), 888–901.Google ScholarDigital Library - [52] . 2009. Gordon: Using flash memory to build fast, power-efficient clusters for data-intensive applications. In Proceedings of the 14th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09).Google ScholarDigital Library
- [53] . 2011. Essential roles of exploiting internal parallelism of flash memory based solid state drives in high-speed data processing. In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA-’11).Google ScholarCross Ref
- [54] . 2020. Design of a host interface logic for GC-free SSDs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 8 (Aug. 2020), 1674–1687.Google ScholarDigital Library
- [55] . 2017. ReFlex: Remote flash \(\approx\) local flash. In Proceedings of the 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’17).Google ScholarDigital Library
- [56] . 2021. FusionRAID: Achieving consistent low latency for commodity SSD arrays. In Proceedings of the 19th USENIX Symposium on File and Storage Technologies (FAST’21).Google Scholar
- [57] . 2014. Willow: A user-programmable SSD. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14).Google Scholar
- [58] . 2016. Application-managed flash. In Proceedings of the 14th USENIX Symposium on File and Storage Technologies (FAST’16).Google ScholarDigital Library
- [59] . 2012. De-indirection for flash-based SSDs with nameless writes. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST’12).Google ScholarDigital Library
- [60] . 2021. ZNS: Avoiding the block interface tax for flash-based SSDs. In Proceedings of the 2021 USENIX Annual Technical Conference (ATC’21).Google Scholar
- [61] . 2021. Optimizing storage performance with calibrated interrupts. In Proceedings of the 15th USENIX Symposium on Operating Systems Design and Implementation (OSDI’21).Google Scholar
- [62] . 2020. DC-Store: Eliminating noisy neighbor containers using deterministic I/O performance and resource isolation. In Proceedings of the 18th USENIX Symposium on File and Storage Technologies (FAST’20).Google Scholar
- [63] . 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI’04).Google Scholar
- [64] . 2008. Improving MapReduce performance in heterogeneous environments. In Proceedings of the 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI’08).Google ScholarDigital Library
- [65] . 1999. Automatic I/O hint generation through speculative execution. In Proceedings of the 3rd USENIX Symposium on Operating Systems Design and Implementation (OSDI’99).Google Scholar
- [66] . 2019. Gerenuk: Thin computation over big native data using speculative program transformation. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP’19).Google ScholarDigital Library
- [67] . 2014. The power of choice in data-aware cluster scheduling. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI’14).Google Scholar
- [68] . 2010. Reining in the outliers in Map-Reduce clusters using Mantri. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10).Google Scholar
- [69] Jung Sheng Hoei, Sampath K. Ratnam, Renato C. Padilla, Kishore K. Muccherla, Sivaganam Parthasarathy, and Peter Feeley. 2019. Redundant Array of Independent NAND for a Three-Dimensional Memory Array. Retrieved November 26, 2022 from https://patents.google.com/patent/US20170249211A1/en.Google Scholar
- [70] . 2016. Taurus: A holistic language runtime system for coordinating distributed managed-language applications. In Proceedings of the 21st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’16).Google ScholarDigital Library
- [71] . 2015. Trash day: Coordinating garbage collection in distributed systems. In Proceedings of the 15th Workshop on Hot Topics in Operating Systems (HotOS XV).Google ScholarDigital Library
- [72] . 2022. Fantastic SSD internals and how to learn and use them. In Proceedings of the 15th ACM International Conference on Systems and Storage (SYSTOR’22).Google ScholarDigital Library
- [73] . 2018. SSDcheck: Timely and accurate prediction of irregular behaviors in black-box SSDs. In Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-51).Google ScholarDigital Library
- [74] Storage Search. 2020. What’s the State of DWPD? Endurance in Industry Leading Enterprise SSDs. Retrieved November 26, 2022 from http://www.storagesearch.com/dwpd.html.Google Scholar
- [75] Western Digital. 2015. Speeds, Feeds and Needs—Understanding SSD Endurance. Retrieved November 26, 2022 from https://blog.westerndigital.com/ssd-endurance-speeds-feeds-needs/.Google Scholar
- [76] Wikipedia. 2021. Non-Volatile Random-Access Memory. Retrieved November 26, 2022 from https://en.wikipedia.org/wiki/Non-volatile_random-access_memory.Google Scholar
- [77] Intel. 2021. Intel Optane Persistent Memory (PMem). Retrieved November 26, 2022 from https://www.intel.com/content/www/us/en/architecture-and-technology/optane-dc-persistent-memory.html.Google Scholar
- [78] GitHub. 2021. IODA Github Homepage. Retrieved November 26, 2022 from https://github.com/huaicheng/IODA.Google Scholar
- [79] GitHub. 2018. FEMU Github Homepage. Retrieved November 26, 2022 from https://github.com/ucare-uchicago/femu.Google Scholar
- [80] . 2020. Determinizing crash behavior with a verified snapshot-consistent flash translation layer. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20).Google Scholar
- [81] . 2020. LeapIO: Efficient and portable virtual NVMe storage on ARM SoCs. In Proceedings of the 25th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20).Google ScholarDigital Library
- [82] . 2018. Management of next-generation NAND flash to achieve enterprise-level endurance and latency targets. ACM Transactions on Storage 14, 4 (2018), Article 33, 25 pages.Google Scholar
- [83] LightNVM. [n.d.]. Open-Channel Solid State Drives. Retrieved November 26, 2022 from http://lightnvm.io/.Google Scholar
- [84] OpenSSD. [n.d.]. Cosmos+ OpenSSD Platform. 1023–1024. Retrieved December 5, 2022 from http://openssd-project.org/.Google Scholar
- [85] GitHub. [n.d.]. DFC Open Source Community. Retrieved November 26, 2022 from https://github.com/DFC-OpenSource.Google Scholar
- [86] GitLab. 2017. Emulab D430s. Retrieved November 26, 2022 from https://gitlab.flux.utah.edu/emulab/emulab-devel/wikis/Utah-Cluster/d430s.Google Scholar
- [87] Samsung. 2020. Ultra-Low Latency with Samsung Z-NAND SSD. Retrieved December 5, 2022 from https://semiconductor.samsung.com/resources/brochure/Ultra-Low%20Latency%20with%20Samsung%20Z-NAND%20SSD.pdf.Google Scholar
- [88] SNIA. 2016. SNIA I/O Trace Data Files. Retrieved November 26, 2022 from http://iotta.snia.org/traces.Google Scholar
- [89] GitHub. [n.d.]. Filebench. Retrieved November 26, 2022 from https://github.com/filebench/filebench/wiki.Google Scholar
- [90] . 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing (SoCC’11).Google ScholarDigital Library
- [91] GitHub. 2020. Sysbench. Retrieved November 26, 2022 from https://github.com/akopytov/sysbench.Google Scholar
- [92] GitHub. 2020. HiBench: The Bigdata Micro Benchmark Suite. Retrieved November 26, 2022 from https://github.com/Intel-bigdata/HiBench.Google Scholar
- [93] . 2013. Effective straggler mitigation: Attack of the clones. In Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI’13).Google ScholarDigital Library
- [94] . 2020. Evanesco: Architectural support for efficient data sanitization in modern flash-based storage systems. In Proceedings of the 25th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’20).Google ScholarDigital Library
- [95] . 2011. Performance impact and interplay of SSD parallelism through advanced commands, allocation strategy and data granularity. In Proceedings of the 25th International Conference on Supercomputing (ICS’11).Google ScholarDigital Library
- [96] Wikipedia. [n.d.]. Additive increase/multiplicative decrease. Retrieved November 26, 2022 from https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease.Google Scholar
- [97] . 2013. The Harey Tortoise: Managing heterogeneous write performance in SSDs. In Proceedings of the 2013 USENIX Annual Technical Conference (ATC’13).Google Scholar
- [98] . 2019. Why and how to increase SSD performance transparency. In Proceedings of the 17th Workshop on Hot Topics in Operating Systems (HotOS XVII).Google ScholarDigital Library
- [99] . 2012. The bleak future of NAND flash memory. In Proceedings of the 10th USENIX Symposium on File and Storage Technologies (FAST’12).Google ScholarDigital Library
- [100] LightNVM. [n.d.]. Open-Channel Solid State Drives Specification (Revision 2.0). Retrieved November 26, 2022 from http://lightnvm.io/docs/OCSSD-2_0-20180129.pdf.Google Scholar
- [101] . 2001. Information and control in gray-box systems. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01).Google ScholarDigital Library
- [102] . 2004. Deploying safe user-level network services with icTCP. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI’04).Google Scholar
Index Terms
- Extending and Programming the NVMe I/O Determinism Interface for Flash Arrays
Recommendations
IODA: A Host/Device Co-Design for Strong Predictability Contract on Modern Flash Storage
SOSP '21: Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems PrinciplesPredictable latency on flash storage is a long-pursuit goal, yet, unpredictability stays due to the unavoidable disturbance from many well-known SSD internal activities. To combat this issue, the recent NVMe IO Determinism (IOD) interface advocates host-...
RAIL: Predictable, Low Tail Latency for NVMe Flash
Flash-based storage is replacing disk for an increasing number of data center applications, providing orders of magnitude higher throughput and lower average latency. However, applications also require predictable storage latency. Existing Flash devices ...
An Efficient Memory-Mapped Key-Value Store for Flash Storage
SoCC '18: Proceedings of the ACM Symposium on Cloud ComputingPersistent key-value stores have emerged as a main component in the data access path of modern data processing systems. However, they exhibit high CPU and I/O overhead. Today, due to power limitations it is important to reduce CPU overheads for data ...
Comments