Abstract
Recent GPUs enable Peer-to-Peer Direct Memory Access (p2p) from fast peripheral devices like NVMe SSDs to exclude the CPU from the data path between them for efficiency. Unfortunately, using p2p to access files is challenging because of the subtleties of low-level non-standard interfaces, which bypass the OS file I/O layers and may hurt system performance. Developers must possess intimate knowledge of low-level interfaces to manually handle the subtleties of data consistency and misaligned accesses.
We present SPIN, which integrates p2p into the standard OS file I/O stack, dynamically activating p2p where appropriate, transparently to the user. It combines p2p with page cache accesses, re-enables read-ahead for sequential reads, all while maintaining standard POSIX FS consistency, portability across GPUs and SSDs, and compatibility with virtual block devices such as software RAID.
We evaluate SPIN on NVIDIA and AMD GPUs using standard file I/O benchmarks, application traces, and end-to-end experiments. SPIN achieves significant performance speedups across a wide range of workloads, exceeding p2p throughput by up to an order of magnitude. It also boosts the performance of an aerial imagery rendering application by 2.6× by dynamically adapting to its input-dependent file access pattern, enables 3.3× higher throughput for a GPU-accelerated log server, and enables 29% faster execution for the highly optimized GPU-accelerated image collage with only 30 changed lines of code.
- AMD Radeon Pro SSG Set to Transform Workstation PC Architecture, and to Shatter Real-Time Visual Computing Barriers. Retrieved on February 7, 2017 from http://www.amd.com/en-us/press-releases/Pages/amd-radeon-pro-2016jul25.aspx.Google Scholar
- GPUDirect RDMA. Retrieved on February 7, 2017 from http://docs.nvidia.com/cuda/gpudirect-rdma/index.html.Google Scholar
- Tech Brief: AMD FireProTM SDI—Link and AMD DirectGMA Technology. {n.d.} Retrieved from https://www.amd.com/Documents/SDI-tech-brief.pdf.Google Scholar
- Jie Zhang, David Donofrio, John Shalf, Mahmut T. Kandemir, and Myoungsoo Jung. 2015. NVMMU: A non-volatile memory management unit for heterogeneous GPU-SSD architectures. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT’15). IEEE, 13--24. Google ScholarDigital Library
- Hung-Wei Tseng, Yang Liu, Mark Gahagan, Jing Li, Yanqin Jin, and Steven Swanson. 2015. Gullfoss: Accelerating and simplifying data movement among heterogeneous computing and storage resources. Technical Report CS2015-1015, Department of Computer Science and Engineering, University of California, San Diego.Google Scholar
- Mustafa Shihab, Karl Taht, and Myoungsoo Jung. 2014. GPUDrive: Reconsidering storage accesses for GPU acceleration. In Proceedings of the Workshop on Architectures and Systems for Big Data.Google Scholar
- Hung-Wei Tseng, Qianchen Zhao, Yuxiao Zhou, Mark Gahagan, and Steven Swanson. 2016. Morpheus: Creating application objects efficiently for heterogeneous computing. In Proceedings of the ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA’16). IEEE, 53--65. Google ScholarDigital Library
- Project Donard. Retrieved on February 7, 2017 from https://github.com/sbates130272/donard.Google Scholar
- NVM Express 1.0e. Retrieved on February 7, 2017 from http://www.nvmexpress.org/wp-content/uploads/NVM-Express-1_0e.pdf.Google Scholar
- Sangman Kim, Seonggu Huh, Xinya Zhang Yige Hu, Amir Wated, Emmett Witchel, and Mark Silberstein. 2014. GPUnet: Networking abstractions for GPU programs. In Proceedings of the USENIX Conference on Operating Systems Design and Implementation (OSDI’14). USENIX, 6--8. Google ScholarDigital Library
- Fail2Ban. {n.d.} Retrieved from www.fail2ban.org/.Google Scholar
- MDADM—Manage MD Devices AKA Linux Software RAID. {n.d.} Retrieved from https://www.kernel.org/pub/linux/utils/raid/mdadm/.Google Scholar
- Anandech. 2016. AMD Announces Radeon-Pro SSG. Retrieved from http://www.anandtech.com/show/10518/amd-announces-radeon-pro-ssg-fiji-with-m2-ssds-onboard.Google Scholar
- ArcGIS for Desktop. {n.d.} Retrieved from http://desktop.arcgis.com/en/arcmap.Google Scholar
- Sagi Shahar, Shai Bergman, and Mark Silberstein. 2016. ActivePointers: A case for software translation on GPUs. In Proceedings of the International Symposium on Computer Architecture (ISCA’16). IEEE, ACM. Google ScholarDigital Library
- Threaded I/O Tester. {n.d.} Retrieved from https://sourceforge.net/p/tiobench.Google Scholar
- GPU Support in Apache Spark and GPU/CPU Mixed Resource Scheduling at Production Scale. Retrieved on February 7, 2017 from http://www.spark.tc/gpu-support-in-spark-and-gpu-cpu-mixed-resource-scheduling-at-production-scale/.Google Scholar
- Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2013. GPUfs: Integrating file systems with GPUs. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’13). ACM, 13. Google ScholarDigital Library
- Jinsoo Yoo, Youjip Won, Joongwoo Hwang, Sooyong Kang, Jongmoo Choil, Sungroh Yoon, and Jaehyuk Cha. 2013. Vssim: Virtual machine-based SSD simulator. In Proceedings of the IEEE 29th Symposium on Mass Storage Systems and Technologies (MSST’13). IEEE, 1--14.Google ScholarCross Ref
- Feng Chen, Rubao Lee, and Xiaodong Zhang. 2011. Essential roles of exploiting internal parallelism of flash memory-based solid-state drives in high-speed data processing. In Proceedings of the IEEE 17th International Symposium on High Performance Computer Architecture (HPCA’11). IEEE, 266--277. Google ScholarDigital Library
- A Fast GPU Memory Copy Library Based on NVIDIA GPUDirect RDMA Technology. Retrieved on February 7, 2017 from https://github.com/NVIDIA/gdrcopy.Google Scholar
- Evacuate Struct_page from the Block Layer. Retrieved on February 7, 2017 from https://lwn.net/Articles/636968/.Google Scholar
- FOSS4G Benchmark. {n.d.} Retrieved from https://wiki.osgeo.org/wiki/FOSS4G_Benchmark.Google Scholar
- True Marble. {n.d.} Retrieved from http://www.unearthedoutdoors.net/global_data/true_marble/.Google Scholar
- VMWare. {n.d.} vRealize Log Insight. Retrieved from http://www.vmware.com/products/vrealize-log-insight.html.Google Scholar
- Giorgos Vasiliadis, Michalis Polychronakis, Spiros Antonatos, Evangelos P. Markatos, and Sotiris Ioannidis. 2009. Regular expression matching on graphics hardware for intrusion detection. In Proceedings of the International Workshop on Recent Advances in Intrusion Detection. Springer, 265--283. Google ScholarDigital Library
- Antonio Torralba, Robert Fergus, and William T. Freeman. 2008. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30, 11 (2008), 1958--1970. Google ScholarDigital Library
- Benchmarking GPUDirect RDMA on Modern Server Platforms. Retrieved on February 7, 2017 from https://devblogs.nvidia.com/parallelforall/benchmarking-gpudirect-rdma-on-modern-server-platforms/.Google Scholar
- Mark Silberstein, Bryan Ford, Idit Keidar, and Emmett Witchel. 2014. GPUfs: Integrating a file system with GPUs. TOCS 32, 1 (2014), 1. Google ScholarDigital Library
- OpenCAPI. Retrieved from http://opencapi.org/.Google Scholar
- Cache Coherent Interconnect for Accelerators (CCIX). Retrieved from http://www.ccixconsortium.com/.Google Scholar
Index Terms
- SPIN: Seamless Operating System Integration of Peer-to-Peer DMA Between SSDs and GPUs
Recommendations
GPUfs: Integrating a file system with GPUs
As GPU hardware becomes increasingly general-purpose, it is quickly outgrowing the traditional, constrained GPU-as-coprocessor programming model. This article advocates for extending standard operating system services and abstractions to GPUs in order ...
A multiple-file write scheme for improving write performance of small files in Fast File System
Fast File System (FFS) stores files to disk in separate disk writes, each of which incurs a disk positioning (seek + rotation) limiting the write performance for small files. We propose a new scheme called co-writing to accelerate small file writes in ...
GPUfs: integrating a file system with GPUs
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsPU hardware is becoming increasingly general purpose, quickly outgrowing the traditional but constrained GPU-as-coprocessor programming model. To make GPUs easier to program and easier to integrate with existing systems, we propose making the host's ...
Comments