Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training,ACM Transactions on Architecture and Code Optimization

当前位置： X-MOL 学术 › ACM Trans. Archit. Code Optim. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Fastensor: Optimise the Tensor I/O Path from SSD to GPU for Deep Learning Training
ACM Transactions on Architecture and Code Optimization ( IF 1.6 ) Pub Date : 2023-10-25 , DOI: 10.1145/3630108
Jia Wei ₁ , Xingjun Zhang ₁ , Longxiang Wang ₁ , Zheng Wei ₁

Affiliation

In recent years, benefiting from the increase in model size and complexity, deep learning has achieved tremendous success in computer vision (CV) and natural language processing (NLP). Training deep learning models using accelerators such as GPUs often requires much iterative data to be transferred from NVMe SSD to GPU memory. Much recent work has focused on data transfer during the pre-processing phase and has introduced techniques such as multiprocessing and GPU Direct Storage (GDS) to accelerate it. However, tensor data during training (such as Checkpoints, logs, and intermediate feature maps) which is also time-consuming, is often transferred using traditional serial, long-I/O-path transfer methods.

In this paper, based on GDS technology, we built Fastensor, an efficient tool for tensor data transfer between NVMe SSDs and GPUs. To achieve higher tensor data I/O throughput, we optimized the traditional data I/O process. We also proposed a data and runtime context-aware tensor I/O algorithm. Fastensor can select the most suitable data transfer tool for the current tensor from a candidate set of tools during model training. The optimal tool is derived from a dictionary generated by our adaptive exploration algorithm in the first few training iterations. We used Fastensor’s unified interface to test the read/write bandwidth and energy consumption of different transfer tools for different sizes of tensor blocks. We found that the execution efficiency of different tensor transfer tools is related to both the tensor block size and the runtime context.

We then deployed Fastensor in the widely applicable Pytorch deep learning framework. We showed that Fastensor could perform superior in typical scenarios of model parameter saving and intermediate feature map transfer with the same hardware configuration. Fastensor achieves a 5.37x read performance improvement compared to torch.save() when used for model parameter saving. When used for intermediate feature map transfer, Fastensor can increase the supported training batch size by 20x, while the total read and write speed is increased by 2.96x compared to the torch I/O API.

中文翻译：

Fastensor：优化从 SSD 到 GPU 的张量 I/O 路径以进行深度学习训练

近年来，受益于模型规模和复杂性的增加，深度学习在计算机视觉（CV）和自然语言处理（NLP）领域取得了巨大成功。使用 GPU 等加速器训练深度学习模型通常需要将大量迭代数据从 NVMe SSD 传输到 GPU 内存。最近的许多工作都集中在预处理阶段的数据传输，并引入了多处理和 GPU 直接存储 (GDS) 等技术来加速数据传输。然而，训练期间的张量数据（例如检查点、日志和中间特征图）也很耗时，通常使用传统的串行、长 I/O 路径传输方法进行传输。

在本文中，我们基于GDS技术构建了Fatensor，一个用于NVMe SSD和GPU之间张量数据传输的高效工具。为了获得更高的张量数据I/O吞吐量，我们优化了传统的数据I/O流程。我们还提出了一种数据和运行时上下文感知的张量 I/O 算法。Fastensor可以在模型训练过程中从候选工具集中选择最适合当前张量的数据传输工具。最佳工具源自我们的自适应探索算法在前几次训练迭代中生成的字典。我们使用Fatensor的统一接口来测试不同传输工具对于不同大小的张量块的读/写带宽和能耗。我们发现不同张量传输工具的执行效率与张量块大小和运行时上下文有关。

然后，我们将 Fastensor 部署在广泛适用的 Pytorch 深度学习框架中。我们表明，在相同硬件配置的情况下，Fatensor 在模型参数保存和中间特征图传输的典型场景中表现出色。与torch相比，Fatensor 的读取性能提高了 5.37 倍。save ()用于模型参数保存时。当用于中间特征图传输时，Fatensor 可以将支持的训练批量大小增加 20 倍，同时总读写速度相比 Torch I/O API 增加 2.96 倍。

更新日期：2023-10-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>