当前位置: X-MOL 学术arXiv.cs.DC › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
MOPAR: A Model Partitioning Framework for Deep Learning Inference Services on Serverless Platforms
arXiv - CS - Distributed, Parallel, and Cluster Computing Pub Date : 2024-04-03 , DOI: arxiv-2404.02445
Jiaang Duan, Shiyou Qian, Dingyu Yang, Hanwen Hu, Jian Cao, Guangtao Xue

With its elastic power and a pay-as-you-go cost model, the deployment of deep learning inference services (DLISs) on serverless platforms is emerging as a prevalent trend. However, the varying resource requirements of different layers in DL models hinder resource utilization and increase costs, when DLISs are deployed as a single function on serverless platforms. To tackle this problem, we propose a model partitioning framework called MOPAR. This work is based on the two resource usage patterns of DLISs: global differences and local similarity, due to the presence of resource dominant (RD) operators and layer stacking. Considering these patterns, MOPAR adopts a hybrid approach that initially divides the DL model vertically into multiple slices composed of similar layers to improve resource efficiency. Slices containing RD operators are further partitioned into multiple sub-slices, enabling parallel optimization to reduce inference latency. Moreover, MOPAR comprehensively employs data compression and share-memory techniques to offset the additional time introduced by communication between slices. We implement a prototype of MOPAR and evaluate its efficacy using four categories of 12 DL models on OpenFaaS and AWS Lambda. The experiment results show that MOPAR can improve the resource efficiency of DLISs by 27.62\% on average, while reducing latency by about 5.52\%. Furthermore, based on Lambda's pricing, the cost of running DLISs is reduced by about 2.58 $\times$ using MOPAR.

中文翻译:

MOPAR:无服务器平台上深度学习推理服务的模型分区框架

凭借其弹性能力和按需付费的成本模式,在无服务器平台上部署深度学习推理服务(DLIS)正在成为一种流行趋势。然而,当 DLIS 作为单一功能部署在无服务器平台上时,DL 模型中不同层的不同资源需求会阻碍资源利用率并增加成本。为了解决这个问题,我们提出了一个名为 MOPAR 的模型划分框架。这项工作基于 DLIS 的两种资源使用模式:全局差异和局部相似性,这是由于资源主导 (RD) 运算符和层堆叠的存在而产生的。考虑到这些模式,MOPAR 采用了混合方法,首先将 DL 模型垂直划分为由相似层组成的多个切片,以提高资源效率。包含RD算子的切片被进一步划分为多个子切片,从而实现并行优化以减少推理延迟。此外,MOPAR综合采用数据压缩和共享内存技术来抵消切片之间通信带来的额外时间。我们实现了 MOPAR 的原型,并在 OpenFaaS 和 AWS Lambda 上使用四类 12 个 DL 模型评估其功效。实验结果表明,MOPAR可以将DLIS的资源效率平均提高27.62%,同时降低约5.52%的延迟。此外,根据 Lambda 的定价,使用 MOPAR 运行 DLIS 的成本降低了约 2.58 $\times$。
更新日期:2024-04-04
down
wechat
bug