Skip to main content
Log in

VTensor: Using Virtual Tensors to Build a Layout-Oblivious AI Programming Framework

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Tensors are a popular programming interface for developing artificial intelligence (AI) algorithms. Layout refers to the order of placing tensor data in the memory and will affect performance by affecting data locality; therefore the deep neural network library has a convention on the layout. Since AI applications can use arbitrary layouts, and existing AI systems do not provide programming abstractions to shield the layout conventions of libraries, operator developers need to write a lot of layout-related code, which reduces the efficiency of integrating new libraries or developing new operators. Furthermore, the developer assigns the layout conversion operation to the internal operator to deal with the uncertainty of the input layout, thus losing the opportunity for layout optimization. Based on the idea of polymorphism, we propose a layout-agnostic virtual tensor programming interface, namely the VTensor framework, which enables developers to write new operators without caring about the underlying physical layout of tensors. In addition, the VTensor framework performs global layout inference at runtime to transparently resolve the required layout of virtual tensors, and runtime layout-oriented optimizations to globally minimize the number of layout transformation operations. Experimental results demonstrate that with VTensor, developers can avoid writing layout-dependent code. Compared with TensorFlow, for the 16 operations used in 12 popular networks, VTensor can reduce the lines of code (LOC) of writing a new operation by 47.82% on average, and improve the overall performance by 18.65% on average.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

References

  1. Abadi M, Barham P, Chen J M, Chen Z F, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M, Kudlur M, Levenberg J, Monga R, Moore S, Murray D G, Steiner B, Tucker P, Vasudevan V, Warden P, Wicke M, Yu Y, Zheng X Q. TensorFlow: A system for large-scale machine learning. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, Nov. 2016, pp.265–283. https://doi.org/10.5555/3026877.3026899.

  2. Chen T Q, Li M, Li Y T, Lin M, Wang N Y, Wang M J, Xiao T J, Xu B, Zhang C Y, Zhang Z. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv: 1512.01274, 2015. https://doi.org/10.48550/arXiv.1512.01274, Sept. 2023.

  3. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z M, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J J, Chintala S. PyTorch: An imperative style, high-performance deep learning library. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 721. https://doi.org/10.5555/3454287.3455008.

  4. Jia Y Q, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. Caffe: Convolutional architecture for fast feature embedding. In Proc. the 22nd ACM International Conference on Multimedia, Nov. 2014, pp.675–678. https://doi.org/10.1145/2647868.2654889.

  5. Barham P, Isard M. Machine learning systems are stuck in a rut. In Proc. the 2019 Workshop on Hot Topics in Operating Systems, May 2019, pp.177–183. https://doi.org/10.1145/3317550.3321441.

  6. Langtangen H P. Numerical computing in Python. In Python Scripting for Computational Science, Langtangen H P (ed. ), Springer, 2004, pp.131–188. https://doi.org/10.1007/978-3-662-05450-5_4.

  7. Goldsborough P. A tour of TensorFlow. arXiv: 1610. 01178, 2016. https://doi.org/10.48550/arXiv.1610.01178, September 2023.

  8. Li C, Yang Y, Feng M, Chakradhar S, Zhou H Y. Optimizing memory efficiency for deep convolutional neural networks on GPUs. In Proc. the 2016 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2016, pp.633–644. https://doi.org/10.1109/SC.2016.53.

  9. Anderson A, Gregg D. Optimal DNN primitive selection with partitioned Boolean quadratic programming. In Proc. the 2018 International Symposium on Code Generation and Optimization, Feb. 2018, pp.340–351. https://doi.org/10.1145/3168805.

  10. Chetlur S, Woolley C, Vandermersch P, Cohen J, Tran J, Catanzaro B, Shelhamer E. cuDNN: Efficient primitives for deep learning. arXiv: 1410.0759, 2014. https://doi.org/10.48550/arXiv.1410.0759, September 2023.

  11. Ould-Ahmed-Vall E, Abuzaina M, Amin M, Bobba J, Dubtsov R, Fomenko E, Gangadhar M, Hasabnis N, Huang J, Karkada D, Kim J Y, Makineni S, Mishura D, Raman K, Ramesh A, Rane V, Riera M, Sergeev D, Sripathi V, Subramanian B, Tokas L, Valles A. Accelerating TensorFlow on modern Intel architectures. In Proc. the 1st International Workshop on Architectures for Intelligent Machines, Sept. 2017.

  12. van der Walt S, Colbert S C, Varoquaux G. The NumPy array: A structure for efficient numerical computation. Computing in Science & Engineering, 2011, 13(2): 22–30. https://doi.org/10.1109/MCSE.2011.37.

    Article  Google Scholar 

  13. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proc. the 32nd International Conference on Machine Learning, Jul. 2015, pp.448–456. https://doi.org/10.5555/3045118.3045167.

  14. Szegedy C, Ioffe S, Vanhoucke V, Alemi A A. Inceptionv4, inception-ResNet and the impact of residual connections on learning. In Proc. the 31st AAAI Conference on Artificial Intelligence, Feb. 2017, pp.4278–4284. https://doi.org/10.5555/3298023.3298188.

  15. Szegedy C, Liu W, Jia Y Q, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In Proc. the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2015. https://doi.org/10.1109/CVPR.2015.7298594.

  16. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z. Rethinking the inception architecture for computer vision. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.2818–2826. https://doi.org/10.1109/CVPR.2016.308.

  17. He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. https://doi.org/10.1109/CVPR.2016.90.

  18. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv: 1409.1556, 2014. http://arxiv.org/abs/1409.1556, September 2023.

  19. Huang G, Liu Z, Van Der Maaten L, Weinberger K Q. Densely connected convolutional networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.2261–2269. https://doi.org/10.1109/CVPR.2017.243.

  20. Howard A G, Zhu M L, Chen B, Kalenichenko D, Wang W J, Weyand T, Andreetto M, Adam H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv: 1704.04861, 2017. https://doi.org/10.48550/arXiv.1704.04861, Sept. 2023.

  21. Sandler M, Howard A, Zhu M L, Zhmoginov A, Chen L C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.4510–4520. 10.1109/CVPR.2018.00474.

  22. Zoph B, Vasudevan V, Shlens J, Le Q V. Learning transferable architectures for scalable image recognition. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.8697–8710. https://doi.org/10.1109/CVPR.2018.00907.

  23. Wang Y E, Wu C J, Wang X D, Hazelwood K, Brooks D. Exploiting parallelism opportunities with deep learning frameworks. ACM Trans. Architecture and Code Optimization, 2021, 18(1): Article No. 9. https://doi.org/10.1145/3431388.

  24. Kim H, Nam H, Jung W, Lee J. Performance analysis of CNN frameworks for GPUs. In Proc. the 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Apr. 2017, pp.55–64. https://doi.org/10.1109/ISPASS.2017.7975270.

  25. Wen Y, Anderson A, Radu V, O’Boyle M F P, Gregg D. TASO: Time and space optimization for memory-constrained DNN inference. In Proc. the 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Sept. 2020, pp.199–208. https://doi.org/10.1109/SBAC-PAD49847.2020.00036.

  26. Wen Y, Anderson A, Radu V, O’Boyle M F P, Gregg D. POSTER: Space and time optimal DNN primitive selection with integer linear programming. In Proc. the 28th International Conference on Parallel Architectures and Compilation Techniques (PACT), Sept. 2019, pp.489–490. https://doi.org/10.1109/PACT.2019.00059.

  27. Zhang X Y, Xiao J M, Zhang X B, Hu Z Z, Zhu H R, Tian Z B, Tan G M. Tensor layout optimization of convolution for inference on digital signal processor. In Proc. the 2019 IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), Dec. 2019, pp.184–193. https://doi.org/10.1109/ISPA-BDCloud-SustainCom-SocialCom48970.2019.00036.

  28. Zheng B J, Vijaykumar N, Pekhimenko G. Echo: Compiler-based GPU memory footprint reduction for LSTM RNN training. In Proc. the 47th Annual International Symposium on Computer Architecture (ISCA), May 30–Jun. 3, 2020, pp.1089–1102. https://doi.org/10.1109/ISCA45697.2020.00092.

  29. Jia Z H, Padon O, Thomas J, Warszawski T, Zaharia M, Aiken A. TASO: Optimizing deep learning computation with automatic generation of graph substitutions. In Proc. the 27th ACM Symposium on Operating Systems Principles, Oct. 2019, pp.47–62. https://doi.org/10.1145/3341301.3359630.

  30. Liu Y Z, Wang Y, Yu R F, Li M, Sharma V, Wang Y D. Optimizing CNN model inference on CPUs. In Proc. the 2019 USENIX Conference on Usenix Annual Technical Conference, July 2019, pp.1025–1040. https://doi.org/10.5555/3358807.3358895.

  31. Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S. Halide: A language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In Proc. the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation, Jun. 2013, pp.519–530. https://doi.org/10.1145/2491956.2462176.

  32. Chen T Q, Moreau T, Jiang Z H, Zheng L M, Yan E, Cowan M, Shen H C, Wang L Y, Hu Y W, Ceze L, Guestrin C, Krishnamurthy A. TVM: An automated end-to-end optimizing compiler for deep learning. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, Oct. 2018, pp.579–594. https://doi.org/10.5555/3291168.3291211.

  33. Chen T Q, Zheng L M, Yan E, Jiang Z H, Moreau T, Ceze L, Guestrin C, Krishnamurthy A. Learning to optimize tensor programs. In Proc. the 32nd International Conference on Neural Information Processing Systems, Dec. 2018, pp.3393–3404. https://doi.org/10.5555/3327144.3327258.

  34. Zheng S Z, Liang Y, Wang S, Chen R Z, Sheng K W. FlexTensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system. In Proc. the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2020, pp.859–873. https://doi.org/10.1145/3373376.3378508.

  35. Zheng L M, Jia C F, Sun M M, Wu Z, Yu C H, Haj-Ali A, Wang Y D, Yang J, Zhuo D Y, Sen K, Gonzalez J E, Stoica I. Ansor: Generating high-performance tensor programs for deep learning. In Proc. the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), Nov. 2020, Article No. 49. https://doi.org/10.5555/3488766.3488815.

  36. Kjolstad F, Kamil S, Chou S, Lugato D, Amarasinghe S. The tensor algebra compiler. Proceedings of the ACM on Programming Languages, 2017, 1(OOPSLA): Article No. 77. https://doi.org/10.1145/3133901.

  37. You Y, Demmel J. Runtime data layout scheduling for machine learning dataset. In Proc. the 46th International Conference on Parallel Processing (ICPP), Aug. 2017, pp.452–461. https://doi.org/10.1109/ICPP.2017.54.

  38. Li J J, Tan G M, Chen M Y, Sun N H. SMAT: An input adaptive auto-tuner for sparse matrix-vector multiplication. ACM SIGPLAN Notices, 2013, 48(6): 117–126. https://doi.org/10.1145/2499370.2462181.

    Article  Google Scholar 

  39. Zhao Y, Li J J, Liao C H, Shen X P. Bridging the gap between deep learning and sparse matrix format selection. In Proc. the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2018, pp.94–108. https://doi.org/10.1145/3178487.3178495.

  40. Nisa I, Li J J, Sukumaran-Rajam A, Rawat P S, Krishnamoorthy S, Sadayappan P. An efficient mixed-mode representation of sparse tensors. In Proc. the 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2019, Article No. 49. https://doi.org/10.1145/3295500.3356216.

  41. Dong X, Liu L, Zhao P, Li G L, Li J S, Wang X Y, Feng X B. Acorns: A framework for accelerating deep neural networks with input sparsity. In Proc. the 28th International Conference on Parallel Architectures and Compilation Techniques, Sept. 2019, pp.178–191. https://doi.org/10.1109/PACT.2019.00022.

  42. Hildebrand M, Khan J, Trika S, Lowe-Power J, Akella V. AutoTM: Automatic tensor movement in heterogeneous memory systems using integer linear programming. In Proc. the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2020, pp.875–890. https://doi.org/10.1145/3373376.3378465.

  43. Hong C W, Sukumaran-Rajam A, Nisa I, Singh K, Sa-dayappan P. Adaptive sparse tiling for sparse matrix multiplication. In Proc. the 24th Symposium on Principles and Practice of Parallel Programming, Feb. 2019, pp.300–314. https://doi.org/10.1145/3293883.3295712.

  44. Jiang P, Hong C W, Agrawal G. A novel data transformation and execution strategy for accelerating sparse matrix multiplication on GPUs. In Proc. the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Feb. 2020, pp.376–388. https://doi.org/10.1145/3332466.3374546.

  45. Li R, Sukumaran-Rajam A, Veras R, Low T M, Rastello F, Rountev A, Sadayappan P. Analytical cache modeling and tilesize optimization for tensor contractions. In Proc. the 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, Nov. 2019, Article No. 74. https://doi.org/10.1145/3295500.3356218.

  46. Peng X, Shi X H, Dai H L, Jin H, Ma W L, Xiong Q, Yang F, Qian X H. Capuchin: Tensor-based GPU memory management for deep learning. In Proc. the 25th International Conference on Architectural Support for Programming Languages and Operating Systems, Mar. 2020, pp.891–905. https://doi.org/10.1145/3373376.3378505.

  47. Kim J, Sukumaran-Rajam A, Thumma V, Krishnamoorthy S, Panyala A, Pouchet L N, Rountev A, Sadayappan P. A code generator for high-performance tensor contractions on GPUs. In Proc. the 2019 IEEE/ACM International Symposium on Code Generation and Optimization, Feb. 2019, pp.85–95. https://doi.org/10.1109/CGO.2019.8661182.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hui-Min Cui.

Supplementary Information

ESM 1

(PDF 1293 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yu, F., Zhao, JC., Cui, HM. et al. VTensor: Using Virtual Tensors to Build a Layout-Oblivious AI Programming Framework. J. Comput. Sci. Technol. 38, 1074–1097 (2023). https://doi.org/10.1007/s11390-022-1457-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-022-1457-6

Keywords

Navigation