Skip to main content
Log in

Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

  • Published:
Theory of Computing Systems Aims and scope Submit manuscript

Abstract

This survey presents a necessarily incomplete (and biased) overview of results at the intersection of arithmetic circuit complexity, structured matrices and deep learning. Recently there has been some research activity in replacing unstructured weight matrices in neural networks by structured ones (with the aim of reducing the size of the corresponding deep learning models). Most of this work has been experimental and in this survey, we formalize the research question and show how a recent work that combines arithmetic circuit complexity, structured matrices and deep learning essentially answers this question. This survey is targeted at complexity theorists who might enjoy reading about how tools developed in arithmetic circuit complexity helped design (to the best of our knowledge) a new family of structured matrices, which in turn seem well-suited for applications in deep learning. However, we hope that folks primarily interested in deep learning would also appreciate the connections to complexity theory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Algorithm 2
Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. The notion of small here is to show circuits of size o(n2) but the bounds are still \({\Omega }\left ({n^{2-\varepsilon }}\right )\) for any fixed ε > 0, while in our case we are more interested in linear maps that have a near-linear sized general arithmetic circuits.

  2. This has led to deep intellectual research on societal implications of machine learning even in the theory community [5].

  3. OK, we could not resist. This though is the last mention of societal issues in the survey.

  4. We will pretty much use \(\mathbb {F}=\mathbb {R}\) (real number) or \(\mathbb {F}=\mathbb {C}\) (complex numbers) in the survey. Even though most of the results in the survey can be made to work for finite fields, we will ignore this aspect of the results.

  5. More precisely, we have \(\textsf {ReLu}(x)=\max \limits (0,x)\) for any \(x\in \mathbb {R}\) and for any \(\boldsymbol {z}\ni \mathbb {R}^{m}\), g(z) = (ReLu(z[0]),⋯ ,ReLu(z[m − 1])).

  6. Ideally, we would also like the first step to be efficient but typically the learning of the network can be done in an offline step so it can be (relatively) more inefficient.

  7. The reader might have noticed that we are ignoring the P vs. NP elephant in the room.

  8. Recall that the training problem happens ‘offline’ so we do not need the learning to be say O(n) time but we would like the learning algorithm to be at the worst be polynomial time.

  9. In fact the SVD will give the best rank r approximation even if W is not rank r– for now let’s just consider the problem setting where we are looking for an exact representation.

  10. Here we have assumed that one can compute g(x) and \(g^{\prime }(x)\) with O(1) operations and assumed that T2(m,n) ≥ m.

  11. At a very high level this involves ‘reversing’ the direction of the edges in the DAG corresponding to the circuit.

  12. However, earlier we where taking the gradient with respect to (essentially) A whereas here it is with respect to x.

  13. Here we consider A as given and x and y as inputs. This implies that we need to prove the Baur-Strassen theorem when we only take derivatives with respect to part of the inputs– but this follows trivially since one can just read off \(\nabla _{{\boldsymbol {x}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\) from \(\nabla _{{\boldsymbol {x},\boldsymbol {y}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\).

  14. In this survey we are dealing with the classical definition of derivatives. If one defines the Kronecker delta function as a limit of a distribution and consider derivatives in the sense of theory of distributions then Efficient gradient property will be satisfied. Indeed, many practical implementation that use sparse as the weight matrices W, when trying to learn W use the distributional definition of the Kronecker delta function.

  15. Indeed consider any sparse+low rank as in (5). If R has rank r at least \(\varepsilon \sqrt {n}\), this immediately implies \(s^{\prime }\ge 2rn \ge {\Omega }\left ({n^{3/2}}\right )\). If on the other hand \(r\le \varepsilon \sqrt {n}\), then by Theorem 4.2, we have \(s^{\prime }\ge \frac {n^{2}}4\).

  16. This means that during the learning phase, we already know L and R and we only need to learn the residual.

  17. At a high level this should not be surprising given the results in Section 3, though Zhao et al. do not utilize the generic connection we established in Section 3.

  18. If WLOG L = Z and R = D, then the diagonal of LR is the diagonal of D and hence all non-zero by our assumption. If both L and R are diagonal matrices, i.e. L = D1 and R = D2, then LR = D1D2 and all these entries are non-zero since we assumed L and R do not share any eigenvalues.

References

  1. Alman, J.: Kronecker products, low-depth circuits, and matrix rigidity. In: Khuller, S., Williams, V.V. (eds.) STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, pp 772–785. ACM, Italy (2021)

  2. Alman, J., Chen, L.: Efficient construction of rigid matrices using an NP oracle. In: Zuckerman, D. (ed.) 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS Baltimore, pp 1034–1055. IEEE Computer Society, Maryland (2019)

  3. Alman, J., Williams, R.R.: Probabilistic rank matrix rigidity. In: Hatami, H., McKenzie, P., King, V. (eds.) Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp 641–652. ACM, Montreal (2017)

  4. Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (2009)

    MATH  Google Scholar 

  5. Barocas, S., Hardt, M., Narayanan, A.: Fairness and Machine Learning. fairmlbook.org. http://www.fairmlbook.org. Accessed 10 Dec 2023 (2019)

  6. Baur, W., Strassen, V.: The complexity of partial derivatives. Theor. Comput. Sci. 22(3), 317–330 (1983)

    Article  MathSciNet  MATH  Google Scholar 

  7. Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big?. In: Elish, M.C., Isaac, W., Zemel, R.S. (eds.) FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, pp 610–623. ACM, Canada (2021)

  8. Beneš, V.E.: Mathematical Theory of Connecting Networks and Telephone Traffic. ISSN. Elsevier Science (1965)

  9. Beneš, V.E.: Optimal rearrangeable multistage connecting networks. The Bell System Technical Journal 43(4), 1641–1656 (1964)

    Article  MathSciNet  MATH  Google Scholar 

  10. Bhangale, A., Harsha, P., Paradise, O., Tal, A.: Rigid matrices from rectangular pcps or: hard claims have complex proofs. In: 61St IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pp 858–869. IEEE, Durham (2020)

  11. Bürgisser, P., Clausen, M., Shokrollahi, M.A.: Algebraic Complexity Theory, vol. 315. Springer Science & Business Media, Berlin (2013)

    Google Scholar 

  12. Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3) (June 2011)

  13. Choromanski, K., Rowland, M., Chen, W., Weller, A.: Unifying orthogonal monte carlo methods. In: International Conference on Machine Learning, pp 1203–1212 (2019)

  14. Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)

    Article  MathSciNet  MATH  Google Scholar 

  15. Dao, T., Chen, B., Sohoni, N.S., Desai, A.D., Poli, M., Grogan, J., Liu, A., Rao, A., Rudra, A., Rė, C.: Monarch: Expressive structured matrices for efficient and accurate training. arXiv:2204.00595 (2022)

  16. Dao, T., Gu, A., Eichhorn, M., Rudra, A., Ré, C.: Learning fast algorithms for linear transforms using butterfly factorizations. In: The International Conference on Machine Learning (ICML) (2019)

  17. Dao, T., Sohoni, N.S., Gu, A., Eichhorn, M., Blonder, A., Leszczynski, M., Rudra, A., Rė, C.: Kaleidoscope: an efficient, learnable representation for all structured linear maps. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Openreview.Net (2020)

  18. De Sa, C., Gu, A., Puttagunta, R., Rė, C., Rudra, A.: A two-pronged progress in structured dense matrix vector multiplication. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp. 1060–1079 (2018)

  19. Dvir, Z., Liu, A.: Fourier and circulant matrices are not rigid. Theory of Computing 16(20), 1–48 (2020)

    Article  MathSciNet  MATH  Google Scholar 

  20. Fiduccia, C.M.: On the algebraic complexity of matrix multiplication. PhD thesis, Brown University. http://cr.yp.to/bib/entries.html#1973/duccia-matrix. Accessed 10 Dec 2023 (1973)

  21. Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (ICLR) (2019)

  22. Golovnev, S.: A course on matrix rigidity. https://golovnev.org/rigidity/ Accessed 15 Aug 2021 (2020)

  23. Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Li, W., Wang, G., Cai, J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018)

    Article  Google Scholar 

  24. Li, J., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., Soljačić, M.: Tunable efficient unitary neural networks (eunn) and their application to rnns. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1733–1741. JMLR.org (2017)

  25. Kailath, T., Kung, S.-Y., Morf, M.: Displacement ranks of matrices and linear equations. J. Math. Anal. Appl. 68(2), 395–407 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  26. Kailath, T., Sayed, AH: Displacement structure: theory and applications. SIAM Rev. 37(3), 297–386 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  27. Kaltofen, E.: Computational differentiation and algebraic complexity theory. In: Bischof, C.H., Griewank, A., Khademi, P.M. (eds.) Workshop Report on First Theory Institute on Computational Differentiation, volume ANL/MCS-TM-183 of Tech. Rep. http://kaltofen.math.ncsu.edu/bibliography/93/Ka93_di.pdf. Accessed 10 Dec 2023, pp 28–30. Association for Computing Machinery, New York (1993)

  28. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)

    Article  Google Scholar 

  29. Li, Y., Yang, H., Martin, E.R., Ho, K.L., Ying, L: Butterfly factorization. Multiscale Modeling & Simulation 13(2), 714–732 (2015)

    Article  MathSciNet  MATH  Google Scholar 

  30. Lokam, S.V.: Complexity lower bounds using linear algebra. Found. Trends Theor. Comput. Sci. 4(1-2), 1–155 (2009)

    MathSciNet  MATH  Google Scholar 

  31. Mathieu, M., LeCun, Y.: Fast approximation of rotations and Hessians matrices. arXiv:1404.7195 (2014)

  32. Munkhoeva, M., Kapushev, Y., Burnaev, E., Oseledets, I.: Quadrature-based features for kernel approximation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp 9165–9174. Curran Associates Inc (2018)

  33. Pan, V.Y.: Structured Matrices, Polynomials: Unified Superfast Algorithms. Springer, New York (2001)

    Book  MATH  Google Scholar 

  34. Stott Parker, D.: Random butterfly transformations with applications in computational linear algebra. Technical report UCLA (1995)

  35. Paturi, R., Pudlák, P.: Circuit lower bounds and linear codes. J. Math. Sci. 134, 2425–2434 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  36. Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O.: Green AI. arXiv:1907.10597 (2019)

  37. Sindhwani, V., Sainath, T.N., Kumar, S.: Structured transforms for small-footprint deep learning. In: Advances in Neural Information Processing Systems, pp 3088–3096 (2015)

  38. Szegö, G.: Orthogonal Polynomials. Number v. 23 in American Mathematical Society colloquium publications. American Mathematical Society (1967)

  39. Thomas, A.T., Gu, A., Dao, T., Rudra, A., Rė, C.: Learning compressed transforms with low displacement rank. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montrėal, Canada, 9066–9078 (2018)

  40. Tsidulko, J.: Google showcases on-device artificial intelligence breakthroughs at I/O. CRN (2019)

  41. Udell, M, Townsend, A: Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science 1(1), 144–160 (2019)

    Article  MathSciNet  MATH  Google Scholar 

  42. Valiant, L.G.: Graph-theoretic arguments in low-level complexity. In: Gruska, J. (ed.) Mathematical Foundations of Computer Science 1977, pp 162–176. Springer, Berlin (1977)

  43. Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

  44. Zhao, L., Liao, S., Wang, Y., Li, Z., Tang, J., Yuan, B.: Theoretical properties for neural networks with weight matrices of low displacement rank. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning volume 70 of Proceedings of Machine Learning Research, pp 4082–4090. PMLR (2017)

Download references

Acknowledgements

The material in Sections 2 and 3 are based on notes for AR’s Open lectures for PhD students in computer science at University of Warsaw titled (Dense Structured) Matrix Vector Multiplication in May 2018– we would like to thank University of Warsaw’s hospitality. The material in Section 5 is based on Dao et al. [17].

We would like to thank Tri Dao, Albert Gu and Chris Ré for many illuminating discussions during our collaborations around these topics.

We would like to thank an anonymous reviewer whose comments improved the presentation of the survey (and for pointing us to Theorem 4.2) and we thank Jessica Grogan for a careful read of an earlier draft of this survey.

AR is supported in part by NSF grant CCF-1763481.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Atri Rudra.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rudra, A. Arithmetic Circuits, Structured Matrices and (not so) Deep Learning. Theory Comput Syst 67, 592–626 (2023). https://doi.org/10.1007/s00224-022-10112-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00224-022-10112-w

Keywords

Navigation