Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

Rudra, Atri

doi:10.1007/s00224-022-10112-w

Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

Published: 17 December 2022

Volume 67, pages 592–626, (2023)
Cite this article

Theory of Computing Systems Aims and scope Submit manuscript

Atri Rudra¹

162 Accesses
1 Altmetric
Explore all metrics

Abstract

This survey presents a necessarily incomplete (and biased) overview of results at the intersection of arithmetic circuit complexity, structured matrices and deep learning. Recently there has been some research activity in replacing unstructured weight matrices in neural networks by structured ones (with the aim of reducing the size of the corresponding deep learning models). Most of this work has been experimental and in this survey, we formalize the research question and show how a recent work that combines arithmetic circuit complexity, structured matrices and deep learning essentially answers this question. This survey is targeted at complexity theorists who might enjoy reading about how tools developed in arithmetic circuit complexity helped design (to the best of our knowledge) a new family of structured matrices, which in turn seem well-suited for applications in deep learning. However, we hope that folks primarily interested in deep learning would also appreciate the connections to complexity theory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

On the Computational Complexity of Deep Learning Algorithms

Beyond graph neural networks with lifted relational neural networks

Article 15 June 2021

Limitations of shallow networks representing finite mappings

Article 17 August 2018

Notes

The notion of small here is to show circuits of size o(n²) but the bounds are still \({\Omega }\left ({n^{2-\varepsilon }}\right )\) for any fixed ε > 0, while in our case we are more interested in linear maps that have a near-linear sized general arithmetic circuits.
This has led to deep intellectual research on societal implications of machine learning even in the theory community [5].
OK, we could not resist. This though is the last mention of societal issues in the survey.
We will pretty much use \(\mathbb {F}=\mathbb {R}\) (real number) or \(\mathbb {F}=\mathbb {C}\) (complex numbers) in the survey. Even though most of the results in the survey can be made to work for finite fields, we will ignore this aspect of the results.
More precisely, we have \(\textsf {ReLu}(x)=\max \limits (0,x)\) for any \(x\in \mathbb {R}\) and for any \(\boldsymbol {z}\ni \mathbb {R}^{m}\), g(z) = (ReLu(z[0]),⋯ ,ReLu(z[m − 1])).
Ideally, we would also like the first step to be efficient but typically the learning of the network can be done in an offline step so it can be (relatively) more inefficient.
The reader might have noticed that we are ignoring the P vs. NP elephant in the room.
Recall that the training problem happens ‘offline’ so we do not need the learning to be say O(n) time but we would like the learning algorithm to be at the worst be polynomial time.
In fact the SVD will give the best rank r approximation even if W is not rank r– for now let’s just consider the problem setting where we are looking for an exact representation.
Here we have assumed that one can compute g(x) and \(g^{\prime }(x)\) with O(1) operations and assumed that T₂(m,n) ≥ m.
At a very high level this involves ‘reversing’ the direction of the edges in the DAG corresponding to the circuit.
However, earlier we where taking the gradient with respect to (essentially) A whereas here it is with respect to x.
Here we consider A as given and x and y as inputs. This implies that we need to prove the Baur-Strassen theorem when we only take derivatives with respect to part of the inputs– but this follows trivially since one can just read off \(\nabla _{{\boldsymbol {x}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\) from \(\nabla _{{\boldsymbol {x},\boldsymbol {y}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\).
In this survey we are dealing with the classical definition of derivatives. If one defines the Kronecker delta function as a limit of a distribution and consider derivatives in the sense of theory of distributions then Efficient gradient property will be satisfied. Indeed, many practical implementation that use sparse as the weight matrices W, when trying to learn W use the distributional definition of the Kronecker delta function.
Indeed consider any sparse+low rank as in (5). If R has rank r at least \(\varepsilon \sqrt {n}\), this immediately implies \(s^{\prime }\ge 2rn \ge {\Omega }\left ({n^{3/2}}\right )\). If on the other hand \(r\le \varepsilon \sqrt {n}\), then by Theorem 4.2, we have \(s^{\prime }\ge \frac {n^{2}}4\).
This means that during the learning phase, we already know L and R and we only need to learn the residual.
At a high level this should not be surprising given the results in Section 3, though Zhao et al. do not utilize the generic connection we established in Section 3.
If WLOG L = Z and R = D, then the diagonal of L −R is the diagonal of D and hence all non-zero by our assumption. If both L and R are diagonal matrices, i.e. L = D₁ and R = D₂, then L −R = D₁ −D₂ and all these entries are non-zero since we assumed L and R do not share any eigenvalues.

References

Alman, J.: Kronecker products, low-depth circuits, and matrix rigidity. In: Khuller, S., Williams, V.V. (eds.) STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, pp 772–785. ACM, Italy (2021)
Alman, J., Chen, L.: Efficient construction of rigid matrices using an NP oracle. In: Zuckerman, D. (ed.) 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS Baltimore, pp 1034–1055. IEEE Computer Society, Maryland (2019)
Alman, J., Williams, R.R.: Probabilistic rank matrix rigidity. In: Hatami, H., McKenzie, P., King, V. (eds.) Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp 641–652. ACM, Montreal (2017)
Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (2009)
MATH Google Scholar
Barocas, S., Hardt, M., Narayanan, A.: Fairness and Machine Learning. fairmlbook.org. http://www.fairmlbook.org. Accessed 10 Dec 2023 (2019)
Baur, W., Strassen, V.: The complexity of partial derivatives. Theor. Comput. Sci. 22(3), 317–330 (1983)
Article MathSciNet MATH Google Scholar
Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big?. In: Elish, M.C., Isaac, W., Zemel, R.S. (eds.) FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, pp 610–623. ACM, Canada (2021)
Beneš, V.E.: Mathematical Theory of Connecting Networks and Telephone Traffic. ISSN. Elsevier Science (1965)
Beneš, V.E.: Optimal rearrangeable multistage connecting networks. The Bell System Technical Journal 43(4), 1641–1656 (1964)
Article MathSciNet MATH Google Scholar
Bhangale, A., Harsha, P., Paradise, O., Tal, A.: Rigid matrices from rectangular pcps or: hard claims have complex proofs. In: 61St IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pp 858–869. IEEE, Durham (2020)
Bürgisser, P., Clausen, M., Shokrollahi, M.A.: Algebraic Complexity Theory, vol. 315. Springer Science & Business Media, Berlin (2013)
Google Scholar
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3) (June 2011)
Choromanski, K., Rowland, M., Chen, W., Weller, A.: Unifying orthogonal monte carlo methods. In: International Conference on Machine Learning, pp 1203–1212 (2019)
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)
Article MathSciNet MATH Google Scholar
Dao, T., Chen, B., Sohoni, N.S., Desai, A.D., Poli, M., Grogan, J., Liu, A., Rao, A., Rudra, A., Rė, C.: Monarch: Expressive structured matrices for efficient and accurate training. arXiv:2204.00595 (2022)
Dao, T., Gu, A., Eichhorn, M., Rudra, A., Ré, C.: Learning fast algorithms for linear transforms using butterfly factorizations. In: The International Conference on Machine Learning (ICML) (2019)
Dao, T., Sohoni, N.S., Gu, A., Eichhorn, M., Blonder, A., Leszczynski, M., Rudra, A., Rė, C.: Kaleidoscope: an efficient, learnable representation for all structured linear maps. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Openreview.Net (2020)
De Sa, C., Gu, A., Puttagunta, R., Rė, C., Rudra, A.: A two-pronged progress in structured dense matrix vector multiplication. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp. 1060–1079 (2018)
Dvir, Z., Liu, A.: Fourier and circulant matrices are not rigid. Theory of Computing 16(20), 1–48 (2020)
Article MathSciNet MATH Google Scholar
Fiduccia, C.M.: On the algebraic complexity of matrix multiplication. PhD thesis, Brown University. http://cr.yp.to/bib/entries.html#1973/duccia-matrix. Accessed 10 Dec 2023 (1973)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (ICLR) (2019)
Golovnev, S.: A course on matrix rigidity. https://golovnev.org/rigidity/ Accessed 15 Aug 2021 (2020)
Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Li, W., Wang, G., Cai, J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018)
Article Google Scholar
Li, J., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., Soljačić, M.: Tunable efficient unitary neural networks (eunn) and their application to rnns. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1733–1741. JMLR.org (2017)
Kailath, T., Kung, S.-Y., Morf, M.: Displacement ranks of matrices and linear equations. J. Math. Anal. Appl. 68(2), 395–407 (1979)
Article MathSciNet MATH Google Scholar
Kailath, T., Sayed, AH: Displacement structure: theory and applications. SIAM Rev. 37(3), 297–386 (1995)
Article MathSciNet MATH Google Scholar
Kaltofen, E.: Computational differentiation and algebraic complexity theory. In: Bischof, C.H., Griewank, A., Khademi, P.M. (eds.) Workshop Report on First Theory Institute on Computational Differentiation, volume ANL/MCS-TM-183 of Tech. Rep. http://kaltofen.math.ncsu.edu/bibliography/93/Ka93_di.pdf. Accessed 10 Dec 2023, pp 28–30. Association for Computing Machinery, New York (1993)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Article Google Scholar
Li, Y., Yang, H., Martin, E.R., Ho, K.L., Ying, L: Butterfly factorization. Multiscale Modeling & Simulation 13(2), 714–732 (2015)
Article MathSciNet MATH Google Scholar
Lokam, S.V.: Complexity lower bounds using linear algebra. Found. Trends Theor. Comput. Sci. 4(1-2), 1–155 (2009)
MathSciNet MATH Google Scholar
Mathieu, M., LeCun, Y.: Fast approximation of rotations and Hessians matrices. arXiv:1404.7195 (2014)
Munkhoeva, M., Kapushev, Y., Burnaev, E., Oseledets, I.: Quadrature-based features for kernel approximation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp 9165–9174. Curran Associates Inc (2018)
Pan, V.Y.: Structured Matrices, Polynomials: Unified Superfast Algorithms. Springer, New York (2001)
Book MATH Google Scholar
Stott Parker, D.: Random butterfly transformations with applications in computational linear algebra. Technical report UCLA (1995)
Paturi, R., Pudlák, P.: Circuit lower bounds and linear codes. J. Math. Sci. 134, 2425–2434 (2006)
Article MathSciNet MATH Google Scholar
Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O.: Green AI. arXiv:1907.10597 (2019)
Sindhwani, V., Sainath, T.N., Kumar, S.: Structured transforms for small-footprint deep learning. In: Advances in Neural Information Processing Systems, pp 3088–3096 (2015)
Szegö, G.: Orthogonal Polynomials. Number v. 23 in American Mathematical Society colloquium publications. American Mathematical Society (1967)
Thomas, A.T., Gu, A., Dao, T., Rudra, A., Rė, C.: Learning compressed transforms with low displacement rank. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montrėal, Canada, 9066–9078 (2018)
Tsidulko, J.: Google showcases on-device artificial intelligence breakthroughs at I/O. CRN (2019)
Udell, M, Townsend, A: Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science 1(1), 144–160 (2019)
Article MathSciNet MATH Google Scholar
Valiant, L.G.: Graph-theoretic arguments in low-level complexity. In: Gruska, J. (ed.) Mathematical Foundations of Computer Science 1977, pp 162–176. Springer, Berlin (1977)
Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Zhao, L., Liao, S., Wang, Y., Li, Z., Tang, J., Yuan, B.: Theoretical properties for neural networks with weight matrices of low displacement rank. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning volume 70 of Proceedings of Machine Learning Research, pp 4082–4090. PMLR (2017)

Download references

Acknowledgements

The material in Sections 2 and 3 are based on notes for AR’s Open lectures for PhD students in computer science at University of Warsaw titled (Dense Structured) Matrix Vector Multiplication in May 2018– we would like to thank University of Warsaw’s hospitality. The material in Section 5 is based on Dao et al. [17].

We would like to thank Tri Dao, Albert Gu and Chris Ré for many illuminating discussions during our collaborations around these topics.

We would like to thank an anonymous reviewer whose comments improved the presentation of the survey (and for pointing us to Theorem 4.2) and we thank Jessica Grogan for a careful read of an earlier draft of this survey.

AR is supported in part by NSF grant CCF-1763481.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University at Buffalo, New York, NY, USA
Atri Rudra

Authors

Atri Rudra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Atri Rudra.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Rudra, A. Arithmetic Circuits, Structured Matrices and (not so) Deep Learning. Theory Comput Syst 67, 592–626 (2023). https://doi.org/10.1007/s00224-022-10112-w

Download citation

Accepted: 19 November 2022
Published: 17 December 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s00224-022-10112-w

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

Abstract

Access this article

Similar content being viewed by others

On the Computational Complexity of Deep Learning Algorithms

Beyond graph neural networks with lifted relational neural networks

Limitations of shallow networks representing finite mappings

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Arithmetic Circuits, Structured Matrices and (not so) Deep Learning

Abstract

Access this article

Similar content being viewed by others

On the Computational Complexity of Deep Learning Algorithms

Beyond graph neural networks with lifted relational neural networks

Limitations of shallow networks representing finite mappings

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation