Abstract
This survey presents a necessarily incomplete (and biased) overview of results at the intersection of arithmetic circuit complexity, structured matrices and deep learning. Recently there has been some research activity in replacing unstructured weight matrices in neural networks by structured ones (with the aim of reducing the size of the corresponding deep learning models). Most of this work has been experimental and in this survey, we formalize the research question and show how a recent work that combines arithmetic circuit complexity, structured matrices and deep learning essentially answers this question. This survey is targeted at complexity theorists who might enjoy reading about how tools developed in arithmetic circuit complexity helped design (to the best of our knowledge) a new family of structured matrices, which in turn seem well-suited for applications in deep learning. However, we hope that folks primarily interested in deep learning would also appreciate the connections to complexity theory.
Similar content being viewed by others
Notes
The notion of small here is to show circuits of size o(n2) but the bounds are still \({\Omega }\left ({n^{2-\varepsilon }}\right )\) for any fixed ε > 0, while in our case we are more interested in linear maps that have a near-linear sized general arithmetic circuits.
This has led to deep intellectual research on societal implications of machine learning even in the theory community [5].
OK, we could not resist. This though is the last mention of societal issues in the survey.
We will pretty much use \(\mathbb {F}=\mathbb {R}\) (real number) or \(\mathbb {F}=\mathbb {C}\) (complex numbers) in the survey. Even though most of the results in the survey can be made to work for finite fields, we will ignore this aspect of the results.
More precisely, we have \(\textsf {ReLu}(x)=\max \limits (0,x)\) for any \(x\in \mathbb {R}\) and for any \(\boldsymbol {z}\ni \mathbb {R}^{m}\), g(z) = (ReLu(z[0]),⋯ ,ReLu(z[m − 1])).
Ideally, we would also like the first step to be efficient but typically the learning of the network can be done in an offline step so it can be (relatively) more inefficient.
The reader might have noticed that we are ignoring the P vs. NP elephant in the room.
Recall that the training problem happens ‘offline’ so we do not need the learning to be say O(n) time but we would like the learning algorithm to be at the worst be polynomial time.
In fact the SVD will give the best rank r approximation even if W is not rank r– for now let’s just consider the problem setting where we are looking for an exact representation.
Here we have assumed that one can compute g(x) and \(g^{\prime }(x)\) with O(1) operations and assumed that T2(m,n) ≥ m.
At a very high level this involves ‘reversing’ the direction of the edges in the DAG corresponding to the circuit.
However, earlier we where taking the gradient with respect to (essentially) A whereas here it is with respect to x.
Here we consider A as given and x and y as inputs. This implies that we need to prove the Baur-Strassen theorem when we only take derivatives with respect to part of the inputs– but this follows trivially since one can just read off \(\nabla _{{\boldsymbol {x}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\) from \(\nabla _{{\boldsymbol {x},\boldsymbol {y}}}\left ({\boldsymbol {y}^{T}\mathbf {A}\boldsymbol {x}}\right )\).
In this survey we are dealing with the classical definition of derivatives. If one defines the Kronecker delta function as a limit of a distribution and consider derivatives in the sense of theory of distributions then Efficient gradient property will be satisfied. Indeed, many practical implementation that use sparse as the weight matrices W, when trying to learn W use the distributional definition of the Kronecker delta function.
Indeed consider any sparse+low rank as in (5). If R has rank r at least \(\varepsilon \sqrt {n}\), this immediately implies \(s^{\prime }\ge 2rn \ge {\Omega }\left ({n^{3/2}}\right )\). If on the other hand \(r\le \varepsilon \sqrt {n}\), then by Theorem 4.2, we have \(s^{\prime }\ge \frac {n^{2}}4\).
This means that during the learning phase, we already know L and R and we only need to learn the residual.
If WLOG L = Z and R = D, then the diagonal of L −R is the diagonal of D and hence all non-zero by our assumption. If both L and R are diagonal matrices, i.e. L = D1 and R = D2, then L −R = D1 −D2 and all these entries are non-zero since we assumed L and R do not share any eigenvalues.
References
Alman, J.: Kronecker products, low-depth circuits, and matrix rigidity. In: Khuller, S., Williams, V.V. (eds.) STOC ’21: 53rd Annual ACM SIGACT Symposium on Theory of Computing, Virtual Event, pp 772–785. ACM, Italy (2021)
Alman, J., Chen, L.: Efficient construction of rigid matrices using an NP oracle. In: Zuckerman, D. (ed.) 60th IEEE Annual Symposium on Foundations of Computer Science, FOCS Baltimore, pp 1034–1055. IEEE Computer Society, Maryland (2019)
Alman, J., Williams, R.R.: Probabilistic rank matrix rigidity. In: Hatami, H., McKenzie, P., King, V. (eds.) Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, STOC 2017, pp 641–652. ACM, Montreal (2017)
Anthony, M., Bartlett, P.L.: Neural Network Learning: Theoretical Foundations. Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge (2009)
Barocas, S., Hardt, M., Narayanan, A.: Fairness and Machine Learning. fairmlbook.org. http://www.fairmlbook.org. Accessed 10 Dec 2023 (2019)
Baur, W., Strassen, V.: The complexity of partial derivatives. Theor. Comput. Sci. 22(3), 317–330 (1983)
Bender, E.M., Gebru, T., McMillan-Major, A., Shmitchell, S.: On the dangers of stochastic parrots: can language models be too big?. In: Elish, M.C., Isaac, W., Zemel, R.S. (eds.) FAccT ’21: 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual Event / Toronto, pp 610–623. ACM, Canada (2021)
Beneš, V.E.: Mathematical Theory of Connecting Networks and Telephone Traffic. ISSN. Elsevier Science (1965)
Beneš, V.E.: Optimal rearrangeable multistage connecting networks. The Bell System Technical Journal 43(4), 1641–1656 (1964)
Bhangale, A., Harsha, P., Paradise, O., Tal, A.: Rigid matrices from rectangular pcps or: hard claims have complex proofs. In: 61St IEEE Annual Symposium on Foundations of Computer Science, FOCS 2020, pp 858–869. IEEE, Durham (2020)
Bürgisser, P., Clausen, M., Shokrollahi, M.A.: Algebraic Complexity Theory, vol. 315. Springer Science & Business Media, Berlin (2013)
Candès, E.J., Li, X., Ma, Y., Wright, J.: Robust principal component analysis? J. ACM 58(3) (June 2011)
Choromanski, K., Rowland, M., Chen, W., Weller, A.: Unifying orthogonal monte carlo methods. In: International Conference on Machine Learning, pp 1203–1212 (2019)
Cooley, J.W., Tukey, J.W.: An algorithm for the machine calculation of complex fourier series. Math. Comput. 19(90), 297–301 (1965)
Dao, T., Chen, B., Sohoni, N.S., Desai, A.D., Poli, M., Grogan, J., Liu, A., Rao, A., Rudra, A., Rė, C.: Monarch: Expressive structured matrices for efficient and accurate training. arXiv:2204.00595 (2022)
Dao, T., Gu, A., Eichhorn, M., Rudra, A., Ré, C.: Learning fast algorithms for linear transforms using butterfly factorizations. In: The International Conference on Machine Learning (ICML) (2019)
Dao, T., Sohoni, N.S., Gu, A., Eichhorn, M., Blonder, A., Leszczynski, M., Rudra, A., Rė, C.: Kaleidoscope: an efficient, learnable representation for all structured linear maps. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. Openreview.Net (2020)
De Sa, C., Gu, A., Puttagunta, R., Rė, C., Rudra, A.: A two-pronged progress in structured dense matrix vector multiplication. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pp. 1060–1079 (2018)
Dvir, Z., Liu, A.: Fourier and circulant matrices are not rigid. Theory of Computing 16(20), 1–48 (2020)
Fiduccia, C.M.: On the algebraic complexity of matrix multiplication. PhD thesis, Brown University. http://cr.yp.to/bib/entries.html#1973/duccia-matrix. Accessed 10 Dec 2023 (1973)
Frankle, J., Carbin, M.: The lottery ticket hypothesis: finding sparse, trainable neural networks. In: International Conference on Learning Representations (ICLR) (2019)
Golovnev, S.: A course on matrix rigidity. https://golovnev.org/rigidity/ Accessed 15 Aug 2021 (2020)
Gu, J., Wang, Z., Kuen, J., Ma, L., Shahroudy, A., Shuai, B., Liu, T., Wang, X., Li, W., Wang, G., Cai, J., Chen, T.: Recent advances in convolutional neural networks. Pattern Recogn. 77, 354–377 (2018)
Li, J., Shen, Y., Dubcek, T., Peurifoy, J., Skirlo, S., LeCun, Y., Tegmark, M., Soljačić, M.: Tunable efficient unitary neural networks (eunn) and their application to rnns. In: Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1733–1741. JMLR.org (2017)
Kailath, T., Kung, S.-Y., Morf, M.: Displacement ranks of matrices and linear equations. J. Math. Anal. Appl. 68(2), 395–407 (1979)
Kailath, T., Sayed, AH: Displacement structure: theory and applications. SIAM Rev. 37(3), 297–386 (1995)
Kaltofen, E.: Computational differentiation and algebraic complexity theory. In: Bischof, C.H., Griewank, A., Khademi, P.M. (eds.) Workshop Report on First Theory Institute on Computational Differentiation, volume ANL/MCS-TM-183 of Tech. Rep. http://kaltofen.math.ncsu.edu/bibliography/93/Ka93_di.pdf. Accessed 10 Dec 2023, pp 28–30. Association for Computing Machinery, New York (1993)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015)
Li, Y., Yang, H., Martin, E.R., Ho, K.L., Ying, L: Butterfly factorization. Multiscale Modeling & Simulation 13(2), 714–732 (2015)
Lokam, S.V.: Complexity lower bounds using linear algebra. Found. Trends Theor. Comput. Sci. 4(1-2), 1–155 (2009)
Mathieu, M., LeCun, Y.: Fast approximation of rotations and Hessians matrices. arXiv:1404.7195 (2014)
Munkhoeva, M., Kapushev, Y., Burnaev, E., Oseledets, I.: Quadrature-based features for kernel approximation. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31, pp 9165–9174. Curran Associates Inc (2018)
Pan, V.Y.: Structured Matrices, Polynomials: Unified Superfast Algorithms. Springer, New York (2001)
Stott Parker, D.: Random butterfly transformations with applications in computational linear algebra. Technical report UCLA (1995)
Paturi, R., Pudlák, P.: Circuit lower bounds and linear codes. J. Math. Sci. 134, 2425–2434 (2006)
Schwartz, R., Dodge, J., Smith, N.A., Etzioni, O.: Green AI. arXiv:1907.10597 (2019)
Sindhwani, V., Sainath, T.N., Kumar, S.: Structured transforms for small-footprint deep learning. In: Advances in Neural Information Processing Systems, pp 3088–3096 (2015)
Szegö, G.: Orthogonal Polynomials. Number v. 23 in American Mathematical Society colloquium publications. American Mathematical Society (1967)
Thomas, A.T., Gu, A., Dao, T., Rudra, A., Rė, C.: Learning compressed transforms with low displacement rank. In: Bengio, S., Wallach, H.M., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montrėal, Canada, 9066–9078 (2018)
Tsidulko, J.: Google showcases on-device artificial intelligence breakthroughs at I/O. CRN (2019)
Udell, M, Townsend, A: Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science 1(1), 144–160 (2019)
Valiant, L.G.: Graph-theoretic arguments in low-level complexity. In: Gruska, J. (ed.) Mathematical Foundations of Computer Science 1977, pp 162–176. Springer, Berlin (1977)
Yu, X., Liu, T., Wang, X., Tao, D.: On compressing deep models by low rank and sparse decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Zhao, L., Liao, S., Wang, Y., Li, Z., Tang, J., Yuan, B.: Theoretical properties for neural networks with weight matrices of low displacement rank. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning volume 70 of Proceedings of Machine Learning Research, pp 4082–4090. PMLR (2017)
Acknowledgements
The material in Sections 2 and 3 are based on notes for AR’s Open lectures for PhD students in computer science at University of Warsaw titled (Dense Structured) Matrix Vector Multiplication in May 2018– we would like to thank University of Warsaw’s hospitality. The material in Section 5 is based on Dao et al. [17].
We would like to thank Tri Dao, Albert Gu and Chris Ré for many illuminating discussions during our collaborations around these topics.
We would like to thank an anonymous reviewer whose comments improved the presentation of the survey (and for pointing us to Theorem 4.2) and we thank Jessica Grogan for a careful read of an earlier draft of this survey.
AR is supported in part by NSF grant CCF-1763481.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Rudra, A. Arithmetic Circuits, Structured Matrices and (not so) Deep Learning. Theory Comput Syst 67, 592–626 (2023). https://doi.org/10.1007/s00224-022-10112-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00224-022-10112-w