Quantum-classical hybrid neural networks in the neural tangent kernel regime

Kouhei Nakaji; Hiroyuki Tezuka; Naoki Yamamoto

doi:10.1088/2058-9565/ad133e

1. Introduction

1.1. Background—quantum/classical neural networks and classical neural tangent kernel (NTK)

Quantum neural networks (qNNs) or quantum classical hybrid neural networks (qcNNs) are systems that, based on their rich expressibility in the functional space, have potential of offering a higher-performance solution in various problems over classical means [1–15]. However, there remain two essential issues to be resolved. First, the existing qNN and qcNN models have no theoretical guarantee in their training process to converge to the optimal or even a 'good' solution. The vanishing gradient (or the barren plateau) issue, stating that the gradient vector decays exponentially fast with respect to the number of qubits, is particularly serious [16]; several proposals to mitigate this issue have been proposed [17–26], but these are not general solutions. Secondly, despite the potential advantage of the quantum models in their expressibility, they are not guaranteed to offer a better solution over classical means, especially the classical neural networks (cNNs). Regarding this point, the recent study [27] has derived a condition for the quantum kernel method to presumably outperform a class of classical means and provided the idea using the projected quantum kernel to satisfy this advantageous condition. Note that the quantum kernel has been thoroughly investigated in several theoretical and experimental settings [28–35]. However, designing an effective quantum kernel (including the projected quantum kernel) is a highly nontrivial task; also, the kernel method generally requires the computational complexity of $O(N_D^2)$ with N_D the number of data, whereas the cNN needs only $O(N_D)$ as long as the computational cost of training does not scale with N_D . Therefore, it is desirable if we could have an easy-trainable qNN or qcNN to which the above-mentioned advantage of quantum kernel method are incorporated.

On the other hand, in the classical regime, the NTK [36] offers useful approaches to analyze several fundamental properties of cNNs and especially deep cNNs, including the convergence properties in the training process. The NTK is a time-varying nonlinear function that appears in the dynamical equation of the output function of cNN in the training process. Surprisingly, NTK becomes time-invariant in the so-called NTK regime where the number of nodes of CNN becomes infinitely large; further, it becomes positive-definite via random initialization of the parameters. As a result, particularly when the problem is the least square regression, the training process is described by a linear differential (or difference) equation, and the analysis of the training process boils down to that of the spectra of this time-invariant positive-definite matrix. The literature studies on NTK that are related to our work are as follows; the relation to Gaussian process [37], relation between the spectra of NTK and the convergence property of cNN [38], and the NTK in the case of classification problem [39–42].

1.2. Our contribution

In this paper, we study a class of qcNN that can be directly analyzed in the NTK regime. In this proposed qcNN scheme, the classical data is first encoded into the state of a quantum system and then re-transformed to a classical data by some appropriate random measurement, which can thus be regarded as a feature extraction process in the high-dimensional quantum Hilbert space. We then input the reconstructed classical data vector into a subsequent cNN. Finally, a cost function is evaluated using the output of cNN, and the parameters contained in the cNN part are updated to lower the cost. Note that, hence, the quantum part is fixed, implying that the vanishing gradient issue does not occur in our framework. The following is the list of results.

The output of qcNN becomes a Gaussian process in the infinite width limit of the cNN part while the width of the quantum part is fixed, where the unitary gate determining the quantum measurement and the weighting parameters of cNN are randomly chosen from unitary 2-designs and Gaussian distributions, respectively. The covariance matrix of this Gaussian process is given by a function of projected quantum kernels mentioned in the first paragraph. That is, our qcNN certainly exploits the quantum feature space.
In the infinite width limit of cNN, the training dynamics in the functional space is governed by a linear differential equation characterized by the corresponding NTK, meaning the exponentially-fast convergence to the global solution if NTK is positive-definite; a condition to guarantee the positive-definiteness is also obtained. At the convergent point, the output of qcNN is of the form of kernel function of NTK. Because NTK is a nonlinear function of the above-mentioned covariance matrix composed of the quantum projection kernels, and because the computational cost of training is low, our qcNN can be regarded as a method to generate an effective quantum kernel with less computational complexity than the standard kernel method.
Because the NTK has an explicit form of covariance matrix, theoretical analysis on the training process and the convergent value of cost function is possible. As a result, based on this theoretical analysis on the cost function, we derive sufficient condition for our qcNN model to lower the cost function than some other full-classical models. Note that, when the size of quantum system is large, classical computers will have a difficulty to simulate the feature extraction process of the qcNN model; this may be a factor that leads to such superiority.

In addition to the above theoretical investigations, we carry out thorough numerical simulations to evaluate the performance of the proposed qcNN model, as follows.

The numerically computed time-evolution of cost function along the training process well agrees with the analytic form of time-evolution of cost (obtained under the assumption that NTK is constant and positive definite), for both the regression and classification problems, when the width of cNN is bigger than 100. This shows the validity of using NTK to analytically investigate the performance of the proposed qcNN.
The convergence speed becomes bigger (i.e. nearly the ideal exponentially-fast convergence is observed) and the value of final cost becomes smaller, when we make the width of cNN bigger. Moreover, we find that enough reduction of the training cost leads to the decrease of generalization error. That is, our qcNN has several desirable properties predicted by the NTK theory, which are indeed satisfied in many classical models.
Both the regression and classification performance largely depend on the choice of quantum circuit ansatz for data-encoding, which is reasonable in the sense that the proposed method is essentially a kernel method. Yet we found an interesting case where the ansatz with bigger-expressibility (due to containing some entangling gates) decreases the value of final cost lower than that achieved via the ansatz without entangling gates. This implies that the quantumness may have a power to enhance the performance of the proposed qcNN model, depending on the dataset or selected ansatz.
The proposed qcNN model shows a clear advantage over full cNNs and qNNs for the problem of learning the quantum data-generating process. A particularly notable result is that, even with much less parameters (compared to the full cNNs) and smaller training cost (compared to the qNNs), the qcNN can execute the regression and the classification task with sufficient accuracy. Also, in terms of the generalization capability, the qcNN model shows much better performance than the others, mainly thanks to the inductive bias.

1.3. Related works

Before finishing this section, we address related works. Recently (after submitting a preprint version of this manuscript), the following studies on quantum NTK have been presented. Their NTK is defined for the cost function of the output state of a qNN. In [43], the authors studied the properties of the linear differential equation of the cost (which corresponds to equation (12) shown later), obtained under the assumption that the NTK does not change much in time. This idea was further investigated in the subsequent paper [44], showing in both theory and numerical simulation that the dynamics of cost exponentially decays when the number of parameters is large, i.e. when the system is within the over-parametrization regime, as suggested by the conventional classical NTK theory. This behavior was also supported by numerical simulations provided in [45]. Also, in [46], a relation between their NTK and the vanishing gradient issue was discussed; that is, to satisfy the assumption that the NTK does not change in time, the qNN has to contain $O(4^n)$ parameters with n the number of qubits, which actually has the same origin as the vanishing gradient issue. In [47] the authors gave a method for mitigating this demanding requirement; they study the training dynamics in a space with effective dimension $d_{\mathrm{eff}}$ instead of the entire Hilbert space with dimension 2ⁿ, which as a result allows $O(d_{\mathrm{eff}}^2)$ parameters to guarantee the exponential convergence. All these studies focus on fully-quantum systems, while in this paper we focus on a class of classical-quantum hybrid systems where the tunable parameters are contained only in the classical part and the NTK is defined with respect to those parameters. A critical consequence due to this difference is that our NTK becomes time-invariant (theorem 5) and the output function becomes Gaussian (theorems 3 and 4) in the over-parametrization regime, while these provable features were not reported in the above literature works. In particular, the time-invariancy is critical to guarantee the exponential convergence of the output function; as mentioned above, they rely on the assumption that the NTK does not change much in time. It may look like that our NTK is a fully classical object and as a result we are allowed to have such provable facts, but certainly it can extract features of the quantum part in the form of nonlinear function of the projected quantum kernel, as mentioned above.

1.4. Structure of the paper

The structure of this paper is as follows. Section 2 reviews the theory of NTK for cNNs. Section 3 begins with describing our proposed qcNN model, followed by showing some theorems. Also we discuss possible advantage of our qcNN over some other models. Section 4 is devoted to give a series of numerical simulations. Section 5 concludes the paper.

2. Preliminary: classical NTK theory

The NTK theory, which was originally proposed in [36], offers a method for analyzing the dynamics of an infinitely-wide cNN under the gradient-descent-based training process. In particular, the NTK theory can be used for explaining why deep cNNs with much more parameters than the number of data (i.e. over-parametrized cNNs) work quite well in various machine learning tasks in terms of training error. We review the NTK theory in sections from 2.1 to 2.4. Importantly, the NTK theory can also be used to conjecture when cNNs may fail. As a motivation for introducing our model, we discuss one of the failure conditions of cNN in terms of NTK, in section 2.5.

2.1. Problem settings of NTK theory

The NTK theory [36] focuses on supervised learning problems. That is, we are given N_D training data ${({\mathbf{x}}^a, y^a)}$ ( $a = 1,2,\cdots, N_D$ ), where ${\mathbf{x}}^a$ is an input vector and y^a is the corresponding output; here we assume for simplicity that y^a is a scalar, though the original NTK theory can handle the case of vector output. Suppose this dataset is generated from the following hidden (true) function $f_{\mathrm{goal}}$ as follows;

$\begin{align} y^a = f_{\mathrm{goal}}\left({\mathbf{x}}^a\right), \qquad \forall a. \end{align} \tag{ 1 }$

Then the goal is to train the model $f_{\boldsymbol{\theta}(t)}$ , which corresponds to the output of a cNN, so that $f_{\boldsymbol{\theta}(t)}$ becomes close to $f_{\mathrm{goal}}$ in some measure, where $\boldsymbol{\theta}(t)$ is the set of the trainable parameters at the iteration step t. An example of the measure that quantifies the distance between $f_{\boldsymbol{\theta}(t)}$ and $f_{\mathrm{goal}}$ is the mean squared error:

$\begin{align} \mathcal{L}_t^{C} = \frac{1}{2}\sum_{a = 1}^{N_D} \left(f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right) - f_{\mathrm{goal}}\left({\mathbf{x}}^a\right) \right)^2 = \frac{1}{2}\sum_{a = 1}^{N_D} \left(f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right) - y^a \right)^2, \end{align} \tag{ 2 }$

which is mainly used for regression problems. Another example of the measure is the binary cross entropy:

$\begin{align} \mathcal{L}_t^{C} = -\sum_{a = 1}^{N_D}\left(y^a \log \sigma_s \left(f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right)\right) + \left(1 - y_a\right) \log \sigma_s \left(f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right)\right)\right), \end{align} \tag{ 3 }$

which is mainly used for classification problems where σ_s is the sigmoid function and y^a is a binary label that takes either 0 or 1.

The function $f_{\boldsymbol{\theta}(t)}$ is constructed by a fully-connected network of L layers. Let $n_\ell$ be the number of nodes (width) of the $\ell$ th layer (hence $\ell = 0$ and $\ell = L$ correspond to the input and output layers, respectively). Then the input ${\mathbf{x}}^a$ is converted to the output $f_{\boldsymbol{\theta}(t)}({\mathbf{x}}^a)$ in the following manner:

$\begin{align} \boldsymbol{\alpha}^{\left(0\right)}\left({\mathbf{x}}^{a}\right) & = {\mathbf{x}}^a, \nonumber\\ \boldsymbol{\alpha}^{\left(\ell\right)}\left({\mathbf{x}}^{a}\right) & = \sigma\left(\tilde{\boldsymbol{\alpha}}^{\left(\ell\right)}\left({\mathbf{x}}^{a}\right)\right), \nonumber\\ \tilde{\boldsymbol{\alpha}}^{\left(\ell + 1\right)}\left({\mathbf{x}}^{a}\right) & = \frac{1}{\sqrt{n_{\ell}}}W^{\left(\ell\right)}\boldsymbol{\alpha}^{\left(\ell\right)}\left({\mathbf{x}}^{a}\right) + \xi b^{\left(\ell\right)}, \nonumber\\ f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right) & = \boldsymbol{\alpha}^{\left(L\right)}\left({\mathbf{x}}^a\right), \end{align} \tag{ 4 }$

where $W^{(\ell)}\in \mathbf{R}^{n_{l} \times n_{l-1}}$ is the weighting matrix and $b^{(\ell)} \in \mathbf{R}^{n_{l}}$ is the bias vector in the $\ell$ th layer. Also σ is the activation function that is differentiable. Note that the vector of trainable parameters $\boldsymbol{\theta}(t)$ is now composed of all the elements of $\{W^{(\ell)}_{jk}\}$ and $b^{(\ell)}$ . The parameters are updated by using the gradient descent algorithm

$\begin{align} \frac{\partial \theta_j \left(t\right)}{\partial t} = - \eta \frac{\partial \mathcal{L}_t^{C}}{\partial \theta_j} = -\eta\sum_a \frac{\partial f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}^a\right)}{\partial \theta_j} \frac{\partial \mathcal{L}_t^{C}}{\partial f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}^a\right)}, \end{align} \tag{ 5 }$

where for simplicity we take the continuous-time regime in t. Also, η is the learning rate and θ_j is the jth parameter. All parameters, $\{W^{(\ell)}_{jk}\}$ and $b^{(\ell)}$ , are initialized by sampling from the mutually independent normal Gaussian distribution.

2.2. Definition of NTK

NTK appears in the dynamics of the output function $f_{\boldsymbol{\theta}(t)}$ , as follows. The time derivative of $f_{\boldsymbol{\theta}(t)}$ is given by

$\begin{align} \frac{\partial f_{\boldsymbol{\theta}(t)} ({\mathbf{x}}))}{\partial t} & = \sum_{j}\frac{\partial f_{\boldsymbol{\theta}(t)} ({\mathbf{x}})}{\partial \theta_j}\frac{\partial \theta_j}{\partial t} = -\eta \sum_{j,b}\frac{\partial f_{\boldsymbol{\theta}(t)} ({\mathbf{x}})}{\partial \theta_j}\frac{\partial f_{\boldsymbol{\theta}(t)} ({\mathbf{x}}^b)}{\partial \theta_j}\frac{\partial \mathcal{L}_t^{C}}{\partial f_{\boldsymbol{\theta}(t)} ({\mathbf{x}}^b)} = -\eta \sum_b K^{(L)}({\mathbf{x}}, {\mathbf{x}}^b, t) \frac{\partial \mathcal{L}_t^{C}}{\partial f_{\boldsymbol{\theta}(t)} ({\mathbf{x}}^b)}, \end{align} \tag{ 6 }$

where $K^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)$ is defined by

$\begin{align} K^{\left(L\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right) = \sum_{j}\frac{\partial f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}\right)}{\partial \theta_j}\frac{\partial f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}^{{\prime}}\right)}{\partial \theta_j}. \end{align} \tag{ 7 }$

The function $K^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)$ is called the NTK. In the following, we will see that the trajectory of $f_{\boldsymbol{\theta}(t)}$ can be analytically calculated in terms of NTK in the infinite width limit $n_1,n_2,\cdots,n_{\ell-1} \rightarrow \infty$ .

2.3. Theorems

The key feature of NTK is that it converges to the time-invariant and positive-definite function $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ in the infinite width limit, as shown below. Before stating the theorems on these surprising properties, let us show the following lemma about the distribution of $f_{\boldsymbol{\theta}(0)}$ :

Lemma 1 (proposition 1 in [36]). With σ as a Lipschitz nonlinear function, in the infinite width limit $n_{\ell} \rightarrow \infty$ for $1\unicode{x2A7D}\ell\unicode{x2A7D} L-1$ , the output function at initialization, $f_{\boldsymbol{\theta}(0)}$ , obeys a centered Gaussian process whose covariance matrix $\boldsymbol{\Sigma}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is given recursively by

$\begin{align} \boldsymbol{\Sigma}^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \frac{1}{n_0}{\mathbf{x}}^T {\mathbf{x}}^{{\prime}} + \xi^2, \nonumber\\ \boldsymbol{\Sigma}^{\left(\ell + 1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \mathbf{E}_{h \sim \mathcal{N}\left(0, {\boldsymbol{\Sigma}}^{\left(\ell\right)}\right)}\left[\sigma\left(h\left({\mathbf{x}}\right)\right) \sigma\left(h\left({\mathbf{x}}^{{\prime}}\right)\right)\right]+\xi^2, \end{align} \tag{ 8 }$

where the expectation is calculated by averaging over the centered Gaussian process with the covariance ${\boldsymbol{\Sigma}}^{(\ell)}$ .

The proof can be found in appendix A.1 of [36]. Note that the expectation for an arbitrary function $z(h({\mathbf{x}}), h({\mathbf{x}}^{{\prime}}))$ can be computed as

$\begin{align} \mathbf{E}_{h \sim \mathcal{N}\left(0, \Sigma^{\left(\ell\right)}\right)}\left[z\left(h\left({\mathbf{x}}\right), h\left({\mathbf{x}}^{{\prime}}\right)\right)\right] = \frac{1}{2\pi\sqrt{|\tilde{{\boldsymbol{\Sigma}}}^{\left(\ell\right)}|}}\int dh\left({\mathbf{x}}\right)dh\left({\mathbf{x}}^{{\prime}}\right) \exp\left(-\frac{1}{2}{\mathbf{h}}^T\left(\tilde{{\boldsymbol{\Sigma}}}^{\left(\ell\right)}\right)^{-1}{\mathbf{h}}\right) z\left(h\left({\mathbf{x}}\right), h\left({\mathbf{x}}^{{\prime}}\right)\right), \end{align} \tag{ 9 }$

where $\tilde{{\boldsymbol{\Sigma}}}^{(\ell)}$ is the $2 \times 2$ matrix

$\begin{align} \tilde{{\boldsymbol{\Sigma}}}^{\left(\ell\right)} = \left( \begin{array}{cc} {\boldsymbol{\Sigma}}^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}\right) & {\boldsymbol{\Sigma}}^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{\prime}\right) \\ {\boldsymbol{\Sigma}}^{\left(\ell\right)}\left({\mathbf{x}}^{\prime}, {\mathbf{x}}\right) & {\boldsymbol{\Sigma}}^{\left(\ell\right)}\left({\mathbf{x}}^{\prime}, {\mathbf{x}}^{\prime}\right) \end{array} \right), \end{align} \tag{ 10 }$

the vector h is defined as ${\mathbf{h}} = \left(h({\mathbf{x}}), h({\mathbf{x}}^{{\prime}})\right)^T$ , and $|\tilde{{\boldsymbol{\Sigma}}}^{(\ell)}|$ is the determinant of the matrix $\tilde{{\boldsymbol{\Sigma}}}^{(\ell)}$ .

From lemma 1, the following theorem regarding NTK can be derived:

Theorem 1 (theorem 1 in [36]). With σ as a Lipschitz nonlinear function, in the infinite width limit $n_{\ell} \rightarrow \infty$ for $1\unicode{x2A7D}\ell\unicode{x2A7D} L-1$ , the neural tangent kernel $K^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)$ converges to the time-invariant function $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ , which is given recursively by

$\begin{align} \boldsymbol{\Theta}^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \boldsymbol{\Sigma}^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \frac{1}{n_0}{\mathbf{x}}^T{\mathbf{x}}^{\prime} + \xi^2, \nonumber\\ \boldsymbol{\Theta}^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \boldsymbol{\Theta}^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)\dot{\boldsymbol{\Sigma}}^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)+\boldsymbol{\Sigma}^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right), \end{align} \tag{ 11 }$

where $\dot{\boldsymbol{\Sigma}}^{(\ell)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \mathbf{E}_{h \sim \mathcal{N}\left(0, \boldsymbol{\Sigma}^{(\ell)}\right)}\left[\dot{\sigma}(h({\mathbf{x}})) \dot{\sigma}\left(h\left({\mathbf{x}}^{{\prime}}\right)\right)\right]$ and $\dot{\sigma}$ is the derivative of σ.

Note that, by definition, the matrix $(\boldsymbol{\Theta}^{(L)}({\mathbf{x}}^a, {\mathbf{x}}^b))$ is symmetric and positive semi-definite. In particular, when $L \unicode{x2A7E} 2$ , the following theorem holds:

Theorem 2 (proposition 2 in [36]). With σ as a Lipschitz nonlinear function, the kernel $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive definite when $L\unicode{x2A7E} 2$ and the input vector x is normalized as ${\mathbf{x}}^T {\mathbf{x}} = 1$ .

The above theorems on NTK in the infinite width limit can be utilized to analyze the trajectory of $f_{\boldsymbol{\theta}(t)}$ as shown in the next subsection.

2.4. Consequence of theorems 1 and 2

From theorems 1 and 2, in the infinite width limit, the differential equation (6) can be exactly replaced by

$\begin{align} \frac{\partial f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}\right)}{\partial t} = -\eta \sum_b \boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}, {\mathbf{x}}^b\right) \frac{\partial \mathcal{L}_t^{C}}{\partial f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}^b\right)}. \end{align} \tag{ 12 }$

The solution depends on the form of $\mathcal{L}_t^{C}$ ; of particular importance is the case when $\mathcal{L}_t^{C}$ is the mean squared loss. In our case (2), the functional derivative of the mean squared loss is given by

$\begin{align} \frac{\partial \mathcal{L}_t^{C}}{\partial f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}^b\right)} = f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}^b\right) - y^b, \end{align} \tag{ 13 }$

and then we obtain the ordinary linear differential equation by substituting (13) for (12). This equation can be solved analytically [48] at each data points as

$\begin{align} f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right) = \sum_{j,b} V_{aj}^T\left(V_{jb}f_{\boldsymbol{\theta}\left(0\right)}\left({\mathbf{x}}^b\right) - V_{jb}y^b\right) e^{-\eta \lambda_j t} + y^a, \end{align} \tag{ 14 }$

where $V = (V_{jb})$ is the orthogonal matrix that diagonalizes $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ as

$\begin{align} \sum_{a = 1}^{N_D}\sum_{b = 1}^{N_D}V_{ja}\boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}^a, {\mathbf{x}}^b\right)V_{bk}^{T} = \lambda_j \delta_{jk}. \end{align} \tag{ 15 }$

The eigenvalues λ_j are non-negative, because $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive semi-definite.

When the conditions of theorem 2 are satisfied, then $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive definite and accordingly $\lambda_j \gt 0$ holds for all j. Thus in the limit $t\rightarrow \infty$ , the solution (14) states that $f_{\boldsymbol{\theta}(t)} ({\mathbf{x}}^a) = y_a$ holds for all a; namely, the value of the cost $\mathcal{L}_t^{C}$ reaches the global minimum $\mathcal{L}_t = 0$ . This fine convergence to the global minimum explains why the over-parameterized cNN can be successfully trained.

We can also derive some useful theoretical formula for general x. In the infinite width limit, from equations (12)–(14) we have

$\begin{align} \frac{\partial f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}\right)}{\partial t} & = -\eta\sum_b \boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}, {\mathbf{x}}^b\right)\left(f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^b\right)-y^b\right) \end{align} \tag{ 16 }$

$\begin{align} & = -\eta\sum_{b,c,j} \boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}, {\mathbf{x}}^b\right) V_{bj}^T\left(V_{jc}f_{\boldsymbol{\theta}\left(0\right)}\left({\mathbf{x}}^c\right)-V_{jc}y^c\right)e^{-\eta\lambda_j t}. \end{align} \tag{ 17 }$

This immediately gives

$\begin{align} f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}\right) &= -\sum_{b,c,j} \boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}, {\mathbf{x}}^b\right)V_{bj}^T D_j \left(V_{jc}f_{\boldsymbol{\theta}\left(0\right)}\left({\mathbf{x}}^c\right)-V_{jc}y^c\right), \end{align} \tag{ 18 }$

where

$\begin{align} D_j &= \left\{\begin{array}{clc} & \left(1-e^{-\eta\lambda_j t}\right)/\lambda_{j} & \left(\lambda_j > 0\right)\\ & \qquad\eta t &\left(\lambda_j = 0\right) \end{array} \right.. \end{align} \tag{ 19 }$

Now, if the initial parameters $\boldsymbol{\theta}(0)$ are randomly chosen from a centered Gaussian distribution, the average of $f_{\boldsymbol{\theta}(t)}({\mathbf{x}})$ over such initial parameters is given by

$\begin{align} \langle f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}\right) \rangle = \sum_{b,c,j}\boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}, {\mathbf{x}}^b\right)V_{bj}^T D_j V_{jc}y^c. \end{align} \tag{ 20 }$

The formula (18) can be used for predicting the output for an unknown data, but it requires $O(N_D^3)$ computation to have V via diagonalizing NTK, which may be costly when the number of data is large. To the contrary, in the case of cNN, the computational cost for its training is $O(N_D N_P)$ , where N_P is the number of parameters in cNN. Thus, if N_D is so large that $O(N_D^3)$ classical computation is intractable, we can use the finite width cNN with $N_P \unicode{x2A7D} O(N_D)$ , rather than (18) as a prediction function. In such case, the NTK theory can be used as theoretical tool for analyzing the behavior of cNN.

Finally, let us consider the case where the cost is given by the binary cross entropy (3); the functional derivative in this case is given by

$\begin{align} \frac{\partial \mathcal{L}_t^{C}}{\partial f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right)} & = -y^a \frac{\dot{\sigma}_s\left(f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right)\right)}{\partial f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right)} - \left(1 - y^a\right)\frac{-\dot{\sigma}_s\left(f\left({\mathbf{x}}^a\right)\right)}{1-\dot{\sigma}_s\left(f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right)\right)} \nonumber\\ & = -y^a + \sigma\left(f\left({\mathbf{x}}^a\right)\right), \end{align} \tag{ 21 }$

where in the last line we use the derivative formula for the sigmoid function:

$\begin{align} \dot{\sigma}_s\left(q\right) = \left(1-\sigma_s\left(q\right)\right)\sigma_s\left(q\right). \end{align} \tag{ 22 }$

By substituting (21) into (12), we obtain

$\begin{align} f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right) = -\eta \int_0^{t}dt^{{\prime}}\sum_b \boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}^a, {\mathbf{x}}^b\right)\left(-y^b + \sigma\left(f_{\boldsymbol{\theta}\left(t^{\prime}\right)}\left({\mathbf{x}}^a\right)\right)\right), \end{align} \tag{ 23 }$

and similarly for general input x

$\begin{align} f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}\right) = -\eta \int_0^{t}dt^{{\prime}}\sum_b \boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}, {\mathbf{x}}^b\right)\left(-y^b + \sigma\left(f_{\boldsymbol{\theta}\left(t^{\prime}\right)}\left({\mathbf{x}}^a\right)\right)\right). \end{align} \tag{ 24 }$

These are not linear differential equations and thus cannot be solved analytically, unlike the mean squared error case; but we can numerically solve them by using standard ordinary differential equation tools [48].

2.5. When may cNN fail?

The NTK theory tells that, as long as the condition of theorem 2 holds, the cost function converges to the global minimum in the limit $t\rightarrow \infty$ . However in practice we must stop the training process of cNN at a finite time $t = \tau$ . Thus, the speed of convergence is also an important factor for analyzing the behavior of cNN. In this subsection we discuss when cNN may fail in terms of the convergence speed. We discuss the case when the cost is the mean squared loss.

Recall now that the speed of convergence depends on the eigenvalues $\{\lambda_j\}_{j = 1}^{N_D}$ . If the minimum of the eigenvalues, $\lambda_{\mathrm{min}}$ , is sufficiently larger than 0, the cost function quickly converges to the global minimum in the number of iteration $O(1/\lambda_{\mathrm{min}})$ . Otherwise, the speed of convergence is not determined only by the spectrum of the eigenvalues, but the other factors in (14) need to be taken into account; actually many of the reasonable settings correspond to this case [38], and thus we will consider this setting in the following.

First, the formula (14) can be rewritten as

$\begin{align} w_j\left(t\right) = \left(w_{j}\left(0\right)-g_{j}\right) e^{-\eta \lambda_j t}+g_{j}, \end{align} \tag{ 25 }$

where $w_j(t) = \sum_a V_{ja}f_{\boldsymbol{\theta}(t)}(x_a)$ and $g_j = \sum_a V_{ja}y_a$ . Let us assume that we stop the training at $t = \tau \lt O(1/\lambda_{\mathrm{min}})$ . With $S_{\eta \tau} = \{j|\lambda_j \lt 1/\eta\tau, 1\unicode{x2A7D} j \unicode{x2A7D} N_D\}$ , if we approximate the exponential function as

$\begin{align} e^{-\eta\lambda_j \tau} \simeq \left\{ \begin{array}{cc} 1 & {\mathrm{if}}\ j\in S_{\eta \tau} \\ 0 & {\mathrm{otherwise}} \end{array} \right. , \end{align} \tag{ 26 }$

then we obtain

$\begin{align} w_j\left(\tau\right) \simeq \left\{ \begin{array}{cl} w_j\left(0\right) & {\mathrm{if}}\ j\in S_{\eta \tau}\\ g_j & {\mathrm{otherwise}} \end{array} \right. . \end{align} \tag{ 27 }$

By using the same approximation, the cost function at the iteration step τ can be calculated as

$\begin{align} \mathcal{L}_\tau^C \equiv \frac{1}{N_D}\sum_{a = 1}^{N_D}\left(f_{\boldsymbol{\theta}\left(\tau\right)}\left({\mathbf{x}}^a\right) - y^a\right)^2& = \frac{1}{N_D}\sum_{a = 1}^{N_D}\left[\sum_{j = 1}^{N_D}V^{T}_{aj}\left( w_j\left(\tau\right) - g_j\right) \right]^2 \simeq \frac{1}{N_D}\sum_{a = 1}^{N_D}\left(\sum_{j\in S_{\eta\tau}}V^{T}_{aj}\left(w_j\left(0\right) - g_j\right)\right)^2 \nonumber\\ & = \frac{1}{N_D}\sum_{j\in S_{\eta\tau}} w_j\left(0\right)^2 + \frac{1}{N_D}\sum_{j\in S_{\eta\tau}}g_j^2 - \frac{2}{N_D}\sum_{j\in S_{\eta\tau}}w_j\left(0\right)g_j. \end{align} \tag{ 28 }$

Since $w_j(0)$ is the sum of centered Gaussian distributed variables, $w_j(0)$ also obeys the centered Gaussian distribution with covariance:

$\begin{align} \langle w_j\left(0\right)w_k\left(0\right) \rangle = \sum_{a,b}V_{ja}V_{kb} \langle f_{\boldsymbol{\theta}\left(0\right)}\left(\mathbf{x}^a\right)f_{\boldsymbol{\theta}\left(0\right)}\left({\mathbf{x}}^b\right) \rangle = \sum_{a,b}V_{ja} \boldsymbol{\Sigma}^{\left(L\right)}\left({\mathbf{x}}^a, {\mathbf{x}}^b\right)V_{bk}^{T}. \end{align} \tag{ 29 }$

Thus, we have

$\begin{align} \langle \mathcal{L}_\tau^C\rangle \simeq \frac{1}{N_D}\sum_{j\in S_{\eta\tau}}\sum_{b,c}V_{jb} \boldsymbol{\Sigma}^{\left(L\right)}\left({\mathbf{x}}^b, {\mathbf{x}}^c\right)V_{cj}^{T} + \frac{1}{N_D}\sum_{j\in S_{\eta\tau}}g_j^2. \end{align} \tag{ 30 }$

Since the covariance matrix can be diagonalized with an orthogonal matrix Vʹ as

$\begin{align} V^{{\prime}}_{jb}\boldsymbol{\Sigma}^{\left(L\right)}\left({\mathbf{x}}^b, {\mathbf{x}}^c\right)V^{^{\prime} T}_{ck} = \lambda_j^{{\prime}} \delta_{jk}, \end{align} \tag{ 31 }$

the first term of equation (30) can be rewritten as

$\begin{align} \frac{1}{N_D}\sum_{j\in S_{\eta\tau}}\sum_{b,c}V_{jb} \boldsymbol{\Sigma}^{\left(L\right)}\left({\mathbf{x}}^b, {\mathbf{x}}^c\right)V_{cj}^{T} = \frac{1}{N_D}\sum_{j\in S_{\eta\tau}}\sum_{k = 1}^{N_D}\lambda_k^{{\prime}}\left({\mathbf{v}}^{{\prime}}_k \cdot {\mathbf{v}}_j\right)^2, \end{align} \tag{ 32 }$

where ${\mathbf{v}}_j = \{V_{ja}\}_{a = 1}^{N_D}$ and ${\mathbf{v}}_j^{{\prime}} = \{V^{{\prime}}_{ja}\}_{a = 1}^{N_D}$ . Also, the second term of (30) can be written as

$\begin{align} \frac{1}{N_D}\sum_{j\in S_{\eta\tau}}g_j^2 = \frac{1}{N_D}\sum_{j\in S_{\eta\tau}}\left({\mathbf{y}} \cdot {\mathbf{v}}_j\right)^2, \end{align} \tag{ 33 }$

where y is the label vector defined by ${\mathbf{y}} = \{y^a\}_{a = 1}^{N_D}$ . Thus, we have

$\begin{align} \langle \mathcal{L}_{\tau}^{C}\rangle \simeq \frac{1}{N_D}\sum_{j\in S_{\eta\tau}}\sum_{k = 1}^{N_D}\lambda_k^{{\prime}}\left({\mathbf{v}}^{{\prime}}_k \cdot {\mathbf{v}}_j\right)^2 + \frac{1}{N_D}\sum_{j\in S_{\eta\tau}}\left({\mathbf{y}} \cdot {\mathbf{v}}_j\right)^2. \end{align} \tag{ 34 }$

The cost $\mathcal{L}_\tau^{C}$ becomes large, depending on the values of the first and the second terms, characterized as follows: (i) the first term becomes large if the eigenvectors of $\boldsymbol{\Sigma}^{(L)}({\mathbf{x}}^b, {\mathbf{x}}^c)$ with respect to large eigenvalues align with the eigenvectors of $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}^b, {\mathbf{x}}^c)$ with respect to small eigenvalues and (ii) the second term becomes large if the label vector aligns with the eigenvectors of $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}^b, {\mathbf{x}}^c)$ with respect to small eigenvalues. Of particular importance is the condition where the latter statement (ii) applies. Namely, the cNN cannot be well optimized in a reasonable time if we use a dataset whose label vector aligns with the eigenvectors of $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}^b, {\mathbf{x}}^c)$ with respect to small eigenvalues. If such a dataset is given to us, therefore, an alternative method that may outperform the cNN is highly demanded, which is the motivation of introducing our model.

Remark 1. If some noise is added to the label of the training data, we need not aim to decrease the cost function toward precisely zero. For example, when the noise vector is appended to the true label vector $\tilde{{\mathbf{y}}}$ in the form ${\mathbf{y}} = \tilde{{\mathbf{y}}}+\boldsymbol{\epsilon}$ , it may be favorable to stop the optimization process at time $t = \tau$ before $\sum_{j\in S_{\eta\tau}}(\boldsymbol{\epsilon}\cdot{\mathbf{v}})^2$ becomes small, for avoiding the overfitting to the noise; actually in the original NTK paper [36] the idea of avoiding the overfitting by using early stopping is mentioned. In this case, instead of $\sum_{j\in S_{\eta\tau}}({\mathbf{y}}\cdot{\mathbf{v}})^2$ , we should aim to decrease the value of $\sum_{j\in S_{\eta\tau}}(\tilde{{\mathbf{y}}}\cdot{\mathbf{v}})^2$ , to construct a prediction function that has a good generalization ability.

3. Proposed model

In this section, we introduce our qcNN model for supervised learning, which is theoretically analyzable using the NTK theory. Before describing the detail, we summarize the notable point of this qcNN. This qcNN is a concatenation of a quantum circuit followed by a cNN, as illustrated in figure 1. Likewise the classical case shown in section 2.4, we obtain the time-invariant NTK in the infinite width limit of the cNN part, which allows us to theoretically analyze the behavior of the entire system. Importantly, NTK in our model coincides with a certain quantum kernel computed in the quantum data-encoding part. This means that the output of our qcNN can represent functions of quantum states defined on the quantum feature space (Hilbert space); hence, if the quantum encoder is designed appropriately, our model may have advantage over purely classical systems. In the following, we discuss the detail of our model from section 3.1 to section 3.3, and discuss possible advantage in section 3.4.

**Figure 1.** The overview of the proposed qcNN model. The first quantum part is composed of the encoding unitary $U_{\mathrm{enc}}({\mathbf{x}}^a)$ for the data ${\mathbf{x}}^a$ followed by the random unitary *U_i* and measurement of an observable O for extracting a feature of the quantum state, $f^{\mathrm{Q}} ({\mathbf{x}}^a)_i$ . We run n₀ different quantum circuits to construct a feature vector ${\mathbf{f}}^Q({\mathbf{x}}^a) = (f^{\mathrm{Q}}({\mathbf{x}}^a)_1, f^{\mathrm{Q}}({\mathbf{x}}^a)_2,\cdots, f^{\mathrm{Q}}({\mathbf{x}}^a)_{n_0})$ , which is the input vector to the classical part composed of n₀-nodes multi-layered NN.
Download figure:
Standard image High-resolution image

**Figure 1.** The overview of the proposed qcNN model. The first quantum part is composed of the encoding unitary $U_{\mathrm{enc}}({\mathbf{x}}^a)$ for the data ${\mathbf{x}}^a$ followed by the random unitary *U_i* and measurement of an observable O for extracting a feature of the quantum state, $f^{\mathrm{Q}} ({\mathbf{x}}^a)_i$ . We run n₀ different quantum circuits to construct a feature vector ${\mathbf{f}}^Q({\mathbf{x}}^a) = (f^{\mathrm{Q}}({\mathbf{x}}^a)_1, f^{\mathrm{Q}}({\mathbf{x}}^a)_2,\cdots, f^{\mathrm{Q}}({\mathbf{x}}^a)_{n_0})$ , which is the input vector to the classical part composed of n₀-nodes multi-layered NN.
Download figure:
Standard image High-resolution image

3.1. qcNN model

We consider the same supervised learning problem discussed in section 2. That is, we are given N_D training data ${({\mathbf{x}}^a, y^a)}$ ( $a = 1,2,\cdots, N_D$ ) generated from the hidden function $f_{\mathrm{goal}}$ satisfying

$\begin{align} y^a = f_{\mathrm{goal}}\left({\mathbf{x}}^a\right), \qquad \forall a. \end{align} \tag{ 35 }$

Then the goal is to train the model function $f_{\boldsymbol{\theta}(t)}$ so that $f_{\boldsymbol{\theta}(t)}$ becomes closer to $f_{\mathrm{goal}}$ in some measure, by updating the vector of parameters $\boldsymbol{\theta}(t)$ as a function of time t. Our qcNN model $f_{\boldsymbol{\theta}(t)}$ is composed of the quantum part ${\mathbf{f}}^Q$ and the classical part $f^{\mathrm{C}}_{\boldsymbol{\theta}(t)}$ , which are concatenated as follows:

$\begin{align} f_{\boldsymbol{\theta}\left(t\right)} = f^{\mathrm{C}}_{\boldsymbol{\theta}\left(t\right)} \circ {\mathbf{f}}^Q. \end{align} \tag{ 36 }$

Only the classical part has trainable parameters in our model as will be stated later, and thus the subscript $\boldsymbol{\theta}(t)$ is placed only on the classical part.

The quantum part first operates the n-qubits quantum circuit (unitary operator) $U_{\mathrm{enc}}$ that loads the classical input data ${\mathbf{x}}^a$ into the quantum state in the manner $|\psi({\mathbf{x}}^a)\rangle = U_{\mathrm{enc}}({\mathbf{x}}^a)|0\rangle^{\otimes n}$ . We then operate a random unitary operator U_i on the quantum state $|\psi({\mathbf{x}}^a)\rangle$ and finally measure an observable O to have the expectation value

$\begin{align} f^{\mathrm{Q}} \left({\mathbf{x}}^a\right)_i = \langle\psi\left(\left({\mathbf{x}}^a\right)\right)|U_i^{\dagger} O U_i |\psi\left(\left({\mathbf{x}}^a\right)\right)\rangle = \langle 0|^{\otimes n}U_{\mathrm{enc}} \left({\mathbf{x}}^a\right)^{\dagger} U_i^{\dagger} O U_i U_{\mathrm{enc}} \left({\mathbf{x}}^a\right)|0\rangle^{\otimes n}. \end{align} \tag{ 37 }$

We repeat this procedure for $i = 1,\ldots,n_0$ and collect these quantities to construct the n₀-dimensional vector ${\mathbf{f}}^Q({\mathbf{x}}^a) = (f^{\mathrm{Q}}({\mathbf{x}}^a)_1, f^{\mathrm{Q}}({\mathbf{x}}^a)_2,\cdots, f^{\mathrm{Q}}({\mathbf{x}}^a)_{n_0})$ , which is the output of the quantum part of our model. The randomizing process corresponds to extracting features of $|\psi({\mathbf{x}}^a)\rangle$ , likewise the machine learning method using the classical shadow tomography [49, 50]; but our method does not construct a tomographic density matrix (called the snapshot) but directly construct the feature vector ${\mathbf{f}}^Q({\mathbf{x}}^a)$ which will be further processed in the classical part. Note that, as shown later, we will make n₀ bigger sufficiently so that the NTK becomes time-invariant and thereby the entire dynamics is analytically solvable. Hence it may look like that the procedure for constructing the n₀-dimensional vector ${\mathbf{f}}^Q({\mathbf{x}}^a)$ is inefficient, but practically a modest number of n₀ is acceptable, as demonstrated in the numerical simulation in section 4.3.

In this paper, we take the following setting for each component. The classical input data ${\mathbf{x}}^a$ is loaded into the n-qubits quantum state through the encoder circuit $U_{\mathrm{enc}}$ . Ideally, we should design the encoder circuit $U_{\mathrm{enc}}$ so that it reflects the hidden structure (e.g. symmetry) of the training data, as suggested in [28, 51]; the numerical simulation in section 4.3 considers this case. As for the randomizing unitary operator U_i , it is of the tensor product form:

$\begin{align} U_i = U_i^1 \otimes U_i^2 \otimes \cdots U_i^{n_Q}, \end{align} \tag{ 38 }$

where m is an integer called the locality, and we assume that $n_Q = n/m$ is an integer. Each $U_i^{k}\ (k = 1,2,\cdots,n_Q)$ is independently sampled from unitary 2-designs and is fixed during the training. Note that a unitary 2-design is implementable on a circuit with the number of gates $O(m^2)$ [52]. Lastly, the observable O is the sum of n_Q local operators:

$\begin{align} O = \sum_{k = 1}^{n_Q}I_{\left(k-1\right)m}\otimes \mathcal{O} \otimes I_{\left(n_Q-k\right)m}, \end{align} \tag{ 39 }$

where I_u is the 2^u-dimensional identity operator and $\mathcal{O}$ is a 2^m-dimensional traceless operator.

Next we describe the classical part, $f^{\mathrm{C}}_{\boldsymbol{\theta}(t)}$ . This is a cNN that takes the vector ${\mathbf{f}}^Q({\mathbf{x}}^a)$ as the input and returns the output $f^{\mathrm{C}}_{\boldsymbol{\theta}(t)}({\mathbf{f}}_Q)$ ; therefore, $f_{\boldsymbol{\theta}(t)}({\mathbf{x}}^a) = f^{\mathrm{C}}_{\boldsymbol{\theta}(t)}({\mathbf{f}}^Q({\mathbf{x}}^a))$ . We implement $f^{\mathrm{C}}_{\boldsymbol{\theta}(t)}$ as an L-layers fully connected cNN, which is the same as that introduced in section 2:

$\begin{align} \boldsymbol{\alpha}^{\left(0\right)}\left({\mathbf{x}}^{a}\right) & = {\mathbf{f}}^Q\left({\mathbf{x}}^a\right), \nonumber\\ \boldsymbol{\alpha}^{\left(\ell\right)}\left({\mathbf{x}}^{a}\right) & = \sigma\left(\tilde{\boldsymbol{\alpha}}^{\left(\ell\right)}\left({\mathbf{x}}^{a}\right)\right), \nonumber\\ \tilde{\boldsymbol{\alpha}}^{\left(\ell + 1\right)}\left({\mathbf{x}}^{a}\right)& = \frac{1}{\sqrt{n_{\ell}}}W^{\left(\ell\right)}\boldsymbol{\alpha}^{\left(\ell\right)}\left({\mathbf{x}}^{a}\right) + \xi b^{\left(\ell\right)}, \nonumber\\ f^{\mathrm{C}}_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{f}}\left({\mathbf{x}}^a\right)\right) & = \boldsymbol{\alpha}^{\left(L\right)}\left({\mathbf{x}}^a\right), \end{align} \tag{ 40 }$

where $\ell = 0,1,\cdots,L-1$ . As in the case of cNN studied in section 2, $W^{(\ell)}$ is the $n_{\ell +1}\times n_{\ell}$ weighting matrix and $b^{(\ell)}$ is the $n_{\ell}$ -dimensional bias vector; each element of W and $b^{(\ell)}$ are initialized by sampling from the mutually independent normal Gaussian distributions.

The parameter $\boldsymbol{\theta}(t)$ is updated by the gradient descent algorithm

$\begin{align} \frac{\partial\theta_p\left(t\right)}{\partial t} = -\eta \frac{\partial \mathcal{L}^Q_{t}}{\partial \theta_p\left(t\right)}, \end{align} \tag{ 41 }$

where $\mathcal{L}^Q_{t}$ is the cost function that reflects a distance between $f_{\boldsymbol{\theta}(t)}$ and $f_{\mathrm{goal}}$ . Also η is the learning rate and $\theta_p(t)\ (p = 1,2,\cdots,P)$ is the pth element of $\boldsymbol{\theta}(t)$ that corresponds to the elements of $W^{(1)}, W^{(2)}, \cdots, W^{(L-1)}$ and $b^{(1)}, b^{(2)}, \cdots, b^{(L-1)}$ . The task of updating the parameters only appears in the classical part, which can thus be performed by applying some established machine learning solver given the N_D training data $\{({\mathbf{x}}^a, y^a) \}\ (a = 1,2,\cdots,N_D)$ , cNN $f^{\mathrm{C}}_{\boldsymbol{\theta}(t)}$ , and the cached output from the quantum part at initialization.

3.2. Quantum neural tangent kernel

As proven in section 2, when the parameters are updated via the gradient descent method (41), the output function $f_{\boldsymbol{\theta}(t)}$ changes in time according to

$\begin{align} \frac{\partial f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}\right)}{\partial t} = -\eta \sum_{a = 1}^{N_D}K_Q\left({\mathbf{x}}, {\mathbf{x}}^a, t\right) \frac{\partial \mathcal{L}^Q_{t}}{\partial f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right)}. \end{align} \tag{ 42 }$

Here $K_Q({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)$ is the quantum neural tangent kernel (QNTK), defined by

$\begin{align} K_Q\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right) = \sum_{p = 1}^{P}\frac{\partial f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}\right)}{\partial \theta_p\left(t\right)}\frac{\partial f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^{{\prime}}\right)}{\partial \theta_p\left(t\right)}. \end{align} \tag{ 43 }$

It is straightforward to show that $K_Q({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)$ is positive semi-definite. We will see the reason why we call $K_Q({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)$ as the quantum neural tangent kernel in the next subsection.

3.3. Theorems

We begin with the theorem stating the probability distribution of the output function $f_{\boldsymbol{\theta}(0)}$ in the case L = 1; this setting shows how a quantum kernel appears in our model, as follows.

Theorem 3. With σ as a Lipschitz function, for L = 1 and in the limit $n_0\xrightarrow{}\infty$ , the output function $f_{\boldsymbol{\theta}(0)}$ is a centered Gaussian process whose covariance matrix $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is given by

$\begin{align} \boldsymbol{\Sigma}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1}\sum_{k = 1}^{n_Q}\left({\mathrm{Tr}}\left(\rho^k_{{\mathbf{x}}}\rho^k_{{\mathbf{x^{{\prime}}}}}\right) - \frac{1}{2^m}\right) + \xi^2. \end{align} \tag{ 44 }$

Here $\rho_x^k$ is the reduced density matrix defined by

$\begin{align} \rho^k_{{\mathbf{x}}} = {\mathrm{Tr}}_k\left(U_{\mathrm{enc}}\left({\mathbf{x}}\right)|0\rangle^{\otimes n}\langle 0|^{\otimes n}U_{\mathrm{enc}}\left({\mathbf{x}}\right)^{\dagger}\right), \end{align} \tag{ 45 }$

where ${\mathrm{Tr}}_k$ is the partial trace over the entire Hilbert space except from the $(km-m)$ th qubit to the $(km-1)$ th qubit.

The proof is found in appendix A. Note that the term $\sum_{k = 1}^{n_Q} {\mathrm{Tr}}(\rho^k_{{\mathbf{x}}}\rho^k_{{\mathbf{x^{{\prime}}}}})$ coincides with one of the projected quantum kernels introduced in [27] with the following motivation. That is, when the number of qubits (hence the dimension of Hilbert space) becomes large, the Gram matrix composed of the inner product between pure states, ${\mathrm{Tr}}(\rho_{\mathbf{x}} \rho_{{\mathbf{x}}^{{\prime}}}) = |\langle \psi({\mathbf{x}})|\psi({\mathbf{x}}^{{\prime}})\rangle|^2$ , becomes close to the identity matrix under certain type of feature map [27, 35, 53], meaning that there is no quantum advantage in using this kernel. The projected quantum kernel may cast as a solution for this problem; that is, by projecting the density matrix in a high-dimensional Hilbert space to a low-dimensional one as in (45), the Gram matrix of kernels defined by the inner product of projected density matrices can take some quantum-intrinsic structure which largely differs from the identity matrix.

The covariance matrix $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ inherits the projected quantum kernel, which can be more clearly seen from the following corollary:

Corollary 1. The covariance matrix obtained in the setting of theorem 3 is of the form

$\begin{align} \boldsymbol{\Sigma}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1}\sum_{k = 1}^{n_Q}{\mathrm{Tr}}\left(\rho^k_{{\mathbf{x}}}\rho^k_{{\mathbf{x^{{\prime}}}}}\right), \end{align} \tag{ 46 }$

if ξ is set to be

$\begin{align} \xi = \sqrt{\frac{n_Q{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{\left(2^{2m}-1\right)2^m}}. \end{align} \tag{ 47 }$

Namely, $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is exactly the projected quantum kernel up to the constant factor, if we suitably choose the coefficient of the bias vector given in equation (40).

Based on result in the case of L = 1, we can derive the following theorems 4 and 5. First, the distribution of $f_{\boldsymbol{\theta}(0)}$ when L > 1 can be recursively computed as follows.

Theorem 4. With σ as a Lipschitz function, for L > 1 and in the limit $n_0, n_1, \cdots, n_{L-1}\xrightarrow{} \infty$ , $f_{\boldsymbol{\theta}(0)}$ is a centered Gaussian process whose covariance matrix $\boldsymbol{\Sigma}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is given recursively by

$\begin{align} \boldsymbol{\Sigma}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1} \sum_{k = 1}^{n_Q}\left({\mathrm{Tr}}\left(\rho^k_{{\mathbf{x}}}\rho^k_{{\mathbf{x^{{\prime}}}}}\right) - \frac{1}{2^m}\right) + \xi^2, \nonumber\\ \boldsymbol{\Sigma}_Q^{\left(\ell + 1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \mathbf{E}_{h \sim \mathcal{N}\left(0, \Sigma_Q^{\left(\ell\right)}\right)}\left[\sigma\left(h\left({\mathbf{x}}\right)\right) \sigma\left(h\left({\mathbf{x}}^{{\prime}}\right)\right)\right]+\xi^2, \end{align} \tag{ 48 }$

where the expectation value is calculated by averaging over the centered Gaussian process with covariance matrix $\Sigma_Q^{(\ell)}$ .

The proof is found in appendix B. Note that the only difference between the quantum case (48) and the classical case (8) is that the covariance matrix corresponding to the first layer in the entire network.

The infinite width limit of the QNTK can be also derived in a similar manner as theorem 1, as follows.

Theorem 5. With σ as a Lipschitz function, in the limit $n_0, n_1, \cdots, n_{L-1}\xrightarrow{} \infty$ , the QNTK $K_Q({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)$ converges to the time-invariant function $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ , which is given recursively by

$\begin{align} \boldsymbol{\Theta}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \boldsymbol{\Sigma}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1}\sum_{k = 1}^{n_Q}\left({\mathrm{Tr}}\left(\rho^k_{\mathbf{x}}\rho^k_{{\mathbf{x}}^{{\prime}}}\right) - \frac{1}{2^m}\right) + \xi^2, \nonumber\\ \boldsymbol{\Theta}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \boldsymbol{\Theta}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)\dot{\boldsymbol{\Sigma}}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)+\boldsymbol{\Sigma}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right), \end{align} \tag{ 49 }$

where $\dot{\boldsymbol{\Sigma}}_Q^{(\ell)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \mathbf{E}_{h \sim \mathcal{N}\left(0, \boldsymbol{\Sigma}_Q^{(\ell)}\right)}\left[\dot{\sigma}(h({\mathbf{x}})) \dot{\sigma}\left(h\left({\mathbf{x}}^{{\prime}}\right)\right)\right]$ and $\dot{\sigma}$ is the derivative of σ.

The proof is in appendix C. Note that the above two theorems can be proven with almost the same manner as in [36].

When L = 1, the QNTK directly inherits the structure of the quantum kernel, and this is the reason why we call $K_Q({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)$ the quantum NTK. Also, such inherited structure in the first layer propagates to the subsequent layers when L > 1; the resulting kernel is then of the form of a nonlinear function of the projected quantum kernel. Considering the fact that designing an effective quantum kernel is in general quite nontrivial, it is useful for us to have a method to automatically generate a nonlinear kernel function appearing when L > 1. Note that, when the ReLU activation function is used, the analytic form of $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is recursively computable as shown in appendix D.

As in the classical case, theorem 5 is the key property that enables us to analytically study the training process of the qcNN. In particular, let us recall theorem 2 and the discussion below equation (15), showing the importance of positive semi-definiteness or definiteness of the kernel $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ . (The positive semi-definiteness is trivial since $K_Q(x, x^{{\prime}}, t)$ is positive semi-definite.) Actually, we now have an analogous result to theorem 2 as follows.

Theorem 6. For a non-constant Lipschitz function σ, QNTK $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive definite unless there exists $\{c_a\}_{a = 1}^{N_D}$ such that (i) $\sum_a c_a\rho_{{\mathbf{x}}^a}^k = {\mathbf{0}}$ $(\forall k)$ , $\sum_a c_a = 0$ , and $c_a \neq 0\ (\exists a)$ or (ii) ξ = 0, $\sum_a c_a\rho_{{\mathbf{x}}^a}^k = I_{m}/2^m$ $(\forall k)$ and $\sum_a c_a = 1$ .

We give the proof in appendix E. Note that condition (i) can be interpreted as the data embedded reduced density matrices being linearly dependent, which can be avoided by removing redundant data. It is difficult to give a proper interpretation on the condition (ii), but it is still avoidable by setting ξ larger than zero.

Based on the above theorems, we can theoretically analyze the learning process and moreover the resulting performance. In the infinite-width limit of cNN part, the dynamics of the output function $f_{\boldsymbol{\theta}(t)} ({\mathbf{x}})$ given by equation (42) takes the form

$\begin{align} \frac{\partial f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}\right)}{\partial t} = -\eta \sum_b \boldsymbol{\Theta}_Q^{\left(L\right)}\left({\mathbf{x}}, {\mathbf{x}}^b\right) \frac{\partial \mathcal{L}_t^{Q}}{\partial f_{\boldsymbol{\theta}\left(t\right)} \left({\mathbf{x}}^b\right)}. \end{align} \tag{ 50 }$

Because the only difference between this dynamical equation and that for the classical case, equation (12), is in the form of NTK, the discussion in section 2.4 can be directly applied. In particular, if the cost $\mathcal{L}_t^{Q}$ is the mean squared error (2), the solution of equation (50) is given by

$\begin{align} f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right) = \sum_j V_{aj}^{QT}\left(V_{jb}^Qf_{\boldsymbol{\theta}\left(0\right)}\left({\mathbf{x}}^b\right) - V_{jb}^Q y^b\right) e^{-\eta \lambda_j t} + y^a, \end{align} \tag{ 51 }$

where V^Q is the orthogonal matrix that diagonalizes $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ as

$\begin{align} \sum_{a = 1}^{N_D}\sum_{b = 1}^{N_D}V_{ja}^Q\boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}^a, {\mathbf{x}}^b\right)V_{bk}^{QT} = \lambda_j^Q \delta_{jk}. \end{align} \tag{ 52 }$

$\{\lambda_j^Q\}$ are the eigenvalues of $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ , which is generally positive semi-definite. If theorem 6 holds, then $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive definite or equivalently $\{\lambda_j^Q\}$ are all positive; then equation (51) shows $f_{\boldsymbol{\theta}(t)}({\mathbf{x}}^a) \to y^a$ as $t\to\infty$ and thus the learning process perfectly completes. Note that, if the cost is the binary cross-entropy (3), then we have

$\begin{align} f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}^a\right) = -\eta \int_0^{t}dt^{{\prime}}\sum_b \boldsymbol{\Theta}^{\left(L\right)}\left({\mathbf{x}}^a, {\mathbf{x}}^b\right)\left(-y^b + \sigma\left(f_{\boldsymbol{\theta}\left(t^{\prime}\right)}\left({\mathbf{x}}^a\right)\right)\right). \end{align} \tag{ 53 }$

3.4. Possible advantage of the proposed model

In this subsection, we discuss two scenarios where the proposed qcNN has possible advantage over other models.

3.4.1. Possible advantage over pure classical models

First, we discuss a possible advantage of our qcNN over classical models. For this purpose, recall that our QNTK contains features of quantum states in the form of a nonlinear function of the projected quantum kernel, as proven in theorem 5. Hence, under the assumption of the classical intractability for the projected quantum kernel [27], our QNTK may also be a classically intractable object. As a result, the output function (51) or (53) may potentially achieve the training error or the generalization error smaller than that any classical means cannot reach. Now, considering the fact that designing an effective quantum kernel is in general quite nontrivial, it is useful for us to have a NN-based method for synthesizing a nonlinear kernel function that really outperforms any classical means for a given task.

To elaborate on the above point, let us study the situation where a quantum advantage would appear in the training error. More specifically, we investigate the condition where

$\begin{align} \min_{\sigma\in F,L}\langle\mathcal{L}_\tau^C \rangle > \min_{\sigma\in F,L,U_{\mathrm {enc}}}\langle\mathcal{L}_\tau^Q \rangle, \end{align} \tag{ 54 }$

holds. Here we assume that the time τ is sufficiently large such that further training does not change the cost. Also, F is the set of differentiable Lipschitz functions, L is the number of layer of cNN, and the average is taken over the initial parameters. If (54) holds, we can say that our qcNN model is better than the pure classical model regarding the training error. To interpret the condition (54) analytically, let us further assume that the cost is the mean squared error. Then, the condition (54) is approximately rewritten by using equation (34) as

$\begin{align} \min_{\sigma\in F,L}\left\{\sum_{j\in S_{\eta\tau}^C}\sum_{k = 1}^{N_D}\lambda_k^{C^{\prime}}\left({\mathbf{v}}^{C^{\prime}}_k \cdot {\mathbf{v}}_j^C\right)^2 + \sum_{j\in S_{\eta\tau}^C}\left({\mathbf{y}} \cdot {\mathbf{v}}_j^C\right)^2\right\} > \min_{\sigma\in F,L,U_{\mathrm {enc}}} \left\{\sum_{j\in S_{\eta\tau}^Q}\sum_{k = 1}^{N_D}\lambda_k^{Q^{\prime}}\left({\mathbf{v}}^{Q^{\prime}}_k \cdot {\mathbf{v}}_j^Q\right)^2 + \sum_{j\in S_{\eta\tau}^Q}\left({\mathbf{y}} \cdot {\mathbf{v}}_j^Q\right)^2\right\}, \end{align} \tag{ 55 }$

where $(\{\lambda_k^{C}\}_{k = 1}^{N_D},\ \{{\mathbf{v}}^{C}_k\}_{k = 1}^{N_D})$ , $(\{\lambda_k^{Q}\}_{k = 1}^{N_D}, \{{\mathbf{v}}^{Q}_k\}_{k = 1}^{N_D})$ , $(\{\lambda_k^{C^{\prime}}\}_{k = 1}^{N_D},\ \{{\mathbf{v}}^{C^{\prime}}_k\}_{k = 1}^{N_D})$ , and $(\{\lambda_k^{Q^{\prime}}\}_{k = 1}^{N_D}, \{{\mathbf{v}}^{Q^{\prime}}_k\}_{k = 1}^{N_D})$ are pairs of the eigenvalues and eigenvectors of $\boldsymbol{\Sigma}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ , $\boldsymbol{\Sigma}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ , $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ , and $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ , respectively. Also, $S_{\eta\tau}^C$ and $S_{\eta\tau}^Q$ are the sets of indices where $\lambda_j^C\lt1/\eta\tau$ and $\lambda_j^Q\lt1/\eta\tau$ , respectively; we call the eigenvectors corresponding to the indices in $S_{\eta\tau}^C$ or $S_{\eta\tau}^Q$ as the bottom eigenvectors. That is, now the condition (54) is converted to the condition (55), which is represented in terms of the eigenvectors of the covariance matrices and the NTKs. Of particular importance is the second terms in both sides. These terms depend only on how well the bottom eigenvectors of $\boldsymbol{\Theta}^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ or $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ align with the label vector y. Therefore, if the bottom eigenvectors of classically intractable QNTK do not align with y at all, while that of classical counterparts align with y, equation (55) is likely to be satisfied, meaning that we may have the advantage of using our qcNN model over classical models. This discussion also suggests the importance of the structure of dataset to have quantum advantage; see section 7 of supplemental materials of [27]. In our case, we may even manipulate y so that $\sum_{j\in S_{\eta\tau}^C}({\mathbf{y}} \cdot {\mathbf{v}}_j^C)^2 \gg \sum_{j\in S_{\eta\tau}^C}({\mathbf{y}} \cdot {\mathbf{v}}_j^Q)^2$ for all possible classical models and thereby obtain a dataset advantageous in the qcNN model. A comprehensive study is definitely important for clarifying practical datasets and corresponding encoders that achieve (54), which is left for future work.

3.4.2. Note on the quantum kernel method

The proposed qcNN model has a merit in the sense of computational complexity for the training process, compared to the quantum kernel method. As shown in [33], by using the representer theorem [54], the quantum kernel method in general is likely to give better solutions than the standard (i.e. the data encoding unitary is used just once) variational method for searching the solution, in terms of the training error. However, the quantum kernel method is poor in scalability, as in the case of the classical counterpart; that is, $O(N_D^2)$ computation is needed to calculate the quantum kernel. To the contrary, our qcNN is exactly the kernel method in the infinite width limit of the classical part, and the computational complexity to learn the approximator is $O(N_D T)$ with T the number of iterations. Therefore, as far as the number of iterations satisfies $T \ll N_D$ , our qcNN model casts as a scalable quantum kernel method.

3.4.3. Specific setting where our model outperforms pure quantum or classical models

Secondly, we discuss the possible advantage of the proposed qcNN model over some other models, for the training error in the following feature prediction problem of quantum states. That is, we are given the training set $\{\rho({\mathbf{x}}^a), y_a\}$ , where $\rho({\mathbf{x}}^a)$ is an unknown quantum state with ${\mathbf{x}}^a$ the characteristic input label such as temperature and y_a is the output mean value of an observable such as the total magnetization; the problem is, based on this training set, to construct a predictor of y for a new label x or equivalently $\rho({\mathbf{x}})$ . Let us now assume that the proposed model can directly access to $\rho({\mathbf{x}}^a)$ ; then clearly it gives a better approximator to the training dataset and thereby a better predictor compared to any classical model that can only use $\{{\mathbf{x}}^a, y_a\}$ . Also, as shown below theorem 5, our model can represent a nonlinear function of the projected quantum kernel and thus presumably approximates the training dataset better than any full-quantum model that can also access to $\rho({\mathbf{x}}^a)$ yet is limited to produce a linear function $y = {\mathrm{Tr}}[A U(\boldsymbol{\theta})\rho({\mathbf{x}})U^\dagger(\boldsymbol{\theta})]$ with an observable A. These advantage will be actually numerically demonstrated in section 4.3. Moreover, [50] proposed a model that makes a random measurement on $\rho({\mathbf{x}}^a)$ to generate a classical shadow for approximating $\rho({\mathbf{x}}^a)$ and then constructs a function of the shadows to predict y for a new input $\rho({\mathbf{x}})$ . Note that our model constructs an approximator directly using the randomized measurement without constructing the classical shadows and thus includes the class of systems proposed in [50]; hence the former can perform better than the latter. Importantly, [50] identifies the class of problems that can be efficiently solved by their model; hence, in principle, this class of problems can also be solved by our model. Lastly, [55] identifies a class of similar feature-prediction problems that can be solved via a specific quantum model with constant number of training data but via any classical model with an exponential number of training data. We will be trying to identify the setting that realizes this provable quantum advantage in our qcNN framework.

4. Numerical experiment

The aim of this section is to numerically answer the following three questions:

How fast is the convergence of QNTK, stated in the theorems in the previous section? In other words, how much is the gap between the training dynamics of an actual finite-width qcNN and that of the theoretical infinite-width qcNN?
How much does the locality m (i.e. the size of randomization in qcNN for extracting the features of encoded data) affect on the training of qcNN?
Is there any clear merit of using our proposed qcNN over fully-classical or fully-quantum machine learning models?

To examine these problems, we perform the following three types of numerical experiments. As for the first question, in section 4.1 we compare the performance of a finite-width qcNN with that of the infinite-width qcNN in specific regression and classification problems; in particular, various types of quantum data-encoders will be studied. We then examine the second question for a specific regression problem, in section 4.2. Finally, in section 4.3, we compare the performance of a finite-width qcNN with a fully-quantum NN (qNN) as well as a fully-classical NN (cNN), in special type of regression and classification problems such that the dataset is generated through a certain quantum process. Throughout our numerical experiments, we use Qulacs [56] to simulate the quantum circuit.

4.1. Finite-width qcNN vs infinite-width qcNN

In this subsection, we compare the performance of an actual finite-width qcNN with that of the theoretical infinite-width qcNN, in a regression task and a classification task with various types of quantum data-encoders.

4.1.1. Experimental settings

4.1.1.1. Choices of the quantum circuit

For the quantum data-encoding part, we employ 5 types of quantum circuit $U_{\mathrm{enc}}({\mathbf{x}})$ whose structural properties are listed in table 1 together with figure 2. In all 5 cases, the circuit is composed of n qubits, and Hadamard gates are first applied to each qubit, followed by RZ-gates that encode the data element $x_i\in[-1, 1]$ in the form ${\mathrm{RZ}}(x_i) = {\mathrm{exp}}(-2\pi i x_i)$ ; here, the data vector is ${\mathbf{x}} = [x_1, x_2, \cdots, x_n]$ , meaning that the dimension of the data vector is equal to the number of qubits. The subsequent quantum circuit is categorized to type-A or type-B as follows. As for the type-A encoders, we consider three types of circuits named Ansatz-A, Ansatz-A4, and Ansatz-A4ne (Ansatz-A4 is constructed via 4 times repetition of Ansatz-A); they contain additional data-encoders composed of RZ-gates with cross-term of data values, i.e. $x_i x_j\ (i,j\in [1,2,\cdots,n])$ . On the other hand, the type-B encoders, Ansatz-B and Ansatz-Bne, which also employ RZ gate for encoding the data-variables, do not have such cross-terms, implying that the type-A encoders have higher nonlinearity than the type-B encoders. Another notable difference between the circuits is the existence of CNOT gates; that is, Ansatz-A, Ansatz-A4, and Ansatz-B contain CNOT-gates, while Ansatz-Ane and Ansatz-Bne do not (ŉe' stands for ŉon-entangled'). In general, a large quantum circuit with many CNOT gates may be difficult to classically simulate, and thus Ansatz-A, Ansatz-A4, and Ansatz-B are expected to show better performance than the other two circuits for some specific tasks. The structures of the subsequent classical NN part will be shown in the following subsection.

**Figure 2.** Configuration of $U_{\mathrm{enc}}({\mathbf{x}})$ . First, Hadamard gates are applied to each qubit. Then, the normalized data values $x_i~(i = 1,\cdots,n)$ are encoded into the angle of RZ-gates. They are followed by the entangling gate composed of CNOT-gates in (a) and (c). Also, (a) and (b) have RZ-gates whose rotating angles are the product of two data values, which are called as 'Cross-term' in table 1. Note that a rotating angle of RZ(x) is $2\pi x$ in (a) and (b), and the dashed rectangle (shown as 'Depth = 1') is repeated 4 times both in Ansatz-A4 and Ansatz-A4ne.
Download figure:
Standard image High-resolution image

Table 1. Specific structural properties of $U_{\mathrm{enc}}({\mathbf{x}})$ .

Circuit type	Cross-term	CNOT	Depth
Ansatz-A	Yes	Yes	×1
Ansatz-A4	Yes	Yes	×4
Ansatz-A4ne	Yes	No	×4
Ansatz-B	No	Yes	×1
Ansatz-Bne	No	No	×1

4.1.1.2. Training method for the classical neural network

In our framework, the trainable parameters are contained only in the classical part (cNN), and they are updated via the standard optimization method. First, we compute the outputs of the quantum circuit, $f^{\,\,\mathrm{Q}} ({\mathbf{x}}^a)_i = \langle\psi ({\mathbf{x}}^a) |U_i^{\dagger} O U_i |\psi ({\mathbf{x}}^a) \rangle$ , $i \in [1,2, \ldots, n_0]$ , for all the training data $\{({\mathbf{x}}^a, y^a) \},~a \in [1,2,\ldots ,N_D]$ ; see figure 1. The outputs are generated through n₀ randomized unitaries $\{U_1,U_2,\ldots, U_{n_0}\}$ , where U_i is sampled from unitary 2-designs with the locality m = 1 [57]. We calculate the expectation of $U_i^\dagger O U_i$ directly using the state vector simulator instead of sampling (the effect of shot noise is analyzed in section 4.3), and these values are forwarded to the inputs to cNN (recall that n₀ corresponds to the width of the first layer of cNN). The training of cNN is performed by using some standard gradient descent methods, whose type and the hyper-parameters such as the learning rate are appropriately selected for each task, as will be described later. The parameters at t = 0 are randomly chosen from the normal distribution ${\cal N}(0, \sqrt{2 / N_{\mathrm{param}} })$ , where $N_{\mathrm{param}}$ is the number of parameters in each layer (here ${\cal N}(\mu, \sigma)$ is the normal distribution with mean µ and standard deviation σ).

4.1.2. Results

4.1.2.1. Result of the regression task

For the regression task, we consider the 1-dimensional hidden function $f_{\mathrm{goal}}(x) = \mathrm{sin}(\textit{x}) + \epsilon$ , where is the stochastic i.i.d. noise subjected to the normal distribution ${\cal N}(0, 0.05)$ . The 1-dimensional input data x is embedded into the 4-dimensional vector ${\mathbf{x}} = [x_1, x_2, x_3, x_4] = [x, x^2, x^3, x^4]$ for quantum circuits. The training dataset $\{x^a, f_{\mathrm{goal}}(x^a)\}, a = 1,\ldots, N_D$ is generated by sampling $x\in U(-1, 1)$ , where $U(u_1, u_2)$ is the uniform distribution in the range $[u_1, u_2]$ . Here the number of training data point is chosen as $N_D = 100$ . Also the number of qubit is set to n = 4. We use the mean squared error for the cost function and the stochastic gradient descent (SGD) with learning rate 10⁻⁴ for the optimizer. The cNN has a single hidden-layer (i.e. L = 1) with the number of nodes $n_0 = 10^3$ , which is equal to the number of inputs and outputs of cNN.

The time-evolution of the cost function during the learning process obtained by the numerical simulation with $n_0 = 10^3$ and its theoretical expression assuming $n_0\to\infty$ are shown in the left 'Simulation' and the right 'Theory' figures, respectively, in figure 3. The curves illustrated in the figures are the best results in total 100 trials of choosing $\{U_i\}$ as well as the initial parameters of cNN. Notably, the convergent values obtained in the simulation well agree with those of theoretical prediction. This means that the performance of the proposed qcNN model can be analytically investigated for various quantum circuit settings.

**Figure 3.** Cost function versus the iteration steps for the regression problem. The time-evolution of the cost function obtained by the numerical simulation with $n_0 = 10^3$ and its theoretical expression assuming $n_0\to\infty$ are shown in the left 'Simulation' and the right 'Theory' figures, respectively.
Download figure:
Standard image High-resolution image

Another important fact is that the type-B encoders show better performance than the type-A encoders. This might be because the type-A encoders have too high expressibility for fitting the simple hidden function, which can be systematically analyzed as demonstrated in [58, 59]. That is, the number of repetition of encoding circuit determines the distribution of Fourier coefficients of the model function; if the model function contains more frequency components, then it has a bigger expressibility for fitting the target function. From this perspective, it is reasonable that the type-B encoders (which have only single-layer encoding block) show better performance than the type-A encoders (which have 4-time-repeating encoding block), since the target hidden function is the single-frequency sin function in our setting. This observation is actually supported by another result showing that Ansatz-A4 shows the best performance for a somewhat complicated hidden function $f_{\mathrm{goal}}(x) = (x-0.2)^2 \mathrm{sin} (12\textit{x})$ . Summarizing, the encoder largely affects on the overall performance and thus should be designed with carefully tuning its expressibility.

4.1.2.2. Result of a classification task

For the classification task, we use an artificial dataset available at [60], which was used to demonstrate that the quantum support vector machine has some advantage over the classical counterpart [61]. Each input data vector x is of 2 dimensional, and thus the number of qubit in the quantum circuit is set as n = 2. The default number of inputs into cNN, or equivalently the width of cNN, is chosen as $n_0 = 10^3$ ; in addition, we will test the cases $n_0 = 10^2$ and $n_0 = 10^4$ for the case of Ansatz-A4ne. Also, we study two different cases of the number of layers of cNN, as L = 1 and L = 2. As for the activation function in cNN, we employ the sigmoid function $\sigma(q) = 1/(1 + e^{-q})$ for the output layer of both L = 1 and L = 2 cases, and ReLU $\sigma(q) = {\mathrm{max}}(0,q)$ for the input later of the L = 2 case; also the number of nodes is $n_0 = 10^3$ for the L = 1 case and $n_0 = n_1 = 10^3$ for the L = 2 case. The number of output label y is two, and correspondingly the model yields the output label according to the following rule; if $f^{C}_{\theta (t)} (f^{Q}({\mathbf{x}}^a))$ is bigger than 0.5, then the output label is '1'; otherwise, the output label is '0'. The number of training data is $n_D = 50$ for each class. As the optimizer for the learning process, Adam [62] with learning rate 10⁻³ is used, and the binary cross entropy (3) is employed as the cost function.

The time-evolution of the cost function during the learning process obtained by the numerical simulation and its theoretical prediction corresponding to the infinite-width cNN are shown in figure 4. The curves illustrated in the figures are the best results in total 100 trials of choosing $\{U_i\}$ as well as the initial parameters of cNN. Clearly, the time-evolution trajectories in Simulation and Theory figures for the same ansatz are similar, particularly in the case of L = 1. However, there is a notable difference in Ansatz-A4 and Ansatz-A4ne; in the Theory figures, the former reaches the final value lower than that achieved by the latter, while in the Simulation figures this ordering exchanges. Now recall that Ansatz-A4 is the ansatz containing CNOT gates, which induce classically intractable quantum state. In this sense, it is interesting that Ansatz-A4 outperforms Ansatz-A4ne, which is though observed only in the case (b) L = 1 Theory.

**Figure 4.** Cost function versus the iteration steps for the classification problem. Figures (a), (b) and figures (c), (d) depict the results in the case of L = 1 and L = 2, respectively. The same dataset is used for each ansatz.
Download figure:
Standard image High-resolution image

In addition, to see the effect of enlarging the width of cNN, we compare three cases where the quantum part is fixed to Ansatz-A4ne and the width of cNN varies as $n_0 = 10^2, 10^3, 10^4$ , in the case of (a) L = 1 Simulation. (Recall that the curve in the Theory figure corresponds to the limit of $n_0\to \infty$ .) The result is that the convergence speed becomes bigger and the value of final cost becomes smaller, as n₀ becomes larger, which is indeed consistent to the NTK theory.

In figures (c) and (d) for L = 2, the trajectory from the simulation closely mirrors that of the theory. In particular, the theoretical result successfully predicts which encoder is effective. We also observe that the convergence speed of the theoretical result when L = 2 is significantly slower than that for L = 1 due to small eigenvalues in the QNTK. Consequently, the training using a finite-width DNN does not converge within our 10 000-iteration experiment. This results in a large discrepancy between the final cost values in Simulation and Theory in the cases of type-B. In the long iteration limit, we anticipate that the final cost values of both the Simulation and Theory will almost align. Moreover, although the trajectories of type-A from Simulation reaches lower values in fewer iterations than in Theory, this does not necessarily imply that the convergence speed of the Simulation is faster. Even if the convergence speed is not faster than that in Theory, the cost values may still reach smaller values with small steps if the final cost values in the convergence in Simulation are smaller than those in Theory. To examine these properties in convergence, simulation with longer steps are required, which will be addressed in future research.

Finally, to see the generalization error, we input 100 test dataset for the trained qcNN models. Figure 5 shows the failure rate, which can be regarded as the generalization error, for some types of ansatz. Because the failure rate obtained when using the classical kernel method presented in [63] is $45\%$ , Ansatz-A4 and -A4ne achieve better performance. This indicates that qcNN with enough expressibility could have higher performance than that of classical method. As another important fact, the result is consistent to that of training error; that is, the ansatz achieving the lower training error shows the lower test error. This might be inconsistent to the following general feature in machine learning; that is, too much expressibility leads to the overfitting and eventually degrades the machine learning performance. However, our model is a function of the projected quantum kernel, which may have a good generalization capability as suggested in [27]. Hence our qcNN model achieving small training error would have a good generalization capability. Further work comparing the performance achieved by full-quantum and full-classical methods will be presented in section. 4.3.

4.2. Effect of the locality on the machine learning performance

Here we focus on the locality m, i.e. the size of the randomizing unitary gate. In our framework, this is regarded as a hyper-parameter, which determines the dimension of reduced Hilbert space, 2^m. Note that the system performance may degrade if m is too large as pointed in [27], and thus m should be carefully chosen. Also, m affects on the eigenvalue distribution of QNTK, which closely relates to the convergence speed of the learning dynamics. Considering the fact that a random circuit may extract essential quantum effect in addition to the above-mentioned practical aspect, in this subsection we study a specific system and a ML task to analyze how much the locality m affects on the convergence speed and the resultant performance.

The ML task is the classification problem for Heart Disease dataset [64]. This dataset has 12 features, meaning that we use a 12-qubits system to encode one feature into one qubit. The goal is to use the training dataset to construct a model system that predicts if a patient would have a heart disease. The number of training data is 100, half of which are the data of patients having a heart disease. We take the qcNN model with L = 1 and several values of m; in particular, we examine the cases $m = 1,2,3,4,6$ for the same dataset. The other setup including the cost function are the same as that used in the previous classification experiment discussed in section 4.1.

We use the theoretical expression of the training process given in equation (51), which is obtained for the infinite-width qcNN, rather than simulating the cost via performing actual training. The learning curves are shown in figure 6. As expected, the convergence speed and the value of final cost largely change dependent on m. To understand the mechanism of this result, firstly, let us recall that the training curve is characterized by the eigenvalues of QNTK $\Theta_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ where x is the data vector. More precisely, as explicitly shown in equation (51), the dynamical component of the index j with large eigenvalue λ_j converges rapidly, while the component with small eigenvalue does slowly. As a result, the distribution of eigenvalues of QNTK determines the entire convergence property of training dynamics. In particular, the ratio of small eigenvalues is a key to characterize the convergence speed. In our simulation, we observe that the magnitude of the eigenvalues totally gets smaller with larger m; this implies that the entire convergence speed would decrease, and actually figure 6 shows this trend. On the other hand, the variance of the eigenvalues distribution also gets smaller with larger m; as a result, the minimum eigenvalue when m = 2 is larger than that when m = 1, implying that the training dynamics with m = 2 may have totally better performance in searching the minimum of the cost, than that with m = 1. Therefore, there should be a trade-off in m. Actually, figure 6 clearly shows that, totally, the case of m = 2 or m = 3 leads to better performance in training. This further suggests us to have a conjecture that, in general, a larger value of m may not lead to better performance and there is an appropriate value of m; considering the fact that a large random quantum circuit is difficult to classically simulate, this observation implies a limitation of the genuine quantum part in the proposed qcNN model.

**Figure 6.** Theoretical prediction of learning curve of qcNN for the classification task with Heart Disease Data Set. Figures (a)–(e) correspond to different locality m. The quantum circuit is composed of 12 qubits. We use the classical NN with L = 1. The other setting, including the cost (the binary cross entropy), is the same as that studied in the classification task in section 4.1.
Download figure:
Standard image High-resolution image

We further note that the value of final cost, which determines the prediction capability for classifying an unseen data, largely changes depending on the type of ansatz. It is particularly notable that Ansatz-A4ne or Ansatz-Bne achieves the best score for all m. These are the ansatz that contain no CNOT-gate, and thus the corresponding quantum states are classically simulatable. That is, for the Heart Disease dataset, it seems that the genuine quantum property, including entanglement, is not effectively used for enhancing the classification performance of the qcNN model. This fact is consistent to the claim given in [27], stating that all quantum machine learning systems will not improve the performance for some specific dataset. In the next subsection, therefore, we will show another learning task with special type of dataset, such that the proposed qcNN containing CNOT gates has certain advantage.

4.3. Advantage of qcNN over full-classical and full-quantum models

Here we study a regression task and a classification task for the dataset generated by a quantum process, to demonstrate possible quantum advantage of the proposed qcNN model as discussed in section 3.4. Our experimental setting is based on the concept suggesting that a quantum machine learning model, which is appropriately constructed with carefully taking into account the dataset, may have a good learnability and generalization capability over classical means. In particular, there are some argument discussing quantum advantage for the dataset generated through a quantum process; see for example [28]. We show that our qcNN model has such desirable property and actually shows better performance, even with much less parameters and thus smaller training cost, compared to a fully classical and a fully quantum means for learning the quantum data-generating process.

4.3.1. Machine learning task and models

First, we explain the meaning of a data generated from a quantum process, which we simply call a 'quantum data' (the case described here is a concrete example of the setting addressed in section 3.4). Typically, a quantum data is the output state of a quantum system driven by a Hamiltonian $H({\mathbf{x}}^a)$ , i.e. $\rho(\mathbf{x}^a) = e^{-iH({\mathbf{x}}^a)}\rho_0 e^{iH({\mathbf{x}}^a)}$ , where the input ${\mathbf{x}}^a$ represents some characteristics of the state such as controllable temperature. The state $\rho(\mathbf{x}^a)$ further evolves through an unknown quantum process including a measurement process. Finally, the output y^a is obtained by measuring some observables; thus, y^a represents a feature of the process or $\rho(\mathbf{x}^a)$ itself. Given such training dataset $\{\mathbf{x}^a, y^a\}_{a = 1}^{N_D}$ , the task is to construct a function that approximates this input-output mapping with good generalization capability for an unseen input data. This problem is related to the general quantum phase recognition (QPR) problem [12, 65, 66] in condensed-matter physics [67], which is inherently classically hard but some quantum machine learning methods may solve efficiently [68–70].

Here we study a specialized version of the above-described problem such that the training dataset $\{\mathbf{x}^a, y^a\}_{a = 1}^{N_D}$ is provided as follows. The input dataset $\{\mathbf{x}^a\}_{a = 1}^{N_D}$ is simply generated from the n-dimensional uniform distribution on $[0, 2\pi]^n$ . Then $\rho (\mathbf{x}^a)$ is generated via an unknown quantum dynamical process $U_{\mathrm{enc}}({\mathbf{x}}) = e^{-iH({\mathbf{x}})}$ ; in the simulation, we assume that this process is given by the quantum circuit shown in figure 7(a) composed of single qubit RX-rotation gates followed by a random multi-qubit unitary operator $U_{\mathrm{random}}$ , the detail of which is shown in appendix G. The output y^a is determined depending on the task. For the regression task, it is given by $y^a = cg(\mathbf{x}^a)+\epsilon^a$ , where $g(\mathbf{x}) = \mathrm{Tr}\left[\rho(\mathbf{x})O\right]$ and is a Gaussian noise with $\mathrm{Var}\left[\epsilon\right] = 10^{-4}$ . This measurement process may contain some uncertainties, and thus we assume that the observable O is unknown for the algorithms. Also c is the normalized constant introduced to satisfy $\mathrm{Var}\left[g(\mathbf{x})\right] = 1$ . For the classification task, if $g(\mathbf{x}^a) \unicode{x2A7E} n/2$ then $y^a = 1$ , and otherwise $y^a = 0$ , where again $g(\mathbf{x}) = \mathrm{Tr}\left[\rho(\mathbf{x})O\right]$ . In the simulation, we take $O = \bigotimes_{i = 1}^{n} (\sigma_z^{(i)}+\mathbf{1}^{(i)})/2$ , where $\sigma_z^{(i)}$ is the Pauli z operator and $\mathbf{1}^{(i)}$ is the identity operator on the ith qubit. The number of training dataset is chosen as $N_D = 1000$ for the regression task and $N_D = 3000$ for the classification task. Moreover, we evaluate the generalization capability using $N_{\mathrm{test}} = 100$ test dataset, which is common for both tasks.

**Figure 7.** The models used in section 4.3: (a) quantum circuit for generating the quantum dataset, which is used for the simulation purpose. (b) Quantum–classical hybrid neural network (qcNN), (c) quantum neural network (qNN), and (d) pure classical neural network (cNN). We use $L_q = 10$ in the figure (c), i.e. a 10-layer qNN. In the figure, the box M depicts the measurement and $U_3(\alpha, \beta, \gamma)$ depicts the generic single-qubit rotation gate with 3 Euler angles.
Download figure:
Standard image High-resolution image

We employ three types of learning models, which are shown in figure 7. First, the figure (b) shows our qcNN model. The point is that this model contains the same encoder $U_{\mathrm{enc}}({\mathbf{x}})$ as that used for generating the training data; that is, the qcNN model has a direct access to the quantum data $\rho({\mathbf{x}}^a)$ . This process is then followed by the random unitary operator given by the product of single qubit gate (i.e. m = 1). The output of the quantum circuit is generated by measuring the observable $O = \bigotimes_{i = 1}^{n} (\sigma_z^{(i)}+\mathbf{1}^{(i)})/2 = \bigotimes_{i = 1}^{n} M^{(i)}$ . Note that this is the same observable as that used for generating the training data, which is assumed to be unknown to the algorithm. However, the random unitary process before the measurement makes this assumption still valid; actually we found that choosing a different observable other than O does not largely change the final performance. Then the expectation of measurement results is transferred to the input to the single-layer (i.e. L = 1) cNN. The activation function of the output node is chosen depending on the task; we employ the identity function for the regression task, while the sigmoid function $\sigma(q) = 1/(1 + e^{-q})$ for the classification task. Finally, the cNN generates $y_{\mathrm{pred}}$ as the raw output for the regression task or $y_{\mathrm{pred}}$ as the binarized output of the cNN for the classification task; that is, as for the latter, $y_{\mathrm{pred}} = 1$ if the output of cNN is bigger than the threshold 0.5 and $y_{\mathrm{pred}} = 0$ if the output is below 0.5. Note that [50] uses the random measurement to generate a classical shadow for approximating $\rho({\mathbf{x}}^a)$ and then constructs a classical machine learning model in terms of the shadows to predict y for unseen $\rho({\mathbf{x}})$ ; in contrast, our approach construct a machine learning model directly using the randomized measurement result without constructing the classical shadows and thus has a clear computational advantage.

The second model is the quantum neural network (qNN) depicted in figure 7(c). This model also has the quantum data $\rho(\mathbf{x}^a)$ directly as input, as in the qcNN model; that is, in the simulation, the data vector ${\mathbf{x}}^a$ is encoded to the same quantum circuit $U_{\mathrm{enc}}({\mathbf{x}})$ shown in figure 7(a). Then a parametrized quantum circuit depicted in the dotted box follows the encoder; this circuit is repeated L_q times, where each block contains different parameters. The output of qNN is computed as $y_{\mathrm{pred}} = w\mathrm{Tr}\left[\rho^{{\prime}}(\mathbf{x})O^{{\prime}}\right]$ , where $\rho^{{\prime}}(\mathbf{x})$ is the output state of the entire quantum circuit and Oʹ is chosen as $O^{{\prime}} = \sigma^{(1)}_z$ . Lastly, w is a scalar parameter; the parameter w is optimized together with the circuit parameters $\{\theta_{i,j}\}$ , to adjust the output range in the case of the regression task, while w is fixed to 1 in the case of the classification task. As in the first model, we use $y_{\mathrm{pred}}$ for the regression task and its binarized version for the classification task.

The third model is the 3-layers cNN depicted in figure 7(d), composed of the n-nodes input layer, n₀-nodes hidden layer, and a single-node output layer. The input and the hidden layers are fully connected; the output node is also fully connected to the hidden layer. We input the data vector x^a so that its ith component $x_i^a$ is the input to the ith node. The activation function is chosen depending on the layer and the task; we employ the sigmoid function $\sigma(q) = 1/(1 + e^{-q})$ in the hidden layer and the identity function in the output node for the regression task, while ReLU $\sigma(q) = {\mathrm{max}}\{0, q\}$ in the hidden layer and the sigmoid function in the output node for the classification task. Hence, this pure classical model knows neither the quantum data $\rho(\mathbf{x}^a)$ and the observable O (or the model does not have enough power for computing these possibly large matrices).

In all the above three models, we will test four cases in the number of qubit, $n = \{2,3,4,5\}$ . Also, the number of nodes in the cNN is chosen as $n_0 = 10^3$ for the qcNN model and the pure cNN model. The expectation value of the observable is calculated using the statevector. Adam [62] is used for optimizing the parameters.

4.3.2. Results

4.3.2.1. Result of the regression task

The resulting performance of the regression task, for both training and test process, is shown in figure 8. We plot the root mean squared errors (RMSE) between the predicted value $y_{\mathrm{pred}}$ and the true value, versus the number of qubit n. For each n, we performed five trials of experiments and computed the mean and the standard deviation of RMSE. Note that the three models have different number of parameters and the measured observables (the latter is applied only to the qcNN and qNN models); a detailed information is given in table 2. In particular, the number of parameters of qcNN is much less than that of cNN though they have the same width of nodes (n₀). Also, the number of measured observables required for optimizing qcNN is much less than that for qNN, because qcNN model does not need to optimize the quantum part by repeatedly measuring the output quantum state.

Table 2. The values of model parameters in section 4.3.

		qcNN	qNN	cNN
# of model parameter ^a	N_p	n₀	$3nL_q$	$(n+2)n_0$
# of different quantum circuit for training	$N_{\mathrm{qc}}$	n₀	$2N_p N_{\mathrm{ite}}$	—
# of model parameter ^a ^b	N_p	1k	60–150	4k–7k
Final cost value (Regression, n = 5)	$C^{\mathrm {reg}}_{f}$	$9.58 \times 10^{-9}$	$1.14 \times 10^{-1}$	$3.78 \times 10^{-2}$
Final cost value (Classification, n = 5)	$C^{\mathrm {class}}_{f}$	0.251	0.503	0.364

^aWithout scaling and bias parameter of the output node.^b $n_0 = 1000$ , $n = \{2,3,4,5\}$ , and $L_q = 10$ .

That is, in total, the qcNN is a compact machine learning model compared to the other two. Nonetheless, qcNN achieves the best performance for all cases in the number of qubits and in both training and test dataset, as shown in figure 8. This is mainly thanks to the 'inductive bias' [28], meaning that qcNN model has an inherent advantage of having the quantum data itself as input. This bias is also given to the qNN model, but this model fails to approximate the training and test data when the number of qubits increases; this is presumably because the model does not have a sufficient expressibility power for representing the target function in the Hilbert space. However, even if the qNN model would have such power, it still suffers from several difficulties in the learning process such as the barren plateau issue and the issue of increasing number of measurement. It is also notable that both qcNN and qNN show almost the same performance for the training and test dataset, meaning that they do not overfit at all to the target data, while the performance of cNN model becomes worse for the test data. This difference might be because the quantum models can access to the data quantum state; this is indeed an inductive bias which may be effectively used for having a good generalization capability as suggested in [28].

4.3.2.2. Result of the classification task

The resulting performance of the classification task is shown in figure 9 which plots the value of Accuracy depending on the number of qubits, n. We perform five trials of experiments and compute the mean and the standard deviation of Accuracy. In this task, we observed a similar performance trend as that for the regression task, where qcNN shows the best performance for all n presumably due to the same reason discussed above; that is, the inductive bias in qcNN model and the lack of expressibility of qNN. In addition to the two class classification task, we also executed the multiclass classification task with the same type of data and models, and observed a similar performance trend (found in appendix F). Just for reference, we also plot Accuracy of qcNN for the case of $N_{\mathrm{ite}} = 3000$ and $n_0 = 3000$ , indicated by 'qcNN(tuned)' in figure 9(recall that Accuracies of the other three are obtained when $N_{\mathrm{ite}} = 1000$ and $n_0 = 1000$ ). This means that the performance of qcNN model can be improved via modifying purely classical part.

4.3.2.3. Effect of shot noise

Finally, we study how the regression/classification performance of the qcNN model would change with respect to the number of shots (measurements); recall that the previous numerical simulations shown in figures 8 and 9 use the statevector simulator, meaning that the number of shots is infinite. The problem settings, including the type of dataset and the learning model, are the same as those studied in the previous subsections. The result is summarized in figure 10, showing (a) the RMSE for the regression task and (b) the Accuracy for the classification task. In both cases, we examined different number of qubits n, where the number of layers of cNN part is 1; but we also examined the 2-layers cNN only for the case of n = 5, where the width of the 1st and 2nd layers are $n_0 = n_1 = 10^3$ .

**Figure 10.** (a) The root mean squared errors (RMSE) versus the number of shots for the training dataset, for the regression task. (b) The accuracy versus the number of shots (measurements) for the training dataset, for the classification task.
Download figure:
Standard image High-resolution image

In the regression task, we observe a clear statistical trend between RMSE and the number of shots $N_{\mathrm{shot}}$ as $\epsilon \sim O(1/\sqrt{N_{\mathrm{shot}}})$ , except for the case of 2-layers cNN denoted as 'n = 5(L2)'. The figure suggests that 10⁵ shots seems to offer a sufficient performance comparable to the ideal statevector simulator. It is notable that the 2-layers cNN significantly reduces RMSE, especially when the number of shots is relatively small; this seems that the higher nonlinearlity of cNN compensates the shot noise. As for the classification task shown in figure 10(b), it is notable that the necessary number of shots is much smaller compared to the regression case. In particular, 10² shots achieves a comparable performance to the ideal statevector simulator. This means that the shot noise, or equivalently the noise contained in the input data to the cNN part, does not largely affect on the performance in the classification task.

4.3.2.4. Details of the experimental setting in section 4.3

The number of model parameters, N_p , and the number of different quantum circuit required for training, $N_{\mathrm{qc}}$ , are shown in table 2. The first and the second row is calculated based on the structure of the models, and the third row is the specific N_p in our numerical experiments. Note that N_p differs depending on the model. The value of N_p of qNN seems to be much smaller than others, but this was chosen so that the computational cost for training the parameters in the quantum part becomes practical (recall that the operation speed of the quantum device is slower than that of state-of-the-art classical computers). Table 3 shows the number of iterations for training, $N_{\mathrm{ite}}$ , together with $N_{\mathrm{qc}}$ that is calculated by substituting $N_{\mathrm{ite}}$ for $N_{\mathrm{qc}}$ in table 2.

Table 3. The number of iterations and different quantum circuits for the training process in the problem studied in section 4.3.

Task\Model			qcNN	qNN	cNN
Regression ^a	# of iteration	$N_{\mathrm{ite}}$	2k–20k	1k	2k–20k
Regression ^a	# of different quantum circuit for training	$N_{\mathrm{qc}}$	1k	120k–300k	—
Classification ^a	# of iteration	$N_{\mathrm{ite}}$	1k	1k	1k
Classification ^a	# of different quantum circuit for training	$N_{\mathrm{qc}}$	1k	120k–300k	—

^a $n_0 = 1000$ , $n = \{2,3,4,5\}$ , and $L_q = 10$ .

5. Conclusion

In this paper, we studied a qcNN composed of a quantum data-encoder followed by a cNN, such that the seminal NTK theory can be directly applied. Actually with appropriate random initialization in both parts and by taking the large width limit of nodes of the cNN, the QNTK defined for the entire system becomes time-invariant and accordingly the dynamics of training process can be explicitly analyzed. Moreover, we find that the output of the entire qcNN becomes a nonlinear function of the projected quantum kernel. That is, the proposed qcNN system functions as a nontrivial quantum kernel that can processes the regression and classification tasks with less computational complexity than that of the conventional quantum kernel method. Also, thanks to the analytic expression of the training process, we obtained a condition of the dataset such that qcNN may perform better than classical counterparts. In addition, for the problem of learning the quantum data-generating process, we gave a numerical demonstration showing that the qcNN shows a clear advantage over full cNNs and qNNs, the latter of which is somewhat nontrivial.

As deduced from the results in section 4 as well as those of the existing studies on the quantum kernel method, the performance heavily depends on the design of data-encoder and the structure of dataset. Hence, given a dataset, the encoder should be carefully designed so that the resulting performance would be quantum-enhanced. A straightforward approach is to replace the fixed data-encoding quantum part with a qNN and train it together with the subsequent data-processing cNN part. Actually, we find a general view that a deep learning uses a neural network composed of the data-encoding (or feature extraction) part and the subsequent data-processing part. Hence, such a qNN-cNN hybrid system might have a similar functionality as the deep learning, implying that it would lead to better prediction performance and further, hopefully, achieve some quantum advantages. However, training the qNN part may suffer from the vanishing gradient issue; hence relatively a small qNN might be a good choice. We leave this problem as a future work.

Acknowledgments

This work was supported by MEXT Quantum Leap Flagship Program Grant Numbers JPMXS0118067285 and JPMXS0120319794, and JSPS KAKENHI Grant Number 20H05966.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Appendix A: Proof of theorem 3

Theorem 3. With σ as a Lipschitz function, for L = 1 and in the limit $n_0\xrightarrow{}\infty$ , the output function $f_{\boldsymbol{\theta}(0)}$ is centered Gaussian process whose covariance matrix $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is given by

$\begin{align} \boldsymbol{\Sigma}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1}\sum_{k = 1}^{n_Q}\left({\mathrm{Tr}}\left(\rho^k_{{\mathbf{x}}}\rho^k_{{\mathbf{x^{{\prime}}}}}\right) - \frac{1}{2^m}\right) + \xi^2. \end{align} \tag{ A1 }$

The reduced density matrix $\rho_x^k$ is defined by

$\begin{align} \rho^k_{{\mathbf{x}}} = {\mathrm{Tr}}_k\left(U_{\mathrm{enc}}\left({\mathbf{x}}\right)|0\rangle^{\otimes n}\langle 0|^{\otimes n}U_{\mathrm{enc}}\left({\mathbf{x}}\right)^{\dagger}\right) \end{align} \tag{ A2 }$

where ${\mathrm{Tr}}_k$ is the partial trace over the Hilbert space associated with all qubits except $(k-1)m \sim km-1$ th qubits.

Proof. From (40) with L = 1, the prediction function becomes

$\begin{align} f_{\boldsymbol{\theta}\left(t\right)}\left({\mathbf{x}}\right) = \frac{1}{\sqrt{n_0}}W^{\left(0\right)}{\mathbf{f}}^Q\left({\mathbf{x}}\right) + \xi b^{\left(0\right)}. \end{align} \tag{ A3 }$

The distribution of $f_{\boldsymbol{\theta}(0)}$ conditioned on the values of ${\mathbf{f}}^Q({\mathbf{x}})$ is centered Gaussian with covariance

$\begin{align} Cov^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \frac{1}{n_0}{\mathbf{f}}^Q\left({\mathbf{x}}\right) \cdot {\mathbf{f}}^Q\left({\mathbf{x}}^{{\prime}}\right) + \xi^2 = \frac{1}{n_0}\sum_{i = 1}^{n_0}\langle \psi\left({\mathbf{x}}\right)|U_i^{\dagger}O U_i|\psi\left({\mathbf{x}}\right)\rangle \langle \psi\left({\mathbf{x}}^{{\prime}}\right)|U_i^{\dagger}O U_i|\psi\left({\mathbf{x}}^{{\prime}}\right)\rangle+ \xi^2, \end{align} \tag{ A4 }$

which can be easily shown by using

$\begin{align} \langle W_{ij}^{\left(0\right)} \rangle & = 0, \qquad \langle W_{ij}^{\left(0\right)}W_{k\ell}^{\left(0\right)} \rangle = \delta_{ik}\delta_{j\ell} \nonumber\\ \langle b^{\left(0\right)}_j \rangle & = 0, \qquad \langle b^{\left(0\right)}_j b^{\left(0\right)}_k \rangle = \delta_{jk}. \end{align} \tag{ A5 }$

In the limit $n_0 \rightarrow \infty$ , from the weak law of large numbers,

$\begin{align} Cov\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)^{\left(1\right)} \rightarrow Cov^{\left(1\right)}_\infty\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \int d\mu\left(U\right) \langle \psi\left({\mathbf{x}}\right)|U^{\dagger}OU|\psi\left({\mathbf{x}}\right)\rangle \langle \psi\left({\mathbf{x}}^{{\prime}}\right)|U^{\dagger}OU|\psi\left({\mathbf{x}}^{{\prime}}\right)\rangle + \xi^2 \nonumber\\ & = \int_{\mathrm{2-design}}dU_1\int_{\mathrm{2-design}}dU_2\cdots\int_{\mathrm{2-design}}dU_{n_Q}\nonumber\\ &\quad\sum_{k = 1}^{n_Q}\langle \psi\left({\mathbf{x}}\right)|I_{\left(k-1\right)m}\otimes U_k^{\dagger}\mathcal{O}U_k\otimes I_{\left(n_Q-k\right)m}|\psi\left({\mathbf{x}}\right)\rangle \nonumber\\ &\quad \times \sum_{r = 1}^{n_Q}\langle \psi\left({\mathbf{x}}^{{\prime}}\right)|I_{\left(r-1\right)m}\otimes U_r^{\dagger}\mathcal{O}U_r\otimes I_{\left(n_Q-r\right)m}|\psi\left({\mathbf{x}}^{{\prime}}\right)\rangle + \xi^2, \end{align} \tag{ A6 }$

where $\mu(U)$ is the distribution of the random unitary matrix and $\int_{\mathrm{2-design}}dU_k$ denotes the integral over unitary 2-designs. By setting $Q^k({\mathbf{x}})$ to

$\begin{align} Q^k\left({\mathbf{x}}\right) = \sum_{k = 1}^{n_Q}\langle \psi\left({\mathbf{x}}\right)|I_{\left(k-1\right)m}\otimes U_k^{\dagger}\mathcal{O}U_k\otimes I_{\left(n_Q-k\right)m}|\psi\left({\mathbf{x}}\right)\rangle, \end{align} \tag{ A7 }$

we obtain

$\begin{align} Cov^{\left(1\right)}_\infty\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \sum_{k\neq r}\int_{\mathrm{2-design}}dU^kQ^k\left({\mathbf{x}}\right)\int_{\mathrm{2-design}}dU^r Q^r\left({\mathbf{x}}^{{\prime}}\right) + \sum_{k = 1}^{n_Q}\int_{\mathrm{2-design}}dU^kQ^k\left({\mathbf{x}}\right)Q^k\left({\mathbf{x}}^{{\prime}}\right) + \xi^2. \end{align} \tag{ A8 }$

The summands of the first and the second terms in (A8) can be computed by using the element-wise integration formula for unitary 2-designs [71]:

$\begin{align} \int_{\mathrm{2-design}}dU dU U_{ab} U^{\ast}_{cd} & = \frac{\delta_{ab}\delta_{cd}}{N} \end{align} \tag{ A9 }$

$\begin{align} \int_{\mathrm{2-design}}dU U_{ab} U^{\ast}_{cd}U_{ef} U^{\ast}_{gh} &= \frac{1}{N^2-1}\left(\delta_{ac}\delta_{bd}\delta_{eg}\delta_{fh}+\delta_{ag}\delta_{bh}\delta_{ce}\delta_{df} \right) -\frac{1}{N\left(N^2-1\right)}\left(\delta_{ac}\delta_{bh}\delta_{eg}\delta_{fd} + \delta_{ah}\delta_{bd}\delta_{ec}\delta_{fh}\right), \end{align} \tag{ A10 }$

where N is the dimension of the unitary matrix.

For the summand of the first term in (A8), we use (A9) and obtain

$\begin{align} \int_{\mathrm{2-design}}dU_k \left[U_{k}^{\dagger}\mathcal{O}U_k\right]_{ab} = \int_{\mathrm{2-design}}dU_k \sum_{cd}\left[U_k^{\ast}\right]_{ca}\mathcal{O}_{cd}\left[U_k\right]_{db} = \sum_{cd}\delta_{ab}\delta_{cd}\mathcal{O}\left({\mathbf{x}}\right)_{cd} = \delta_{ab}{\mathrm{Tr}}\left(\mathcal{O}\right) = 0, \end{align} \tag{ A11 }$

where in the last equality we use that $\mathcal{O}$ is a traceless operator. Therefore the first term in (A8) is zero.

The summand of the second term in (A8) can be written as

$\begin{align} \int_{\mathrm{2-design}}dU_k Q^k\left({\mathbf{x}}\right)Q^k\left({\mathbf{x}}^{{\prime}}\right) & = \int_{\mathrm{2-design}}dU_k{\mathrm{Tr}}\left(U_k^{\dagger}\mathcal{O}U_k\rho^k_{{\mathbf{x}}}\right){\mathrm{Tr}}\left(U_k^{\dagger}\mathcal{O}U_k\rho^k_{{\mathbf{x}}^{{\prime}}}\right)\nonumber\\ & = \int_{\mathrm{2-design}}dU_k\sum_{a_1b_1}\sum_{a_2b_2}\left[U_k^{\dagger}\mathcal{O}U_k\right]_{a_1b_1}\left[\rho^k_{{\mathbf{x}}}\right]_{b_1a_1}\left[U_k^{\dagger}\mathcal{O}U_k\right]_{a_2b_2}\left[\rho^k_{{\mathbf{x}}}\right]_{b_2a_2}, \end{align} \tag{ A12 }$

where $\rho_{{\mathbf{x}}}^k$ is defined in (A2). By using (A8) the integration of the matrix element can be computed as

$\begin{align} &\int_{\mathrm{2-design}}dU_k\left[U_k^{\dagger}\mathcal{O}\left({\mathbf{x}}\right)U_k\right]_{a_1 b_1}\left[U_k^{\dagger}\mathcal{O}\left({\mathbf{x}}^{{\prime}}\right)U_k\right]_{a_2 b_2}\nonumber\\ &\quad = \int_{\mathrm{2-design}}dU_k\sum_{c_1,d_1}\sum_{c_2,d_2}\left[U_k^{\ast}\right]_{c_1 a_1}\mathcal{O}_{c_1d_1}\left[U_k\right]_{d_1 b_1} \left[U_k^{\ast}\right]_{c_2 a_2}\mathcal{O}_{c_2d_2}\left[U_k\right]_{d_2b_2} \nonumber\\ &\quad = \frac{1}{2^{2m}-1}\sum_{c_1,d_1}\sum_{c_2,d_2}\left[\left(\delta_{c_1d_1}\delta_{a_1b_1}\delta_{c_2d_2}\delta_{a_2b_2}+\delta_{c_1d_2}\delta_{a_1b_2}\delta_{d_1u_2}\delta_{b_1a_2}\right)\right. \nonumber\\ &\quad\left.-\frac{1}{2^m}\left(\delta_{c_1d_1}\delta_{a_1b_2}\delta_{c_2d_2}\delta_{a_2b_1}+\delta_{c_1d_2}\delta_{a_1b_1}\delta_{c_2d_1}\delta_{a_2b_2}\right)\right]\mathcal{O}_{c_1d_1}\mathcal{O}_{c_2d_2}\nonumber\\ &\quad = \frac{1}{2^{2m}-1}\left[\left({\mathrm{Tr}}\left(\mathcal{O}\right)\right)^2\delta_{a_1b_1}\delta_{a_2b_2} + {\mathrm{Tr}}\left(\mathcal{O}^2\right)\delta_{a_1b_2}\delta_{b_1a_2}-\frac{1}{2^m}\left\{\left({\mathrm{Tr}}\left(\mathcal{O}\right)\right)^2\delta_{a_1b_2}\delta_{a_2b_1}+{\mathrm{Tr}}\left(\mathcal{O}^2\right)\delta_{a_1b_1}\delta_{a_2b_2}\right\}\right]\nonumber\\ &\quad = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1}\left(\delta_{a_1b_2}\delta_{a_2b_1}-\frac{1}{2^m}\delta_{a_1b_1}\delta_{a_2b_2}\right), \end{align} \tag{ A13 }$

where in the last equality we use $\mathcal{O}$ is traceless. Substituting the result of (A13) to (A12), we obtain

$\begin{align} \int_{\mathrm{2-design}}dU_k Q^k\left({\mathbf{x}}\right)Q^k\left({\mathbf{x}}^{{\prime}}\right) = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1}\left[{\mathrm{Tr}}\left(\rho^k_{{\mathbf{x}}}\rho^k_{{\mathbf{x}}^{{\prime}}}\right)-\frac{1}{2^m}\right]. \end{align} \tag{ A14 }$

Substituting zero to the first term in (A8) and (A14) to the summand of the second term, we can show that the covariance matrix is equal to $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ .

Since the covariance matrix $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ does not depend on the value of ${\mathbf{f}}_Q({\mathbf{x}})$ in the limit of $n_0\rightarrow \infty$ , the unconditioned distribution of $f_{\boldsymbol{\theta}(t)}$ is equal to the conditioned distribution of $f_{\boldsymbol{\theta}(t)}$ , namely the centered Gaussian process with the covariance $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ in this limit. □

Appendix B: Proof of theorem 4

Theorem 4. With σ as a Lipschitz function, for $L (\gt1)$ and in the limit $n_0,n_1, \cdots n_{L-1}\xrightarrow{} \infty$ , $f_{\boldsymbol{\theta}(0)}$ is centered Gaussian process whose covariance $\boldsymbol{\Sigma}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is given recursively by

$\begin{align} \boldsymbol{\Sigma}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1}\sum_{k = 1}^{n_Q}\left({\mathrm{Tr}}\left(\rho^k_{{\mathbf{x}}}\rho^k_{{\mathbf{x^{{\prime}}}}}\right) - \frac{1}{2^m}\right) + \xi^2. \nonumber\\ \boldsymbol{\Sigma}_Q^{\left(\ell + 1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \mathbf{E}_{h \sim \mathcal{N}\left(0, \Sigma_Q^{\left(\ell\right)}\right)}\left[\sigma\left(h\left({\mathbf{x}}\right)\right) \sigma\left(h\left({\mathbf{x}}^{{\prime}}\right)\right)\right]+\xi^2 \end{align} \tag{ B1 }$

where the expectation value is calculated by averaging over centered Gaussian process with covariance $\Sigma_Q^{(L)}$ .

Proof. We prove that $\tilde{\alpha}^{(\ell)}({\mathbf{x}})_j$ for $j = 1,2,\cdots,n_\ell$ are i.i.d centered Gaussian process with the covariance given by the equation (B1) in the infinite width limit by induction, which becomes the proof for the theorem.

For L = 1 we can readily show that the distributions of $\tilde{\alpha}^{(1)}({\mathbf{x}})_j$ are i.i.d centered Gaussian. Then the value of the covariance can be derived in the same manner as the proof of theorem 3.

From the induction hypothesis, $\tilde{\alpha}^{(\ell)}({\mathbf{x}})_j$ for $j = 1,2,\cdots,n_\ell$ are i.i.d centered Gaussian process with the covariance given by the equation (B1) in the infinite width limit. The element-wise formula for the forward propagation from $\ell$ th layer to the next layer can be written as

$\begin{align} \tilde{\alpha}^{\left(\ell + 1\right)}\left({\mathbf{x}}\right)_j = W^{\left(\ell+1\right)}_{jk}\sigma\left(\tilde{{\alpha}}_k^{\left(\ell\right)}\left({\mathbf{x}}\right)\right) + b^{\left(\ell\right)}. \end{align} \tag{ B2 }$

By using

$\begin{align} \langle W_{jk}^{\left(\ell\right)}\rangle = 0, \langle W_{jk}^{\left(\ell\right)} W_{j^{{\prime}}k^{{\prime}}}^{\left(\ell\right)}\rangle = \delta_{jj^{{\prime}}}\delta_{kk^{{\prime}}}, \end{align} \tag{ B3 }$

it can be readily shown that the distributions of $\tilde{\alpha}^{(\ell + 1)}({\mathbf{x}})_j$ conditioned on the values of $\sigma(\tilde{{\alpha}}_k^{(\ell)}({\mathbf{x}}))_k$ are i.i.d. centered Gaussian process with covariance

$\begin{align} Cov^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \frac{1}{n_{\ell}}\sum_k \sigma\left(\tilde{{\alpha}}_k^{\left(\ell\right)}\left({\mathbf{x}}\right)\right)\sigma\left(\tilde{{\alpha}}_k^{\left(\ell\right)}\left({\mathbf{x}}^{{\prime}}\right)\right) + \xi^2. \end{align} \tag{ B4 }$

Since the distributions of $\tilde{\alpha}^{(\ell)}({\mathbf{x}})_k$ for $k = 1,2,\cdots,n_\ell$ are i.i.d, so are the distributions of $\sigma(\tilde{\alpha}^{(\ell)}({\mathbf{x}})_k)$ . Therefore from the weak law of large number, in the limit $n_{\ell}\rightarrow \infty$ the sum is transformed to the expectation value as

$\begin{align} Cov^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)\rightarrow \mathbf{E}_{h \sim \mathcal{N}\left(0, \Sigma_Q^{\left(\ell\right)}\right)}\left[\sigma\left(h\left(x\right)\right) \sigma\left(h\left(x^{{\prime}}\right)\right)\right]+\xi^2. \end{align} \tag{ B5 }$

Because the limit of the covariance does not depend on $\sigma(\tilde{\alpha}^{(\ell)}({\mathbf{x}})_k)$ , the unconditioned distribution of $\tilde{\alpha}^{(\ell + 1)}({\mathbf{x}})_j$ is equal to the conditioned distribution, which concludes the proof. □

Appendix C: Proof of theorem 5

Theorem 5. With σ as a Lipschitz function, in the limit $n_0, n_1, \cdots n_{L-1}\xrightarrow{} \infty$ , the quantum neural tangent kernel $K_Q^L({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)$ converges to the time independent function $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ , which is given recursively by

$\begin{align} \boldsymbol{\Theta}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \boldsymbol{\Sigma}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1}\sum_{k = 1}^{n_Q}\left({\mathrm{Tr}}\left(\rho^k_{\mathbf{x}}\rho^k_{\mathbf{x}^{{\prime}}}\right) - \frac{1}{2^m}\right) + \xi^2, \nonumber\\ \boldsymbol{\Theta}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \boldsymbol{\Theta}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)\dot{\boldsymbol{\Sigma}}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)+\boldsymbol{\Sigma}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) \end{align} \tag{ C1 }$

where $\dot{\boldsymbol{\Sigma}}_Q^{(\ell)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \mathbf{E}_{h \sim \mathcal{N}\left(0, \boldsymbol{\Sigma}_Q^{(\ell)}\right)}\left[\dot{\sigma}(h({\mathbf{x}})) \dot{\sigma}\left(h\left({\mathbf{x}}^{{\prime}}\right)\right)\right]$ and $\dot{\sigma}$ is the derivative of σ.

Proof. We define the elementwise QNTK as

$\begin{align} K_{Qjk}^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right) = \sum_{p = 1}^P \frac{\partial\tilde{\alpha}^{\left(\ell\right)}\left({\mathbf{x}}\right)_j}{\partial \theta_p\left(t\right)}\frac{\partial\tilde{\alpha}^{\left(\ell\right)}\left({\mathbf{x}}\right)_k}{\partial \theta_p\left(t\right)} \end{align} \tag{ C2 }$

and prove

$\begin{align} K_{Qjk}^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right) \rightarrow \boldsymbol{\Theta}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) \delta_{jk} \end{align} \tag{ C3 }$

in the infinite width limit $n_0,n_1,\cdots,n_{\ell-1} \rightarrow \infty$ by induction. Then by setting $\ell = L$ and $n_\ell = 1$ we obtain the proof of the theorem.

For $\ell = 1$ ,

$\begin{align} \tilde{{\boldsymbol{\alpha}}}^{\left(1\right)} \left({\mathbf{x}}\right) = \frac{1}{\sqrt{n_0}}W^{\left(0\right)}{\mathbf{f}}^Q\left({{\mathbf{x}}}\right) + \xi b^{\left(0\right)}. \end{align} \tag{ C4 }$

Then the elementwise QNTK is computed as

$\begin{align} K_{Qjk}^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right) & = \frac{1}{n_0}\sum_{i^{\prime} j^{\prime}} \frac{\partial \tilde{\alpha}_{j}^{\left(1\right)} \left({\mathbf{x}}\right)}{\partial W_{i^{\prime} j^{\prime}}^{\left(0\right)}}\frac{\partial \tilde{\alpha}_{k}^{\left(1\right)} \left({\mathbf{x}}^{{\prime}}\right)}{\partial W_{i^{\prime} j^{\prime}}^{\left(0\right)}} + \sum_{i^{\prime}}\frac{\partial \tilde{\alpha}_{j}^{\left(1\right)} \left({\mathbf{x}}\right)}{\partial b^{\left(0\right)}_{i^{\prime}}} \frac{\partial \tilde{\alpha}_{k}^{\left(1\right)} \left({\mathbf{x}}\right)}{\partial b^{\left(0\right)}_{i^{\prime}}} \end{align} \tag{ C5 }$

$\begin{align} & = \frac{1}{n_0}\sum_{j^{{\prime}}}f^{\mathrm{Q}}\left({{\mathbf{x}}}\right)_{j^{{\prime}}}\cdot f^{\mathrm{Q}}\left({{\mathbf{x}}}\right)_{j^{{\prime}}} \delta_{jk} + \xi^2 \delta_{jk} \end{align} \tag{ C6 }$

$\begin{align} &\rightarrow \boldsymbol{\Sigma}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) \qquad \left(n_0\rightarrow \infty\right), \end{align} \tag{ C7 }$

where the last line is derived in the proof in theorem 3. Therefore $K_{Qjk}^{(\ell)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)\rightarrow \boldsymbol{\Theta}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}) = \boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is proved for $\ell = 1$ .

From the induction hypothesis, (C3) holds until $\ell$ th layer in the infinite width limit $n_0, n_1,\cdots, n_{\ell-1}\rightarrow \infty$ . Then by using

$\begin{align} \tilde{\boldsymbol{\alpha}}^{\left(\ell + 1\right)} \left({\mathbf{x}}\right) &= \frac{1}{\sqrt{n_\ell}}W^{\left(\ell\right)}_{jk}\boldsymbol{\alpha}^{\left(\ell\right)}\left({\mathbf{x}}\right) + \xi b^{\left(\ell\right)}. \end{align} \tag{ C8 }$

$\begin{align} K_{Qjk}^{\left(\ell + 1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right) & = \sum_{\ell^{{\prime}} = 0}^{\ell} \sum_{i^{\prime} j^{\prime}} \frac{\partial \tilde{\alpha}_{j}^{\left(\ell+1\right)} \left({\mathbf{x}}\right)}{\partial W_{i^{\prime} j^{\prime}}^{\left(\ell^{{\prime}}\right)}}\frac{\partial \tilde{\alpha}_{k}^{\left(\ell+1\right)} \left({\mathbf{x}}^{{\prime}}\right)}{\partial W_{i^{\prime} j^{\prime}}^{\left(\ell^{{\prime}}\right)}} + \sum_{\ell^{{\prime}} = 0}^{\ell}\sum_{i^{\prime}}\frac{\partial \tilde{\alpha}_{j}^{\left(\ell+1\right)} \left({\mathbf{x}}\right)}{\partial b^{\left(\ell^{{\prime}}\right)}_{i^{\prime}}} \frac{\partial \tilde{\alpha}_{k}^{\left(\ell+1\right)} \left({\mathbf{x}}\right)}{\partial b^{\left(\ell^{{\prime}}\right)}_{i^{\prime}}} \nonumber\\ & = \kappa^{\left(0:\ell-1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right)_{jk} + \kappa^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right)_{jk}, \end{align} \tag{ C9 }$

where

$\begin{align} \kappa^{\left(0:\ell-1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right)_{jk} & = \sum_{\ell^{{\prime}} = 0}^{\ell-1} \sum_{i^{\prime} j^{\prime}} \frac{\partial \tilde{\alpha}_{j}^{\left(\ell+1\right)} \left({\mathbf{x}}\right)}{\partial W_{i^{\prime} j^{\prime}}^{\left(\ell^{{\prime}}\right)}}\frac{\partial \tilde{\alpha}_{k}^{\left(\ell+1\right)} \left({\mathbf{x}}^{{\prime}}\right)}{\partial W_{i^{\prime} j^{\prime}}^{\left(\ell^{{\prime}}\right)}} + \sum_{\ell^{{\prime}} = 0}^{\ell-1}\sum_{i^{\prime}}\frac{\partial \tilde{\alpha}_{j}^{\left(\ell+1\right)} \left({\mathbf{x}}\right)}{\partial b^{\left(\ell^{{\prime}}\right)}_{i^{\prime}}} \frac{\partial \tilde{\alpha}_{k}^{\left(\ell+1\right)} \left({\mathbf{x}}\right)}{\partial b^{\left(\ell^{{\prime}}\right)}_{i^{\prime}}} \end{align} \tag{ C10 }$

$\begin{align} \kappa^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right)_{jk} &\equiv \sum_{i^{\prime} j^{\prime}} \frac{\partial \tilde{\alpha}_{j}^{\left(\ell+1\right)} \left({\mathbf{x}}\right)}{\partial W_{i^{\prime} j^{\prime}}^{\left(\ell\right)}}\frac{\partial \tilde{\alpha}_{k}^{\left(\ell+1\right)} \left({\mathbf{x}}^{{\prime}}\right)}{\partial W_{i^{\prime} j^{\prime}}^{\left(\ell\right)}} + \sum_{i^{\prime}}\frac{\partial \tilde{\alpha}_{j}^{\left(\ell+1\right)} \left({\mathbf{x}}\right)}{\partial b^{\left(\ell\right)}_{i^{\prime}}} \frac{\partial \tilde{\alpha}_{k}^{\left(\ell+1\right)} \left({\mathbf{x}}\right)}{\partial b^{\left(\ell\right)}_{i^{\prime}}} \nonumber\\ & = \frac{1}{n_\ell}\sum_{j^{{\prime}}}\tilde{\alpha}\left({\mathbf{x}}\right)_{j^{{\prime}}}^{\left(\ell\right)}\tilde{\alpha}\left({\mathbf{x}}^{{\prime}}\right)^{\left(\ell\right)}_{j^{{\prime}}}\delta_{jk} + \xi^2 \delta_{jk} = Cov^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)\delta_{jk}. \end{align} \tag{ C11 }$

From the proof of theorem 4, $\kappa^{(\ell)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)_{jk}\rightarrow \boldsymbol{\Sigma}_Q^{(\ell)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})\delta_{jk}$ in the limit $n_{\ell}\rightarrow \infty$ .

By using the chain rule

$\begin{align} \frac{\partial\tilde{\alpha}\left({\mathbf{x}}\right)_{j}^{\left(\ell+1\right)}}{\partial \theta_p} = \frac{1}{\sqrt{n_\ell}}\sum_{k = 1}^{n_\ell}W_{jk}^{\left(\ell\right)}\frac{\partial\tilde{\alpha}_{k}^{\left(\ell\right)}\left({\mathbf{x}}\right)}{\partial \theta_p}\dot{\sigma}\left(\tilde{\alpha}_{k}^{\left(\ell\right)}\left({\mathbf{x}}\right)\right), \end{align} \tag{ C12 }$

the other term $\kappa^{(0:\ell-1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t)_{jk}$ is rewritten as

$\begin{align} \kappa^{\left(0:\ell-1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right)_{jk} = \frac{1}{n_{\ell}}\sum_{j^{\prime} k^{\prime}}W_{jj^{\prime}}^{\left(\ell\right)}W_{kk^{\prime}}^{\left(\ell\right)} K_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right)_{j^{\prime} k^{\prime}} \dot{\sigma}\left(\alpha_{j^{\prime}}^{\left(\ell\right)}\left({\mathbf{x}}\right)\right) \dot{\sigma}\left(\alpha_{k^{\prime}}^{\left(\ell\right)}\left({\mathbf{x}}^{\prime}\right)\right), \end{align} \tag{ C13 }$

and from the induction hypothesis (C3),

$\begin{align} \kappa^{\left(0:\ell-1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t\right)_{jk} \rightarrow \frac{1}{n_{\ell}}\sum_{j^{\prime}}W_{jj^{\prime}}^{\left(\ell\right)}W_{kj^{\prime}}^{\left(\ell\right)}\boldsymbol{\Theta}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) \dot{\sigma}\left(\tilde{\alpha}_{j^{\prime}}^{\left(\ell\right)}\left({\mathbf{x}}\right)\right) \dot{\sigma}\left(\tilde{\alpha}_{j^{\prime}}^{\left(\ell\right)}\left({\mathbf{x}}^{\prime}\right)\right), \end{align} \tag{ C14 }$

in the limit $n_1, n_2, \cdots, n_{\ell-1}$ . In the limit $n_{\ell}\rightarrow \infty$ from the weak law of large number, the sum can be replaced by the expectation value as follows:

$\begin{align} \frac{1}{n_{\ell}}\sum_{j^{\prime}}W_{jj^{\prime}}^{\left(\ell\right)}W_{kj^{\prime}}^{\left(\ell\right)}\boldsymbol{\Theta}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) \dot{\sigma}\left(\tilde{\alpha}_{j^{\prime}}^{\left(\ell\right)}\left({\mathbf{x}}\right)\right) \dot{\sigma}\left(\tilde{\alpha}_{j^{\prime}}^{\left(\ell\right)}\left({\mathbf{x}}^{\prime}\right)\right) &\rightarrow \langle W_{jj^{\prime}}^{\left(\ell\right)}W_{kj^{\prime}}^{\left(\ell\right)}\rangle \boldsymbol{\Theta}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)\dot{\boldsymbol{\boldsymbol{\Sigma}}}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) \nonumber\\ & = \delta_{jk} \boldsymbol{\Theta}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)\dot{\boldsymbol{\boldsymbol{\Sigma}}}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right). \end{align} \tag{ C15 }$

Thus we have shown that $K_{Qjk}^{(\ell+1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}, t) \rightarrow \boldsymbol{\Theta}_Q^{(\ell+1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}) \delta_{jk}$ , which conclude the proof. □

Appendix D: QNTK with the ReLU activation

If we choose the ReLU activation, $\sigma(q) = \max(0, q)$ , we can compute the analytical expression of QNTK for L > 1 recursively. From the formulae proven in [72], the analytic expressions of $\boldsymbol{\Sigma}_Q^{(\ell+1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ and $\dot{\boldsymbol{\Sigma}}^{(\ell)}_Q({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ are

$\begin{align} \boldsymbol{\Sigma}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \frac{1}{2\pi}\boldsymbol{\Sigma}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}\right) \boldsymbol{\Sigma}_Q^{\left(\ell\right)}\left({\mathbf{x}}^{{\prime}}, {\mathbf{x}}^{{\prime}}\right) \left(\sin\theta^{\ell}_{{\mathbf{x}}{\mathbf{x}}^{{\prime}}}+\left(\pi-\theta^{\ell}_{{\mathbf{x}}{\mathbf{x}}^{{\prime}}}\right)\cos\theta^{\ell}_{{\mathbf{x}}{\mathbf{x}}^{{\prime}}} \right) + \xi^2 \end{align} \tag{ D1 }$

$\begin{align} \dot{\boldsymbol{\Sigma}}^{\left(\ell\right)}_Q\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) & = \frac{1}{2\pi}\left(\pi - \theta^{\ell}_{{\mathbf{x}}{\mathbf{x}}^{{\prime}}}\right), \end{align} \tag{ D2 }$

where

$\begin{align} \theta^{\ell}_{{\mathbf{x}}{\mathbf{x}}^{{\prime}}} = \arccos\left(\frac{\boldsymbol{\Sigma}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)}{\sqrt{\boldsymbol{\Sigma}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}\right)\boldsymbol{\Sigma}_Q^{\left(\ell\right)}\left({\mathbf{x}}^{{\prime}}, {\mathbf{x}}^{{\prime}}\right)}}\right) \end{align} \tag{ D3 }$

when the activation is given by $\sigma(q) = \max(0, q)$ . From (D1), $\boldsymbol{\Sigma}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is recursively computable. By substituting (D1) and (D2) into the latter equation in (49), $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is also recursively computable.

Appendix E: Proof of theorem 6

Theorem 6. For a non-constant Lipschitz function σ, QNTK $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive definite unless there exists $\{c_a\}_{a = 1}^{N_D}$ such that (i) $\sum_a c_a\rho_{{\mathbf{x}}^a}^k = {\mathbf{0}}$ $(\forall k)$ , $\sum_a c_a = 0$ , and $c_a \neq 0\ (\exists a)$ or (ii) ξ = 0, $\sum_a c_a\rho_{{\mathbf{x}}^a}^k = I_{m}/2^m$ $(\forall k)$ and $\sum_a c_a = 1$ .

Proof. In the recurrence relation,

$\begin{align} \boldsymbol{\Theta}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \boldsymbol{\Theta}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) \dot{\boldsymbol{\Sigma}}_Q^{\left(\ell\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)+\boldsymbol{\Sigma}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right), \end{align} \tag{ E1 }$

the product of two positive semi-definite kernel, $\boldsymbol{\Theta}_Q^{(\ell)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}}) \dot{\boldsymbol{\Sigma}}_Q^{(\ell)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)$ is positive semi-definite. Therefore if the rest term of (E1) $\boldsymbol{\Sigma}_Q^{(\ell+1)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right)$ is positive definite, $\boldsymbol{\Theta}_Q^{(\ell+1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is also positive definite. The positive definiteness of $\boldsymbol{\Sigma}_Q^{(\ell+1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ can be shown by checking if

$\begin{align} \sum_{a,b} c_a c_b \boldsymbol{\Sigma}_Q^{\left(\ell+1\right)}\left({\mathbf{x}}^a, {\mathbf{x}}^b\right) = \mathbf{E}_{h \sim \mathcal{N}\left(0, \Sigma_Q^{\left(\ell\right)}\right)}\left[\left(\sum_a c_a\sigma\left(h\left({\mathbf{x}}^a\right)\right) \right)^2\right]+\xi^2\left(\sum_{a}c_a\right)^2 \end{align} \tag{ E2 }$

is non-zero for any ${\mathbf{c}}\neq{\mathbf{0}}\ \left({\mathbf{c}} = \{c_a\}_{a = 1}^{N_D}\right)$ , which holds when $\sum_a c_a \sigma(h({\mathbf{x^a}}))$ is not almost surely zero. If $\boldsymbol{\Sigma}_Q^{(\ell)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive-definite the Gaussian $h({\mathbf{x}})$ is non-degenerate, and therefore $\sum_a c_a \sigma(h({\mathbf{x^a}}))\gt0$ with finite probability since σ is not constant function meaning that $\boldsymbol{\Sigma}_Q^{(\ell+1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive definite. Thus the positive definiteness of $\boldsymbol{\Sigma}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})\ (L\unicode{x2A7E} 2)$ can be recursively proven if $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive definite.

Recall that

$\begin{align} \boldsymbol{\Sigma}_Q^{\left(1\right)}\left({\mathbf{x}}, {\mathbf{x}}^{{\prime}}\right) = \frac{{\mathrm{Tr}}\left(\mathcal{O}^2\right)}{2^{2m}-1}\sum_{k = 1}^{n_Q}\left({\mathrm{Tr}}\left(\rho_{\mathbf{x}}^{k}\rho_{{\mathbf{x}}^{{\prime}}}^{k}\right) - \frac{1}{2^m}\right) + \xi^2. \end{align} \tag{ E3 }$

Then $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive definite if

$\begin{align} \sum_{k = 1}^{n_Q}{\mathrm{Tr}}\left(\sum_{a}c_a\rho_{\mathbf{x^a}}^{k}\right)^2 + \left(\xi^2-\frac{n_Q}{2^m}\right)\left(\sum_a c_a\right)^2 > 0 \end{align} \tag{ E4 }$

for all ${\mathbf{c}}\neq {\mathbf{0}}$ .

For $\sum_a c_a = 0$ . The left hand side of (E4) becomes $\sum_k{\mathrm{Tr}}\left(\sum_{a}c_a\rho_{\mathbf{x^a}}^{k}\right)^2$ ; it becomes zero if and only if $\sum_a c_a \rho^k_{{\mathbf{x}}^a} = 0$ for all k because $c_a\rho^k_{{\mathbf{x}}^a}$ is Hermitian operators, which corresponds to the condition (i) in the theorem.

For $\sum_a c_a = \beta \neq 0$ , the left hand side is proportional to β², thus we can obtain the general condition that (E4) is satisfied even if we set β = 1. Let us define $\rho^k\equiv\sum_a c_a \rho_{\mathbf{x^a}}^{k}$ . Then ρ_k is Hermitian with ${\mathrm{Tr}}(\rho^k) = 1$ . Therefore, given the eigenvalues of ρ_k as $\{\gamma_i^k\}_{i = 1}^{2^m}$ ,

$\begin{align} {\mathrm{Tr}}\left(\rho^k\right)^2 = \sum_{i = 1}^{2^m} \left(\gamma_i^k\right)^2 \unicode{x2A7E} 2^m\times\sqrt[2^m]{\prod_{i = 1}^{2^m} \left(\gamma_i^k\right)^2}, \end{align} \tag{ E5 }$

where equality is attained when $\gamma_i^k = 1/2^{m}$ , meaning that ${\mathrm{Tr}}\left(\rho^k\right)^2\unicode{x2A7E} 1/2^m$ and the equality is satisfied when $\rho^k = I_{m}/2^m$ . Thus by using the equality condition, we see that

$\begin{align} \sum_{k = 1}^{n_Q}{\mathrm{Tr}}\left(\sum_{a}c_a\rho_{\mathbf{x^a}}^{k}\right)^2 + \left(\xi^2-\frac{n_Q}{2^m}\right)\left(\sum_a c_a\right)^2 = \xi^2, \end{align} \tag{ E6 }$

if and only if $\sum_a c_a \rho_{\mathbf{x^a}}^{k} = I_{m}/2^m$ . Therefore (E4) is satisfied unless $\xi^2 = 0$ and there exists c that satisfies $\sum_a c_a = 1$ , and $\sum_a c_a \rho_{\mathbf{x^a}}^{k} = I_{m}/2^m$ , which corresponds to the condition (ii). Since $\boldsymbol{\Sigma}_Q^{(1)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ is positive definite unless condition (i) or condition (ii) is satisfied, so is $\boldsymbol{\Theta}_Q^{(L)}({\mathbf{x}}, {\mathbf{x}}^{{\prime}})$ as we show above, which concludes the proof of the theorem. □

**Figure 11.** The accuracy versus the number of qubits, for (a) the training dataset and (b) the test dataset, for the multiclass classification task.
Download figure:
Standard image High-resolution image

**Figure 12.** The random unitary quantum circuit. $U_3(\alpha, \beta, \gamma)$ depicts the generic single-qubit rotation gate with 3 Euler angles, where $\alpha, \beta, \gamma$ are randomly chosen from the uniform distribution $U(0, 2\pi)$ .
Download figure:
Standard image High-resolution image

**Figure 12.** The random unitary quantum circuit. $U_3(\alpha, \beta, \gamma)$ depicts the generic single-qubit rotation gate with 3 Euler angles, where $\alpha, \beta, \gamma$ are randomly chosen from the uniform distribution $U(0, 2\pi)$ .
Download figure:
Standard image High-resolution image

Appendix F: Multiclass classification task in section 4.3

In this section, we demonstrate a multiclass classification task of quantum data. The basic problem setting is the same as that presented in section 4.3 and figure 1, and the output parts are modified for multiclass classification task. For qcNN and the pure classical neural network model, the output layer is a 4-node full-connected layer, and the activation function of the output layer is the softmax function $f_i(\mathbf{x}) = \frac{e^{x_i}}{\sum _{k = 1}^4 e^{x_k}}$ , where x corresponds to the 4-dimensional vector of the output layer. The predicted label i is given by the index that achieves the highest value of f_i . For qNN model, the ansatz is the same as the original one, and we assigned the predicted label based on the expectation values of the first two qubits, i.e. '0' for the value '00', '1' for '01', '2' for '10', and '3' for '11'. The obtained four expectation values are input into the softmax function, and the label with the highest value of function is the predicted label. We chose the cross entropy as a loss function, and we use Adam for optimizing the model parameters to minimize the loss function. The target data is prepared as follows. The quantum data generating process is the same as that presented in section 4.3, and, in this case, we assigned labels $y^a = 0,1,2,3$ to 750 samples each in ascending order of the value $g({\mathbf{x}}^a)$ since the total number of samples is $N_D = 3000$ , where $g({\mathbf{x}}) = {\mathrm{Tr}}[\rho({\mathbf{x}})O]$ . The test data is 100 randomly generated samples.

The experimental results are displayed in figure 11. We observed a similar performance trend as that for the two class classification task displayed in figure 9, where qcNN shows the best performance for all n presumably due to the same reason discussed in section 4.3.2.

Appendix G: Detail of the numerical experiment in section 4.3

In the numerical experiment, the quasi random unitary circuit shown in figure 12 is used for $U_{\mathrm{random}}$ in section 4.3.

Quantum-classical hybrid neural networks in the neural tangent kernel regime

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

1.1. Background—quantum/classical neural networks and classical neural tangent kernel (NTK)

1.2. Our contribution

1.3. Related works

1.4. Structure of the paper

2. Preliminary: classical NTK theory

2.1. Problem settings of NTK theory

2.2. Definition of NTK

2.3. Theorems

2.4. Consequence of theorems 1 and 2

2.5. When may cNN fail?

3. Proposed model

3.1. qcNN model

3.2. Quantum neural tangent kernel

3.3. Theorems

3.4. Possible advantage of the proposed model

3.4.1. Possible advantage over pure classical models

3.4.2. Note on the quantum kernel method

3.4.3. Specific setting where our model outperforms pure quantum or classical models

4. Numerical experiment

4.1. Finite-width qcNN vs infinite-width qcNN

4.1.1. Experimental settings

4.1.1.1. Choices of the quantum circuit

4.1.1.2. Training method for the classical neural network

4.1.2. Results

4.1.2.1. Result of the regression task

4.1.2.2. Result of a classification task

4.2. Effect of the locality on the machine learning performance

4.3. Advantage of qcNN over full-classical and full-quantum models

4.3.1. Machine learning task and models

4.3.2. Results

4.3.2.1. Result of the regression task

4.3.2.2. Result of the classification task

4.3.2.3. Effect of shot noise

4.3.2.4. Details of the experimental setting in section 4.3

5. Conclusion

Acknowledgments

Data availability statement

Appendix A: Proof of theorem 3

Appendix B: Proof of theorem 4

Appendix C: Proof of theorem 5

Appendix D: QNTK with the ReLU activation

Appendix E: Proof of theorem 6

Appendix F: Multiclass classification task in section 4.3

Appendix G: Detail of the numerical experiment in section 4.3