Quantum kernel evaluation via Hong–Ou–Mandel interference

C Bowie; S Shrapnel; M J Kewming

doi:10.1088/2058-9565/acfba9

1. Introduction

In recent years, there has been growing interest in the applications of machine learning to the physical sciences [1]. One particular area of that has received considerable attention is quantum machine learning [1–20]. In quantum machine learning, evaluating a quantum kernel is analogous to computing a classical kernel in classical machine learning. Both methods aim to measure the similarity or distance between data points in high-dimensional feature spaces without explicitly transforming the data. In classical machine learning, kernel methods use kernel functions to calculate inner products between data points in the transformed space. In quantum machine learning, classical data is encoded into quantum states, and the quantum kernel computes inner products between these quantum states to quantify their similarity or distance in the quantum feature space. Encoding classical data into a quantum system is equivalent to embedding the data into a quantum feature space [6, 9, 12–14, 21, 22], and measurement of the quantum system is equivalent to evaluating the kernel. This connection between classical and quantum kernels leverages the computational advantages of quantum computing for efficient inner product calculations, providing a powerful tool for quantum machine learning algorithms and applications in data analysis and pattern recognition.

Despite much success in the theoretical investigation of quantum kernels, there have been only a few experimental demonstrations. A seminal approach using nuclear spins [23] encodes specific feature vectors into two unitary operators, which are consecutively applied to an initial state before the system is measured along its magnetic moment, providing kernel evaluation. Several applications using quantum optics have also been demonstrated. Two studies have encoded features into the dual-rail encoding of multiple photons, first demonstrated in [24] and subsequently in [25]. The entanglement between dual-rail encoded photons can be used to exploit the quantum advantage inherent in quantum kernels, namely the exponential speed-up in their evaluation [26]. A final example involves using the spectral modes of ultrafast radiofrequency pulses to classify and train labeled datasets [27]. The advantage of this methodology is that it makes use of the photon's spectral modes, decomposing these into an orthogonal eigenbasis for encoding higher-dimensional feature vectors. The latter results discussed above highlight the utility of photonics in quantum information. In fact, optically encoded quantum information presents itself in many machine learning examples [28–33]. The popularity of photonic quantum information stems from the photon's versatility given its numerous and highly controllable degrees of freedom [34]. Despite the numerous encoding proposals that hinge on these increased degrees of freedom, the currently proposed physical platforms are limited to qubit-based models. To this end, we propose a new method of evaluating quantum kernels using Hong–Ou–Mandel (HOM) interference. As we will show, this method not only makes use of the entanglement necessary for the enhancement in quantum kernels [24, 26], but also utilises the higher-dimensional spectrum of single photons, which were exploited in [27].

2. Kernel method machine learning

We will begin by providing a brief pedagogical overview of kernel methods—for completeness see [35]. To summarise kernel methods succinctly, one starts by encoding data into some higher-dimensional feature space where classification of the data can be easier to analyse. However what makes this algorithm so useful is that is does not need to explicitly perform evaluations of the data in the feature space, but rather can be carried out using the kernel function that is defined on the domain of the original input data; this is commonly known as the Kernel trick.

To understand this more precisely, let us describe a simple machine learning classification task. Suppose we have a data set Y—say of N images—as a list of input vectors and labels, $Y \equiv \{(\vec{x}_{1}, y_{1}), (\vec{x}_{2}, y_{2}), \ldots, (\vec{x}_{N},y_{N})\}$ . Here y_i is the label—for example, cats and dogs—and $\vec{x}_{i}$ is the input data $\mathcal{X}$ —for example the pixel colour values, or some other set of features. The goal of our classification task is ultimately to classify an unknown data set Yʹ, that is a set of data with unknown labels. One approach is to determine this complex relationship between inputs $\vec{x}_{i}$ and labels y_i, such that if it is presented with an unknown data set Yʹ, the algorithm will correctly classify this data. This is typically the approach of deep neural networks which employ a vast number of tunable parameters and non-linear functions to effectively replicate this relationship.

Kernel methods on the other hand embed the data $\mathcal{X}$ in a higher-dimensional feature space $\mathcal{F}$ via the feature map $\phi{:\mathcal{X}\rightarrow \mathcal{F}}$ . In the feature space, the data is then classified mathematically by a suitable choice of distance measure along a decision boundary—for example, the inner product which we will consider going forward. This concept is demonstrated visually in figure 1. In this sense, a kernel k maps two inputs $\vec{x}$ and $\vec{x}^{\prime}$ —both of which are from the input space $\mathcal{X}$ —to a distance measure $\mathbb{C}$ such that $k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{C}$ [35]. Moreover, the feature map φ is related to the kernel mapping via the inner product of different feature vectors

$\begin{equation} k\left(\vec{x},\vec{x}^{\prime}\right) = \langle \phi\left(\vec{x}\right), \phi\left(\vec{x}^{\prime}\right) \rangle, \end{equation} \tag{ 1 }$

which has an associated Gram matrix K,

$\begin{equation} K_{m,n} = k\left(\vec{x}_{m},\vec{x}_{n}\right) \forall \vec{x} \in \mathcal{X}, \end{equation} \tag{ 2 }$

that is positive semi-definite. If the Gram matrix satisfies the condition

$\begin{equation} \sum_{m,n}^M c_m c_n*k\left(\vec{x}_m,\vec{x}_n\right) \unicode{x2A7E} 0, \end{equation} \tag{ 3 }$

for any $c_1 \ldots c_M \in \mathbb{C}$ , an associated feature mapping is guaranteed to exist and the function can be considered a kernel [35]. With all this specified one can then classify a data point $\vec{x}$ in the feature space $\mathcal{F}$ according to some decision boundary b that separates the classes as depicted in figure 1

$\begin{equation} f\left(x\right) = \langle b, \phi\left(\vec{x}\right)\rangle, \end{equation} \tag{ 4 }$

where f(x) is positive or negative indicating the binary classification.

**Figure 1.** A visual representation of kernel machine learning, between two classes; blue circles and orange squares. (Left) A depiction of the initial feature space of unprocessed data contained in the two dimension features space with features x₁ and x₂. (Right) After transforming to the feature space $\mathcal{F}$ via φ, the data can clearly be linearly separated into two classes (for illustrative purposes, we have limited this to a two dimensional transform, but this can be relaxed in general). In this depiction, the average of each class can be computed (coloured stars) and used as a discrepancy measured via MMD.
Download figure:
Standard image High-resolution image

**Figure 1.** A visual representation of kernel machine learning, between two classes; blue circles and orange squares. (Left) A depiction of the initial feature space of unprocessed data contained in the two dimension features space with features x₁ and x₂. (Right) After transforming to the feature space $\mathcal{F}$ via φ, the data can clearly be linearly separated into two classes (for illustrative purposes, we have limited this to a two dimensional transform, but this can be relaxed in general). In this depiction, the average of each class can be computed (coloured stars) and used as a discrepancy measured via MMD.
Download figure:
Standard image High-resolution image

Given that a kernel can be evaluated as the inner product between two feature vectors, there is a natural extension to quantum feature spaces [13]. Suppose that we again have some input data $\vec{x}$ , which is then encoded into a quantum state $\vert \Phi(\vec{x}) \rangle$ . The quantum kernel will correspond to the overlap of this state with another

$\begin{equation} k\left(\vec{x}, \vec{x}^{\prime}\right) = \langle \Phi\left(\vec{x}\right) \vert \Phi\left(\vec{x}^{\prime}\right) \rangle. \end{equation} \tag{ 5 }$

Furthermore, given that all the kernel outcomes can be mapped to probabilities, any inner product calculated using quantum states way will automatically satisfy the Gram matrix conditions outlined in equation (3). This has provided ample motivation for the development of kernel based quantum machine learning.

Evaluating the kernel requires that one can directly measure the overlap between two quantum states, which either requires quantum state tomography or other methods using quantum computers where one can directly parameterise the feature map φ.

3. Temporal encoding and HOM interference

Here we will outline a proposal for quantum kernel evaluation using the temporal encoding of single photon Fock-states and HOM interference. Creating higher dimensional quantum optical states can be achieved in many ways as detailed in the recent reviews on this topic [34, 36–39]. Experimentally however, not all methods of encoding information in optics are made equal. For example, a natural orthonormal basis would be multi-photon states, however these can be difficult to measure and require photon number resolving detectors [40–43].

A more common methodology which circumvents this approach is to encode into multiple photons in different paths, while simultaneously exploiting the two polarisation modes, thus creating numerous dual rail encoded qubits, which to date have been used to evaluate quantum kernels [24]. However, the physical scaling and technological challenges associated with numerous dual rail qubits can prevent higher dimensional feature spaces from being reached in practice.

We propose to sidestep these complications through the use of temporally/frequency encoded photons, a method which has received considerably less attention in the quantum kernel community [44, 45]. A continuous-mode single-photon state can be described as the coherent superposition of many spectral modes ω

$\begin{equation} \vert 1_{\Psi} \rangle = \int \mathrm{d}\omega \Psi\left(\omega\right) \hat{a}^{\dagger}\left(\omega\right)\vert 0 \rangle, \end{equation} \tag{ 6 }$

where $\Psi(\omega)$ is the spectral density function that weights each mode, and $\hat{a}^{\dagger}(\omega)$ are the creation operators associated with each ω. Furthermore, we will make the quantum optics approximation whereby we assume that the spectral spread is much smaller than the carrier $\omega\ll \omega_\mathrm{c}$ [46]. In this limit, the Fourier transform of the slowly varying spectral envelope corresponds to the temporal wave-packet, thus yielding the description of the single photon state defined in the time domain

$\begin{equation} \vert 1_{\Psi} \rangle = \int \mathrm{d}t \Psi\left(t\right) \hat{a}^{\dagger}\left(t\right)\vert 0 \rangle. \end{equation} \tag{ 7 }$

Here the temporal wave-packet satisfies the normalisation condition $\int \mathrm{d}t \vert \Psi(t)\vert^{2} { = } 1$ and the bosonic field operators satisfy the commutation relation

$\begin{equation} \left[\hat{a}\left(t\right), \hat{a}^{\dagger}\left(t^{\prime}\right)\right] = \delta\left(t-t^{\prime}\right). \end{equation} \tag{ 8 }$

Feature vectors can be encoded into the temporal modes of single photons provided a set of orthogonal temporal modes are chosen

$\begin{equation} \Psi\left(\vec{x}, t\right) = \sum_{n = 1}^N \alpha_n\left(\vec{x}\right) u_n\left(t\right) \end{equation} \tag{ 9 }$

where $\alpha_n(\vec{x})$ denotes a unit weight—encoding the information from the feature vector—of each mode and $\{u_n(t)\}$ a set of orthonormal temporal mode functions [45]. In our work, we will take the set $\{u_n(t)\}$ as the set of orthogonal Hermite Gaussian (HG) modes, noting that this choice is arbitrary could be replaced by any numerically orthogonal set of single-variable functions,

$\begin{equation} u_{n}\left(t\right) = \frac{\mathrm{e}^{-\frac{t^{\,{2}}}{2}}}{\pi^{1/4}\sqrt{2^{n}n!}}H_{n}\left(t\right) \,, \end{equation} \tag{ 10 }$

where

$\begin{equation} H_{n}\left(t\right) = \left(-1\right)^{n}\mathrm{e}^{-t^{2}}\frac{\mathrm{d}^{n}}{\mathrm{d}t^{n}}\mathrm{e}^{-t^{2}/2}\,, \end{equation} \tag{ 11 }$

is the $n{\mathrm{th}}$ Hermite polynomial and satisfies the orthogonality relation

$\begin{equation} \int \mathrm{d} t u_{n}^{*}\left(t\right)u_{m}\left(t\right) = \delta_{nm}. \end{equation} \tag{ 12 }$

Furthermore, we will define the unit weight vector α_n simply as

$\begin{equation} \alpha_n\left(\vec{x}\right) = w_{n} \phi_{n}\left(\vec{x}\right), \end{equation} \tag{ 13 }$

where $\phi_{n}(\vec{x})\in \mathbb{C}$ is the nth element of the feature vector of the input data $\vec{x}$ , and w_n is a free weight—commonly added in kernel methods to allow optimisation of the resulting kernel. Moreover the coefficients are normalised such that $\sum_{n}\vert \alpha_{n} \vert^{2} = 1$ . We are thus in a position to now define the proposed single photon encoding for an input vector $\vec{x}$ as

$\begin{equation} \vert \Phi\left(\vec{x}, t\right) \rangle = \sum_{n = 1} \int \mathrm{d}t \alpha_{n}\left(\vec{x}\right) u_{n}\left(t\right) \hat{a}^{\dagger}\left(t\right) \vert 0 \rangle. \end{equation} \tag{ 14 }$

In our proposed encoding scheme, the single photon corresponds to the information carrier, the HG temporal modes correspond to the orthogonal basis of the feature space, and the coefficients encode the data into this basis.

Now suppose we have two encoded feature vectors $\vert \Phi(\vec{x},t) \rangle$ and $\vert \Phi(\vec{x}^{\prime},t) \rangle$ and we would like to measure the quantum kernel equation (5). We can do this by interfering the two photons with each other on a 50:50 beam-splitter (BS). This interference is intrinsically quantum mechanical and leads to 'bunching' where both photons exit the same port and are detected together. This is known as HOM interference [47]. The probability of detecting the two photons simultaneously at either detector $P(2,0)$ or $P(0,2)$ is equal to the overlap of the two wave functions

$\begin{equation} P\left(2,0\right) + P\left(0,2\right) = \vert \langle \Phi\left(\vec{x}^{\prime},t^{\prime}\right) \vert \Phi\left(\vec{x},t\right) \rangle\vert^{2}\,, \end{equation} \tag{ 15 }$

which can be readily evaluated using the commutation relation equation (8) and the orthogonality of the HG modes equation (12) as

$\begin{equation} \left|\langle \Phi\left(\vec{x}^{\prime},t^{\prime}\right) \vert \Phi\left(\vec{x},t\right) \rangle\right|^{2} = \left|\sum_{n}\alpha_{n}^{*}\left(\vec{x}^{\prime}\right)\alpha_{n}\left(\vec{x}\right)\right|^{2}, \end{equation} \tag{ 16 }$

which provides a direct evaluation of the quantum kernel.

This quantum kernel can be observed directly by measuring the HOM dip, whereby two temporally synchronised detectors measure correlated photon detection events, otherwise known as coincidence counts (CCs) [47, 48]. This is visualised in figure 2. The normalised CC (divided by the total number of counts) is equal to the probability of measuring a photon event at both detectors simultaneously and is therefore equivalent to

$\begin{equation} \mathrm{CC}\left(\vec{x}^{\prime},\vec{x}\right) = 1-\left|\langle \Phi\left(\vec{x}^{\prime},t^{\prime}\right) \vert \Phi\left(\vec{x},t\right) \rangle\right|^{2}. \end{equation} \tag{ 17 }$

This provides a clear experimental evaluation of the quantum kernel where the HOM interference will classify the similarity between feature vectors $\vec{x}$ and $\vec{x}^{\prime}$ .

**Figure 2.** (Top) A depiction of HOM interference where two identical photons incident on a beam splitter simultaneously will interfere, causing them to bunch into pairs. This effect can be measured using the coincidence counts ( $\mathrm{CCs}$ ) of the two respective detectors. (Below) Shows the HOM dip as a function of the relative time-delay dt of the photons. As the time-delay between the two vanish, then the probability of measuring a $\mathrm{CC}$ also vanishes.
Download figure:
Standard image High-resolution image

**Figure 2.** (Top) A depiction of HOM interference where two identical photons incident on a beam splitter simultaneously will interfere, causing them to bunch into pairs. This effect can be measured using the coincidence counts ( $\mathrm{CCs}$ ) of the two respective detectors. (Below) Shows the HOM dip as a function of the relative time-delay dt of the photons. As the time-delay between the two vanish, then the probability of measuring a $\mathrm{CC}$ also vanishes.
Download figure:
Standard image High-resolution image

Finally, there is an immediate parallel between our physical proposal and a theoretical proposal subject to a quantum advantage, discovered in [26] and tested in [24]. Firstly, we are both working towards solving the same task, namely supervised cluster assignment. In the interest of brevity, we will use our abbreviated notation in equation (6), where $\vert \Phi(\vec{x}^{\prime}, t) \rangle{\rightarrow }\vert 1_{\Phi^{\prime}} \rangle$ describes the encoded features in orthogonal temporal modes. We therefore initially begin with two photons, encoded into different paths corresponding to the initial state:

$\begin{equation} \vert \psi \rangle = \vert 1_{\Phi^{\prime}} \rangle\vert 1_{\Phi} \rangle. \end{equation} \tag{ 18 }$

The two photons are then interfered on a single BS as described above, leading to the transformation:

$\begin{equation} \vert \psi \rangle \rightarrow \frac{1}{2}\left(\vert 2_{\vec{u},\vec{v}} \rangle\vert 0 \rangle - \vert 0 \rangle\vert 2_{\vec{u},\vec{v}} \rangle + \vert 1_{\vec{u}} \rangle\vert 1_{\vec{v}} \rangle - \vert 1_{\vec{v}} \rangle\vert 1_{\vec{u}} \rangle\right), \end{equation} \tag{ 19 }$

where the notation indicates that the two photons are in the same path mode $\vert 2_{\Phi^{\prime},\Phi} \rangle$ . As we have shown above, the probability of measuring two qubits (in either port) is given by equation (15), and measuring one qubit in both ports (a CC) is given by equation (17). This proposed scheme makes use of the same quantum resources required in [26], whereby a larger distance between $\Phi(\vec{x},t)$ and $\Phi(\vec{x}^{\prime},t)$ is represented in the entanglement of the state given by equation (19). Rather than relying on an empty ancilla as an indicator function, we make use of both photons simultaneously as information carriers. By utilising entanglement as a resource in the same manner as [26], we conclude that our proposed physical platform also offers the same available quantum advantage over their classical counterparts.

4. Application example: numerical experiment

We have so far described how one can use the CC of temporally encoded photons to evaluate a quantum kernel. Now we will demonstrate an example application for a binary classification task using generated data. The model for machine learning and classification in this case is a maximum mean discrepancy (MMD) model via kernel mean embedding (KME), with training of the model performed via stochastic gradient descent (SGD). The applications of this physical platform are not limited to any aspect of this model, it can be used for any machine learning algorithm that requires kernel evaluation.

4.1. Kernel implementation model: MMD

This subsection outlines the use of MMD for kernel machine learning and specifically the implementation for our proposed physical platform. Suppose we have two classifications, both of which are characterised by the probability distributions over the input data $P(\vec{x})$ and $Q(\vec{x})$ respectively. Then given a kernel function ${k: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{C}}$ with associated mapping ${\phi:\mathcal{X}\rightarrow \mathcal{F}}$ , we can define group mean µ_P over the classification group P(X) as

$\begin{align} \mu_{P} & = \int_{\mathcal{X}_{P}} k\left(\cdot, \vec{x}\right) \mathrm{d} P\left(\vec{x}\right) = \int_{\mathcal{X}_{P}} \phi\left(\vec{x}\right)\mathrm{d} P\left(\vec{x}\right), \end{align} \tag{ 20 }$

where $\mathcal{X}_{P}$ indicates that we are only averaging over feature vectors of the P class. Our encoding scheme described above permits the immediate definition of the quantum feature mean [14]

$\begin{equation} \vert \mu_{P} \rangle = \frac{1}{N_{P}}\int_{\mathcal{X}_{P}} \vert \Phi\left(\vec{x}, t\right) \rangle \mathrm{d} P\left(\vec{x}\right), \end{equation} \tag{ 21 }$

where N_P is a normalisation constant. Therefore, we can create a quantum feature mean by simply averaging over all the state vectors corresponding to a quantum feature map.

We now define the MMD which corresponds to the absolute difference measure between any two distributions P and Q, and is defined as the absolute difference between KMEs,

$\begin{align} \mathrm{MMD} \left(P, Q\right) & = || \mu_{P} - \mu_{Q} ||^2, \end{align} \tag{ 22 }$

$\begin{align} & = \langle \mu_{P}, \mu_{P} \rangle + \langle \mu_{ Q}, \mu_{Q} \rangle - 2 \langle \mu_{P}, \mu_{Q} \rangle. \end{align} \tag{ 23 }$

What is more, if we now combine our quantum feature map using our single photon encoding equation (14) as well as our quantum feature mean equation (21), then we can evaluate the quantum MMD using our HOM interferometer where one photon encodes the mean µ_P and the other µ_Q, which yields

$\begin{equation} \mathrm{MMD} \left(P, Q\right) = 2 \mathrm{CC}\left(P,Q\right), \end{equation} \tag{ 24 }$

where $\mathrm{CC}(P,Q) = 1 - \vert \langle \mu_{P} \vert \mu_{Q} \rangle\vert^{2}$ given by equation (17). We can therefore measure the MMD using a single evaluation of a HOM interferometer and the kernel function associated with feature mapping φ. Moreover, using the free-weights w_n in equation (13), one can optimise the MMD, yielding an optimal class separation in feature space $\mathcal{F}$ . Crucially, the MMD model provides a cost function to be optimised in a given feature space allowing for an implicit separating hyperplane between classification groups. Once optimised, the mapping φ could be used to classify an unseen data point $\vec{x}^{\prime}$ , using a quantum HOM classifier by

$\begin{align} \mathrm{Class}\left(\vec{x}\right) & = \frac{1}{2}\left(\mathrm{MMD}\left(\vec{x}, Q\right) - \mathrm{MMD}\left(\vec{x},P\right)\right), \end{align} \tag{ 25 }$

$\begin{align} & = \left|\langle \Phi\left(\vec{x},t\right) \vert \mu_{Q} \rangle\right|^{2} -\left|\langle \Phi\left(\vec{x},t\right) \vert \mu_{P} \rangle\right|^{2}, \end{align} \tag{ 26 }$

where a positive evaluation of this class function will place $\vec{x}$ in class Q, and negative will place it in class P. If the feature map has been optimised such that the means are almost orthogonal ( $\langle \mu_{P} \vert \mu_{Q} \rangle \sim 0$ ), then classification can be accurately approximated with a single feature mean: if, for example, $\vec{x}$ belongs to P, then the overlap with µ_P will be high, and low with µ_Q, thus we would expect to see a HOM dip in the former, and not in the latter. To minimise the use of resources, we note that one does not need to compare $\vec{x}$ to both µ_P and µ_Q, but rather just to one—say µ_Q—provided that the dip has been calibrated to $\mathrm{MMD}(P,Q)$ . In this setting, one could set the decision boundary to be equal to

$\begin{equation} \mathrm{Class}\left(\vec{x}\right) \rightarrow \begin{cases} Q & \mathrm{if}~\left|\langle \Phi\left(\vec{x},t\right) \vert \mu_{Q} \rangle\right|^{2} \unicode{x2A7E} \frac{1}{2} - \frac{1}{2}\left|\langle \mu_{P} \vert \mu_{Q} \rangle\right|^{2}\\ P &\mathrm{if}~\left|\langle \Phi\left(\vec{x},t\right) \vert \mu_{Q} \rangle\right|^{2} < \frac{1}{2} - \frac{1}{2}\left|\langle \mu_{P} \vert \mu_{Q} \rangle\right|^{2} \end{cases}. \end{equation} \tag{ 27 }$

In a high efficiency experiment (low loss and noise, high quantum efficiency) with highly orthogonal encoding ( $\langle \mu_{P} \vert \mu_{Q} \rangle \sim 0$ ), this classification of an unknown $\vec{x}$ could be measured with very few single photon measurements as the expected CC would be highly correlated or anti-correlated with the single comparison point µ_Q. This ensures that very few photons are required to perform this classification, thus minimising the experimental overhead.

4.2. Optimization training: SGD

The MMD model for kernel implementation defines the criteria used to determine the ability of the model and the process of classification, but does not specify the optimisation algorithm used to train the model. For this example, we choose to implement an SGD optimisation algorithm, however one is free to choose whatever algorithm they see fit for this purpose.

The parameter to be optimised are the free weights, given by the vector $\vec{w}$ , which were introduced in equation (13). We use SGD with MMD employed as the cost function to be optimised. This ensures that the weights are optimised with respect to the means, and most likely to be maximally orthogonal in the feature Hilbert space. Moreover, we use SGD rather than normal gradient descent to ensure the optimisation does halt in any local minima. As such the weights are updated at each iteration i according to the difference equation

$\begin{equation} \vec{w}_{i+1} = \vec{w}_{i}+ L_{i} \nabla_{\vec{w}} \mathrm{MMD}\left(P,Q\right)\vert_{\vec{w} = \vec{w}_{i}} + \vec{\epsilon}_{i} , \end{equation} \tag{ 28 }$

where the subscript i indicates the iteration—not to be confused with the element n in equation (13)—and $\nabla_{\vec{w}} \mathrm{MMD}(P,Q)$ is the numerical derivative of the cost function evaluated at $\vec{w}_{i}$ —MMD evaluation as given by equation (22)—with respect to the weights. L_i is then the learning rate at the ith iteration, and $\vec{\epsilon}_{i}$ is a stochastic random variable we add at each time step.

4.3. Application and results

To simulate the expected results of this model, a numerical experiment is performed. We initially generate a two dimensional (two features F₁ and F₂) training and test data set using scikit-learns' 'make blobs' function with parameters such that it would be separable in a polynomial feature space, but not be linearly separable [49]. The use of two feature is for visual demonstration purposes, and the model is applicable to higher dimensional data. Here the data corresponds to two classes blue P and red Q. The training data set is depicted in figure 3. From the training set, the group mean from each classification is determined by initially mapping each data vector to the feature space given by the polynomial kernel of degree two

$\begin{equation} \phi:\left(F_{1}, F_{2}\right) \rightarrow \left(F_{1}^{2}, F_{2}^{2}, F_{1}F_{2}\right), \end{equation} \tag{ 29 }$

and then taking average values within each classification group. After this transformation, we can also add trainable weights w_n as we introduced in equation (13), and then normalise them as suitable coefficients for the quantum encoding equation (14). The weights are then optimised using the SGD algorithm outlined in equation (28). For the interested reader, a more detailed explanation of the training process and associated hyper-parameters can be found in appendix.

**Figure 3.** Scatter plot of the example data showing two classes (red) and (blue) which are clearly distinguishable, but have overlapping means and no hyperplane that separates them in two dimensions. Here F₁ and F₂ are two features and the mean data point for each classification group has noted by the symbol $(\star)$ . Mapping the means to feature space using the transformation in equation (29) is insufficient to induce the needed hyperplane, separation only occurs in the feature space after optimisation of the free weight parameters.
Download figure:
Standard image High-resolution image

The optimisation process maximises the MMD value of the two feature mean encoded photons µ_P and µ_Q, minimising the probability of a HOM interference measurement registering a CC. This process is demonstrated in figure 4(a) where we plot the HOM dip calculated before and after the weights have been optimised between the two means µ_P and µ_Q. At the centre of the dip, where the time-delay between the photons is zero dt = 0, the means are initially very similar, resulting in no discernible HOM dip— $\langle \mu_{P} \vert \mu_{Q} \rangle \sim 1$ —but after training are maximally orthogonal thus generate a large dip— $\langle \mu_{P} \vert \mu_{Q} \rangle \sim 0$ . Moreover, when the training is complete, we can use our encoded photons to classify unseen data accurately via equation (26) which is depicted in the Violin plots in figure 4(b) and also shown in the before and after training confusion matrices figures 4(c) and (d). The confusion matrix is a machine learning tool that demonstrates the percentage of correct/incorrect classifications. The vertical axis denotes the calculated classification as given by the machine HOM classifier, and the horizontal axis denotes the true classification. We clearly see that after training we are able to classify the data precisely using our HOM quantum classifier.

**Figure 4.** (a) A depiction of the HOM dip of the two means created by evaluating the two means µ_P and µ_Q before (solid black line) showing a dip indicating significant overlap, and after training (black dashed line) showing no dip indicating a large separation of two means in the feature Hilbert space. (b) Depicts two violin plots showing the classification of unseen data before and after using the quantum HOM classifier in equation (26). The widths of the violin plots are arbitrary but are proportional to the probability density of the classified distributions. After training the number of misclassifications by the HOM classifier is low. (c) and (d) Show the confusion matrix before and after training, where the diagonal (off diagonal) elements correspond to correct (incorrect) classifications.
Download figure:
Standard image High-resolution image

5. Discussion

In this article we proposed a practical experimental methodology for evaluating quantum kernels using HOM interference of temporal encoded photons. We showed that the kernel is directly related to the CCs measured via two photodetectors. We further showed that by using a quantum KME, one could also perform MMD for classification, which we demonstrated using a very simple example. Given that we have optimised the MMD between the two classes, then it is quite reasonable to assume that the number of experimental runs required to classify the data is minimised—since quantum classification via quantum kernel methods necessarily requires estimating the probability given by equation (5). Finally, this concept offers another experimentally feasible methodology for evaluating quantum kernels in the growing application of photonic quantum machine learning.

Acknowledgments

We would like to acknowledge the useful conversations with Gerard Milburn and Sahar Basiri-Esfahani. This work was funded by the Australian Research Council Centre of Excellence for Engineered Quantum Systems (Project No. CE170100009). M J K acknowledges the financial support from a Marie Skłodwoska-Curie Fellowship (Grant No. 101065974). All code used to simulate all presented results is available in a devoted github repository [50].

Data availability statement

The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/cassiejayne/Optical-Kernel-Simulation.

Appendix: Details of numerical example

Training was performed on 1000 data points with two features created using scikit's 'make blobs' function. Four blobs were created with different centering, and were grouped into two as demonstrated in figure 3.This process was repeated to create test data, using the same centres to represent the same distribution, and the same number of data points. All simulations were run on a conventional desktop computer.

Using the maximum mean discrepancy (MMD) method, most of the analysis was conducted on the mean data value of each data set in the transformed feature space. Each data point was transformed according to the relevant mapping φ, and then the data points for each classification group were grouped and averaged. At this stage, the averages were normalised for applicability to quantum encoding.

The cost function to be optimised is the MMD as defined in equation (22). In realistic application, the MMD would be evaluated using equation (24). For the purposes of simulation, the MMD is evaluated mathematically as in equation (22). All notes in this section apply to the model explored in section 5.

The training is performed using a stochastic gradient descent (SGD) algorithm. This is an iterative process that updates the free weights using the gradient of the MMD function with respect to the last iterative weight change. This process is described in equation (28). The training process is often subject to a number of hyper parameters. In our case, the hyper parameters employed are dependant on the current cost function, as the cost function is to be maximised, and bound by a maximum value of two. The training runs through 1000 iterations and then as the learning rate L and the noise ε vary depending on the measure of the cost function. These are scheduled according to the the following table:

$\mathrm{Cost} \lt 1.8$	L = 0.1, $\epsilon = \mathcal{N}(0, 0.5)$
$1.8 \lt \mathrm{Cost} \lt 1.9$	L = 0.01, $\epsilon = \mathcal{N}(0, 0.05)$
$1.9 \lt \mathrm{Cost}$	L = 0.001, $\epsilon = \mathcal{N}(0, 0.05)$

where $\mathcal{N}(0, \sigma)$ corresponds to a normal distribution with mean 0, and standard deviation σ. The average cost function with regard to epoch over 1000 training iterations is presented in figure 5.

**Figure 5.** A plot of the cost function (MMD value) with regard to epoch averaged over 1000 training trials. The initial cost function for each trial is calculated for a random vector of free weights.
Download figure:
Standard image High-resolution image

Quantum kernel evaluation via Hong–Ou–Mandel interference

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Kernel method machine learning

3. Temporal encoding and HOM interference