At the time of writing, quantum machine learning (QML) is just about the greatest combination of buzzwords you could hope to synthesize. A lot is written about QML, and the topic is often (confusingly) both overhyped and undersold at the same time. In this section we’ll try to give a flavor for how QPUs might transform machine learning, while also being careful to point out the caveats inherent in manipulating quantum data.
Useful QML applications require very large numbers of qubits. For this reason, our overview of QML applications is necessarily very high-level. Such a summary is also fitting given the rapidly changing nature of this nascent field. Although our discussion will be more schematic than pragmatic, it will heavily leverage our hands-on experience of primitives from earlier chapters.
We summarize three different QML applications: solving systems of linear equations, Quantum Principal Component Analysis, and Quantum Support Vector Machines. These have been selected due to both their relevance to machine learning and their simplicity to discuss. These are also applications whose conventional counterparts are hopefully familiar to anyone who has dabbled in machine learning. We only give a brief description of the conventional progenitors of each QML application as it’s introduced.
In discussing QML, we’ll frequently make use of the following pieces of machine-learning terminology:
Term used to describe measurable properties of data points available to a machine-learning model for making predictions. We often imagine the possible values of these features to define a feature space.
Refers to machine-learning models that must be trained on a collection of points in feature space for which correct classes or responses are already known. Only then can the adequately trained model be used to classify (or predict) responses for new points in the feature space.
Refers to machine-learning models that are able to learn patterns and structure in training data that does not include known responses.
Used to describe supervised predictive models that assign a given point in feature space to one of several discrete classes.
Used to describe supervised models predicting some continuously varying response variable.
One form of unsupervised data preprocessing that can benefit machine-learning models of all types. Dimensionality reduction aims to reduce the number of features needed to describe a problem.
In addition to this terminology, we’ll also make use of mathematical descriptions of machine-learning problems. As such, this chapter is slightly more mathematically involved than our previous discourse.
Our first QML application teaches us how a QPU can help solve systems of linear equations.
Although systems of linear equations are certainly fundamental to much of machine learning, they also underlie vast areas across all of applied mathematics. The HHL algorithm 1 (often referred to simply as HHL) we present for leveraging a QPU to efficiently solve these systems is consequently a fundamental and powerful tool, and we’ll see that it’s a key building block in other QML applications too. HHL has also been considered for applications ranging from modeling electrical effects to streamlining computer graphics calculations.
We begin our summary of the HHL algorithm by recapping the mathematics needed to describe conventional systems of linear equations. We then summarize the distinctly quantum operation of HHL, outlining its performance improvements and—just as importantly—its constraints. Finally, we give a more detailed description of how HHL works “inside the box.”
The most concise way of representing a system of linear equations is in terms of matrix multiplication. In fact, for the seasoned equation solver the terms matrices and linear equations are synonymous. For example, suppose we have a system of two linear equations, 3x1 + 4x2 = 3 and 2x1 + x2 = 3. We can equivalently, and much more concisely, represent these as the single matrix equation shown in Equation 13-1.
We can recover our two linear equations through the rules of matrix multiplication. More generally, a system of n linear equations for n variables can be written as a matrix equation containing an n × n matrix, A, and an n-dimensional vector :
Here we have also introduced a vector of the n variables we want to solve for.
In this matrix formulation, the task of solving the system of equations boils down to being able to invert the matrix A. If we can obtain the inverse A–1, then we can easily determine the n unknown variables via Equation 13-2.
Many conventional algorithms exist for finding the inverse of a matrix, and the most efficient rely on the matrix in question possessing certain helpful properties.
The following matrix parameters can affect the performance of both conventional and quantum algorithms:
The size of the system of linear equations. Equivalently, this is the dimensionality of A—if we want to solve a system of n linear equations to find n variables, then A will be an n × n matrix.
The condition number of the matrix A representing the system of linear equations. Given the system of equations , the condition number tells us how much an error in our specification of affects the error we can expect to find in our solution for . κ is calculated as the maximum ratio between the relative error in the input and the output . It turns out that κ can equivalently be found as2 the ratio of the absolute values of the maximum and minimum eigenvalues of A.
The sparsity of the matrix A. This is the number of nonzero entries in A.
The precision we require in our solution. In the case of HHL, we’ll shortly see that a state ∣〉 is output that amplitude-encodes the solution vector . Increasing means increasing the precision with which the values in are represented within these amplitudes.
Our assessment of how efficiently HHL can invert a matrix will be with respect to these parameters. By efficiency we mean the runtime of the algorithm (a measure of how many fundamental operations it must employ). For comparison, at the time of writing, the leading conventional algorithm for solving systems of linear equations is probably the conjugate gradient descent method. This has a runtime of .
HHL (named after its 2009 discoverers Harrow, Hassidim, and Lloyd) employs the primitives we’ve learned thus far to find (in a particular sense) the inverse of a matrix faster than is possible with conjugate gradient descent. We say in a particular sense because HHL solves a distinctly quantum version of the problem. HHL provides the solutions to a system of linear equations amplitude-encoded in a QPU register, and as such, they are inaccessibly quantum. Although HHL cannot solve systems of linear equations in a conventional sense, amplitude-encoded solutions can still be very useful, and are in fact critical building blocks in other QML applications.
Before decomposing HHL into primitives, we’ll give an executive summary of its inputs, outputs, and performance.
The inputs and outputs of HHL are as shown in Figure 13-1.
This schematic shows that HHL accepts two sets of input registers, and a matrix (via a quantum simulation):
This contains a number of scratch qubits used by various primitives within HHL, all prepared in the ∣0⟩ state. Because HHL deals with fixed-point (or floating-point) data and involves nontrivial arithmetic operations (such as taking the square root), we require a lot of scratch qubits. This makes HHL difficult to simulate for even the simplest cases.
We also need to provide HHL with the vector from Equation 13-2, amplitude-encoded in a QPU register (in the sense we discussed in Chapter 9). We will denote the state of a register amplitude encoding as ∣〉. Note that to prepare an amplitude encoding of we will need to use QRAM. Thus, HHL fundamentally relies on the existence of QRAM.
Naturally, HHL also needs access to the matrix encapsulating the system of linear equations we wish to solve. The bottom of Figure 13-1 illustrates how HHL requires us to represent A as a QPU operation, which we can achieve via the process of quantum simulation outlined in Chapter 9. This means that the matrix A must meet the requirements we noted for performing quantum simulation.
Figure 13-1 also shows that two registers are output from HHL.
The solution vector is amplitude-encoded within a single output QPU register (we denote this state as ∣〉). As we’ve already stressed, this implies that we cannot access the individual solutions, since they’re hidden in the amplitudes of a quantum superposition that we cannot hope to efficiently extract with READ
operations.
The scratch qubits are returned to their starting state of ∣0⟩, allowing us to continue using them elsewhere in our QPU.
Here are some examples of ways in which the quantum output from HHL can still be incredibly useful despite its inaccessible nature:
Rather than a specification of solutions for all n variables in , we may only wish to know some derived property, such as their sum, mean value, or perhaps even whether or not they contain a certain frequency component. In such cases we may be able to apply an appropriate quantum circuit to ∣〉, allowing us to READ
the derived value.
If we are satisfied with checking only whether or not the solution vector is equal to one particular suspected vector, then we can employ the swap test introduced in Chapter 3 between ∣〉 and another register encoding the suspected vector.
If, for example, we plan to use the HHL algorithm as a component in a larger algorithm, ∣〉 may be sufficient for our needs as is.
Since systems of linear equations are fundamental in many areas of machine learning, HHL is the starting point for many other QML applications, such as regression3 and data fitting.4
The HHL algorithm has a runtime5 of .
In comparison with the conventional method of conjugate gradient descent and its runtime of , HHL clearly offers an exponential improvement in the dependence on the size of the problem (n).
One could argue that this is an unfair comparison, since conventional conjugate gradient descent reveals the full set of solutions to us, unlike the quantum answer generated by HHL. We could instead compare HHL to the best conventional algorithms for determining derived statistics from solutions to systems of linear equations (sum, mean, etc.), which have n and κ dependencies of , but HHL still affords an exponential improvement in the dependence on n.
Although it’s tempting to focus solely on how algorithms scale with the problem size n, other parameters are equally important. Though offering an impressive exponential speedup in terms of n, HHL’s performance is worse than conventional competitors once we consider poorly conditioned or less sparse problems (where κ or s becomes important).6 HHL also suffers if we demand more precision and place importance on the parameter .
For these reasons, we have the following fine print:
The HHL algorithm is suited to solving systems of linear equations represented by sparse, well-conditioned matrices.
Additionally, since HHL leverages a quantum simulation primitive, we need to take note of any requirements particular to whichever quantum simulation technique we employ.
Having done our best to give a realistic impression of HHL’s usage, let’s break down how it works.
The intuition behind HHL relies on one particular method for finding the inverse of a matrix via its eigendecomposition. Any matrix has an associated set of eigenvectors and eigenvalues. Since our focus in this chapter is machine learning, we’ll assume some familiarity with this concept. If the idea is new to you, eigenvectors and eigenvalues are essentially the matrix equivalents of the eigenstates and eigenphases of QPU operations that we introduced while discussing phase estimation in Chapter 8.
In “Phase Estimation in Practice” we also noted that any QPU register state can be considered to be some superposition of the eigenstates of any QPU operation. Through its dependence on quantum simulation, HHL is restricted to solving systems of linear equations that are represented by Hermitian matrices.7 For such matrices a similar fact regarding its eigendecomposition is true. Any vector we might want to act a Hermitian matrix A on can be expressed in the basis of (i.e., written as a linear combination of) A’s eigenvectors.
For example, consider the Hermitian matrix A and vector shown in Equation 13-3.
The two eigenvectors of this particular matrix A are and , with associated eigenvalues and , as you can check by confirming that and . Since the example A considered here is Hermitian, can be written in the basis of its eigenvectors, and in fact . We could simply write this as with the understanding that the components are expressed in the basis of A’s eigenvectors.
We can also write A in its own eigenbasis.8 It turns out that when written this way a matrix is always diagonal, with its main diagonal consisting of its eigenvalues. For our preceding example, A is therefore written in its eigenbasis as shown in Equation 13-4.
Expressing A in its eigenbasis is very helpful for finding its inverse, because inverting a diagonal matrix is trivial. To do so you simply numerically invert the nonzero values along its diagonal. For example, we can find A–1 as shown in Equation 13-5.
We should note, of course, that this gives us A–1 expressed in the eigenbasis of A. We can either leave the inverse like this (if we wish to act it on vectors that are also expressed in A’s eigenbasis), or rewrite it in the original basis.
So in summary, if we can find the eigenvalues of some general matrix A, then we can determine as shown in Equation 13-6.
Where are the eigenvalues of A and we use to denote the components of expressed in A’s eigenbasis.
HHL manages to employ this matrix inversion method using the quantum parallelism of a QPU. The output register from HHL contains an amplitude encoding of precisely the vector shown in Equation 13-6; i.e., one where the amplitude of state ∣〉 is . The schematic in Figure 13-2 shows what’s inside the HHL box from Figure 13-1, outlining how HHL uses our familiar QPU primitives to produce the output state in Equation 13-6.
Although we don’t explicitly show them in Figure 13-2, HHL will also require other input registers specifying certain configuration parameters needed for the quantum simulation of A.
Let’s step through each of the primitive components used in Figure 13-2.
We already know that the phase estimation primitive can efficiently find the eigenstates and eigenphases of a QPU operation. You might suspect that this could help us in using the eigendecomposition approach to inverting a matrix—and you’d be right!
Figure 13-2 shows that we begin by using QRAM to produce a register with an amplitude encoding of and quantum simulation to produce a QPU operation representing A. We then feed both of these resources into the phase estimation primitive as shown in Figure 13-3.
Figure 13-3 helps us recall that two input registers are passed to the phase estimation primitive. The lower eigenstate
register takes an input specifying an eigenstate of the QPU operation for which we would like the associated eigenphase. The upper output
register (initialized in state ∣0⟩) produces a representation of the eigenphase, which in our case is an eigenvalue of A.
So phase estimation provides an eigenvalue of A. But how do we get all n eigenvalues?
Running phase estimation n separate times would reduce HHL’s runtime to O(n)—no better than its conventional counterparts. We could input a uniform superposition in the eigenstate
register and calculate the eigenvalues in parallel, producing a uniform superposition of them in the output
register. But suppose we take a slightly different approach and instead send the amplitude encoding of in the eigenstate
register? This results in the output
register being a superposition of A’s eigenvalues, but one that is entangled with the amplitude encoding of produced in the output
register.
This will be far more useful to us than just a uniform superposition of the eigenvalues, since now each ∣〉 within the entangled state of eigenstate
and output
registers has an amplitude of (thanks to the eigenstate
register). This isn’t quite what we want to produce the solution as shown in Equation 13-6, however. The output
register’s state represents the eigenvalues of A. What we really want is to invert these eigenvalues and—instead of them being held in another (entangled) register—move them into values multiplying the amplitudes of the eigenstate
register. The next steps in HHL achieve this inversion and transfer.
The second primitive in Figure 13-2 inverts each value stored in the (entangled) superposition of the output
and eigenstate
registers.
At the end of this step the output
register encodes a superposition of ∣〉 states, still entangled with the eigenstate
register. To invert numerical values encoded in a quantum state we actually utilize a number of the arithmetic primitives introduced in Chapter 5. There are many different ways we could build a QPU numerical inversion algorithm from these primitives. One possibility is to use the Newton method for approximating the inverse. Whatever approach we take, as we noted in Chapter 12, division is hard. This seemingly simple operation requires a significant overhead in qubit numbers. Not only will the constituent operations require scratch qubits, but dealing with inverses means that we have to encode the involved numerical values in either fixed- or floating-point representations (as well as dealing with overflow, etc.). In fact, the overheads required by this step are a contributing factor to why a full code sample of even the simplest HHL implementation currently falls outside the scope (and capabilities!) of what we can reasonably present here.9
Regardless, at the end of this step, the eigenstate
register will contain a superposition of values.
We now need to move the state-encoded inverted eigenvalues of A into the amplitudes of this state. Don’t forget that the amplitude-encoded state of is still entangled with all this, so getting those inverted eigenvalues into the state amplitudes would yield the final line in Equation 13-6, and therefore an amplitude encoding of the solution vector ∣〉.
The key to achieving this is applying a C-ROTY
(i.e., a conditional ROTY
operation; see Chapter 2). Specifically, we set the target of this conditional operation to a new single scratch qubit (labeled ROTY scratch
in Figure 13-2), initially in state ∣0⟩. It’s possible to show (although not without resorting to much more mathematics) that if we condition this ROTY
on the inverse cosine of the values stored in the output
register, then the amplitudes of all parts of the entangled output
and eigenstate
registers where ROTY scratch
is in the ∣1⟩ state acquire precisely the factors that we’re after.
Consequently, this transfer step of the algorithm consists of two parts:
Calculate in superposition on each state in the output
register. This can be achieved in terms of our basic arithmetic primitives,10 although again with a significant overhead in the number of additional qubits required.
Perform a C-ROTY
between the first register and the ROTY scratch
(prepared in ∣0⟩).
At the end of this we will have the state we want if ROTY scratch
is in state ∣1⟩. We can ensure this if we READ
the ROTY scratch
and get a 1 outcome. Unfortunately, this only occurs with a certain probability. To increase the likelihood that we get the needed 1 outcome we can employ another of our QPU primitives, performing amplitude amplification on the ROTY scratch
.
Amplitude amplification allows us to increase the probability of getting an outcome of 1 when we READ
the single-qubit ROTY
register. This then increases the chance that we end up with the eigenstate
register having the desired amplitudes.
If despite the amplitude amplification a READ
of ROTY scratch
still produces the undesired 0 outcome, we have to discard the states of the registers and rerun the whole HHL algorithm from the beginning.
Assuming success in our previous READ
, the eigenstate
register now contains an amplitude encoding of the solutions . But we’re not quite done. The eigenstate
register is still entangled not only with the output
register, but also with the many other scratch qubits we’ve introduced along the way. As mentioned in Chapter 5, having our desired state entangled with other registers is troublesome for a number of reasons. We therefore apply the uncomputation procedure (also outlined in Chapter 5) to disentangle the eigenstate
register.
Bingo! The eigenstate
register now contains a disentangled amplitude encoding of ready to be used in any of the ways suggested at the start of this section.
HHL is a complex and involved QPU application. Don’t feel intimidated if it takes more than one parse to get to grips with; it’s well worth the effort to see how a more substantial algorithm masters our QPU primitives.
Quantum Principal Component Analysis (QPCA) is a QPU implementation of the eponymous data-processing routine. QPCA not only offers a potentially more efficient approach to this widely used machine-learning task, but can also act as a building block in other QML applications. Like HHL, QPCA relies on QRAM hardware. Before outlining how a QPU allows us to soup up conventional Principal Component Analysis (PCA), we first review PCA itself.
PCA is an invaluable tool in data science, machine learning, and beyond. Often used as a preprocessing step, PCA can transform an input set of features into a new, uncorrelated set. The uncorrelated features produced by PCA can be ordered in terms of the amount of the data’s variance that they encode. By retaining only some of these new features, PCA is often employed as a dimensionality reduction technique. Keeping only the first few principal components allows us to reduce the number of features we need to deal with while retaining as much as possible of the interesting variation within the data.
The process of Principal Component Analysis also goes by many other names in different disciplines, such as the Karhunen-Loève transform or Hotelling transform. It is also equivalent to a Singular Value Decomposition.
A common geometrical way to understand the action of PCA is to envision m data points, each described by n features, as being a set of a m points in an n-dimensional feature space. In this setting PCA produces a list of n directions in feature space, ordered such that the first is the direction along which there is the most variance in the data, while the second contains the second most variance, and so on. These directions are the so-called principal components of our data. This is often illustrated in a simple two-dimensional feature space (i.e., n = 2), as shown in Figure 13-4.
One disadvantage in using the principal components generated by PCA as a new set of features is that they may not have any physical interpretation. However, if we’re ultimately interested in building models having the greatest predictive power, this may not be our greatest concern.
Although a geometric description of PCA is useful for building intuition, to actually compute principal components we need a mathematical prescription for finding them. First we calculate the covariance matrix of the dataset in question. If we arrange our data into an m × n matrix X (where each row corresponds to one of the m original data points and each column contains values for one of the n different features), then the covariance matrix σ is given by:
Conveniently, the principal components are given by finding the eigendecomposition of this covariance matrix. The eigenvectors correspond to the principal component directions, while each associated eigenvalue is proportional to the variance of the data along that principal component. If we want to use PCA for dimensionality reduction, we can then rearrange the eigenvectors in order of decreasing eigenvalue and pick only the top p as our new, reduced set of features.
When performing PCA in practice it is important to normalize data before calculating its covariance matrix, as the PCA process is sensitive to the scale of the data. One common normalization technique is finding the deviation of each feature from its mean value and scaling the result by the standard deviation of the data.
The most computationally expensive step in PCA is performing the eigendecomposition of the covariance matrix σ;. As with HHL, the need to determine eigenvalues immediately brings to mind the phase estimation primitive from Chapter 8. When carefully applied, phase estimation can help us run PCA on a QPU.
We might suspect that something like the following steps could help us find the eigendecomposition that we need for PCA:
Represent the covariance matrix of the data as a QPU operation.
Perform phase estimation on this QPU operation to determine its eigenvalues.
However, there are a few problems with this proposed approach:
In the first step we might assume that quantum simulation techniques would help us represent the covariance matrix as a QPU operation, as they did for the matrices involved with HHL. Sadly, covariance matrices rarely satisfy the sparsity requirements of quantum simulation techniques, so we’ll need a different way to find a QPU operation representation of σ.
In the second proposed step, how do we learn both the eigenvalues and eigenvectors that we need? Recall from Figure 13-3 that phase estimation has two input registers, one of which we must use to specify the eigenstate for which we want the associated eigenphase (and hence eigenvalue). But knowing any of the eigenvectors of σ is precisely part of the problem we want to solve with QPCA! We got around this seemingly circular problem when using phase estimation in HHL because we were able to use ∣〉 in the eigenstate
input register. Even though we didn’t know precisely what eigenstates ∣〉 superposed, phase estimation acted on them all in parallel—without us ever needing to learn them. Is there also some kind of clever eigenstate
input we can use for QPCA?
Remarkably, by introducing one critical trick we can solve both of the problems just described. The trick in question is a way to represent the covariance matrix σ in a QPU register (not a QPU operation!). How this trick fixes the preceding problems is quite circuitous (and mathematically involved), but we give a brief outline next.
The idea of representing a matrix in a register is new to us—so far we’ve gone to some length to represent matrices as QPU operations.
We’ve exclusively used circle notation to describe QPU registers, but have occasionally hinted that a full-blown mathematical description involves using (complex-valued) vectors. However, though we’ve carefully avoided introducing the notion, there’s an even more general mathematical description of a QPU register that uses a matrix known as a density operator.11 The details of density operators are far beyond the scope of this book (though Chapter 14 contains tips and references for starting to learn about them), but the important point for QPCA is that if we have QRAM access to our data, then a trick exists for initializing a QPU register so that its density operator description is precisely the covariance matrix of our data. While encoding matrices in a QPU register’s density operator description like this is not normally very useful, for QPCA it affords us the following fixes to the two problems we highlighted earlier.
Having our covariance matrix in a QPU register’s density operator allows us to perform a trick where by leveraging the SWAP
operation (see Chapter 3) we repeatedly perform a kind of partial “mini-SWAP” (partial in a quantum superposition sense) between the register encoding σ and a second register. Although we won’t go into any detail about how to modify SWAP
to perform this mini-SWAP
subroutine, it turns out that using it effectively results in a quantum simulation of σ being implemented on the second register.12 This is precisely the result we would normally achieve using more standard quantum simulation techniques,13 only this mini-SWAP
approach to generating a QPU operation representing σ works efficiently even if σ isn’t sparse, so long as it is of low rank. Despite this trick requiring us to repeatedly apply SWAP
operations (and consequently repeatedly reencode σ as a QPU register’s density operator), it still proves to be efficient.
The rank of a matrix is the number of its columns that are linearly independent. Since the columns of σ are the features of our data, saying that a covariance matrix is of “low rank” means that our data is actually well described by some smaller subspace of the full feature space. The number of features we would need to describe this subspace is the rank of σ.
It transpires that a density operator representation of σ is also precisely the right state for us to use in the phase estimation eigenstate
input register. If we do this, then the eigenstate
register output by the phase estimation primitive will encode an eigenvector of σ (i.e., one of the principal components) and the output
register will encode the associated eigenvalue (i.e., the amount of variance along that principal component). Precisely which principal component, eigenvalue/eigenvector pair we get is randomly determined, but the probability of getting a given principal component is, conveniently, determined by its variance.
With these proposed fixes, a full schematic for QPCA is as shown in Figure 13-5.
Figure 13-5 illustrates how the mini-SWAP
subroutine’s ability to perform quantum simulation with a density operator allows us to use it as the cont_U
input to phase estimation. Note that the density operator encoder must be run multiple times and its output passed to the mini-SWAP
subroutine, which provides the (conditional) QPU operation required by the phase estimation primitive (see Chapter 8). Also note that one run of the density operator encoder is input into the eigenstate
register of the phase estimation primitive.
We’ve used the term density operator encoder to denote the process that allows us to represent our covariance matrix as a register’s density operator. We won’t go into detail about how the encoder works here, but Figure 13-5 summarizes how access to such an ability allows us to turn the phase estimation primitive to the task of PCA.
After all this, we have an algorithm returning all the information we need about one (randomly chosen) principal component of our data—most likely to be the component having highest variance (exactly the ones we often care about when using PCA). But both the principal components and their variances are stored in QPU registers, and we must assert our usual important caveat that our solutions are output in a quantum form. Nevertheless, as was the case with HHL, it may still be possible to READ
useful derived properties. Furthermore, we could pass the QPU register states onward to other QPU applications, where their quantum nature is likely advantageous.
Conventional algorithms for PCA have runtimes of O(d), where d is the number of features we wish to perform PCA on.
In contrast, QPCA has a runtime of O(R log d), with R being the lowest-rank acceptable approximation to the covariance matrix σ (i.e., the smallest number of features that allow us to still represent our data to an acceptable approximation). In cases where R < d (the data is well described by the low-rank approximation of our principal components), QPCA gives an exponential improvement in runtime over its conventional counterpart.
The runtime of QPCA qualifies our earlier assertion that it only offers an improvement for low-rank covariance matrices. This requirement is not as restrictive as it may seem. We are (presumably) normally using PCA on data that we expect is amenable to such a low-rank approximation—otherwise we’d be remiss to consider representing it with a subset of its principal components.
Quantum support vector machines (QSVMs) demonstrate how QPUs can implement supervised machine-learning applications. Like a conventional supervised model, the QSVM we describe here must be trained on points in feature space having known classifications. However, QSVM comes with a number of unconventional constraints. First, a QSVM requires that training data be accessible in superposition using QRAM. Furthermore, the parameters that describe our learned model (used for future classification) are produced amplitude-encoded in a QPU register. As always, this means that we must take special care in how we plan to utilize a QSVM.
Support vector machines (SVMs) are a popular type of supervised classifier finding wide application. As with other linear classifiers, the idea of SVMs is to use training data to find hyperplanes in feature space that separate different output classes of the problem. Once an SVM has learned such hyperplanes, a new data point in feature space can be classified by checking on which sides of these hyperplanes it lies. For a simple example, suppose we only have two features (and therefore a two-dimensional feature space), and furthermore that there are only two possible output classes the data can assume. In that case the hyperplane we seek is a line, as shown in Figure 13-6, where the x- and y-axes represent values of the two features, and we use blue and red markers to represent training data from the two output classes.
How does an SVM learn a suitable hyperplane from training data? Geometrically we can think of the SVM training process as considering two parallel hyperplanes separating the training data classes, and trying to maximize the distance between this pair of hyperplanes. If we then choose a third hyperplane halfway between these as our classifier, we will have the classification hyperplane with the maximum margin between the training classes. The optimized hyperplane learned by the SVM is mathematically described14 by a normal vector and offset b. Having learned this description of the hyperplane, we predict the class of a new data point according to the following rule:
This mathematically determines on which side of the hyperplane the new point lies.
Although the preceding description of finding optimal hyperplanes is perhaps easier to visualize, for computational purposes it’s common to consider a so-called dual formulation of the SVM training process.15 The dual form is more useful for describing QSVMs, and requires us to solve a quadratic programming problem for a set of parameters . Specifically, finding an SVM’s optimal hyperplane is equivalent to finding the maximizing the expression in Equation 13-7.
Here is the ith training data point in feature space and yi is the associated known class. The set of inner products between the training data points take on a special role if we generalize SVMs (as we’ll discuss briefly in the next section) and are often collected into a matrix known as the kernel matrix.
Finding an satisfying this expression subject to the constraints that and gives us the information needed to recover the optimal hyperplane parameters and b. In fact, we can classify new data points directly in terms of according to Equation 13-8, where the intercept b can also be calculated from the training data.16
The key thing to note here for our forthcoming discussion of a quantum SVM is that classifying a new data point requires us to calculate its inner product with every training data point, .
We won’t delve any further into the detailed derivation or usage of these equations; instead we’ll see how the calculations they require can be performed much more efficiently if we have access to a QPU.
First, though, it’s worth noting that the kind of SVMs we have described so far are restricted to certain classification problems. To begin with, our discussion has assumed that the data to be modeled is linearly separable; i.e., that there definitely exists a hyperplane that could completely and unambiguously separate data from the two classes. Often, of course, this might not be the case, and while linearly separating the data could still provide a good fit, data from different classes may somewhat overlap in the feature space, as exemplified in Figure 13-7.
SVMs deal with this possibility through the introduction of so-called soft margins in the training process. Although we won’t expand on them here, it’s worth noting that the QPU speedups we outline shortly persist for soft-margin SVMs.
Even soft-margin SVMs are still restricted to a linear separation of the data, though. In some cases a simple hyperplane may not do a good job of separating classes. When this is the case, we can try embedding our feature space in an even higher-dimensional space. If we perform this embedding carefully, we may be able to find a hyperplane in the higher-dimensional space that does effectively segregate the data from different classes. In other words, we use the projection of an n + m-dimensional hyperplane into an n-dimensional feature space, since a linear n + m-dimensional hyperplane can have a nonlinear n-dimensional projection. Such a nonlinear SVM is shown in Figure 13-7. Though this may sound like a complex modification to our SVM training process, it turns out we can very easily implement this kind of nonlinear generalization. By replacing the kernel matrix of inner products appearing in Equation 13-7 with a carefully chosen alternative kernel matrix, we’re able to encapsulate higher-dimensional, nonlinear margins. This extension is often referred to as training an SVM with a nonlinear kernel.
There exist QPU algorithms improving the performance of both the training of SVM models and classification with a trained model. Here we will only outline using a QPU to efficiently train a QSVM model. The best classical algorithms for training conventional SVMs have runtimes of O(poly(m,n)), where m is the number of training data points and n is the number of features describing each point. In contrast, a QSVM can be trained with runtime O(log(mn)).
Obtaining a quantum advantage for training SVMs is contingent on us being content with a fully quantum model. What we mean by this is that a trained QSVM is a set of QPU registers containing amplitude encodings of the hyperplane parameters and b. Although these parameters are locked in superpositions, we’ll see that they can still be used to classify new points in feature space, so long as the data for these new points can be accessed in superposition via a QRAM.
Training an SVM on a set of m training data points, , requires us to find optimal values of the solving the quadratic programming problem stated in Equation 13-7. This seems like a tricky problem to speed up with a QPU. However, there is a reformulation of the SVM training problem known as a Least Squares Support Vector Machine (LS-SVM) that casts the building of an SVM classifier into a least-squares optimization problem.17 As a consequence, to find the required by the SVM dual formulation, we are now presented with a linear system of equations, the solution of which is given by the matrix equation shown in Equation 13-9 (where is a vector containing the training data classes).
This looks more like something amenable to a QPU speedup, via the HHL algorithm from “Solving Systems of Linear Equations”. The matrix F is constructed from the kernel matrix K of training data inner products as outlined in Equation 13-10.
Here, γ is a real number hyperparameter of the model (that in practice would be determined by cross-validation), is the identity matrix, and we use 1 to denote a vector of m ones. Once we’ve used F to obtain values of and b, we can classify new data points using the LS-SVM just as we did with the standard SVM model, by using the criterion in Equation 13-8.
Using the HHL algorithm to efficiently solve Equation 13-9 and return registers’ amplitude encoding ∣〉 and ∣〉 requires us to address a few key concerns:
In other words, is the matrix F of the type we can invert with HHL (i.e., Hermitian, sufficiently sparse, etc.)?
If we can calculate F–1 using HHL, then how do we ensure it correctly acts on the vector as Equation 13-9 requires?
Even if we address concerns 1 and 2, we still need a way to make use of the obtained quantum representations of b and to train data presented to us in the future.
Let’s address each of these concerns in turn to help convince ourselves that the training of a QSVM can indeed leverage the HHL algorithm.
It’s not too tricky to see that the matrix F is Hermitian, and so we can potentially represent it as a QPU operation using the quantum simulation techniques from Chapter 9. As we previously noted, being Hermitian, although necessary, is not sufficient for a matrix to be efficiently used in quantum simulation. However, it’s also possible to see that a matrix of the form F can be decomposed into a sum of matrices, each of which satisfies all the requirements of quantum simulation techniques. So it turns out that we can use quantum simulation to find a QPU operation representing F. Note, however, that the nontrivial elements of F consist of inner products between the training data points. To efficiently use quantum simulation for representing F as a QPU operation we will need to be able to access the training data points using QRAM.
If we’re using HHL to find F–1 then we can take care of this concern during the phase estimation stage of the algorithm. We input an amplitude encoding of the vector of training data classes, ∣〉, into the phase estimation’s eigenstate
register. Here we use ∣〉 to denote a QPU register state amplitude-encoding the vector . This, again, assumes that we have QRAM access to the training data classes. By the same logic we described when originally explaining HHL in “Solving Systems of Linear Equations”, ∣〉 can be thought of as a superposition of the eigenstates of F. As a consequence, when we follow through the HHL algorithm we will find that ∣〉 is contained in the final output register, which is equal to precisely the solutions that we desire: ∣〉. So we don’t have to do anything fancy—HHL can output F–1 acted on the required vector, just as it did when we originally used it for solving systems of linear equations more generally.
Suppose we are given a new data point , which we want to classify with our trained QSVM. Recall that a “trained QSVM” is really just access to the state ∣〉. We can perform this classification efficiently so long as we have QRAM access to the new data point . Classifying the new point requires computing the sign given by Equation 13-8. This, in turn, involves determining the inner products of with all the training data points , weighted by the LS-SVM dual hyperplane parameters i. We can calculate the requisite inner products in superposition and assess Equation 13-8 as follows. First we use our trained LS-SVM state ∣〉 in the address
register of a query to the QRAM holding the training data. This gives us a superposition of the training data with amplitudes containing the i. In another register we perform a query to the QRAM containing the new data point .
Having both of these states, we can perform a special swap-test subroutine. Although we won’t go into full detail, this subroutine combines the states resulting from our two QRAM queries into an entangled superposition and then performs a carefully constructed swap test (see Chapter 3). Recall that as well as telling us whether the states of two QPU registers are equal or not, the exact success probability p of the READ
involved in the swap test is dependent on the fidelity of the two states—a quantitative measure of precisely how close they are. The swap test we use here is carefully constructed so that the probability p of READ
ing a 1 reflects the sign we need in Equation 13-8. Specifically, p < 1/2 if the sign is +1 and p >= 1/2 if the sign is –1. By repeating the swap test and counting 0 and 1 outcomes, we can estimate the value of the probability p to a desired accuracy, and classify the data point accordingly.
Thus, we are able to train and use a quantum LS-SVM model according to the schematic shown in Figure 13-8.
Without delving into mathematical details, like those of the swap-test subroutine, Figure 13-8 gives only a very general overview of how LS-SVM training proceeds. Note that many details of the swap-test subroutine are not shown, although the key input states are highlighted. This gives some idea of the key roles the QPU primitives we’ve introduced throughout the book play.
Our QSVM summary also provides an important take-home message. A key step was recasting the SVM problem in a format that is amenable to techniques a QPU is well suited for (in particular, matrix inversion). This exemplifies a central ethos of this book—through awareness of what a QPU can do well, domain expertise may allow the discovery of new QPU applications simply by casting existing problems in QPU-compatible forms.
Quantum machine learning is still an extremely dynamic area of research. In this chapter we presented three canonical examples, but new developments in QML are constantly being made. At the same time, conventional approaches to machine learning are being inspired by QPU algorithms—in fact, within the time period of us writing this book, all three of the QML applications presented in this chapter have inspired the development of conventional algorithms with similar runtime improvements.18 This should not shake your confidence in QML’s potential; these results were unlikely to have been discovered without the inspiration of their QPU counterparts, showing that QML applications have far-reaching and unexpected consequences.19 There are also many other QML applications that we simply didn’t have space to properly mention. These include efficient QPU applications for linear regression,20 unsupervised learning,21, Boltzmann machines,22 semidefinite programming,23 and quantum recommender systems24 (quantum recommender systems have also inspired improvements in conventional algorithms25).
2 The expression for condition number in terms of eigenvalues only holds if A is a normal matrix. All matrices used in HHL are necessarily normal because of its reliance on quantum simulation and Hermitian matrices.
3 Kerenidis and Prakash, 2017.
5 Several improvements and extensions have been made to the original HHL algorithm, resulting in runtimes with different trade-offs between the various parameters. Here we focus on the conceptually simpler original algorithm.
6 A more recent result by Childs, et al., 2015, actually manages to improve the dependence of HHL on to .
7 However, as discussed in Chapter 9, we can always extend an n × n non-Hermitian matrix to a 2n × 2n Hermitian one.
8 By this we mean finding the elements A must have in order to correctly relate vectors expressed in its eigenbasis.
9 There’s something poetic and possibly profound about the fact that attempting to perform conventional notions of arithmetic on a QPU can cause such crippling overheads.
10 We actually need to include a constant value in this calculation, and calculate rather than just . Since we’re not implementing the HHL algorithm fully here we omit this for simplicity.
11 Density operators provide a more general description of a QPU register’s state than using a complex vector (or circle notation) because they allow for cases where the register might not only be in superposition, but subject to some statistical uncertainty as to precisely what that superposition is. These quantum states that contain statistical uncertainty are usually called mixed states.
12 For more technical detail on how this operation is built and how it enables such a quantum simulation, see Lloyd et al., 2013.
13 Note that this mini-SWAP
approach to representing a matrix as a QPU operation won’t always be better than quantum simulation approaches. It only works well here because covariance matrices happen to be simple to encode in a QPU register’s density operator thanks to them being in Gram form.
14 These parameters describe the hyperplane through the equation , where is a vector of feature values.
15 Optimization problems often have such a dual form that can be easier to deal with. It’s worth noting that sometimes these dual forms can contain subtle differences from the original (primal) optimization problem.
16 In fact, b can be determined from a single piece of training data lying on one of the two parallel hyperplanes defining the margin. The points lying on these defining hyperplanes are known as the support vectors.
17 There are some subtle differences between SVM and LS-SVM models, although the two have been shown to be equivalent under certain reasonable conditions. See, for example, Ye and Xiong, 2007.
18 HHL: Chia et al., 2018; QPCA and QSVM: Tang, 2018.
19 In fact, it’s not obvious how practical these quantum-inspired algorithms will be in actual usage. See for example Arrazola et al., 2019.