Chapter 13. Quantum Machine Learning

At the time of writing, quantum machine learning (QML) is just about the greatest combination of buzzwords you could hope to synthesize. A lot is written about QML, and the topic is often (confusingly) both overhyped and undersold at the same time. In this section we’ll try to give a flavor for how QPUs might transform machine learning, while also being careful to point out the caveats inherent in manipulating quantum data.

Useful QML applications require very large numbers of qubits. For this reason, our overview of QML applications is necessarily very high-level. Such a summary is also fitting given the rapidly changing nature of this nascent field. Although our discussion will be more schematic than pragmatic, it will heavily leverage our hands-on experience of primitives from earlier chapters.

We summarize three different QML applications: solving systems of linear equations, Quantum Principal Component Analysis, and Quantum Support Vector Machines. These have been selected due to both their relevance to machine learning and their simplicity to discuss. These are also applications whose conventional counterparts are hopefully familiar to anyone who has dabbled in machine learning. We only give a brief description of the conventional progenitors of each QML application as it’s introduced.

In discussing QML, we’ll frequently make use of the following pieces of machine-learning terminology:

Features

Term used to describe measurable properties of data points available to a machine-learning model for making predictions. We often imagine the possible values of these features to define a feature space.

Supervised

Refers to machine-learning models that must be trained on a collection of points in feature space for which correct classes or responses are already known. Only then can the adequately trained model be used to classify (or predict) responses for new points in the feature space.

Unsupervised

Refers to machine-learning models that are able to learn patterns and structure in training data that does not include known responses.

Classification

Used to describe supervised predictive models that assign a given point in feature space to one of several discrete classes.

Regression

Used to describe supervised models predicting some continuously varying response variable.

Dimensionality reduction

One form of unsupervised data preprocessing that can benefit machine-learning models of all types. Dimensionality reduction aims to reduce the number of features needed to describe a problem.

In addition to this terminology, we’ll also make use of mathematical descriptions of machine-learning problems. As such, this chapter is slightly more mathematically involved than our previous discourse.

Our first QML application teaches us how a QPU can help solve systems of linear equations.

Solving Systems of Linear Equations

Although systems of linear equations are certainly fundamental to much of machine learning, they also underlie vast areas across all of applied mathematics. The HHL algorithm 1 (often referred to simply as HHL) we present for leveraging a QPU to efficiently solve these systems is consequently a fundamental and powerful tool, and we’ll see that it’s a key building block in other QML applications too. HHL has also been considered for applications ranging from modeling electrical effects to streamlining computer graphics calculations.

We begin our summary of the HHL algorithm by recapping the mathematics needed to describe conventional systems of linear equations. We then summarize the distinctly quantum operation of HHL, outlining its performance improvements and—just as importantly—its constraints. Finally, we give a more detailed description of how HHL works “inside the box.”

Describing and Solving Systems of Linear Equations

The most concise way of representing a system of linear equations is in terms of matrix multiplication. In fact, for the seasoned equation solver the terms matrices and linear equations are synonymous. For example, suppose we have a system of two linear equations, 3x1 + 4x2 = 3 and 2x1 + x2 = 3. We can equivalently, and much more concisely, represent these as the single matrix equation shown in Equation 13-1.

Equation 13-1. Using matrices to describe systems of linear equations
3421x1x2=33

We can recover our two linear equations through the rules of matrix multiplication. More generally, a system of n linear equations for n variables can be written as a matrix equation containing an n × n matrix, A, and an n-dimensional vector b:

?x=b

Here we have also introduced a vector x=[x1,...,xn] of the n variables we want to solve for.

In this matrix formulation, the task of solving the system of equations boils down to being able to invert the matrix A. If we can obtain the inverse A–1, then we can easily determine the n unknown variables via Equation 13-2.

Equation 13-2. Solving systems of linear equations by inverting matrices
x=?-1b

Many conventional algorithms exist for finding the inverse of a matrix, and the most efficient rely on the matrix in question possessing certain helpful properties.

The following matrix parameters can affect the performance of both conventional and quantum algorithms:

n

The size of the system of linear equations. Equivalently, this is the dimensionality of A—if we want to solve a system of n linear equations to find n variables, then A will be an n × n matrix.

κ

The condition number of the matrix A representing the system of linear equations. Given the system of equations ?x=b, the condition number tells us how much an error in our specification of b affects the error we can expect to find in our solution for x=?-1b. κ is calculated as the maximum ratio between the relative error in the input b and the output x=?-1b. It turns out that κ can equivalently be found as2 the ratio |λmax|/|λmin| of the absolute values of the maximum and minimum eigenvalues of A.

s

The sparsity of the matrix A. This is the number of nonzero entries in A.

ϵ

The precision we require in our solution. In the case of HHL, we’ll shortly see that a state ∣x〉 is output that amplitude-encodes the solution vector x. Increasing ϵ means increasing the precision with which the values in x are represented within these amplitudes.

Our assessment of how efficiently HHL can invert a matrix will be with respect to these parameters. By efficiency we mean the runtime of the algorithm (a measure of how many fundamental operations it must employ). For comparison, at the time of writing, the leading conventional algorithm for solving systems of linear equations is probably the conjugate gradient descent method. This has a runtime of O(nsκlog(1/ϵ)).

Solving Linear Equations with a QPU

HHL (named after its 2009 discoverers Harrow, Hassidim, and Lloyd) employs the primitives we’ve learned thus far to find (in a particular sense) the inverse of a matrix faster than is possible with conjugate gradient descent. We say in a particular sense because HHL solves a distinctly quantum version of the problem. HHL provides the solutions to a system of linear equations amplitude-encoded in a QPU register, and as such, they are inaccessibly quantum. Although HHL cannot solve systems of linear equations in a conventional sense, amplitude-encoded solutions can still be very useful, and are in fact critical building blocks in other QML applications.

What HHL does

Before decomposing HHL into primitives, we’ll give an executive summary of its inputs, outputs, and performance.

The inputs and outputs of HHL are as shown in Figure 13-1.

A high-level view of the inputs and outputs utilized by the HHL algorithm for solving systems of linear equations
Figure 13-1. A high-level view of the inputs and outputs utilized by the HHL algorithm to solve systems of linear equations.

Inputs

This schematic shows that HHL accepts two sets of input registers, and a matrix (via a quantum simulation):

Scratch register

This contains a number of scratch qubits used by various primitives within HHL, all prepared in the ∣0⟩ state. Because HHL deals with fixed-point (or floating-point) data and involves nontrivial arithmetic operations (such as taking the square root), we require a lot of scratch qubits. This makes HHL difficult to simulate for even the simplest cases.

Amplitude encoding of b

We also need to provide HHL with the vector b from Equation 13-2, amplitude-encoded in a QPU register (in the sense we discussed in Chapter 9). We will denote the state of a register amplitude encoding b as ∣b〉. Note that to prepare an amplitude encoding of b we will need to use QRAM. Thus, HHL fundamentally relies on the existence of QRAM.

QPU operation representing A

Naturally, HHL also needs access to the matrix encapsulating the system of linear equations we wish to solve. The bottom of Figure 13-1 illustrates how HHL requires us to represent A as a QPU operation, which we can achieve via the process of quantum simulation outlined in Chapter 9. This means that the matrix A must meet the requirements we noted for performing quantum simulation.

Outputs

Figure 13-1 also shows that two registers are output from HHL.

Solution register

The solution vector x is amplitude-encoded within a single output QPU register (we denote this state as ∣x〉). As we’ve already stressed, this implies that we cannot access the individual solutions, since they’re hidden in the amplitudes of a quantum superposition that we cannot hope to efficiently extract with READ operations.

Scratch register

The scratch qubits are returned to their starting state of ∣0⟩, allowing us to continue using them elsewhere in our QPU.

Here are some examples of ways in which the quantum output from HHL can still be incredibly useful despite its inaccessible nature:

  1. Rather than a specification of solutions for all n variables in x, we may only wish to know some derived property, such as their sum, mean value, or perhaps even whether or not they contain a certain frequency component. In such cases we may be able to apply an appropriate quantum circuit to ∣x〉, allowing us to READ the derived value.

  2. If we are satisfied with checking only whether or not the solution vector x is equal to one particular suspected vector, then we can employ the swap test introduced in Chapter 3 between ∣x〉 and another register encoding the suspected vector.

  3. If, for example, we plan to use the HHL algorithm as a component in a larger algorithm, ∣x〉 may be sufficient for our needs as is.

Since systems of linear equations are fundamental in many areas of machine learning, HHL is the starting point for many other QML applications, such as regression3 and data fitting.4

Speed and fine print

The HHL algorithm has a runtime5 of O(κ2s2ϵ-1logn).

In comparison with the conventional method of conjugate gradient descent and its runtime of O(nsκlog(1/ϵ)), HHL clearly offers an exponential improvement in the dependence on the size of the problem (n).

One could argue that this is an unfair comparison, since conventional conjugate gradient descent reveals the full set of solutions to us, unlike the quantum answer generated by HHL. We could instead compare HHL to the best conventional algorithms for determining derived statistics from solutions to systems of linear equations (sum, mean, etc.), which have n and κ dependencies of O(nκ), but HHL still affords an exponential improvement in the dependence on n.

Although it’s tempting to focus solely on how algorithms scale with the problem size n, other parameters are equally important. Though offering an impressive exponential speedup in terms of n, HHL’s performance is worse than conventional competitors once we consider poorly conditioned or less sparse problems (where κ or s becomes important).6 HHL also suffers if we demand more precision and place importance on the parameter ϵ.

For these reasons, we have the following fine print:

The HHL algorithm is suited to solving systems of linear equations represented by sparse, well-conditioned matrices.

Additionally, since HHL leverages a quantum simulation primitive, we need to take note of any requirements particular to whichever quantum simulation technique we employ.

Having done our best to give a realistic impression of HHL’s usage, let’s break down how it works.

Inside the box

The intuition behind HHL relies on one particular method for finding the inverse of a matrix via its eigendecomposition. Any matrix has an associated set of eigenvectors and eigenvalues. Since our focus in this chapter is machine learning, we’ll assume some familiarity with this concept. If the idea is new to you, eigenvectors and eigenvalues are essentially the matrix equivalents of the eigenstates and eigenphases of QPU operations that we introduced while discussing phase estimation in Chapter 8.

In “Phase Estimation in Practice” we also noted that any QPU register state can be considered to be some superposition of the eigenstates of any QPU operation. Through its dependence on quantum simulation, HHL is restricted to solving systems of linear equations that are represented by Hermitian matrices.7 For such matrices a similar fact regarding its eigendecomposition is true. Any vector we might want to act a Hermitian matrix A on can be expressed in the basis of (i.e., written as a linear combination of) A’s eigenvectors.

For example, consider the Hermitian matrix A and vector z shown in Equation 13-3.

Equation 13-3. Example matrix and vector for introducing eigendecomposition
?=2223,z=10

The two eigenvectors of this particular matrix A are v1=[-0.7882,0.615] and v2=[-0.615,-0.788], with associated eigenvalues λ1=0.438 and λ2=4.56, as you can check by confirming that ?v1=λ1v1 and ?v2=λ2v2. Since the example A considered here is Hermitian, z can be written in the basis of its eigenvectors, and in fact z=-0.788v1-0.615v2. We could simply write this as z=[-0.788,-0.615] with the understanding that the components are expressed in the basis of A’s eigenvectors.

We can also write A in its own eigenbasis.8 It turns out that when written this way a matrix is always diagonal, with its main diagonal consisting of its eigenvalues. For our preceding example, A is therefore written in its eigenbasis as shown in Equation 13-4.

Equation 13-4. Writing a matrix in its eigenbasis
?=0.438004.56

Expressing A in its eigenbasis is very helpful for finding its inverse, because inverting a diagonal matrix is trivial. To do so you simply numerically invert the nonzero values along its diagonal. For example, we can find A–1 as shown in Equation 13-5.

Equation 13-5. Inverting a matrix written in its eigenbasis
?-1=10.4380014.56=2.281000.219

We should note, of course, that this gives us A–1 expressed in the eigenbasis of A. We can either leave the inverse like this (if we wish to act it on vectors that are also expressed in A’s eigenbasis), or rewrite it in the original basis.

So in summary, if we can find the eigenvalues of some general matrix A, then we can determine x=?-1b as shown in Equation 13-6.

Equation 13-6. General approach for determining the inverse of a matrix via its eigendecomposition
x=1λ1...00...1λnb˜1b˜n=1λ1b˜11λnb˜n

Where λ1,...,λn are the eigenvalues of A and we use b˜1,...,b˜n to denote the components of b expressed in A’s eigenbasis.

Schematic outline of the primitives contained in the HHL algorithm
Figure 13-2. Schematic outline of the primitives contained in the HHL algorithm

HHL manages to employ this matrix inversion method using the quantum parallelism of a QPU. The output register from HHL contains an amplitude encoding of precisely the vector shown in Equation 13-6; i.e., one where the amplitude of state ∣i〉 is b˜i/λi. The schematic in Figure 13-2 shows what’s inside the HHL box from Figure 13-1, outlining how HHL uses our familiar QPU primitives to produce the output state in Equation 13-6.

Note

Although we don’t explicitly show them in Figure 13-2, HHL will also require other input registers specifying certain configuration parameters needed for the quantum simulation of A.

Let’s step through each of the primitive components used in Figure 13-2.

1. Quantum simulation, QRAM, and phase estimation

We already know that the phase estimation primitive can efficiently find the eigenstates and eigenphases of a QPU operation. You might suspect that this could help us in using the eigendecomposition approach to inverting a matrix—and you’d be right!

Figure 13-2 shows that we begin by using QRAM to produce a register with an amplitude encoding of b and quantum simulation to produce a QPU operation representing A. We then feed both of these resources into the phase estimation primitive as shown in Figure 13-3.

Recap of phase estimation primitive, where we have identified the eigenphase obtained in the output register as the eigenvalue of A
Figure 13-3. Recap of phase estimation primitive, where we have identified the eigenphase obtained in the output register as the eigenvalue of A

Figure 13-3 helps us recall that two input registers are passed to the phase estimation primitive. The lower eigenstate register takes an input specifying an eigenstate of the QPU operation for which we would like the associated eigenphase. The upper output register (initialized in state ∣0⟩) produces a representation of the eigenphase, which in our case is an eigenvalue of A.

So phase estimation provides an eigenvalue of A. But how do we get all n eigenvalues?

Running phase estimation n separate times would reduce HHL’s runtime to O(n)—no better than its conventional counterparts. We could input a uniform superposition in the eigenstate register and calculate the eigenvalues in parallel, producing a uniform superposition of them in the output register. But suppose we take a slightly different approach and instead send the amplitude encoding of b in the eigenstate register? This results in the output register being a superposition of A’s eigenvalues, but one that is entangled with the amplitude encoding of b produced in the output register.

This will be far more useful to us than just a uniform superposition of the eigenvalues, since now each ∣λi〉 within the entangled state of eigenstate and output registers has an amplitude of b˜i (thanks to the eigenstate register). This isn’t quite what we want to produce the solution as shown in Equation 13-6, however. The output register’s state represents the eigenvalues of A. What we really want is to invert these eigenvalues and—instead of them being held in another (entangled) register—move them into values multiplying the amplitudes of the eigenstate register. The next steps in HHL achieve this inversion and transfer.

2. Invert values

The second primitive in Figure 13-2 inverts each λi value stored in the (entangled) superposition of the output and eigenstate registers.

At the end of this step the output register encodes a superposition of ∣1/λi〉 states, still entangled with the eigenstate register. To invert numerical values encoded in a quantum state we actually utilize a number of the arithmetic primitives introduced in Chapter 5. There are many different ways we could build a QPU numerical inversion algorithm from these primitives. One possibility is to use the Newton method for approximating the inverse. Whatever approach we take, as we noted in Chapter 12, division is hard. This seemingly simple operation requires a significant overhead in qubit numbers. Not only will the constituent operations require scratch qubits, but dealing with inverses means that we have to encode the involved numerical values in either fixed- or floating-point representations (as well as dealing with overflow, etc.). In fact, the overheads required by this step are a contributing factor to why a full code sample of even the simplest HHL implementation currently falls outside the scope (and capabilities!) of what we can reasonably present here.9

Regardless, at the end of this step, the eigenstate register will contain a superposition of 1/λi values.

3. Move inverted values into amplitudes

We now need to move the state-encoded inverted eigenvalues of A into the amplitudes of this state. Don’t forget that the amplitude-encoded state of b is still entangled with all this, so getting those inverted eigenvalues into the state amplitudes would yield the final line in Equation 13-6, and therefore an amplitude encoding of the solution vector ∣x〉.

The key to achieving this is applying a C-ROTY (i.e., a conditional ROTY operation; see Chapter 2). Specifically, we set the target of this conditional operation to a new single scratch qubit (labeled ROTY scratch in Figure 13-2), initially in state ∣0⟩. It’s possible to show (although not without resorting to much more mathematics) that if we condition this ROTY on the inverse cosine of the 1/λi values stored in the output register, then the amplitudes of all parts of the entangled output and eigenstate registers where ROTY scratch is in the ∣1⟩ state acquire precisely the 1/λi factors that we’re after.

Consequently, this transfer step of the algorithm consists of two parts:

  1. Calculate arccos(1/λi) in superposition on each state in the output register. This can be achieved in terms of our basic arithmetic primitives,10 although again with a significant overhead in the number of additional qubits required.

  2. Perform a C-ROTY between the first register and the ROTY scratch (prepared in ∣0⟩).

At the end of this we will have the state we want if ROTY scratch is in state ∣1⟩. We can ensure this if we READ the ROTY scratch and get a 1 outcome. Unfortunately, this only occurs with a certain probability. To increase the likelihood that we get the needed 1 outcome we can employ another of our QPU primitives, performing amplitude amplification on the ROTY scratch.

4. Amplitude amplification

Amplitude amplification allows us to increase the probability of getting an outcome of 1 when we READ the single-qubit ROTY register. This then increases the chance that we end up with the eigenstate register having the desired b˜i/λi amplitudes.

If despite the amplitude amplification a READ of ROTY scratch still produces the undesired 0 outcome, we have to discard the states of the registers and rerun the whole HHL algorithm from the beginning.

5. Uncompute

Assuming success in our previous READ, the eigenstate register now contains an amplitude encoding of the solutions x. But we’re not quite done. The eigenstate register is still entangled not only with the output register, but also with the many other scratch qubits we’ve introduced along the way. As mentioned in Chapter 5, having our desired state entangled with other registers is troublesome for a number of reasons. We therefore apply the uncomputation procedure (also outlined in Chapter 5) to disentangle the eigenstate register.

Bingo! The eigenstate register now contains a disentangled amplitude encoding of x ready to be used in any of the ways suggested at the start of this section.

HHL is a complex and involved QPU application. Don’t feel intimidated if it takes more than one parse to get to grips with; it’s well worth the effort to see how a more substantial algorithm masters our QPU primitives.

Quantum Principle Component Analysis

Quantum Principal Component Analysis (QPCA) is a QPU implementation of the eponymous data-processing routine. QPCA not only offers a potentially more efficient approach to this widely used machine-learning task, but can also act as a building block in other QML applications. Like HHL, QPCA relies on QRAM hardware. Before outlining how a QPU allows us to soup up conventional Principal Component Analysis (PCA), we first review PCA itself.

Conventional Principal Component Analysis

PCA is an invaluable tool in data science, machine learning, and beyond. Often used as a preprocessing step, PCA can transform an input set of features into a new, uncorrelated set. The uncorrelated features produced by PCA can be ordered in terms of the amount of the data’s variance that they encode. By retaining only some of these new features, PCA is often employed as a dimensionality reduction technique. Keeping only the first few principal components allows us to reduce the number of features we need to deal with while retaining as much as possible of the interesting variation within the data.

Note

The process of Principal Component Analysis also goes by many other names in different disciplines, such as the Karhunen-Loève transform or Hotelling transform. It is also equivalent to a Singular Value Decomposition.

A common geometrical way to understand the action of PCA is to envision m data points, each described by n features, as being a set of a m points in an n-dimensional feature space. In this setting PCA produces a list of n directions in feature space, ordered such that the first is the direction along which there is the most variance in the data, while the second contains the second most variance, and so on. These directions are the so-called principal components of our data. This is often illustrated in a simple two-dimensional feature space (i.e., n = 2), as shown in Figure 13-4.

The two principal components of 1000 data points in a two-dimensional feature space. The directions of the arrows show the principal component vectors, while their lengths represent the variance of the data in that direction
Figure 13-4. The two principal components of 1,000 data points in a two-dimensional feature space. The directions of the arrows show the principal component vectors, while their lengths represent the variance of the data in that direction.

One disadvantage in using the principal components generated by PCA as a new set of features is that they may not have any physical interpretation. However, if we’re ultimately interested in building models having the greatest predictive power, this may not be our greatest concern.

Although a geometric description of PCA is useful for building intuition, to actually compute principal components we need a mathematical prescription for finding them. First we calculate the covariance matrix of the dataset in question. If we arrange our data into an m × n matrix X (where each row corresponds to one of the m original data points and each column contains values for one of the n different features), then the covariance matrix σ is given by:

σ=1n-1?T?

Conveniently, the principal components are given by finding the eigendecomposition of this covariance matrix. The eigenvectors correspond to the principal component directions, while each associated eigenvalue is proportional to the variance of the data along that principal component. If we want to use PCA for dimensionality reduction, we can then rearrange the eigenvectors in order of decreasing eigenvalue and pick only the top p as our new, reduced set of features.

Warning

When performing PCA in practice it is important to normalize data before calculating its covariance matrix, as the PCA process is sensitive to the scale of the data. One common normalization technique is finding the deviation of each feature from its mean value and scaling the result by the standard deviation of the data.

The most computationally expensive step in PCA is performing the eigendecomposition of the covariance matrix σ;. As with HHL, the need to determine eigenvalues immediately brings to mind the phase estimation primitive from Chapter 8. When carefully applied, phase estimation can help us run PCA on a QPU.

PCA with a QPU

We might suspect that something like the following steps could help us find the eigendecomposition that we need for PCA:

  1. Represent the covariance matrix of the data as a QPU operation.

  2. Perform phase estimation on this QPU operation to determine its eigenvalues.

However, there are a few problems with this proposed approach:

Problem 1: Quantum simulation with σ

In the first step we might assume that quantum simulation techniques would help us represent the covariance matrix as a QPU operation, as they did for the matrices involved with HHL. Sadly, covariance matrices rarely satisfy the sparsity requirements of quantum simulation techniques, so we’ll need a different way to find a QPU operation representation of σ.

Problem 2: Input for phase estimation

In the second proposed step, how do we learn both the eigenvalues and eigenvectors that we need? Recall from Figure 13-3 that phase estimation has two input registers, one of which we must use to specify the eigenstate for which we want the associated eigenphase (and hence eigenvalue). But knowing any of the eigenvectors of σ is precisely part of the problem we want to solve with QPCA! We got around this seemingly circular problem when using phase estimation in HHL because we were able to use ∣b〉 in the eigenstate input register. Even though we didn’t know precisely what eigenstates ∣b〉 superposed, phase estimation acted on them all in parallel—without us ever needing to learn them. Is there also some kind of clever eigenstate input we can use for QPCA?

Remarkably, by introducing one critical trick we can solve both of the problems just described. The trick in question is a way to represent the covariance matrix σ in a QPU register (not a QPU operation!). How this trick fixes the preceding problems is quite circuitous (and mathematically involved), but we give a brief outline next.

Representing a covariance matrix in a QPU register

The idea of representing a matrix in a register is new to us—so far we’ve gone to some length to represent matrices as QPU operations.

We’ve exclusively used circle notation to describe QPU registers, but have occasionally hinted that a full-blown mathematical description involves using (complex-valued) vectors. However, though we’ve carefully avoided introducing the notion, there’s an even more general mathematical description of a QPU register that uses a matrix known as a density operator.11 The details of density operators are far beyond the scope of this book (though Chapter 14 contains tips and references for starting to learn about them), but the important point for QPCA is that if we have QRAM access to our data, then a trick exists for initializing a QPU register so that its density operator description is precisely the covariance matrix of our data. While encoding matrices in a QPU register’s density operator description like this is not normally very useful, for QPCA it affords us the following fixes to the two problems we highlighted earlier.

Tip

The trick QPCA uses to represent a covariance matrix as a density operator works because covariance matrices are always in Gram form, meaning that they can be written in the form ?T? for some other matrix V. For other matrices it’s not such a useful trick.

Fixing problem 1

Having our covariance matrix in a QPU register’s density operator allows us to perform a trick where by leveraging the SWAP operation (see Chapter 3) we repeatedly perform a kind of partial “mini-SWAP” (partial in a quantum superposition sense) between the register encoding σ and a second register. Although we won’t go into any detail about how to modify SWAP to perform this mini-SWAP subroutine, it turns out that using it effectively results in a quantum simulation of σ being implemented on the second register.12 This is precisely the result we would normally achieve using more standard quantum simulation techniques,13 only this mini-SWAP approach to generating a QPU operation representing σ works efficiently even if σ isn’t sparse, so long as it is of low rank. Despite this trick requiring us to repeatedly apply SWAP operations (and consequently repeatedly reencode σ as a QPU register’s density operator), it still proves to be efficient.

Note

The rank of a matrix is the number of its columns that are linearly independent. Since the columns of σ are the features of our data, saying that a covariance matrix is of “low rank” means that our data is actually well described by some smaller subspace of the full feature space. The number of features we would need to describe this subspace is the rank of σ.

Fixing problem 2

It transpires that a density operator representation of σ is also precisely the right state for us to use in the phase estimation eigenstate input register. If we do this, then the eigenstate register output by the phase estimation primitive will encode an eigenvector of σ (i.e., one of the principal components) and the output register will encode the associated eigenvalue (i.e., the amount of variance along that principal component). Precisely which principal component, eigenvalue/eigenvector pair we get is randomly determined, but the probability of getting a given principal component is, conveniently, determined by its variance.

With these proposed fixes, a full schematic for QPCA is as shown in Figure 13-5.

Schematic for QPCA
Figure 13-5. Schematic for QPCA

Figure 13-5 illustrates how the mini-SWAP subroutine’s ability to perform quantum simulation with a density operator allows us to use it as the cont_U input to phase estimation. Note that the density operator encoder must be run multiple times and its output passed to the mini-SWAP subroutine, which provides the (conditional) QPU operation required by the phase estimation primitive (see Chapter 8). Also note that one run of the density operator encoder is input into the eigenstate register of the phase estimation primitive.

We’ve used the term density operator encoder to denote the process that allows us to represent our covariance matrix as a register’s density operator. We won’t go into detail about how the encoder works here, but Figure 13-5 summarizes how access to such an ability allows us to turn the phase estimation primitive to the task of PCA.

The output

After all this, we have an algorithm returning all the information we need about one (randomly chosen) principal component of our data—most likely to be the component having highest variance (exactly the ones we often care about when using PCA). But both the principal components and their variances are stored in QPU registers, and we must assert our usual important caveat that our solutions are output in a quantum form. Nevertheless, as was the case with HHL, it may still be possible to READ useful derived properties. Furthermore, we could pass the QPU register states onward to other QPU applications, where their quantum nature is likely advantageous.

Performance

Conventional algorithms for PCA have runtimes of O(d), where d is the number of features we wish to perform PCA on.

In contrast, QPCA has a runtime of O(R log d), with R being the lowest-rank acceptable approximation to the covariance matrix σ (i.e., the smallest number of features that allow us to still represent our data to an acceptable approximation). In cases where R < d (the data is well described by the low-rank approximation of our principal components), QPCA gives an exponential improvement in runtime over its conventional counterpart.

The runtime of QPCA qualifies our earlier assertion that it only offers an improvement for low-rank covariance matrices. This requirement is not as restrictive as it may seem. We are (presumably) normally using PCA on data that we expect is amenable to such a low-rank approximation—otherwise we’d be remiss to consider representing it with a subset of its principal components.

Quantum Support Vector Machines

Quantum support vector machines (QSVMs) demonstrate how QPUs can implement supervised machine-learning applications. Like a conventional supervised model, the QSVM we describe here must be trained on points in feature space having known classifications. However, QSVM comes with a number of unconventional constraints. First, a QSVM requires that training data be accessible in superposition using QRAM. Furthermore, the parameters that describe our learned model (used for future classification) are produced amplitude-encoded in a QPU register. As always, this means that we must take special care in how we plan to utilize a QSVM.

Conventional Support Vector Machines

Support vector machines (SVMs) are a popular type of supervised classifier finding wide application. As with other linear classifiers, the idea of SVMs is to use training data to find hyperplanes in feature space that separate different output classes of the problem. Once an SVM has learned such hyperplanes, a new data point in feature space can be classified by checking on which sides of these hyperplanes it lies. For a simple example, suppose we only have two features (and therefore a two-dimensional feature space), and furthermore that there are only two possible output classes the data can assume. In that case the hyperplane we seek is a line, as shown in Figure 13-6, where the x- and y-axes represent values of the two features, and we use blue and red markers to represent training data from the two output classes.

Example of classification hyperplane produced by an SVM for a binary (two-class) classification problem having two features
Figure 13-6. Example of the classification hyperplane produced by an SVM for a two-class classification problem with two features

How does an SVM learn a suitable hyperplane from training data? Geometrically we can think of the SVM training process as considering two parallel hyperplanes separating the training data classes, and trying to maximize the distance between this pair of hyperplanes. If we then choose a third hyperplane halfway between these as our classifier, we will have the classification hyperplane with the maximum margin between the training classes. The optimized hyperplane learned by the SVM is mathematically described14 by a normal vector w and offset b. Having learned this description of the hyperplane, we predict the class of a new data point x according to the following rule:

class=sign(w·x-b)

This mathematically determines on which side of the hyperplane the new point lies.

Although the preceding description of finding optimal hyperplanes is perhaps easier to visualize, for computational purposes it’s common to consider a so-called dual formulation of the SVM training process.15 The dual form is more useful for describing QSVMs, and requires us to solve a quadratic programming problem for a set of parameters α=[α1,...,αm]. Specifically, finding an SVM’s optimal hyperplane is equivalent to finding the α maximizing the expression in Equation 13-7.

Equation 13-7. Dual description of the SVM optimization problem
iαiyi-12ijαixi·xjαj

Here xi is the ith training data point in feature space and yi is the associated known class. The set of inner products xi·xj=Kij between the training data points take on a special role if we generalize SVMs (as we’ll discuss briefly in the next section) and are often collected into a matrix known as the kernel matrix.

Finding an α satisfying this expression subject to the constraints that iαi=0 and yiαi0i gives us the information needed to recover the optimal hyperplane parameters w and b. In fact, we can classify new data points directly in terms of α according to Equation 13-8, where the intercept b can also be calculated from the training data.16

Equation 13-8. Classification rule for an SVM in the dual representation
signiαixi·x-b

The key thing to note here for our forthcoming discussion of a quantum SVM is that classifying a new data point x requires us to calculate its inner product with every training data point, x·xi.

We won’t delve any further into the detailed derivation or usage of these equations; instead we’ll see how the calculations they require can be performed much more efficiently if we have access to a QPU.

SVM generalizations

First, though, it’s worth noting that the kind of SVMs we have described so far are restricted to certain classification problems. To begin with, our discussion has assumed that the data to be modeled is linearly separable; i.e., that there definitely exists a hyperplane that could completely and unambiguously separate data from the two classes. Often, of course, this might not be the case, and while linearly separating the data could still provide a good fit, data from different classes may somewhat overlap in the feature space, as exemplified in Figure 13-7.

Data that cannot be fit by a linear SVM model. The training data from different classes 'overlap' in feature space , and furthermore the correct decision boundary is clearly nonlinear.
Figure 13-7. Data that cannot be fit by a linear SVM model—the training data from different classes overlap in feature space, and furthermore the correct decision boundary is clearly nonlinear.

SVMs deal with this possibility through the introduction of so-called soft margins in the training process. Although we won’t expand on them here, it’s worth noting that the QPU speedups we outline shortly persist for soft-margin SVMs.

Even soft-margin SVMs are still restricted to a linear separation of the data, though. In some cases a simple hyperplane may not do a good job of separating classes. When this is the case, we can try embedding our feature space in an even higher-dimensional space. If we perform this embedding carefully, we may be able to find a hyperplane in the higher-dimensional space that does effectively segregate the data from different classes. In other words, we use the projection of an n + m-dimensional hyperplane into an n-dimensional feature space, since a linear n + m-dimensional hyperplane can have a nonlinear n-dimensional projection. Such a nonlinear SVM is shown in Figure 13-7. Though this may sound like a complex modification to our SVM training process, it turns out we can very easily implement this kind of nonlinear generalization. By replacing the kernel matrix of inner products appearing in Equation 13-7 with a carefully chosen alternative kernel matrix, we’re able to encapsulate higher-dimensional, nonlinear margins. This extension is often referred to as training an SVM with a nonlinear kernel.

SVM with a QPU

There exist QPU algorithms improving the performance of both the training of SVM models and classification with a trained model. Here we will only outline using a QPU to efficiently train a QSVM model. The best classical algorithms for training conventional SVMs have runtimes of O(poly(m,n)), where m is the number of training data points and n is the number of features describing each point. In contrast, a QSVM can be trained with runtime O(log(mn)).

Using a QPU to train a quantum SVM

Obtaining a quantum advantage for training SVMs is contingent on us being content with a fully quantum model. What we mean by this is that a trained QSVM is a set of QPU registers containing amplitude encodings of the hyperplane parameters w and b. Although these parameters are locked in superpositions, we’ll see that they can still be used to classify new points in feature space, so long as the data for these new points can be accessed in superposition via a QRAM.

Training an SVM on a set of m training data points, {xi,yi}i=1m, requires us to find optimal values of the α solving the quadratic programming problem stated in Equation 13-7. This seems like a tricky problem to speed up with a QPU. However, there is a reformulation of the SVM training problem known as a Least Squares Support Vector Machine (LS-SVM) that casts the building of an SVM classifier into a least-squares optimization problem.17 As a consequence, to find the α required by the SVM dual formulation, we are now presented with a linear system of equations, the solution of which is given by the matrix equation shown in Equation 13-9 (where y is a vector containing the training data classes).

Equation 13-9. LS-SVM equation
bα=?-10y

This looks more like something amenable to a QPU speedup, via the HHL algorithm from “Solving Systems of Linear Equations”. The matrix F is constructed from the kernel matrix K of training data inner products as outlined in Equation 13-10.

Equation 13-10. How the F matrix is built from the kernel matrix
?=01T1K+1γ1

Here, γ is a real number hyperparameter of the model (that in practice would be determined by cross-validation), 1 is the identity matrix, and we use 1 to denote a vector of m ones. Once we’ve used F to obtain values of α and b, we can classify new data points using the LS-SVM just as we did with the standard SVM model, by using the criterion in Equation 13-8.

Using the HHL algorithm to efficiently solve Equation 13-9 and return registers’ amplitude encoding ∣b〉 and ∣α〉 requires us to address a few key concerns:

Concern 1: Is F suitable for HHL?

In other words, is the matrix F of the type we can invert with HHL (i.e., Hermitian, sufficiently sparse, etc.)?

Concern 2: How can we act F–1 on [0,y]?

If we can calculate F–1 using HHL, then how do we ensure it correctly acts on the vector [0,y] as Equation 13-9 requires?

Concern 3: How do we classify data?

Even if we address concerns 1 and 2, we still need a way to make use of the obtained quantum representations of b and α to train data presented to us in the future.

Let’s address each of these concerns in turn to help convince ourselves that the training of a QSVM can indeed leverage the HHL algorithm.

Concern 1: Is F suitable for HHL?

It’s not too tricky to see that the matrix F is Hermitian, and so we can potentially represent it as a QPU operation using the quantum simulation techniques from Chapter 9. As we previously noted, being Hermitian, although necessary, is not sufficient for a matrix to be efficiently used in quantum simulation. However, it’s also possible to see that a matrix of the form F can be decomposed into a sum of matrices, each of which satisfies all the requirements of quantum simulation techniques. So it turns out that we can use quantum simulation to find a QPU operation representing F. Note, however, that the nontrivial elements of F consist of inner products between the training data points. To efficiently use quantum simulation for representing F as a QPU operation we will need to be able to access the training data points using QRAM.

Concern 2: How can we act F–1 on [0,y]?

If we’re using HHL to find F–1 then we can take care of this concern during the phase estimation stage of the algorithm. We input an amplitude encoding of the vector of training data classes, ∣y'〉, into the phase estimation’s eigenstate register. Here we use ∣y'〉 to denote a QPU register state amplitude-encoding the vector [0,y]. This, again, assumes that we have QRAM access to the training data classes. By the same logic we described when originally explaining HHL in “Solving Systems of Linear Equations”, ∣y'〉 can be thought of as a superposition of the eigenstates of F. As a consequence, when we follow through the HHL algorithm we will find that ∣F-1y〉 is contained in the final output register, which is equal to precisely the solutions that we desire: ∣b,α〉. So we don’t have to do anything fancy—HHL can output F–1 acted on the required vector, just as it did when we originally used it for solving systems of linear equations more generally.

Concern 3: How do we classify data?

Suppose we are given a new data point x, which we want to classify with our trained QSVM. Recall that a “trained QSVM” is really just access to the state ∣b,α〉. We can perform this classification efficiently so long as we have QRAM access to the new data point x. Classifying the new point requires computing the sign given by Equation 13-8. This, in turn, involves determining the inner products of x with all the training data points xi, weighted by the LS-SVM dual hyperplane parameters αi. We can calculate the requisite inner products in superposition and assess Equation 13-8 as follows. First we use our trained LS-SVM state ∣b,α〉 in the address register of a query to the QRAM holding the training data. This gives us a superposition of the training data with amplitudes containing the αi. In another register we perform a query to the QRAM containing the new data point x.

Having both of these states, we can perform a special swap-test subroutine. Although we won’t go into full detail, this subroutine combines the states resulting from our two QRAM queries into an entangled superposition and then performs a carefully constructed swap test (see Chapter 3). Recall that as well as telling us whether the states of two QPU registers are equal or not, the exact success probability p of the READ involved in the swap test is dependent on the fidelity of the two states—a quantitative measure of precisely how close they are. The swap test we use here is carefully constructed so that the probability p of READing a 1 reflects the sign we need in Equation 13-8. Specifically, p < 1/2 if the sign is +1 and p >= 1/2 if the sign is –1. By repeating the swap test and counting 0 and 1 outcomes, we can estimate the value of the probability p to a desired accuracy, and classify the data point x accordingly.

Thus, we are able to train and use a quantum LS-SVM model according to the schematic shown in Figure 13-8.

Without delving into mathematical details, like those of the swap-test subroutine, Figure 13-8 gives only a very general overview of how LS-SVM training proceeds. Note that many details of the swap-test subroutine are not shown, although the key input states are highlighted. This gives some idea of the key roles the QPU primitives we’ve introduced throughout the book play.

Our QSVM summary also provides an important take-home message. A key step was recasting the SVM problem in a format that is amenable to techniques a QPU is well suited for (in particular, matrix inversion). This exemplifies a central ethos of this book—through awareness of what a QPU can do well, domain expertise may allow the discovery of new QPU applications simply by casting existing problems in QPU-compatible forms.

Schematic showing the stages in training and classifying with a QSVM
Figure 13-8. Schematic showing the stages in training and classifying with a QSVM

Other Machine Learning Applications

Quantum machine learning is still an extremely dynamic area of research. In this chapter we presented three canonical examples, but new developments in QML are constantly being made. At the same time, conventional approaches to machine learning are being inspired by QPU algorithms—in fact, within the time period of us writing this book, all three of the QML applications presented in this chapter have inspired the development of conventional algorithms with similar runtime improvements.18 This should not shake your confidence in QML’s potential; these results were unlikely to have been discovered without the inspiration of their QPU counterparts, showing that QML applications have far-reaching and unexpected consequences.19 There are also many other QML applications that we simply didn’t have space to properly mention. These include efficient QPU applications for linear regression,20 unsupervised learning,21, Boltzmann machines,22 semidefinite programming,23 and quantum recommender systems24 (quantum recommender systems have also inspired improvements in conventional algorithms25).

1 Harrow et al., 2009.

2 The expression for condition number in terms of eigenvalues only holds if A is a normal matrix. All matrices used in HHL are necessarily normal because of its reliance on quantum simulation and Hermitian matrices.

3 Kerenidis and Prakash, 2017.

4 Wiebe et al., 2012.

5 Several improvements and extensions have been made to the original HHL algorithm, resulting in runtimes with different trade-offs between the various parameters. Here we focus on the conceptually simpler original algorithm.

6 A more recent result by Childs, et al., 2015, actually manages to improve the dependence of HHL on ϵ to poly(log(1/ϵ)).

7 However, as discussed in Chapter 9, we can always extend an n × n non-Hermitian matrix to a 2n × 2n Hermitian one.

8 By this we mean finding the elements A must have in order to correctly relate vectors expressed in its eigenbasis.

9 There’s something poetic and possibly profound about the fact that attempting to perform conventional notions of arithmetic on a QPU can cause such crippling overheads.

10 We actually need to include a constant value in this calculation, and calculate arccos(C/λi) rather than just arccos(1/λi). Since we’re not implementing the HHL algorithm fully here we omit this for simplicity.

11 Density operators provide a more general description of a QPU register’s state than using a complex vector (or circle notation) because they allow for cases where the register might not only be in superposition, but subject to some statistical uncertainty as to precisely what that superposition is. These quantum states that contain statistical uncertainty are usually called mixed states.

12 For more technical detail on how this operation is built and how it enables such a quantum simulation, see Lloyd et al., 2013.

13 Note that this mini-SWAP approach to representing a matrix as a QPU operation won’t always be better than quantum simulation approaches. It only works well here because covariance matrices happen to be simple to encode in a QPU register’s density operator thanks to them being in Gram form.

14 These parameters describe the hyperplane through the equation w·x-b=0, where x is a vector of feature values.

15 Optimization problems often have such a dual form that can be easier to deal with. It’s worth noting that sometimes these dual forms can contain subtle differences from the original (primal) optimization problem.

16 In fact, b can be determined from a single piece of training data lying on one of the two parallel hyperplanes defining the margin. The points lying on these defining hyperplanes are known as the support vectors.

17 There are some subtle differences between SVM and LS-SVM models, although the two have been shown to be equivalent under certain reasonable conditions. See, for example, Ye and Xiong, 2007.

18 HHL: Chia et al., 2018; QPCA and QSVM: Tang, 2018.

19 In fact, it’s not obvious how practical these quantum-inspired algorithms will be in actual usage. See for example Arrazola et al., 2019.

20 Chakraborty et al., 2018.

21 Lloyd et al., 2013.

22 Wiebe et al., 2014.

23 Brandão et al., 2017.

24 Kerenidis and Prakash, 2016.

25 Tang, 2018.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset