-
Notifications
You must be signed in to change notification settings - Fork 53
/
perceptron.Rmd
324 lines (227 loc) · 26.2 KB
/
perceptron.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
# (PART) Quantum Machine Learning {-}
# Quantum perceptron
<div style="text-align: right"> Contributors: Armando Bellante, Samantha Buck </div>
<br>
```{r, echo=FALSE, fig.width=10, fig.align = 'center', fig.cap="This section is a work in progress"}
knitr::include_graphics("images/wip.png")
```
The following chapter is an investigation into the quantum version of the classical perceptron algorithm, and is based on the previous works of [@kapoor2016quantum]
The chapter is organized as follows: we first introduce the fundamentals of the classical version of the perceptron, then we present the online quantum perceptron algorithm and the version space quantum perceptron algorithm. <!--, and finally the state-of-the-art results. -->
## Classical perceptron
The perceptron is a machine learning algorithm that can be thought of as the most basic fundamental building block of more complex artificial neural networks (ANNs), or alternatively as a very simple form of neural network in and of itself.
The perceptron is a linear classifier used for binary predictions: its goal is to classify incoming data in one of two given categories. Like in any supervised learning tasks, the classifier is trained using a dataset of couples data points-labels and its goal is to generalize to previously unseen data points.
Unlike more complex learners, the textbook perceptron can only deal with *linearly separable* datasets. A dataset is said to be linearly separable if there exists at least one hyperplane that can successfully separate the elements of the dataset into distinct groups.
<!-- ![Figure 1: Linearly Separable Data versus Nonlinearly Separable Data.](images/LinearvsNonlinear.png) -->
Two sets,
[$\mathrm{A}$]{style="color: orange"} and
[$\mathrm{B}$]{style="color: blue"}, in an $n-$space are linearly separable only if there exists an $(n-1)$ hyperplane that shatters the two sets. In a 2-dimension space, the hyperplane is a line. In a 3D space it is a plane, and so on. (Note that in 2D, for instance, it could not be a curve! The linearity of this classifier make it so that you can only shatter the sets using hyperplanes. One could overcome this limitation by introducing kernels in the perceptron formulation or by projecting the data in a space, via "feature engineering", so that they are linearly separable in the new space.)
Let us introduce the concept of linear separability using proper mathematical formalism.
```{definition, name="Linear Separability"}
Two sets
[$\mathrm{A}$]{style="color: orange"} and
[$\mathrm{B}$]{style="color: blue"} of points in an $n$-dimensional space are called absolutely linearly separable if $n+1$ real numbers $w_{1}$, $\ldots$, $w_{n}, b$ exist, such that every point $(x_{1}$, $x_{2}$, $\ldots$, $x_{n})$ in [$\mathrm{A}$]{style="color: orange"} satisfies :
\begin{equation}
\sum_{i=1}^{n} w_{i} x_{i}>b
\end{equation}
every point [$\mathrm{B}$]{style="color: blue"} satisfies :
\begin{equation}
\sum_{i=1}^{n} w_{i} x_{i}<b
\end{equation}
```
The general equation of the separation hyperplane would then be $\sum_{i=1}^n w_ix_i + b = 0$.
The numbers $w_i$ and $b$ are often referred to as weights and bias respectively.
Without the bias term, the hyperplane that $w$ defines would always have to go through the origin.
It is common practise to absorb $b$ into the vector $w$ by adding one additional constant dimension to the data points and the weigths vector.
In doing so, ${x}$ becomes
$[\begin{array}{c}{x}\\ 1\end{array}]$, and
${w}$ becomes
$[\begin{array}{l}{w} \\ b\end{array}]$.\
We can then verify that their inner product will yield,
\begin{equation}
\left[\begin{array}{c}
{x} \\
1
\end{array}\right]^{\top}\left[\begin{array}{l}
{w} \\
b
\end{array}\right]={w}^{\top} {x}+b
\end{equation}
From now on, we will write $x$ and $w$ assuming the bias is included in this way.
We will soon focus on how the perceptron learns the weights $w_{i}$ and $b$.
However, assume for a while that the perceptron knows these numbers (i.e., it has been trained) and only needs to use them to classify the incoming data points in one of the two classes.
<!-- ![Figure 2: Schematic of the classical perceptron.](images/perceptron.png) -->
Given a data point $x$, its coordinates $x_{i}$ are weighted in a sum $\sum w_{i} x_{i}$ and this result is passed to one "neuron" that has an activation function $g$.
The output of this activation function $y=g(\sum w_{i} x_{i})$ is a binary value that represents one of the two possible classes.
In general, the activation function can take many forms (ex: the sigmoid function, hyperbolic, step function, etc.) and return whichever couple of values (e.g., \{0,1\}, \{-1,1\}).
For the remainder of this chapter, we will assume the sign function is used.
In other words, given a data point $x$ and the weights $w$, the perceptron classifies the point in the following way:
\begin{equation}
y\left(x\right)=\operatorname{sign}\left({w}^{\top} {x}\right)
\end{equation}
where $y(x)=-1$ means that the sample belongs to one class and $y(x)=+1$ means it belongs to the other, by a matter of convention.
The hyperplane identified by the weigths $w$ is also called decision boundary as, using the sign activation function, the position of the data points w.r.t. it determines the output class.
<!-- ![Figure 3: Schematic of a decision boundary in 2D.](images/boundary.png) -->
### Training the perceptron
In this book, we focus only on the *online training* algorithm, as this is handy for our quantum algorithms.
We invite the interested reader to read more about the *batch training*.
The perceptron, like we *should* do, learns from its own errors but it also need somewhere to start from.
Its training process happens in the following way.
We initiate the perceptron with some random weights, feed it with many training couples data point-label, and let it shot its guess for each point.
If the guess is correct, nothing happens. If the guess is wrong, the weights need to be updated.
Using the sign function, and considering a couple data point-label ($x^{(i)}$, $y^{(i)}$) from the training set, the perceptron classifies the point correctly if
$y^{(i)}\left({w}^{\top} {x}^{(i)}\right)>0$ (this is because the class $y^{(i)}$ has been encoded using values in $\{-1,+1\}$).
If a classification error is detected, the weights get modified by an amount that is proportional to the input $x^{(i)}$.
This process continues for a certain number of iterations, known as "epochs."
In the online learning regime, which is the one we investigate in this chapter, the update to the weight and bias vectors occurs for each and every misclassified data point iteratively throughout every epoch.
During an epoch, each training data point is considered once.
The training ends when there are no misclassified points in one epoch.
We remind the reader that the training goal is to determine the weights that produce a linear decision boundary that correctly classifies the predictions.
The following pseudocode highlights the structure of the perceptron algorithm and identifies at which step in the processing schedule the error calculation and subsequent parametric updates occur.
<img src="images/perceptronalgo.png" alt="drawing" width="600">
The update is computed by adding the amount $\Delta w_{i}=\eta \cdot error \cdot x^{(i)}$ to the weights $w$, where $\eta$ is a learning rate end $error = w^{\top}x-y^{(i)}$.
Notice that if $error$ and $x^{(i)}_j$ have the same sign, the increment of $w_j$ is positive (the strength increases) and otherwise decreases.
To sum up in an overall picture, the perceptron iteratively updates the linear boundary function (i.e., the hyperplane equation) that acts as the classifying condition between the two classes of the data, until no errors are detected and the algorithm converges. The training happens by minimizing the error on the training data set provided to the algorithm.
Ultimately the identification of the weights allows for the distinction of data points belonging on one side of the boundary or the other.
Of course, if the data were not linearly separable, the training algorithm would run forever.
The online training requires the existence of a margin between the two classes.
```{definition, name="Margin of a dataset"}
Let $\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \dots, (x^{(n)}, y^{(n)})\}$
a dataset of $n$ couples data points-labels, with $x^{(i)} \in \R^m$ and $y^{(i)} \in \{-1,+1\}$ for all $i \in [n]$.
We define the margin of this dataset as:
\begin{equation}
\gamma = \max_{q \in \R^{m}} \min_{i \in [n]} \frac{y^{(i)}q^{\top}x^{(i)}}{\|q\|\|x^{(i)}\|}.
\end{equation}
```
It naturally follows that a dataset is linearly separable if and only if its margin is different from $0$.
<!--
**Note on Scaling**
An extraordinary feature of the perceptron model is that there exists upper bounds for the number of updates that need to be made during this training procedure. Specifically, if the training data is composed of unit vectors, $\phi_{i} \in \mathbb{R}^{D}$, that are separated by a margin of $\gamma$, then there exists a perceptron training algorithm that make at most $O\left(\frac{1}{r^{2}}\right)$ mistakes, *independent* of the dimension of the training vectors. In the worst case, the algorithm will need to look at all points in the training set at least once, consequently the computation complexity will be $O(N)$. The main goal of realizing the quantum version of the perceptron algorithm is to investigate if quantum procedures (i.e., Grover's search and amplitude amplification) can provide improvements both in terms of computational complexity (i.e., better than $O(N)$) and statistical efficiency (i.e., improve upon $O\left(\frac{1}{\gamma^{2}}\right)$.) Given a training data set, the traditional representation is to depict the data as points in the feature space and use hyperplanes to depict the classifiers. However,as will be showcased in the following two sections, there also exists a dual representation where the hyperplanes are depicted as points and the data points are represented as hyperplanes that induce constraints on the feasible set of classifiers. Therefore, instead of only applying quantum constructs to the feature space, a version space interpretation of perceptrons will also be considered, which ends up leading to the improved scaling with $\gamma$ (version space is formally defined as the set of all possible hyperplanes that perfectly separate the data$:VS:=\left\{w \mid y_{i} \cdot w^{T}{\phi}_{i}>0 \text { for all } i\right\}$).
Now that we have developed a high level understanding of the function of the classical perceptron and to what forms of data it is applicable, we can move forward and investigate the algorithmic architectures used to construct the quantum versions of the perceptron.
-->
<!--However, before going further into this investigation, it will be worthwhile to briefly highlight the key differences in the main techniques used for accessing data within the supervised learning regime, as this context will become necessary for a more complete appreciation of the concepts in the proceeding sections:
**Online versus batch learning**
**Online:**
Online learning is a method of machine learning in which data becomes available to the algorithm in a sequential order and is used to update the dependent variable of the system.
*Key Features:*
- Learning based on each data point as it is observed.
- The updates to weights $w$ happen after each data point *within* the epoch.
**Batch:**
Batch learning is a technique which generates an output by learning on the entire training data set at once.
*Key Features:*
- Learning takes place over groups, or "batches", of data.
- Updates to the weights, $w$, only happen after each epoch.
<!--
**Grover's Search Algorithm**
Grovers algorithm @grover1996fast, also often referred to as the "quantum search algorithm", refers to a quantum algorithm for unstructured search that successfully determines the unique input to a black box function that produces a particular output value. Specifically, the algorithm consists of two oracle steps: a phase inversion oracle step, and an amplitude amplification oracle step. Grover's search algorithm is ubiquitous in the realm of established quantum algorithms, and is used as the overarching quantum resource in both of the formulations of the quantum perceptron algorithm that will be presented in the following sections. For a more robust investigation of Grover's search, the reader is encouraged to review section 4.2 \@ref(sec:4.2) of this book, \@ref(subsec:findmin), and also section II B in [@kapoor2016quantum], where specifics on how Grover's is leveraged in context of the quantum perceptron can be found.
-->
## Online quantum perceptron
**General Idea:** Uses Grover’s search to more efficiently search for misclassified data points in the data set. This allows the perceptron algorithm to more quickly converge on a correct classifying hyperplane.
<!--
If we're given a set of $N$ separable training examples $\left\{\phi_{1}, \ldots, \phi_{N}\right\} \in \mathbb{R}^{D}$ with corresponding labels $\left\{y_{1}, \ldots, y_{N}\right\}, y_{i} \in\{+1,-1\}$, the goal of perceptron is to recover a hyperplane $w$ that perfectly classifies the training set. To put this formally, we want $w$ such that $y_{i} \cdot w^{T} \phi_{i}>0$ for all $i$. The rule that we'll consider for online perceptron training is, upon misclassifying a vector $(\phi, y), w \leftarrow w+y \phi$.
<!--
Earlier in our exploration of the classical perceptron algorithm, the difference between two forms of accessing data in the context of machine learning algorithm were introduced - the online and the batch style learning techniques. The key difference between these two main umbrella categories (i.e. online and batch) lies in *where* the update to the weight vectors $w$, and bias vectors $b$, occurr within the algorithms' subroutines. In the online learning regime, the update to the weight and bias vectors occurs for each and every data point within the epoch, occurring iteratively throughout every epoch, while in the batch learning regime only *one* weight and bias vector update gets implemented per epoch, which occurs after all of the data points within the epoch have been iterated over.
The main purpose of highlighting the difference between online and batch learning again in this section of the investigation is to make clear to the reader that this particular implementation of the quantum perceptron algorithm falls under the \"online learning\" umbrella category, since in this formulation of the algorithm there are multiple weight and bias vector updates that are done per epoch (i.e. one for each data point), which is the key common feature shared by online learning style algorithms.
-->
Where the quantum version of the online perceptron differs from that of its classical analogue is in *how* the data points are accessed for use within each individual epoch and in the number of times the perceptron needs to be used to classify the data points.
In the classical version of the online perceptron, the training points are fed into the classification algorithm successively one by one, and a weight update is performed every time a point is misclassified.
Imagine we have $n$ data points and only the last ones of the epoch get misclassified: this requires $O(n)$ evaluations of the classification function before the weights can be updated.
The quantum version of the online perceptron deviates away from this mechanism of accessing the data points within an epoch in a \"one at a time\" fashion and lowers the number of time we need to call the classification function.
The idea is to access the data points in superposition in a quantum state and apply the classification function to this state, so to apply the classification function linearly to all the data points at once, searching for the misclassified one.
Assume we have a dataset of $\{x^{(1)}, \ldots, x^{(N)}\}$ vectors and $\{y^{(1)}, \ldots, y^{(N)}\}$ labels.
Without loss of generality, we assume that the training set consists of *unit* vectors and one-bit labels.
Furthermore, we assume that the $\{x^{(1)}, \ldots, x^{(N)}\}$ vectors can be classically represented using $B$ bits.
Then, with $|x^{(1)}\rangle, \ldots, |x^{(n)}\rangle$ we denote the $(B+1)$-bit representations of the data vectors, followed by the label bit.
Note that each state $|x^{(j)}\rangle$ corresponds to a state of the computational basis of the $(B+1)$-dimensional Hilbert space.
Next, in order to construct our desired online quantum perceptron, we will need to have a mechanism with which to access the training data.
We assume that the data is accessible via the following oracle $U$.
\begin{align}
U|j\rangle|{0}\rangle=|j\rangle|x^{(j)}\rangle\\
U^{\dagger}|j\rangle|{x^{(j)}}\rangle=|j\rangle|0\rangle
\end{align}
As we discussed in previous chapters, because $U$ and $U^{\dagger}$ are linear operators, we have that $U \frac{1}{\sqrt{n}}\sum_{j=1}^{n} |j\rangle|0\rangle=\frac{1}{\sqrt{n}}\sum_{j=1}^n |j\rangle|x^{(j)}\rangle$. A quantum computer can therefore access each training vector simultaneously using a single operation, while only requiring enough memory to store one of the $|x^{(j)}\rangle$.
Next, we need a mechanism to test if a given weight configuration $w$, correctly classifies a data point.
First of all, we need to build a quantum circuit that implements the Boolean function $f: \left(w,x^{(j)}, y^{(j)}\right) \mapsto \{0,1\}$, where $f\left(w, x^{(j)}, y^{(j)}\right)$ is $1$ if and only if the perceptron with weights $w$ misclassifies the training sample $(x^{(j)}, y^{(j)})$. This circuit needs to adapt to the current weight configuration $w$, which can be either "hardcoded" in the circuit (and the circuit would need to be updated everytime the vector changes) or be given as an input via use of extra qubits.
Given access to such circuit, then we can define an operator, $\mathcal{F_w}$:
\begin{equation}
\mathcal{F_w} |x^{(j)}\rangle=(-1)^{f(w, x^{(j)}, y^{(j)})} |x^{(j)}\rangle.
\end{equation}
$\mathcal{F}_{w}$ is easily implemented on a quantum computer using a multiply controlled phase gate and a quantum implementation of the perceptron classification algorithm $f_{w}$.
<!--
We can use this operator on
\begin{equation}
F_{w}|j\rangle|{0}\rangle=(-1)^{f_{w}(x^{(j)}, y^{(j)})}|j\rangle |0\rangle
\end{equation}
We can apply build $F_{w}$ as
\begin{equation}
F_{w}=U^{\dagger}\left(\mathbb{I} \otimes \mathcal{F}_{w}\right) U,
\end{equation}
where is $\mathcal{F}_{w}$ be a unitary operation such that:
\begin{equation}
\mathcal{F}_{w} |x^{(j)}\rangle=(-1)^{f_{w}(x^{(j)}, y^{(j)})} |x^{(j)}\rangle.
\end{equation}
$\mathcal{F}_{w}$ is easily implemented on a quantum computer using a multiply controlled phase gate and a quantum implementation of the perceptron classification algorithm $f_{w}$.
-->
Now, we can use the unitary $\mathcal{F}_{w}$ as an oracle for Grover's algorithm (see Section \@ref(section:grover)).
This allows for a quadratic reduction in the number of times that the training vectors need to be accessed by the classification function.
The following pseudocode sums up the online quantum perceptron algorithm.
<img src="images/online_quantum_perceptron.png" alt="drawing" width="600">
This algorithm gives birth to the following theorem, which formalizes the quadratic speedup.
(ref:kapoor2016quantum) [@kapoor2016quantum]
```{theorem, thm-perceptron, name="Online quantum perceptron (ref:kapoor2016quantum)"}
Given a training set that consists of unit vectors $\{x^{(1)}, \ldots, x^{(n)}\}$ and labels $\{y^{(1)}, \ldots, y^{(n)}\}$, with margin $\gamma$, the number of applications of $f$ needed to learn the weights $w$ such that $P\left(\exists j: f_{w}\left(x^{(j)}, y^{(j)}\right)=1\right) \leq \epsilon$ using a quantum computer is $n_{\text {quant}}$ where
\begin{equation}
\Omega(\sqrt{n}) \ni n_{\text {quant}} \in O\left(\frac{\sqrt{n}}{\gamma^{2}} \log \left(\frac{1}{\epsilon \gamma^{2}}\right)\right)
\end{equation}
whereas the number of queries to \(f\) needed in the classical setting, \(n_{\text {class }}\), where the training vectors are found by sampling uniformly from the training data is bounded by
\begin{equation}
\Omega(n) \ni n_{\text {class}} \in O\left(\frac{n}{\gamma^{2}} \log \left(\frac{1}{\epsilon \gamma^{2}}\right)\right) .
\end{equation}
```
Note that if the training data is supplied as a stream (as in the standard online model), then the upper bound for the classical model changes to $n_{\text {class}} \in O\left(n / \gamma^{2}\right)$ because all $n$ training vectors can be deterministically checked to see if they are correctly classified by the perceptron. A quantum advantage is therefore obtained if $n \gg \log ^{2}\left(1 / \epsilon \gamma^{2}\right)$. (see [@kapoor2016quantum] for a better explaination.)
Theorem \@ref(thm:thm-perceptron) can easily be proved using Grover's theorem (with exponential search, since we do not know the exact number of answers to the search problem. i.e., multiple data points can be misclassified) and two lemmas.
```{lemma}
Given only the ability to sample uniformly from the training vectors, the number of queries to $f_{w}$ needed to find a training vector that the curnent perceptron model fails to classify correctly, or conclude that no such example exists, with probability $1-\epsilon \gamma^{2}$ is at most $O\left(n \log \left(1 / \epsilon \gamma^{2}\right)\right)$.
```
<br>
```{lemma}
Assuming that the training vectors $\{x^{(1)}, \ldots, x^{(n)}\}$ are unit vectors and that they are drawn from two classes separated by a maryin of $\gamma$ in feature space, the online quantum perceptron algorithm will either update the perceptron weights, or conclude that the current model provides a separating hyperplane between the two classes, using a number of queries to \(f_{w}\) that is bounded above by $O\left(\sqrt{n} \log \left(1 / \epsilon \gamma^{2}\right)\right)$ with probability of failure at most $\epsilon \gamma^{2}$.
```
<br>
The interested reader is encouraged consult the appendix of [@kapoor2016quantum] for proofs of those lemmas.
Furthermore, Novikoff’s theorem states that the $1/\gamma^2$ is an upper bound in the number of times the algorithms described in these lemmas need to be applied, which is the reason for the first line of the algorithm and the $1/\gamma^2$ term in the runtime.
## Version space quantum perceptron
**General idea:** Discretize the space in which weight vectors live and use Grover's search algorithm to find one vector that correctly classifies all the data points. Instead of searching among the data points, we are searching among weight vectors.
The key of the version space training algorithm is to search, among the possible weight vectors, one that correctly shatters the data points. Instead of iterating over the data points to update a weight vector, we will iterate over a discrete set of weight vectors to search a good one. The reason why this algorithm is called version space quantum perceptron lies in the space where the weight vector live, which is referred to as the version space.
```{theorem, version-space}
Given a training set with margin $\gamma$, a weight vector $w$ sampled from a spherical gaussian distribution $\mathcal{N}(0,1)$ perfectly separates the data with probability $\Theta(\gamma)$.
```
<br>
The main idea is to sample $k$ sample hyperplanes $w^{(1)}, \ldots, w^{(k)}$ from a spherical Gaussian distribution $\mathcal{N}(0, \mathbb{1})$ such that $k$ is large enough to guarantee us that there is at least one good weight vector.
Theorem \@ref(thm:version-space) tells us that the expected number of samples $k$ scales as $O(1/\gamma)$, and classically we would need to test all the $n$ data points for at least $O(1/\gamma)$ different vectors, with a cost $\Omega(n/\gamma)$.
By exploiting the power of Grover's search we can have a quadratic speedup on $k$ and achieve $O(n/\sqrt{\gamma})$.
Like in the previous case, to use Grover, we need one unitary $U$ that creates the weight vectors among which we search and one oracle $\mathcal{F}$ that modifies the phase of the correct weight vectors.
The oracle $U$ must do the following job:
\begin{equation}
U|j\rangle |0\rangle=|j\rangle |w^{(j)}\rangle
\end{equation}
for $j \in \{1, \ldots, k\}$, giving access to the $k$ sampled weight vectors.
Clearly $U$ is a linear operator and can be used with indexes in superposition, to give access to all the weight vectors simultaneously $U\frac{1}{\sqrt{k}} \sum_{j=1}^{k} |j\rangle |0\rangle =\frac{1}{\sqrt{k}} \sum_{j=1}^{k} |j\rangle |w^{(j)}\rangle$.
On the other hand, the oracle $\mathcal{F}$ should perform the following operation:
\begin{equation}
\mathcal{F}|w^{(j)}\rangle=(-1)^{1+\left(f\left(w^{(j)},x^{(1)}, y^{(1)}\right) \vee \cdots \vee f\left(w^{(j)},x^{(n)}, y^{(n)}\right)\right)}|w^{(j)}\rangle.
\end{equation}
Here $f\left(w^{(j)}, x^{(i)}, y^{(i)}\right)$ works as in the quantum online perceptron algorithm: returns $1$ if the data sample $(x^{(i)}, y^{(i)})$ is not correctly predicted by the weight $w^{(j)}$.
Basically, $1+\left(f\left(w^{(j)},x^{(1)}, y^{(1)}\right) \vee \cdots \vee f\left(w^{(j)},x^{(n)}, y^{(n)}\right)\right)$ returns $1$ if the vector $w^{(j)}$ correctly classifies all the data points and $0$ if there exists a training sample that is not correctly classified.
Similarly to the previous case, this circuit can be built by "hardcoding" the knowledge of the data samples in each of the $n$ functions or by creating one flexible function that accepts the vector and the points as input (by using additional qubits for the data points). This operation can be implemented with $O(n)$ queries to $f$.
Once we have the oracles $U$ and $\mathcal{F}$, we can sum up the algorithm in pseudocode.
<img src="images/version_space_quantum_perceptron.png" alt="drawing" width="600">
The following theorem formalizes the runtime of the algorithm.
```{theorem, quantum-version-space, name="Version space quantum perceptron (kapoor2016quantum)"}
Given a training set that consists of unit vectors $x^{(1)}, \ldots, x^{(n)}$ and labels $y^{(1)}, \ldots, y^{(n)}$, separated by a margin of $\gamma$, the number of queries to $f$ needed to infer a perceptron model $w$, with probability at least $1-\epsilon$, using a quantum computer is $O\left(\frac{n}{\sqrt{\gamma}} \log \left(\frac{1}{\epsilon}\right)\right)$.
```
The proof of this theorem is straightforward from the statement of Grover's algorithm and some statistics to bound the failure probability.
<!--
## Improved version space quantum perceptron
There is an algorithm with runtime...
## Generalization bounds
-->