Generating the training data

The first step is to generate some training data. For this, we will use NumPy's random number generator. As discussed in the previous section, we will fix the seed of the random number generator, so that re-running the script will always generate the same values:

In [3]: np.random.seed(42)

Alright, now let's get to it. What should our training data look like exactly?

In the previous example, each data point is a house on the town map. Every data point has two features (that is, the x and y coordinates of its location on the town map) and a class label (that is, a blue square if a Blues fan lives there and a red triangle if a Reds fan lives there).

The features of a single data point can, therefore, be represented by a two-element vector holding its x and y coordinates on the town map. Similarly, its label is number 0 if it is a blue square or 1 if it is a red triangle. The process consists of generating data, plotting it, and then predicting the label of the new point. Let's see how we can do these steps:

We can generate a single data point by picking random locations on the map and a random label (either 0 or 1). Let's say the town map spans a range and . Then, we can generate a random data point as follows:

In [4]: single_data_point = np.random.randint(0, 100, 2)
...     single_data_point
Out[4]: array([51, 92])

As shown in the preceding output, this will pick two random integers between 0 and 100. We will interpret the first integer as the data point's x coordinate on the map and the second integer as the point's y coordinate.

Similarly, let's pick a label for the data point:

In [5]: single_label = np.random.randint(0, 2)
...     single_label
Out[5]: 0

It turns out that this data point would have class, 0, which we interpret as a blue square.

Let's wrap this process in a function that takes as input the number of data points to generate (that is, num_samples) and the number of features every data point has (that is, num_features):

In [6]: def generate_data(num_samples, num_features=2):
...         """Randomly generates a number of data points"""

Since, in our case, the number of features is 2, it is okay to use this number as a default argument value. This way, if we don't explicitly specify num_features when calling the function, a value of 2 is automatically assigned to it. I'm sure you already knew that.

The data matrix we want to create should have num_samples rows and num_features columns, and every element in the matrix should be an integer drawn randomly from the range, (0, 100):

...      data_size = (num_samples, num_features)
...      train_data = np.random.randint(0, 100, size=data_size)

Similarly, we want to create a vector that contains a random integer label in the range, (0, 2), for all samples:

...         labels_size = (num_samples, 1)
...         labels = np.random.randint(0, 2, size=labels_size)

Don't forget to have the function return the generated data:

...        return train_data.astype(np.float32), labels

OpenCV can be a bit finicky when it comes to data types, so make sure to always convert your data points into np.float32!

Let's put the function to the test and generate an arbitrary number of data points, let's say, eleven, whose coordinates are chosen randomly:

In [7]: train_data, labels = generate_data(11)
... train_data
Out[7]: array([[ 71., 60.],
 [ 20., 82.],
 [ 86., 74.],
 [ 74., 87.],
 [ 99., 23.],
 [ 2., 21.],
 [ 52., 1.],
 [ 87., 29.],
 [ 37., 1.],
 [ 63., 59.],
 [ 20., 32.]], dtype=float32)

As we can see from the preceding output, the train_data variable is an 11 x 2 array, where each row corresponds to a single data point. We can also inspect the first data point with its corresponding label by indexing into the array:

In [8]: train_data[0], labels[0]
Out[8]: (array([ 71., 60.], dtype=float32), array([1]))

This tells us that the first data point is a red triangle (because it has class, 1) and lives at location (x, y) = (71, 60) on the town map. If we want, we can plot this data point on the town map using Matplotlib:

In [9]: plt.plot(train_data[0, 0], train_data[0, 1], color='r', marker='^', markersize=10)
...     plt.xlabel('x coordinate')
...     plt.ylabel('y coordinate')
Out[9]: [<matplotlib.lines.Line2D at 0x137814226a0>]

We get the following result:

But what if we want to visualize the whole training set at once? Let's write a function for that. The function should take as input a list of all of the data points that are blue squares (all_blue) and a list of the data points that are red triangles (all_red):

In [10]: def plot_data(all_blue, all_red):

Our function should then plot all of the blue data points as blue squares (using color, 'b', and marker, 's'), which we can achieve with the scatter function from matplotlib. For this to work, we have to pass the blue data points as an N x 2 array, where N is the number of samples. Then, all_blue[:, 0] contains all of the x coordinates of the data points, and all_blue[:, 1] contains all of the y coordinates:

...          plt.figure(figsize=(10, 6))
...          plt.scatter(all_blue[:, 0], all_blue[:, 1], c='b',
             marker='s', s=180)

Analogously, the same can be done for all of the red data points:

...          plt.scatter(all_red[:, 0], all_red[:, 1], c='r',
             marker='^', s=180)

Finally, we annotate the plot with labels:

...          plt.xlabel('x coordinate (feature 1)')
...          plt.ylabel('y coordinate (feature 2)')

Let's try it on our dataset! First, we have to split all of the data points into red and blue sets. We can quickly select all of the elements of the labels array created earlier that are equal to 0, using the following command (where ravel flattens the array):

In [11]: labels.ravel() == 0
Out[11]: array([False, False, False, True, False, True, True, True,          True, True, False])

All of the blue data points are then all of the rows of the train_data array created earlier, the corresponding label of which is 0:

In [12]: blue = train_data[labels.ravel() == 0]

The same can be done for all of the red data points:

In [13]: red = train_data[labels.ravel() == 1]

Finally, let's plot all of the data points:

In [14]: plot_data(blue, red)

This will create the following diagram:

Now it's time to train the classifier.

Table of Contents for Generating the training data

Create new playlist

Sign In

Sign Up

Table of Contents for
Generating the training data