Creating training set for Yolo object detection

To create the training set for YOLO, a grid of the same size as output feature map prediction from the YOLO network is placed on each training input image. For each cell within the grid, we create a target vector Y of length B*5+C (that is, the same as output feature map grid cell size in the preceding section).

Let's take an example training image and see how we create target vector for cells in grid on the image:

In the preceding illustration, consider that we choose the cell based on the shortest distance of the object center (in the image, the back car's center is closest to the green cell). If we look at the training image above we notice that object of class of interest is only present in one cell that is cell number 8. Rest of the cells 1-7 and 9 don’t have any object of interest. The target vector for each cell will have 16 entries and look like following:

First entry which is confidence score for the presence of class which is 0 for both anchor boxes in cells which don’t have any object. Rest of the values will be don't cares. Cell number 8 has an object and the bounding box of object has high IOU.

The final volume of the target vector output from ConvNet after training for the input training image of size NxM will be 3x3x16 (On this toy example)

The label information for each image in your dataset will just include the objects' center coordinates and their bounding boxes. It’s your responsibility while implementing the code to make it match the output vector of your network; these include tasks like the ones listed:

Transforming the image space into your grid space for each center point
Transform the bounding box dimensions on the image space to grid space dimensions
Find which cell is closest to your object on the image space

If we multiply each cell class's probability with the confidence for each bounding box, we will get some detections that can be filtered by another algorithm (Non-Maxima Suppression).

Let's define confidence as something that reflects the presence or absence of an object of any class on the cell. (Note that if there is no object on the cell, the confidence should be zero, and if there is an object, the confidence should be the IoU):

We also need to define a conditional class probability; given the presence of an object P(class|Pr), we want this because we don't want the loss function to penalize a wrong class prediction if there is no object on the cell. The network only predicts one set of class probabilities per cell, regardless of the number of boxes, B.

Table of Contents for Creating training set for Yolo object detection

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating training set for Yolo object detection