Computer vision beyond classification

As we have seen, there are many techniques that we can use to make our image classifier work better. These are techniques that you'll find used throughout this book, and not only for computer vision applications.

In this final section of the chapter, we will discuss some approaches that go beyond classifying images. These tasks often require more creative use of neural networks than what we've discussed throughout this chapter.

To get the most out of this section, you don't need to worry too much about the details of the techniques presented, but instead look at how researchers were creative about using neural networks. We're taking this approach because you will often find that the tasks you are looking to solve require similar creativity.

Facial recognition

Facial recognition has many applications for retail institutions. For instance, if you're in the front office, you might want to automatically recognize your customer at an ATM, or alternatively, you might want to offer face-based security features, such as the iPhone offers. In the back office, however, you need to comply with KYC regulations, which require you to identify which customer you are working with.

On the surface, facial recognition looks like a classification task. You give an image of a face to the machine, and it will predict which person it is. The trouble is that you might have millions of customers, but only one or two pictures per customer.

On top of that, you'll likely be continuously getting new customers. You can't change your model every time you get a new customer, and a simple classification approach will fail if it has to choose between millions of classes with only one example for each class.

The creative insight here is that instead of classifying the customer's face, you can see whether two images show the same face. You can see a visual representation of this idea in the following diagram:

Facial recognition

Schematic of a Siamese network

To this end, you'll have to run the two images through first. A Siamese network is a class of neural network architecture that contains two or more identical subnetworks, both of which are identical and contain the same weights. In Keras, you can achieve such a setup by defining the layers first and then using them in both networks. The two networks then feed into a single classification layer, which determines whether the two images show the same face.

To avoid running all of the customer images in our database through the entire Siamese network every time we want to recognize a face, it's common to save the final output of the Siamese network. The final output of the Siamese network for an image is called the face embedding. When we want to recognize a customer, we compare the embedding of the image of the customer's face with the embeddings stored in our database. We can do this with a single classification layer.

Storing facial embedding is very beneficial as it will save us a significant amount of computational cost, in addition to allowing for the clustering of faces. Faces will cluster together according to traits such as sex, age, and race. By only comparing an image to the images in the same cluster, we can save even more computational power and, as a result, get even faster recognition.

There are two ways to train Siamese networks. We can train them together with the classifier by creating pairs of matching and non-matching images and then using binary cross-entropy classification loss to train the entire model. However, another, and in many respects better, option is to train the model to generate face embeddings directly. This approach is described in Schroff, Kalenichenko, and Philbin's 2015 paper, FaceNet: A Unified Embedding for Face Recognition and Clustering, which you can read here: https://arxiv.org/abs/1503.03832.

The idea is to create triplets of images: one anchor image, one positive image showing the same face as the anchor image, and one negative image showing a different face than the anchor image. A triplet loss is used to make the distance between the anchor's embedding and the positive's embedding smaller, and the distance between the anchor and the negative larger.

The loss function looks like this:

Facial recognition

Here Facial recognition is an anchor image, and Facial recognition is the output of the Siamese network, the anchor image's embedding. The triplet loss is the Euclidean distance between the anchor and the positive minus the Euclidean distance between the anchor and the negative. A small constant, Facial recognition, is a margin enforced between positive and negative pairs. To reach zero loss, the difference between distances needs to be Facial recognition.

You should be able to understand that you can use a neural network to predict whether two items are semantically the same in order to get around large classification problems. You can train the Siamese model through some binary classification tasks but also by treating the outputs as embeddings and using a triplet loss. This insight extends to more than faces. If you wanted to compare time series to classify events, then you could use the exact same approach.

Bounding box prediction

The likelihood is that at some point, you'll be interested in locating objects within images. For instance, say you are an insurance company that needs to inspect the roofs it insures. Getting people to climb on roofs to check them is expensive, so an alternative is to use satellite imagery. Having acquired the images, you now need to find the roofs in them, as we can see in the following screenshot. You can then crop out the roofs and send the roof images to your experts, who will check them:

Bounding box prediction

California homes with bounding boxes around their roofs

What you need are bounding box predictions. A bounding box predictor outputs the coordinates of several bounding boxes together with predictions for what object is shown in the box.

There are two approaches to obtaining such bounding boxes.

A Region-based Convolutional Neural Network (R-CNN) reuses a classification model. It takes an image and slides the classification model over the image. The result is many classifications for different parts of the image. Using this feature map, a region proposal network performs a regression task to come up with bounding boxes and a classification network creates classifications for each bounding box.

The approach has been refined, culminating in Ren and others' 2016 paper, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, which is available at https://arxiv.org/abs/1506.01497, but the basic concept of sliding a classifier over an image has remained the same.

You Only Look Once (YOLO), on the other hand, uses a single model consisting of only convolutional layers. It divides an image into a grid and predicts an object class for each grid cell. It then predicts several possible bounding boxes containing objects for each grid cell.

For each bounding box, it regresses coordinates and both width and height values, as well as a confidence score that this bounding box actually contains an object. It then eliminates all bounding boxes with a too low confidence score or with a too large overlap with another, a more confident bounding box.

Note

For a more detailed description, read Redmon and Farhadi's 2016 paper, YOLO9000: Better, Faster, Stronger, available at https://arxiv.org/abs/1612.08242. Further reading includes the 2018 paper, YOLOv3: An Incremental Improvement. This is available at https://arxiv.org/abs/1804.027.

Both are well-written, tongue-in-cheek papers, that explain the YOLO concept in more detail.

The main advantage of YOLO over an R-CNN is that it's much faster. Not having to slide a large classification model is much more efficient. However, an R-CNN's main advantage is that it is somewhat more accurate than a YOLO model. If your task requires real-time analysis, you should use YOLO; however, if you do not need real-time speed but just want the best accuracy, then using an R-CNN is the way to go.

Bounding box detection is often used as one of many processing steps. In the insurance case, the bounding box detector would crop out all roofs. The roof images can then be judged by a human expert, or by a separate deep learning model that classifies damaged roofs. Of course, you could train an object locator to distinguish between damaged and intact roofs directly, but in practice, this is usually not a good idea.

If you're interested in reading more about this, Chapter 4, Understanding Time Series, has a great discussion on modularity.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset