Chapter 3. Utilizing Computer Vision

When Snapchat first introduced a filter featuring a breakdancing hotdog, the stock price of the company surged. However, investors were less interested in the hotdog's handstand; what actually fascinated them was that Snapchat had successfully built a powerful form of computer vision technology.

The Snapchat app was now not only able to take pictures, but it was also able to find the surfaces within those pictures that a hotdog could breakdance on. Their app would then stick the hotdog there, something that could still be done when the user moved their phone, allowing the hotdog to keep dancing in the same spot.

While the dancing hotdog may be one of the sillier applications of computer vision, it successfully showed the world the potential of the technology. In a world full of cameras, from the billions of smartphones, security cameras, and satellites in use every day, to Internet of Things (IoT) devices, being able to interpret images yields great benefits for both consumers and producers.

Computer vision allows us to both perceive and interpret the real world at scale. You can think of it like this: no analyst could ever look at millions of satellite images to mark mining sites and track their activity over time; it's just not possible. Yet for computers, it's not just a possibility; it's something that's a reality here and now.

In fact, something that’s being used in the real world now, by several firms, is retailers counting the number of cars in their parking lot in order to estimate what the sales of goods will be in a given period.

Another important application of computer vision can be seen in finance, specifically in the area of insurance. For instance, insurers might use drones to fly over roofs in order to spot issues before they become an expensive problem. This could extend to them using computer vision to inspect factories and equipment they insure.

Looking at another case in the finance sector, banks needing to comply with Know-Your-Customer (KYC) rules are automating back-office processes and identity verification. In financial trading, computer vision can be applied to candlestick charts in order to find new patterns for technical analysis. We could dedicate a whole book to the practical applications of computer vision.

In this chapter, we will be covering the building blocks of computer vision models. This will include a focus on the following topics:

  • Convolutional layers.
  • Padding.
  • Pooling.
  • Regularization to prevent overfitting.
  • Momentum-based optimization.
  • Batch normalization.
  • Advanced architectures for computer vision beyond classification.
  • A note on libraries.

Before we start, let's have a look at all the different libraries we will be using in this chapter:

  • Keras: A high-level neural network library and an interface to TensorFlow.
  • TensorFlow: A dataflow programming and machine learning library that we use for GPU-accelerated computation.
  • Scikit-learn: A popular machine learning library with implementation of many classic algorithms as well as evaluation tools.
  • OpenCV: An image processing library that can be used for rule-based augmentation
  • NumPy: A library for handling matrices in Python.
  • Seaborn: A plotting library.
  • tqdm: A tool to monitor the progress of Python programs.

It's worth taking a minute to note that all of these libraries, except for OpenCV, can be installed via pip; for example, pip install keras.

OpenCV, however, will require a slightly more complex installation procedure. This is beyond the scope of this book, but the information is well documented online via OpenCV documentation, which you can view at the following URL: https://docs.opencv.org/trunk/df/d65/tutorial_table_of_content_introduction.html.

Alternately, it's worth noting that both Kaggle and Google Colab come with OpenCV preinstalled. To run the examples in this chapter, make sure you have OpenCV installed and can import with import cv2.

Convolutional Neural Networks

Convolutional Neural Networks, ConvNets, or CNNs for short, are the driving engine behind computer vision. ConvNets allow us to work with larger images while still keeping the network at a reasonable size.

The name Convolutional Neural Network comes from the mathematical operation that differentiates them from regular neural networks. Convolution is the mathematically correct term for sliding one matrix over another matrix. We'll explore in the next section, Filters on MNIST, why this is important for ConvNets, but also why this is not the best name in the world for them, and why ConvNets should, in reality, be called filter nets.

You may be asking, "but why filter nets?" The answer is simply because what makes them work is the fact that they use filters.

In the next section, we will be working with the MNIST dataset, which is a collection of handwritten digits that has become a standard "Hello, World!" application for computer vision.

Filters on MNIST

What does a computer actually see when it sees an image? Well, the values of the pixels are stored as numbers in the computer. So, when the computer sees a black-and-white image of a seven, it actually sees something similar to the following:

Filters on MNIST

The number 7 from the MNIST dataset

The preceding is an example from the MNIST dataset. The handwritten number in the image has been highlighted to make the figure seven visible for humans, but for the computer, the image is really just a collection of numbers. This means we can perform all kinds of mathematical operations on the image.

When detecting numbers, there are a few lower-level features that make a number. For example, in this handwritten figure 7, there's a combination of one vertical straight line, one horizontal line on the top, and one horizontal line through the middle. In contrast, a 9 is made up of four rounded lines that form a circle at the top and a straight, vertical line.

We're now able to present the central idea behind ConvNets. We can use small filters that can detect a certain kind of low-level feature, such as a vertical line, and then slide it over the entire image to detect all the vertical lines in the image.

The following screenshot shows a vertical line filter. To detect vertical lines in our image, we need to slide this 3x3 matrix filter over the image.

Filters on MNIST

A vertical line filter

Using the MNIST dataset on the following page, we start in the top-left corner and slice out the top-left 3x3 grid of pixels, which in this case is all zeros.

We then perform an element-wise multiplication of all the elements in the filter with all elements in the slice of the image. The nine products then get summed up, and bias is added. This value then forms the output of the filter and gets passed on as a new pixel to the next layer:

Filters on MNIST

As a result, the output of our vertical line filter will look like this:

Filters on MNIST

The output of a vertical line filter

Take a minute to notice that the vertical lines are visible while the horizontal lines are gone. Only a few artifacts remain. Also, notice how the filter captures the vertical line from one side.

Since it responds to high pixel values on the left and low pixel values on the right, only the right side of the output shows strong positive values. Meanwhile, the left side of the line actually shows negative values. This is not a big problem in practice as there are usually different filters for different kinds of lines and directions.

Adding a second filter

Our vertical filter is working, but we've already noticed that we also need to filter our image for horizontal lines in order to detect a seven.

Our horizontal line filter might look like this:

Adding a second filter

A horizontal line filter

Using that example, we can now slide this filter over our image in the exact same way we did with the vertical filter, resulting in the following output:

Adding a second filter

The output of the vertical line filter

See how this filter removes the vertical lines and pretty much only leaves the horizontal lines? The question now is what do we now pass onto the next layer? Well, we stack the outputs of both filters on top of each other, creating a three-dimensional cube:

Adding a second filter

The MNIST convolution

By adding multiple convolutional layers, our ConvNet is able to extract ever more complex and semantic features.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset