Working with big image datasets

Images tend to be big files. In fact, it's likely that you will not be able to fit your entire image dataset into your machine's RAM.

Therefore, we need to load the images from disk "just in time" rather than loading them all in advance. In this section, we will be setting up an image data generator that loads images on the fly.

We'll be using a dataset of plant seedlings in this case. This was provided by Thomas Giselsson and others, 2017, via their publication, A Public Image Database for Benchmark of Plant Seedling Classification Algorithms.

This dataset is available from the following link: https://arxiv.org/abs/1711.05458.

You may be wondering why we're looking at plants; after all, plant classifications are not a common problem that is faced in the finance sector. The simple answer is that this dataset lends itself to demonstrating many common computer vision techniques and is available under an open domain license; it's therefore a great training dataset for us to use.

Readers who wish to test their knowledge on a more relevant dataset should take a look at the State Farm Distracted Driver dataset as well as the Planet: Understanding the Amazon from Space dataset.

Note

The code and data for this section and the section on stacking pretrained models can be found and run here: https://www.kaggle.com/jannesklaas/stacking-vgg.

Keras comes with an image data generator that can load files from disk out of the box. To do this, you simply need to run:

from keras.preprocessing.image import ImageDataGenerator

To obtain a generator reading from the files, we first have to specify the generator. In Keras, ImageDataGenerator offers a range of image augmentation tools, but in our example, we will only be making use of the rescaling function.

Rescaling multiplies all values in an image with a constant. For most common image formats, the color values range from 0 to 255, so we want to rescale by 1/255. We can achieve this by running the following:

imgen = ImageDataGenerator(rescale=1/255)

This, however, is not yet the generator that loads the images for us. The ImageDataGenerator class offers a range of generators that can be created by calling functions on it.

To obtain a generator loading file, we have to call flow_from_directory.

We then have to specify the directory Keras should use, the batch size we would like, in this case 32, as well as the target size the images should be resized to, in this case 150x150 pixels. To do this, we can simply run the following code:

train_generator = imgen.flow_from_directory('train',batch_size=32, target_size=(150,150))
validation_generator = imgen.flow_from_directory('validation',batch_size=32, tar get_size=(150,150))

How did Keras find the images and how does it know which classes the images belong to? The Keras generator expects the following folder structure:

  • Root:
    • Class 0
      • img
      • img
    • Class 1
      • img
      • img
    • Class 1
      • img

Our dataset is already set up that way, and it's usually not hard to sort images to match the generator's expectations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset