Data analysis and preprocessing

We need to analyze the dataset and do some basic preprocessing. So, let's start off by defining some helper functions that will enable us to load a specific batch from the five batches that we have and print some analysis about this batch and its samples:

# Defining a helper function for loading a batch of images
def load_batch(cifar10_dataset_dir_path, batch_num):
    
    with open(cifar10_dataset_dir_path + '/data_batch_' + str(batch_num), mode='rb') as file:
        batch = pickle.load(file, encoding='latin1')

    input_features = batch['data'].reshape((len(batch['data']), 3, 32, 32)).transpose(0, 2, 3, 1)
    target_labels = batch['labels']

    return input_features, target_labels

Then, we define a function that can help us display the stats of a specific sample from a specific batch:

#Defining a function to show the stats for batch ans specific sample
def batch_image_stats(cifar10_dataset_dir_path, batch_num, sample_num):

    batch_nums = list(range(1, 6))

    #checking if the batch_num is a valid batch number
    if batch_num not in batch_nums:
        print('Batch Num is out of Range. You can choose from these Batch nums: {}'.format(batch_nums))
        return None

    input_features, target_labels = load_batch(cifar10_dataset_dir_path, batch_num)

    #checking if the sample_num is a valid sample number
    if not (0 <= sample_num < len(input_features)):
        print('{} samples in batch {}. {} is not a valid sample number.'.format(len(input_features), batch_num, sample_num))
        return None

    print('
Statistics of batch number {}:'.format(batch_num))
    print('Number of samples in this batch: {}'.format(len(input_features)))
    print('Per class counts of each Label: {}'.format(dict(zip(*np.unique(target_labels, return_counts=True)))))

    image = input_features[sample_num]
    label = target_labels[sample_num]
    cifar10_class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

    print('
Sample Image Number {}:'.format(sample_num))
    print('Sample image - Minimum pixel value: {} Maximum pixel value: {}'.format(image.min(), image.max()))
    print('Sample image - Shape: {}'.format(image.shape))
    print('Sample Label - Label Id: {} Name: {}'.format(label, cifar10_class_names[label]))
    plt.axis('off')
    plt.imshow(image)

Now, we can use this function to play around with our dataset and visualize specific images:

# Explore a specific batch and sample from the dataset
batch_num = 3
sample_num = 6
batch_image_stats(cifar10_batches_dir_path, batch_num, sample_num)

The output is as follows:


Statistics of batch number 3:
Number of samples in this batch: 10000
Per class counts of each Label: {0: 994, 1: 1042, 2: 965, 3: 997, 4: 990, 5: 1029, 6: 978, 7: 1015, 8: 961, 9: 1029}

Sample Image Number 6:
Sample image - Minimum pixel value: 30 Maximum pixel value: 242
Sample image - Shape: (32, 32, 3)
Sample Label - Label Id: 8 Name: ship

Figure 11.2: Sample image 6 from batch 3

Before going ahead and feeding our dataset to the model, we need to normalize it to the range of zero to one.

Batch normalization optimizes network training. It has been shown to have several benefits:

Faster training: Each training step will be slower because of the extra calculations during the forward pass of the network and the additional hyperparameters to train during backward propagation passes of the network. However, it should converge much more quickly, so training should be faster overall.
Higher learning rates: The gradient descent algorithm mostly requires small learning rates for the network to converge to the loss function's minima. And as the neural networks get deeper, their gradient values get smaller and smaller during backpropagation, so they usually require even more iterations. Using the idea of batch normalization allows us to use much higher learning rates, which further increases the speed at which networks train.
Easy to initialize weights: Weight initialization can be difficult, and it will be even more difficult if we are using deep neural networks. Batch normalization seems to allow us to be much less careful about choosing our initial starting weights.

So, let's proceed by defining a function that will be responsible for normalizing a list of input images so that all the pixel values of these images are between zero and one:

#Normalize CIFAR-10 images to be in the range of [0,1]

def normalize_images(images):
    
    # initial zero ndarray
    normalized_images = np.zeros_like(images.astype(float))
    
    # The first images index is number of images where the other indices indicates
    # hieight, width and depth of the image
    num_images = images.shape[0]
    
    # Computing the minimum and maximum value of the input image to do the normalization based on them
    maximum_value, minimum_value = images.max(), images.min()
    
    # Normalize all the pixel values of the images to be from 0 to 1
    for img in range(num_images):
        normalized_images[img,...] = (images[img, ...] - float(minimum_value)) / float(maximum_value - minimum_value)

    return normalized_images

Next up, we need to implement another helper function to encode the labels of the input image. In this function, we will use one-hot encoding of sklearn, where each image label is represented by a vector of zeros except for the class index of the image that this vector represents.

The size of the output vector will be dependent on the number of classes that we have in the dataset, which is 10 classes in the case of CIFAR-10 data:

#encoding the input images. Each image will be represented by a vector of zeros except for the class index of the image 
# that this vector represents. The length of this vector depends on number of classes that we have
# the dataset which is 10 in CIFAR-10

def one_hot_encode(images):
    
    num_classes = 10
    
    #use sklearn helper function of OneHotEncoder() to do that
    encoder = OneHotEncoder(num_classes)
    
    #resize the input images to be 2D
    input_images_resized_to_2d = np.array(images).reshape(-1,1)
    one_hot_encoded_targets = encoder.fit_transform(input_images_resized_to_2d)
    
    return one_hot_encoded_targets.toarray()

Now, it's time to call the preceding helper functions to do the preprocessing and persist the dataset so that we can use it later:

def preprocess_persist_data(cifar10_batches_dir_path, normalize_images, one_hot_encode):
    
    
    num_batches = 5
    valid_input_features = []
    valid_target_labels = []

    for batch_ind in range(1, num_batches + 1):
        
        #Loading batch
        input_features, target_labels = load_batch(cifar10_batches_dir_path, batch_ind)
        num_validation_images = int(len(input_features) * 0.1)

        # Preprocess the current batch and perisist it for future use
        input_features = normalize_images(input_features[:-num_validation_images])
        target_labels = one_hot_encode( target_labels[:-num_validation_images])
        
        #Persisting the preprocessed batch
        pickle.dump((input_features, target_labels), open('preprocess_train_batch_' + str(batch_ind) + '.p', 'wb'))
        

        # Define a subset of the training images to be used for validating our model
        valid_input_features.extend(input_features[-num_validation_images:])
        valid_target_labels.extend(target_labels[-num_validation_images:])

    # Preprocessing and persisting the validationi subset
    input_features = normalize_images( np.array(valid_input_features))
    target_labels = one_hot_encode(np.array(valid_target_labels))
    
    pickle.dump((input_features, target_labels), open('preprocess_valid.p', 'wb'))
    

    #Now it's time to preporcess and persist the test batche
    with open(cifar10_batches_dir_path + '/test_batch', mode='rb') as file:
        test_batch = pickle.load(file, encoding='latin1')


    test_input_features = test_batch['data'].reshape((len(test_batch['data']), 3, 32, 32)).transpose(0, 2, 3, 1)
    test_input_labels = test_batch['labels']

    # Normalizing and encoding the test batch
    input_features = normalize_images( np.array(test_input_features))
    target_labels = one_hot_encode(np.array(test_input_labels))
    
    pickle.dump((input_features, target_labels), open('preprocess_test.p', 'wb'))
    
# Calling the helper function above to preprocess and persist the training, validation, and testing set
preprocess_persist_data(cifar10_batches_dir_path, normalize_images, one_hot_encode)

So, we have the preprocessed data saved to disk.

We also need to load the validation set for running the trained model on it at different epochs of the training process:

# Load the Preprocessed Validation data
valid_input_features, valid_input_labels = pickle.load(open('preprocess_valid.p', mode='rb'))

Table of Contents for Data analysis and preprocessing

Create new playlist

Sign In

Sign Up

Table of Contents for
Data analysis and preprocessing