Reasons for sub-optimal performance of visual CNN models

The performance of a CNN network can be improved to a certain extent by adopting proper tuning and setup mechanisms such as: data pre-processing, batch normalization, optimal pre-initialization of weights; choosing the correct activation function; using techniques such as regularization to avoid overfitting; using an optimal optimization function; and training with plenty of (quality) data.

Beyond these training and architecture-related decisions, there are image-related nuances because of which the performance of visual models may be impacted. Even after controlling the aforementioned training and architectural factors, the conventional CNN-based image classifier does not work well under some of the following conditions related to the underlying images:

Very big images
Highly cluttered images with a number of classification entities
Very noisy images

Let's try to understand the reasons behind the sub-optimal performance under these conditions, and then we will logically understand what may fix the problem.

In conventional CNN-based models, even after a downsizing across layers, the computational complexity is quite high. In fact, the complexity is of the order of , where L and W are the length and width of the image in inches, and PPI is pixels per inch (pixel density). This translates into a linear complexity with respect to the total number of pixels (P) in the image, or O(P). This directly answers the first point of the challenge; for higher L, W, or PPI, we need much higher computational power and time to train the network.

Operations such as max-pooling, average-pooling, and so on help downsize the computational load drastically vis-a-vis all the computations across all the layers performed on the actual image.

If we visualize the patterns formed in each of the layers of our CNN, we would understand the intuition behind the working of the CNN and why it needs to be deep. In each subsequent layer, the CNN trains higher conceptual features, which may progressively better help understand the objects in the image layer after layer. So, in the case of MNIST, the first layer may only identify boundaries, the second the diagonals and straight-line-based shapes of the boundaries, and so on:

Illustrative conceptual features formed in different (initial) layers of CNN for MNIST

MNIST is a simple dataset, whereas real-life images are quite complex; this requires higher conceptual features to distinguish them, and hence more complex and much deeper networks. Moreover, in MNIST, we are trying to distinguish between similar types of objects (all handwritten numbers). Whereas in real life, the objects might differ widely, and hence the different types of features that may be required to model all such objects will be very high:

This brings us to our second challenge. A cluttered image with too many objects would require a very complex network to model all these objects. Also, since there are too many objects to identify, the image resolution needs to be good to correctly extract and map the features for each object, which in turn means that the image size and the number of pixels need to be high for an effective classification. This, in turn, increases the complexity exponentially by combining the first two challenges.

The number of layers, and hence the complexity of popular CNN architectures used in ImageNet challenges, have been increasing over the years. Some examples are VGG16 – Oxford (2014) with 16 layers, GoogLeNet (2014) with 19 layers, and ResNet (2015) with 152 layers.

Not all images are perfect SLR quality. Often, because of low light, image processing, low resolution, lack of stabilization, and so on, there may be a lot of noise introduced in the image. This is just one form of noise, one that is easier to understand. From the perspective of CNN, another form of noise can be image transition, rotation, or transformation:

Image without noise

Same image with added noise

In the preceding images, try reading the newspaper title Business in the image without and with noise, or identify the mobile in both the images. Difficult to do that in the image with noise, right? Similar is the detection/classification challenge with our CNN in the case of noisy images.

Even with exhaustive training, perfect hyperparameter adjustment, and techniques such as dropouts and others, these real-life challenges continue to diminish the image recognition accuracy of CNN networks. Now that we've understood the causes and intuition behind the lack of accuracy and performance in our CNNs, let's explore some ways and architectures to alleviate these challenges using visual attention.

Table of Contents for Reasons for sub-optimal performance of visual CNN models

Create new playlist

Sign In

Sign Up

Table of Contents for
Reasons for sub-optimal performance of visual CNN models