When?

Research has shown that feature extraction in convolutional network weights trained on ImageNet outperforms the conventional feature extraction methods such as SURF, Deformable Part Descriptors (DPDs), Histogram of Oriented Gradients (HOG), and bag of words (BoW). This means that convolutional features can be used equally well wherever the conventional visual representations work, with the only drawback being that deeper architectures might require a longer time to extract the features.

When a deep convolutional neural network is trained on ImageNet the visualization of convolution filters in the first layers (refer to the following illustration) shows that they learn low-level features similar to edge detection filters, while the convolution filters at the last layers learn high-level features that capture the class-specific information. Hence, if we extract the features for ImageNet after the first pooling layer and embed them into a 2D space (using, for example, t-SNE), the visualization will show that there is some anarchy in the data, while if we do the same at fully connected layers, we will notice that the data with the same semantic information gets organized into clusters. This implies that the network generalizes quite well at higher levels, and it will be possible to transfer this knowledge to unseen classes.

According to experiments conducted on datasets with a small degree of similarity with respect to ImageNet, the features based on convolutional neural network weights trained on ImageNet perform better than the conventional feature extraction methods for the following tasks:

Object recognition: This CNN feature extractor can successfully perform classification tasks on other datasets with unseen classes.
Domain adaptation: This is when the training and testing data are from different distributions, while the labels and number of classes are the same. Different domains can consider images captured with different devices or in different settings and environment conditions. A linear classifier with CNN features successfully clusters images with the same semantic information across different domains, while SURF features overfit to domain-specific characteristics.
Fine-grained classification: This is when we want to classify between the subcategories within the same high-level class. For example, we can categorize between bird species. CNN features, along with logistic regression, although not trained on fine-grained data, perform better than the baseline approaches.
Scene recognition: Here, we need to classify the scene of the entire image. A CNN feature extractor trained on object classification databases with a simple linear classifier on top, outperforms complex learning algorithms applied on traditional feature extractors on recognition data.

Some of the tasks mentioned here are not directly related to image classification, which was the primary goal while training on ImageNet and therefore someone would expect that the CNN features would fail to generalize to unseen scenarios. However, those features, combined with a simple linear classifier, outperform the hand-crafted features. This means that the learned weights of a CNN are reusable.

So when should we use transfer learning? When we have a task where the available dataset is small due to the nature of the problem (such as classify ants/bees). In this case, we can train our model on a larger dataset that contains similar semantic information and subsequently, retrain the last layer only (linear classifier) with the small dataset. If we have just enough data available, and there is a larger similar dataset to ours, pretraining on this similar dataset may result in a more robust model. As normally we train models with the weights randomly initialized, in this case, they will be initialized with the weights trained on this other dataset. This will facilitate the network to converge faster and generalise better. In this scenario, it would make sense to only fine-tune a few layers at the top end of the model.

Rule of thumb is that the more data you have available, the more layers you can train, starting from the top of the network. Initialize your model weights from a pre-trained, for example, on ImageNet, model.

Table of Contents for When?

Create new playlist

Sign In

Sign Up

Table of Contents for
When?