ImageNet

The ImageNet dataset was created in 2010 as a collaborative effort of Alex Berg (Columbia University), Jia Deng (Princeton University), and Fei-Fei Li (Stanford University) to run as a tester competition on large-scale visual recognition, in conjunction with the PASCAL Visual Object Classes Challenge, 2010. The dataset is a collection of images that represent the contents of WordNet. WordNet is a lexical database for the English language. It groups English words into sets of synonyms called synsets in a hierarchical structure. The following screenshot shows the WordNet structure of nouns. The number in the brackets is the number of synsets in the subtree. 

The evolution of image classification algorithms, which almost solved the classification challenge on the existing datasets led to the need for a new dataset that would allow image classification at a large scale. This is closer to a real-world scenario where we would like the machine to describe the content of an arbitrary image simulating human capability. Compared to its predecessors, where the number of classes is in the 100s, ImageNet offers over 10 million high-quality images covering more than 10,000 classes. Many of these classes are related to each other, which makes the task of classification more challenging, for example, distinguishing many breeds of dogs. Since the dataset is enormously huge, it was hard to annotate each image with all the classes that are present in it, so by convention, each image was labeled only with one class.

Since 2010, the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC) challenge focuses on image classification, single-object localization, and detection. The data for the object classification challenge consists of 1.2 million images (from 1000 categories/synsets), of training data, 50,000 images of validation data, and 100,000 images of testing data.

In the classification challenge the main metric used to evaluate an algorithm is the top-5 error rate. The algorithm is allowed to give five predicted classes and will not be penalized if at least one of the predicted classes matches the ground truth label.

Formally if we let  be the image and  be the ground truth label. Then, we have the predicted labels , where at least one is equal to  to consider it a successful prediction. Consider that the error of a prediction is as follows:

Meaning the final error of an algorithm is then the proportion of test images on which a mistake was made, as shown:

Imagenet was one of the big reasons why deep learning has taken off in recent years. Before deep learning became popular the top-5 error rate in ILSVRC was around 28% and not going down much at all.  However, in 2012, the winner of the challenge, SuperVision, smashed the top-5 classification error down to 16.4%. The teams model, now known as AlexNet, was a deep convolutional neural network. This huge victory woke people up to the power of CNNs and it became the stepping stone for many modern CNN architectures.

In the following years CNN models continued to dominate and the top-5 error rate kept falling. The 2014 winner GoogLeNet reduced the error rate to 6.7% and this was halved again in 2015 to 3.57% by ResNet. Since then there has been smaller improvements with 2017's winner "WMW squeeze and excitation networks" produced a 2.25% error.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset