Chapter 5. Image Recognition

Vision is arguably the most important human sense. We rely on our vision to recognize our food, to run away from danger, to recognize our friends and family, and to find our way in familiar surroundings. We rely on our vision, in fact, to read this book and to recognize each and every letter and symbol printed in it. However, image recognition has (and in many ways still is) for the longest time been one of the most difficult problems in computer science. It is very hard to teach a computer programmatically how to recognize different objects, because it is difficult to explain to a machine what features make up a specified object. In deep learning, however, as we have seen, the neural network learns by itself, that is, it learns what features make up each object, and it is therefore well suited for a task such as image recognition.

In this chapter we will cover the following topics:

  • Similarities between artificial and biological models
  • Intuition and justification for CNN
  • Convolutional layers
  • Pooling layers
  • Dropout
  • Convolutional layers in deep learning

Similarities between artificial and biological models

Human vision is a complex and heavily structured process. The visual system works by hierarchically understanding reality through the retina, the thalamus, the visual cortex, and the inferior temporal cortex. The input to the retina is a two-dimensional array of color intensities that is sent, through the optical nerve, to the thalamus. The thalamus receives sensory information from all of our senses with the exception of the olfactory system and then it forwards the visual information collected from the retina to the primary visual cortex, which is the striate cortex (called V1), which extracts basic information such as lines and movement directions. The information then moves to the V2 region that is responsible for color interpretation and color constancy under different lighting conditions, then to the V3 and V4 regions that improve color and form perception. Finally, the information goes down to the Inferior Temporal cortex (IT) for object and face recognition (in reality, the IT region is also further subdivided in three sub-regions, the posterior IT, central IT, and anterior IT). It is therefore clear that the brain processes visual information by hierarchically processing the information at different levels. Our brain then seemingly works by creating simple abstract representations of reality at different levels that can then be recombined together (see for reference: J. DiCarlo, D. Zoccolan, and N. Rust, How does the brain solve visual object recognition?, https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3306444).

The Deep Learning neural networks we have seen so far work similarly by creating abstract representations, as we have seen in RBMs, for example, but there is another important piece to the puzzle for understanding sensory information: the information we extract from sensory inputs is often determined mostly by the information most closely related. Visually, we can assume that pixels that are close by are most closely related and their collective information is more relevant than what we can derive from pixels very far from each other. In understanding speech, as another example, we have discussed how the study of tri-phones is important, that is, the fact that the understanding of a sound is dependent on the sounds preceding and following it. To recognize letters or digits, we need to understand the dependency of pixels close by, since that is what determines the shape of the element to figure out the difference between, say, a 0 or a 1. Pixels that are very far from those making up a 0 hold, in general, little or no relevance for our understanding of the digit "0". Convolutional networks are built exactly to address this issue: how to make information pertaining to neurons that are closer more relevant than information coming from neurons that are farther apart. In visual problems, this translates into making neurons process information coming from pixels that are near, and ignoring information related to pixels that are far apart.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset