Applying the RAM on a noisy MNIST sample

To understand the working of the RAM in greater detail, let's try to create an MNIST sample incorporating some of the problems as highlighted in the earlier section:

Larger image of noisy and distorted MNIST

The preceding image represents a larger image/collage using an actual and slightly noisy sample of an MNIST image (of number 2), and a lot of other distortions and snippets of other partial samples. Also, the actual digit 2 here is not centered. This example represents all the previously stated problems, yet it is simple enough to understand the working of the RAM.

The RAM uses the concept of a Glimpse Sensor. The RL agent fixes its gaze at a particular coordinate (l) and particular time (t-1). The coordinate at time t-1, lt-1 of the image xand uses the Glimpse Sensor to extract retina-like multiple-resolution patches of the image with lt-1 as the center. These representations, extracted at time t-1, are collectively called p(xt, lt-1):

The concept of the Glimpse Sensor
  ;

These images show the representations of our image across two fixations using the Glimpse Sensor.

The representations obtained from the Glimpse Sensor are passes through the 'Glimpse Network, which flattens the representation at two stages. In the first stage, the representations from the Glimpse Sensor and the Glimpse Network are flattened separately (), and then they are combined into a single flattened layer () to generate the output representation gt for time t:

The concept of the Glimpse Network

These output representations are then passed through the RNN model architecture. The fixation for the next step in the iteration is determined by the RL agent to maximize the reward from this architecture:

 
Model architecture (RNN)

As can be intuitively understood, the Glimpse Sensor captures important information across fixations, which can help identify important concepts. For example, the multiple resolution (here 3) representations at the Fixation represented by our second sample image have three resolutions as marked (red, green, and blue in order of decreasing resolution). As can be seen, even if these are used directly, we have got a varying capability to detect the right digit represented by this noisy collage:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset