,

3

Silicon Retinas

images

This chapter introduces biological and silicon retinas, focusing on recently developed silicon retina vision sensors with an asynchronous address-event output. The first part of the chapter introduces biological retinas and four examples of Address-Event Representation (AER) retinas. The second part of the chapter discusses the details of some of the pixel designs and the specifications of these sensors. The discussion focuses on future goals for improvements.

3.1   Introduction

Chapter 2 discussed the event-based communication architectures for neuromorphic systems, and this chapter focuses on state-of-the-art electronic models of biological retinas that use this communication fabric to transmit their output data. Electronically emulating the retina has been a major target of neuromorphic electronic engineers since the field’s earliest times. When Fukushima et al. (1970) demonstrated his discrete model of the retina, it generated excitement although the practical industrial impact was miniscule. Nonetheless it probably inspired an entire generation of young Japanese engineers to work on cameras and vision sensors, which was evident in a Japanese domination of the electronic imager market which lasted for several decades. Likewise, Mahowald and Mead (1991) on the cover of Scientific American was an inspiration to a generation of American neuromorphic electronic engineers despite having raw performance capabilities that barely enabled the recognizable imaging of a cat. It has taken more than 20 years of effort to go from these early lab prototypes to devices that can now be commercially purchased and used in applications. This chapter is about the current state of electronic retinas, with a focus on event-based silicon retinas. It starts with a brief synopsis of biological retinal function and then reviews some existing silicon retina vision sensors. The chapter concludes with an extensive discussion of silicon retina performance specifications and their measurements.

3.2   Biological Retinas

As illustrated in Figure 3.1, the retina is a complex structure with three primary layers: the photoreceptor layer, the outer plexiform layer, and the inner plexiform layer (Kuffler 1953; Masland 2001; Rodieck 1988; Werblin and Dowling 1969). The photoreceptor layer consists of two classes of cells: cones and rods, which transform the incoming light into an electrical signal that affects neurotransmitter release in the photoreceptor output synapses. The photoreceptor cells in turn drive horizontal cells and bipolar cells in the outer plexiform layer. The two major classes of bipolar cells, the ON bipolar cells and OFF bipolar cells, separately code for bright spatiotemporal contrast and dark spatiotemporal contrast changes. They do this by comparing the photoreceptor signals to spatiotemporal averages computed by the laterally connected layer of horizontal cells, which form a resistive mesh. The horizontal cells are connected to each other by conductive pores called gap junctions. Together with the input current produced at the photoreceptor synapses, this network computes spatiotemporal low-passed copies of the photoreceptor outputs. The horizontal cells and bipolar cell terminals come together in complex synapses at the rod and cone pedicles of photoreceptors. There the bipolar cells are effectively driven by differences between the photoreceptor and horizontal cell outputs, and in turn feed back onto the photoreceptors. In the even more complex outer plexiform layer, the ON and OFF bipolar cells synapse onto many types of amacrine cells and many types of ON and OFF ganglion cells in the inner plexiform layer. The horizontal and the amacrine cells mediate the signal transmission process between the photoreceptors and the bipolar cells, and the bipolar cells and the ganglion cells, respectively.

images

Figure 3.1 Cross section of the retina. Adapted from Rodieck (1988). Reprinted with permission of Sinauer Associates

The bipolar and ganglion cells can be further divided into two different groups consisting of cells with more sustained responses and cells with more transient responses. These cells carry information along at least two parallel pathways in the retina, the magnocellular pathway, where cells are sensitive to temporal changes in the scene, and the parvocellular pathway where cells are sensitive to forms in the scene. This partition into sustained and transient pathways is too simplistic; in reality, there are many parallel pathways computing many views (probably at least 50 in the mammalian retina) of the visual input. However, for the purpose of this chapter, which is to discuss developments of silicon models of retinal processing, a simplified view of biological vision is sufficient.

Biological retinas have many desirable characteristics that are lacking in conventional silicon imagers, of which we mention two characteristics that have been incorporated in silicon retinas. First, eyes operate over a wide dynamic range (DR) of light intensity, allowing vision in lighting conditions over nine decades, from starlight to bright sunlight. Second, the cells in the outer plexiform and inner plexiform layers code for spatiotemporal contrast, thus removing redundant information and allowing the cells to encode the signals within their limited DR (Barlow 1961).

While conventional imagers are well suited to produce sequences of pictures, they do not have these biological properties of local gain control and redundancy reduction. Conventional imagers are constrained by economics and the infrastructure of conventional imaging technology. Economics has dictated that small pixels are cheaper than large ones, and that camera outputs should consist of a stroboscopic sequence of static pictures. The responses of the pixels are read out serially, leading to the generation of frames on which almost all of machine vision has been historically based. By contrast, biological vision does not depend on frames, so the availability of activity-driven, event-based silicon retinas can help to instigate research in new algorithms that are not dependent on frame information. These algorithms have advantages for fast, low-power vision systems.

3.3   Silicon Retinas with Serial Analog Output

It is interesting to see the historical evolution of silicon retinas in the diagram shown in Figure 3.19 (page 66). All the early retina designs had a serial output like conventional imagers; retina output values were repetitively sampled – or ‘scanned’ – to produce at the output a ‘picture’ of the activity. The first silicon VLSI retina implemented a model of the photoreceptor cells, horizontal cells, and bipolar cells (Mahowald and Mead 1991). (Historically, an earlier electronic retina model by Fukushima et al. (1970) used discrete elements but in this chapter we focus on VLSI retinas.) Each silicon photoreceptor mimics a cone cell and contains both a continuous-time photosensor and adaptive circuitry that adjusts its response to cope with changing light levels (Delbruck and Mead 1994). A network of MOS variable resistors (the H-Res circuit discussed in Mead 1989) mimics the horizontal cell layer, furnishing feedback based on the average amount of light striking nearby photoreceptors; the bipolar cell circuitry amplifies the difference between the signal from the photoreceptor and the local average and rectifies this amplified signal into ON and OFF outputs. The response of the resulting retinal circuit approximates the behavior of the human retina (Mahowald 1994; Mead 1989). The ‘Parvo–Magno’ retina implementation discussed in Section 3.5.3 included both sustained and transient bipolar and ganglion cells and also amacrine cells. This design still comes closest to capturing the various cell classes of biological retinas. Besides these two retinas, many designers have produced silicon retina designs which implement usually a specific retina functionality but considered various circuit approaches to reduce mismatch in the responses of the chips so that the retinas can be used in applications. For example, the scanned silicon retina system of Kameda and Yagi (2003) uses a frame-based active pixel sensor (APS) as the phototransduction stage because of the better matching in the outputs of the pixel responses.

3.4   Asynchronous Event-Based Pixel Output Versus Synchronous Frames

Because of the frame-based nature of the output from cameras in most electronic systems, almost all computer vision algorithms are based on image frames, that is, on static pictures or sequences of pictures. This is natural given that frame-based devices have been dominant from the earliest days of electronic imaging. Although frame-based sensors have the advantages of small pixels and compatibility with standard machine vision, they also have clear drawbacks: the pixels are sampled repetitively even if their values are unchanged; short-latency vision problems require high frame rate and produce massive output data; DR is limited by the identical pixel gain, the finite pixel capacity for integrated photocharge, and the identical integration time. The high computational cost of machine vision with conventional imagers is largely the practical reason for the recent development of AER silicon retinas. These sensors are discussed next.

3.5   AER Retinas

The asynchronous nature of AER output is inspired by the way in which neurons communicate over long range. Rather than sampling pixel values, AER retina pixels asynchronously output address events when they detect a significant signal. The definition of ‘significant’ depends on the sensor, but in general, these events should be much sparser for typical visual input than repetitive samples of analog value. In other words, good retina pixels autonomously determine their own region of interest (ROI), a job that is left to the system level for conventional image sensors with ROI readout control.

The design and use of these AER vision sensors is still relatively novel. Even though the first integrated AER retina was built in 1992, there was not much progress in usable AER vision sensors until recently. The main obstacles to advancement were the unfamiliarity with asynchronous logic and poor uniformity of pixel response characteristics. Industry is unfamiliar with frame-free vision sensors and understandably wary of large pixels with small fill factors. In recent years, the technology has progressed much faster mostly due to the increasing number of people working in this area. Recent AER retinas typically implement only one or two of the retina cell types so as to keep the fill factor still reasonably large and to have a low pixel array response variance. Table 3.1 compares the AER silicon retinas discussed here, along with a conventional CMOS camera for machine vision and the human eye. From even a cursory inspection we can see that cameras lag far behind the human eye, and that silicon retinas have far fewer pixels and consume much more power per pixel than a standard CMOS camera. Of course this low power consumption of a conventional camera leaves out the processing costs. If the silicon retina reduces redundancy or allows the vision processor to run at lower speed or bandwidth, then system-level power consumption can be greatly reduced by using it. In any case, we can still attempt to place existing sensors in classes in order to guide our thinking. AER retinas can be broadly divided into the following classes as listed in the row labeled ‘Class’ in the table:

  • Spatial contrast (SC) sensors reduce spatial redundancy based on intensity ratios versus spatial difference sensors which use intensity differences. SC sensors are more useful for analyzing static scene content for the purpose of feature extraction and object classification.
  • Temporal contrast (TC) sensors reduce temporal redundancy based on relative intensity change versus temporal difference sensors which use absolute intensity change. TC sensors are more useful for dynamic scenes with nonuniform scene illumination, for applications such as object tracking and navigation.

The exposure readout and pixel reset mechanisms can also be divided broadly into two classes:

  • Frame event (FE) sensors which use synchronous exposure of all pixels and scheduled event readout.
  • Asynchronous event (AE) sensors whose pixels continuously generate events based on a local decision about immediate relevance.

The next sections discuss examples of four different silicon retinas: the dynamic vision sensor (DVS) (Lichtsteiner et al. 2008), the asynchronous time-based image sensor (ATIS), the Magno–Parvo spatial and TC retina (Zaghloul and Boahen 2004a, 2004b), the biomorphic Octopus image sensor (Culurciello et al. 2003), and the SC and orientation sensor (VISe) (Ruedi et al. 2003). Each of these vision sensors were designed with particular aims.

3.5.1   Dynamic Vision Sensor

The DVS responds asynchronously to relative temporal changes in intensity (Lichtsteiner et al. 2008). The sensor outputs an asynchronous stream of pixel address-events (AEs) that encode scene reflectance changes, thus reducing data redundancy while preserving precise timing information. These properties are achieved by modeling three key properties of biological vision: sparse, event-based output, representation of relative luminance change (thus directly encoding scene reflectance change), and rectification of positive and negative signals into separate output channels. The DVS improves on prior frame-based temporal difference detection imagers (e.g., Mallik et al. 2005) by asynchronously responding to TC rather than absolute illumination, and on prior event-based imagers which do not reduce redundancy at all (Culurciello et al. 2003), reduce only spatial redundancy (Ruedi et al. 2003), have large fixed pattern noise (FPN), slow response, and limited DR (Zaghloul and Boahen 2004a), or have low contrast sensitivity (Lichtsteiner et al. 2004).

images

It would seem that by discarding all the DC information the DVS would be hard to use. However, for solving many dynamic vision problems this static information is not helpful. The DVS has been applied for various applications: high-speed ball tracking in a goalie robot (Delbrück and Lichtsteiner 2007), building an impressive pencil balancing robot (Conradt et al. 2009a, 2009b), highway traffic analysis (Litzenberger et al. 2006), stereo-based head tracking for people counting (Schraml et al. 2010), for tracking particle motion in fluid dynamics (Drazen et al. 2011), tracking microscopic particles (Ni et al. 2012), stereo vision based on temporal correlation (Rogister et al. 2012), and stereo-based gesture recognition (Lee et al. 2012). In other unpublished work (implementations of these unpublished projects is open-sourced in jAER (2007) as the classes Info, BeeCounter, SlotCarRacer, FancyDriver, and LabyrinthGame), the sensor has also been used for analyzing sleep-wake activity of mice for a week, counting bees entering and leaving a hive, building a self-driving slot car racer, driving a high-speed electric monster truck along a chalked line, and controlling a ball on a Labyrinth game table.

DVS Pixel

The DVS pixel design uses a combination of continuous and discrete time operation, where the timing is self-generated. The use of self-timed switched capacitor architecture leads to well-matched pixel response properties and fast, wide DR operation. To achieve these characteristics, the DVS pixel uses a fast logarithmic photoreceptor circuit, a differencing circuit that amplifies changes with high precision and simple comparators. Figure 3.2a shows how these three components are connected. The photoreceptor circuit automatically controls individual pixel gain (by its logarithmic response) while at the same time responding quickly to changes in illumination. The drawback of this photoreceptor circuit is that transistor threshold variation causes substantial DC mismatch between pixels, necessitating calibration when this output is used directly (Kavadias et al. 2000; Loose et al. 2001). The DC mismatch is removed by balancing the output of the differencing circuit to a reset level after the generation of an event. The gain of the change amplification is determined by the well-matched capacitor ratio C1/C2. The effect of random comparator mismatch is reduced by the precise gain of the differencing circuit.

Due to the logarithmic conversion in the photoreceptor and the removal of DC by the differencing circuit, the pixel is sensitive to ‘temporal contrast’ TCON, which is defined as

images

where I is the photocurrent. (The units of I do not affect d(logI).) The log here is the natural logarithm. Figure 3.2b illustrates the principle of operation of the pixel. The operation of some of the pixel components will be discussed in the latter part of this chapter.

images

Figure 3.2 DVS pixel. (a) Simplified pixel schematic. (b) Principle of operation. In (a), the inverters are symbols for single-ended inverting amplifiers. © 2008 IEEE. Reprinted, with permission, from Lichtsteiner et al. (2008)

The DVS was originally developed as part of the CAVIAR multichip AER vision system (Serrano-Gotarredona et al. 2009) in Chapter 16 but it can be directly interfaced to other AER components that use the same word-parallel protocol or can be adapted to other AER protocols with glue logic. The DVS streams time-stamped address events to a host PC over a high-speed USB 2.0 interface. The camera architecture is discussed in more detail in Chapter 13. On the host side, there is a lot of complexity in acquiring, rendering, and processing the nonuniformly distributed, asynchronous retina events in real time on a hardware single-threaded platform like a PC or microcontroller. The infrastructure implemented through the open-source jAER project, consisting of several hundred Java classes, allows users to capture retina events, monitor them in real time, control the on-chip bias generators, and process them for various applications (jAER 2007).

DVS Example Data

Perhaps the most important section of any report for users (and paper reviewers) is the one displaying example data from real-world scenarios. As an example, the images in Figure 3.3 show DVS example data from Lichtsteiner et al. (2008). The dynamic properties are illustrated by using grayscale or 3D to show the time axis. The ‘Rotating Dot’ panel shows the events generated by a black dot drawn on a white disk rotating at 200 revolutions per second under indoor fluorescent office illumination of 300 lux. The events are rendered both in space-time over 10 ms and as a briefer snapshot image spanning 300 μs. The spinning dot forms a helix in space time. The ‘Faces’ image was collected indoors at night with illumination from a 15-W fluorescent desk lamp. The ‘Driving Scene’ was collected outdoors under daylight from a position on a car dashboard when just starting to cross an intersection. The ‘Juggling Event Time’ image shows the event times as grayscale. The paths of the balls are shaded in gray showing the direction of movement. For example, the uppermost ball is rising because the top part of the path is darker, indicating later events in the time slice. The ‘Eye’ image shows events from a moving eye under indoor illumination. The ‘Highway Overpass’ images show events produced by cars on a highway viewed from an overpass in late afternoon lighting, on the left displayed as ON and OFF events and on the right as relative time during the snapshot.

images

Figure 3.3 DVS data taken under natural lighting conditions with either object or camera motion. These are rendered as 2D histograms of collected events, either as contrast (gray scale represents reconstructed gray scale change), gray scale time (event time is shown as gray scale, black = young, gray = old, white = no event), or 3D space time. © 2008 IEEE. Reprinted, with permission, from Lichtsteiner et al.

3.5.2   Asynchronous Time-Based Image Sensor

The ATIS (Posch et al. 2011) combines a DVS pixel and an intensity measurement unit (IMU) in each pixel. DVS events are used to trigger time-based intensity measurements. This impressive sensor, with a resolution of 304 × 240 30 μm pixels in a 180 nm technology (see Table 3.1), combines in each pixel the notions of TC detection (Lichtsteiner et al. 2008) with pulse-width modulated (PWM) intensity encoding (Qi et al. 2004). It uses a new time-based correlated double sampling (CDS) circuit (Matolin et al. 2009) to output pixel gray level values only from pixels that change. This intensity readout scheme allows high DR exceeding 130 dB, which is especially valuable in surveillance scenarios. A good example is Belbachir et al. (2012).

The ATIS pixel asynchronously requests access to an output channel only when it has a new illumination value to communicate. The asynchronous operation avoids the time quantization of frame-based acquisition and scanning readout. The gray level is encoded by the time between two additional IMU events. This IMU measurement is started by a DVS event in the pixel. Example data from the ATIS is shown in Figure 3.4. The transistor circuits are discussed in the second part of this chapter in Section 3.6.2, with a focus on its effective time-based CDS.

3.5.3   Asynchronous Parvo–Magno Retina Model

The Parvo–Magno retina by Zaghloul and Boahen (2004a) is a very different type of silicon retina that was focused on modeling of both the outer and inner retina including sustained (Parvo) and transient (Magno) types of cells, based on histological and physiological findings. It is an improvement over the first retina by Mahowald and Mead which models only the outer retina circuitry, that is, the cones, horizontal cells, and bipolar cells (Mahowald and Mead 1991). The use of the Parvo–Magno retina in an orientation feature extraction system is described in Section 13.3.2.

images

Figure 3.4 ATIS example data. (a) Capture of image of student walking to school is triggered externally by a rolling (row-wise) reset of all pixels. Subsequently, only intensity values of pixels that change (e.g., the moving student) are updated. (b) The sparse DVS events are generated in the ATIS pixels by the moving person. Reproduced with permission of Garrick Orchard

The Parvo–Magno design captures key adaptive features of biological retinas including light and contrast adaptation, and adaptive spatial and temporal filtering. By using small transistor log-domain circuits that are tightly coupled spatially by diffuser networks and single-transistor synapses, they were able to achieve 5760 phototransistors at a density of 722/mm2 and 3600 ganglion cells at a density of 461/mm2 tiled in 2 × 48 × 30 and 2 × 24 × 15 mosaics of sustained and transient ON and OFF ganglion cells in a 3.5 × 3.3 mm2 area, using a 0.25 μm process (see Table 3.1).

The overall Parvo–Magno model is shown in Figure 3.5. The cone outer segments (CO) supply photocurrent to cone terminals (CT), which excite horizontal cells (HC). Horizontal cells reciprocate with shunting inhibition. Both cones and horizontal cells are electrically coupled to their neighbors by gap junctions. Horizontal cells modulate cone to horizontal cell excitation and cone gap junctions. ON and OFF bipolar cells (BC) relay cone signals to ganglion cells (outputs) and excite narrow- and wide-field amacrine cells (NA, WA). They also excite amacrine cells that inhibit complementary bipolars and amacrines. Narrow-field amacrine cells inhibit bipolar terminals and wide-field amacrine cells; their inhibition onto wide-field amacrine cells is shunting. They also inhibit transient ganglion cells (OnT, OffT), but not sustained ganglion cells (OnS, OffS). Wide-field amacrine cells modulate narrow-field amacrine cell pre-synaptic inhibition and spread their signals laterally through gap junctions.

The outer retina’s synaptic interactions realize spatiotemporal bandpass filtering and local gain control. The model of the inner retina realizes contrast gain control (the control of sensitivity to TC) through modulatory effects of wide-field amacrine cell activity. As TC increases, their modulatory activity increases, the net effect of which is to make the transient ganglion cells respond more quickly and more transiently. This silicon retina outputs spike trains that capture the behavior of ON- and OFF-center wide-field transient and narrow-field sustained ganglion cells, which provide 90% of the primate retina’s optic nerve fibers. These are the four major types that project, via thalamus, to the primary visual cortex.

images

Figure 3.5 Parvo–Magno retina model. (a) Synaptic organization. (b) Single-transistor synapses. Adapted from Zaghloul and Boahen (2006). © IOP Publishing. Reproduced by permission of IOP Publishing. All rights reserved

images

Figure 3.6 Parvo–Magno chip responses. (a) A raster plot of the spikes (top) and histogram (bottom, bin width = 20 ms) recorded from a single column of the chip array. The stimulus was a 3 Hz 50% – contrast drifting sinusoidal grating whose luminance varied horizontally across the screen and was constant in the vertical direction. The amplitude of the fundamental Fourier component of these histograms is plotted in all frequency responses presented here. (b) The distribution of firing rates for the four types of GC outputs, demonstrating the amount of response variability. Log of the firing rate f is plotted versus the probability density p(f). Histograms are computed from all active cells of a given type in the array (40% of the sustained cells and 60% of the transient cells did not respond). (c) Top image shows that edges in a static scene are enhanced by sustained type GC outputs of the chip. Bottom image shows that an optimal mathematical reconstruction of the image from sustained GC activity demonstrates fidelity of retinal encoding despite the extreme variability. © 2004 IEEE. Reprinted, with permission, from Zaghloul and Boahen (2004b)

Figure 3.6 shows sample data from the Parvo–Magno retina. Clearly this device has complex and interesting properties, but the extremely large mismatch between pixel response characteristics makes it quite hard to use.

3.5.4   Event-Based Intensity-Coding Imagers (Octopus and TTFS)

Rather than modeling the redundancy reduction and local computation done in the retina, some imagers simply use the AER representation to directly communicate pixel intensity values. These sensors encode the pixel intensity either in the spikefrequency (Azadmehr et al. 2005; Culurciello et al. 2003) or in the time to first spike, referenced to a reset signal (Qi et al. 2004; Shoushun and Bermak 2007).

The so-called ‘Octopus retina’ by Culurciello et al. (2003), for example, communicates the pixel intensity through the frequency or inter-spike interval of the AER event outputs. (The Octopus retina is named after the animal because the octopus eye, and those of other cephalopods, is inverted compared to mammalian eyes – the photoreceptors are on the inside of the eye rather than the outside – and some of the retinal processing is apparently done in the optic lobe. However, the octopus eye still has a complex architecture and is capable, for instance, of discriminating polarized light.) This imager is also used as a front end of the IFAT system described in Section 13.3.1.

This biologically inspired readout method offers pixel-parallel readout. In contrast, a serially scanned array allocates an equal portion of the bandwidth to all pixels independent of activity and continuously dissipates power because the scanner is always active. In the Octopus sensor, brighter pixels are favored because their integration threshold is reached faster than darker pixels, thus their AER events are communicated at a higher frequency. Consequently, brighter pixels request the output bus more often than darker ones and are updated more frequently. Another consequence of this asynchronous image readout is image lag behind moving objects. A moving object consisting of bright and dark parts will result in later intensity readings for the dark parts than for the bright parts, distorting the object’s shape and creating motion artifacts similar to those seen from the ATIS (Section 3.6.2).

Octopus retina example images are shown in Figure 3.7. The Octopus scheme allows wide DR and that is why the shaded face and light bulb scenes are visible when the images are rendered using a logarithmic encoding. But the rather large FPN makes the sensor hard to sell for conventional imaging. This FPN is due mostly to threshold variation in the individual PFM circuits.

The Octopus imager is composed of 80 × 60 pixels of 32 × 30 mm2 fabricated in a 0.6 μm CMOS process. It has a DR of 120 dB (under uniform bright illumination and when no lower bound was placed on the update rate per pixel). The DR is 48.9 dB value at 30-pixel updates per second. Power consumption is 3.4 mW in uniform indoor light and a mean event rate of 200 kHz, which updates each pixel 41.6 times per second. The imager is capable of updating each pixel 8.3 K times per second under bright illumination.

The Octopus imager has the advantage of a small pixel size compared to the other AER retinas, but has the disadvantage that the bus bandwidth is allocated according to the local scene luminance. Because there is no reset mechanism and the event interval directly encodes intensity, a dark pixel can take a long time to emit an event, and a single highlight in the scene can saturate the off-chip communication spike address bus. Another drawback of this approach is the complexity of the digital frame grabber required to count the spikes produced by the array. The buffer must either count events over some time interval, or hold the latest spike time and use this to compute the intensity value from the ISI to the current spike time. This however leads to a noisier image. The Octopus retina would probably be most useful for tracking small and bright light sources.

images

Figure 3.7 Example images from a fabricated Octopus retina of 80 × 60 pixels. (a) Linear intensity (top) and log (bottom) scales. The images have been reconstructed by counting large numbers of spike events. (b) Comparing instantaneous ISI with average spike count (here the average count over image is 200 spikes). © 2003 IEEE. Reprinted, with permission, from Culurciello et al. (2003)

The alternate way of coding intensity information is the time to first spike (TTFS) coding implementation from the group of Harris and others (Luo and Harris 2004; Luo et al. 2006; Qi et al. 2004; Shoushun and Bermak 2007). In this encoding method, the system does not require the storage of large number of spikes, since every pixel only generates one spike per frame. A similar coding method was also suggested as a rank-order scheme used by neurons in the visual system to code information (Thorpe et al. 1996). The global threshold for generating a spike in each pixel can be reduced over the frame time so that dark pixels will still emit a spike in a reasonable amount of time. A disadvantage of the TTFS sensors is that uniform parts of the scene all try to emit their events at the same time, overwhelming the AER bus. An example of mitigating this problem would be by serially resetting rows of pixels, but then the problem remains that a particular image could still cause simultaneous emission of events.

The TTFS notion is used in the ATIS (Section 3.5.2) and in a related form in the VISe, discussed next. In the ATIS, the DVS events in individual pixels trigger a reading of intensity like the TTFS sensors have, but because the reading is local and triggered by intensity change, it overcomes some of the main drawbacks of the TTFS sensors.

3.5.5   Spatial Contrast and Orientation Vision Sensor (VISe)

The beautifully designed VISe (VISe stands for VIsion Sensor) (Ruedi et al. 2003) has a complementary functionality to the DVS (Section 3.5.1). The VISe outputs events that code for spatial rather than TC. The VISe pixels compute the contrast magnitude and angular orientation. It differs from the previous sensors in that it outputs these events using a temporal ordering according to the contrast magnitude. This ordering can greatly reduce the amount of data delivered. The VISe design results in impressive performance, particularly in terms of low mismatch and high sensitivity.

A VISe cycle starts with a global frame exposure, for example, 10 ms. Next the VISe transmits address events in the order of high-to-low SC. Pixels that emit events with the earliest times correspond to pixels that see the highest local SC. Readout can be aborted early if limited processing time is available without losing information about high-contrast features. Each contrast event is followed by another event that encodes gradient orientation by the timing of the event relative to global steering functions, which are described later. The VISe has low 2% contrast mismatch, a contrast direction precision of 3 bits, and a large 120 dB illumination DR. The main limitation of this architecture is that it does not reduce temporal redundancy (compute temporal derivatives), and its temporal resolution is limited to the frame rate.

The traditional APS approach to acquire an image with a CMOS imager consists of integrating on a capacitor, in each pixel of an array, the photocurrent delivered by a photodiode for a fixed exposure time (Fossum 1995). Without saturation, the resulting voltages are then proportional to the photocurrents. In the VISe pixel, like the Octopus or ATIS retina, instead of integrating photocurrents for a fixed duration, photocurrents are integrated until a given reference voltage is reached. This principle as used in the VISe is illustrated in Figure 3.8, where a central pixel and its four neighbors are considered (left, right, top, and bottom). In each pixel, the photocurrent Ic is integrated on a capacitor C. The resulting voltage is Vc continuously compared to a reference voltage Vref. Once Vc reaches Vref, the central pixel samples the voltages VL, VR, VT, and VB of the four neighboring pixels and stores these voltages on capacitors.

images

Figure 3.8 Contrast generation in the VISe. © 2003 IEEE. Reprinted, with permission, from Ruedi et al. (2003)

The local integration time TS is given by images. If IX is the photocurrent in neighbor X, then the sampled voltage will be

images

VX depends only on the ratio of the photocurrents generated in pixels X and C. With the assumption that the photocurrent of the central pixel corresponds to the average of its four neighbors – which is mostly fulfilled with the spatial low-pass characteristics of the optics – one obtains

images

Combining Eqs. (3.2) and (3.3) and solving for VRVL gives

images

The same form of equation can be derived for VTVB. Equation (3.4) is equivalent to the conventional definition of the Michelson contrast, which is the difference divided by the average.

Voltages VX and VY can be considered the X and Y components of a local contrast vector. If the illumination level is decreased by a factor of three, as illustrated in Figure 3.8, the time elapsed between the start of the integration and the sampling of the four neighbors is three times longer so that the sampled voltages remain the same.

The computation of the contrast direction and magnitude of the contrast vector defined in Eqs. (3.3) and (3.4), and the readout of the combined contrast and orientation signal, are together illustrated in Figure 3.9. Each pixel continuously computes the scalar projection F(t) of the local contrast vector on a rotating unit vector images. When images points in the same direction as the local contrast vector, F(t) goes through a maximum. Thus, F(t) is a sine wave whose amplitude and phase represent the magnitude and direction of the contrast vector. The projection is computed locally, in each pixel, by multiplying the local contrast components by globally distributed steering signals. The steering signals are sinusoidal voltages arranged in quadrature to represent the rotation of images.

To illustrate these operations and the data output, a frame acquisition and the whole sequence of operations for two pixels a and b are shown in Figure 3.9. During a first phase, photocurrents are integrated. The steering functions are then turned on. During the first steering period, Fa(t) and Fb(t) go through a maximum, which is detected and memorized by a simple maximum detector (peaks a and b). During the subsequent periods, a monotonously decreasing threshold function is distributed to all pixels in parallel. In each pixel, this threshold function is continuously compared to this maximum. To output the contrast magnitude, an event encoding the address (X, Y) of the pixel is emitted on the AER bus (Mortara et al. 1995) when the threshold function reaches the maximum of the individual pixel. The contrast present at each pixel is time encoded by the high contrasts preceding the lower ones. This scheme limits data transmission to pixels with a contrast higher than a given value. To dispatch the contrast direction, a second event is emitted at the first zero crossing with positive slope that follows. This way, the contrast direction is ordered in time according to the contrast magnitude. Without gating of the contrast direction by the contrast magnitude, each pixel would fire a pulse at the first occurrence of the zero crossing. Notice that the zero crossing is shifted by −90° with respect to the maximum of F(t).

images

Figure 3.9 VISe computation and data output principle (two pixels a and b are represented). © 2003 IEEE. Reprinted, with permission, from Ruedi et al. (2003)

Figure 3.10 illustrates a number of images of the contrast representation output by the sensor. Even a high DR scene consisting of a person and an illuminated light bulb, or an indoor/outdoor scene, or a scene containing the sun, are still represented clearly. And even low contrast edges are cleanly output.

Although the VISe is a lovely and imaginative architecture, it has been replaced in the group’s developments by a different approach based around a digital logarithmic image sensor tightly integrated with a custom DSP (Ruedi et al. 2009). The reason cited for abandoning the VISe approach is the desire to scale better to smaller digital process technologies and the desire to integrate tightly to a custom digital coprocessor on the same die.

images

Figure 3.10 VISe SC output. Each image shows the detected contrast magnitude as gray level. The magnitude of contrast is encoded by the time of the emission of the event relative to start of readout. © 2003 IEEE. Reprinted, with permission, from Ruedi et al. (2003)

3.6   Silicon Retina Pixels

The first part of this chapter presented an overview of the DVS, the ATIS, the Parvo–Magno retina, the Octopus, and the VISe vision sensors. The second part of this chapter discusses the pixel details of some of the retina designs.

3.6.1   DVS Pixel

Figure 3.2 shows an abstract view of the DVS pixel. This section discusses the actual implementation of the pixel, shown in Figure 3.11. The pixel consists of four parts: photoreceptor, buffer, differencing amplifier, comparators, plus the AER handshaking circuits.

The photoreceptor circuit uses a well-known transimpedance configuration (see, e.g., Delbruck and Mead 1994) that converts the photocurrent logarithmically into a voltage. The photodiode PD photocurrent is sourced by a saturated nFET Mfb. The gate of Mfb is connected to the output of an inverting amplifier (Mpr, Mcas, Mn) whose input is connected to the photodiode. Because this configuration holds the photodiode clamped at a virtual ground, the bandwidth of the photoreceptor is extended by the factor of the loop gain in comparison to a passive logarithmic photoreceptor circuit. This extended bandwidth is beneficial for high-speed applications, especially in low lighting conditions. The global sum of all photocurrents is available at the ΣI node, where it is sometimes used for adaptive control of bias values.

images

Figure 3.11 Complete DVS pixel circuit. (a) Analog part. (b) AER handshaking circuits. (c) Pixel layout. © 2008 IEEE. Reprinted, with permission, from Lichtsteiner et al. (2008)

The photoreceptor output Vp is buffered with a source follower to Vsf to isolate the sensitive photoreceptor from the rapid transients in the differencing circuit. (It is important in the layout to avoid any capacitive coupling to Vd because these can lead to false events and excessive noise.) The source follower drives the capacitive input of the differencing circuit.

Next the differencing amplifier amplifies changes in the buffer output. This inverting amplifier with capacitive feedback is balanced with a reset switch that shorts its input and output together, resulting in a reset voltage level that is about a diode drop from Vdd. Thus the change of Vdiff from its reset level represents the amplified change of log intensity.

It is straightforward to see that the relation between the TC TCON defined in Eq. (3.1) and Vdiff is given by

images

where A = C1/C2 is the differencing circuit gain, UT is the thermal voltage, and κx is the subthreshold slope factor of transistor MX. This equation might make it appear that the amplifier somehow integrates, and yes, this is true, it does integrate TCON, but TCON is already the derivative of log intensity. Therefore, it boils down to simple amplification of the change in log intensity since the last reset was released.

The comparators (MONn, MONp, MOFFn, MOFFp) compare the output of the inverting amplifier against global thresholds that are offset from the reset voltage to detect increasing and decreasing changes. If the input of a comparator overcomes its threshold, an ON or OFF event is generated.

Replacing ΔVdiff in Eq. (3.7) by comparator input thresholds don and doff and solving for Δ ln(I) yield the threshold positive and negative TCs θon and θoff that trigger ON or OFF events:

images

where don – diff is the ON threshold and doff – diff is the OFF threshold.

The threshold TC θ has dimensions of natural log of intensity and is hereafter called contrast threshold. Each event represents a change of log intensity of θ. For smoothly varying TCs, the rate R(t) of generated ON and OFF events can be approximated with

images

ON and OFF events are communicated to the periphery by circuits that implement the 4-phase address-event handshaking (see Chapter 2) with the peripheral AE circuits discussed in detail in Chapter 12. The row and column ON and OFF request signals (RR, CRON, CROFF) are generated individually, while the acknowledge signals (RA, CA) are shared. They can be shared because the pixel makes either an ON or OFF event, never both simultaneously. The row signals RR and RA are shared by pixels along rows and the signals CRON, CROFF, and CA are shared along columns. The signals RR, CRON, and CROFF are pulled high by statically biased pFET row and column pull-ups. When either the ON or OFF comparator changes state from its reset condition, the communication cycle starts. The communication cycle ends by turning on the reset transistor Mr which removes the pixel request.

Mr resets the pixel circuit by balancing the differencing circuit. This transistor also has the important function of enabling an adjustable refractory period Trefr (implemented by the starved NAND gate consisting of Mrof, MRA, and MCA), during which the pixel cannot generate another event. This refractory period limits the maximum firing rate of individual pixels to prevent small groups of pixels from taking the entire bus capacity.

When Mr is turned off, positive charge is injected from its channel onto the differencing amplifier input. This is compensated by slightly adjusting the thresholds don and doff around diff.

The smallest threshold that can be set across an array of pixels is determined by mismatch: as the threshold is decreased, a pixel with a low threshold will eventually not stop generating an event even when the differencing amplifier is reset. Depending on the design of the AER communication circuits, this state will either hang the bus or causes a storm of events from this pixel that consumes bus bandwidth. Mismatch characterization is further discussed in Section 3.7.1. The contrast sensitivity of the original DVS can be enhanced by inserting an extra voltage preamplification stage at node Vph in Figure 3.12a. This modified design from Leñero Bardallo et al. (2011b) uses a few extra transistors to implement a two-stage voltage amplification before the capacitive differentiator/amplifier. The photoreceptor circuit is also modified to the one in Figure 3.12b. The amplifying transistors are biased in the limit of strong to weak inversion. The extra amplifiers introduce a voltage gain of about 25, which allows reducing the capacitive gain stage (and area) by about 5, while still allowing overall gain improvement. As a result, the pixel area is reduced by about 30%, contrast sensitivity is improved, overall mismatch is only slightly degraded about a factor 2, but power consumption is significantly increased by a factor between 5 and 10. This technique is especially valuable when MIM capacitors are not available and poly capacitors occupy valuable transistor area. It has the possible drawback that now an AGC circuit must be implemented to center the DC operating point of the photoreceptor around the proper light intensity. Control of the gain can then globally generate events as though the illumination were changing. A more recent improvement to the preamplifier reduces the power consumption significantly (Serrano-Gotarredona and Linares-Barranco 2013).

3.6.2   ATIS Pixel

The ATIS pixel detects TC and uses this to locally trigger intensity measurements (Figure 3.13a). The change detector initiates the exposure of a new gray-level measurement. The asynchronous operation also avoids the time quantization of frame-based acquisition and scanning readout.

images

Figure 3.12 DVS pixel with higher sensitivity. (a) Pixel circuit, showing photoreceptor with feedback to gate of pFET that has fixed gate voltage VG and preamplifier G. (b) Details of G preamplifier. © 2011 IEEE. Reprinted, with permission, from Leñero Bardallo et al. (2011b)

The temporal change detector consists of the pixel circuits in the DVS. For the IMU, a time-based PWM imaging technique was developed. In time-based or pulse modulation imaging, the incident light intensity is encoded in the timing of pulses or pulse edges. This approach automatically optimizes the integration time separately for each pixel instead of imposing a fixed integration time for the entire array, allowing for high DR and improved signal-to-noise-ratio (SNR). DR is no longer limited by the power supply rails as in conventional CMOS APS, providing some immunity to the supply voltage scaling of modern CMOS technologies. Instead, DR is now only limited by the dark current and allowable integration time. As long as there is more photocurrent than the dark current and the user can wait for sample long enough to cover the integration voltage range, the intensity can still be captured. The photodiode is reset and subsequently discharged by the photocurrent. An in-pixel comparator triggers a digital pulse signal when a first reference voltage VH is reached, and then another pulse when the second reference voltage VL is crossed (Figure 3.13b). The resulting integration time tint is inversely proportional to the photocurrent. Independent AER communication circuits on the four sides of the sensor read out the DVS and two IMU events.

images

Figure 3.13 ATIS pixel (a) and examples of pixel signals (b)

Time-domain true correlated double sampling (TCDS) (Matolin et al. 2009) based on two adjustable integration thresholds (VH and VL) and intra-pixel state logic (‘state machine’) completes the pixel circuit (Figure 3.13a). The key feature of TCDS is the use of a single comparator where the reference is switched between the high and low thresholds. This way, comparator mismatch and kTC noise are both canceled out, improving both shot noise and FPN. VHVL can be set as small as 100 mV with acceptable image quality, allowing operation down to low intensities. However, IMU capture of fast moving objects requires high light intensity so that the object is not too blurred during the pixel exposure; significant drawback is that small or thin moving objects can disappear from the IMU output because the IMU exposure is restarted by the trailing edge of the object, resulting in an exposure of the background after the object has already passed. This effect is a specific instance of the motion artifacts that result from exposures that are not globally synchronous, and the need for so-called ‘global shutter’ imagers drives a large market for such sensors in machine vision applications.

The asynchronous change detector and the time-based photo-measurement approach complement each other nicely, mainly for two reasons: first, because both reach a DR > 120 dB – the former is able to detect small relative change over the full range, while the latter is able to resolve the associated gray levels independently of the initial light intensity; second, because both circuits are event based, namely the events of detecting brightness changes and the events of pixel integration voltages reaching thresholds.

3.6.3   VISe Pixel

The VISe sensor introduced in Section 3.5.5 measures SC and orientation. Here we will describe only the VISe front-end circuit to illustrate how it integrates over a fixed voltage range and then generates a sampling pulse. This sampling pulse is then used in the full VISe pixel to sample neighboring pixel photodetector voltages onto capacitors. These capacitors are the gates of the multiplier circuit used for contrast orientation measurement, which will not be described here. Figure 3.14 shows the schematic of the VISe photocurrent integration and sampling block. When signal RST is high, capacitor C is reset to voltage Vblack. Transistor MC together with OTA1 maintain a constant voltage across the photodiode so that the photocurrent is not integrated on the parasitic capacitance of the photodiode. The transconductance amplifier has an asymmetrical input differential pair to generate a voltage Vph around 25 mV. The voltage follower made of MP1 and MP2 is a simple p-type source follower in a separate well to get a gain close to one and minimize the parasitic input capacitance. Voltage VC is distributed to the four neighbors and compared to a reference voltage Vref. At the beginning of the photocurrent integration, signal Vout is high. When VC reaches Vref, signal Vout, which is fed to the four-quadrant multiplier block, goes down.

images

Figure 3.14 VISe pixel front end. © 2003 IEEE. Reprinted, with permission, from Ruedi et al. (2003)

3.6.4   Octopus Pixel

The Octopus retina pixel introduced in Section 3.5.4 reports intensity by measuring the time taken for the input photocurrent to be integrated on a capacitor to a fixed voltage threshold. When the threshold is crossed, an AER spike is generated by the pixel. The magnitude of the photocurrent is represented by the inter-event interval or frequency of spikes from a pixel. Although this representation does not reduce redundancy or compress DR and thus have much relation to biological retinal function, the circuit is interesting and forms the basis for some popular low-power spike generators used in silicon neurons as discussed in Chapter 7.

The pixel in the retina uses an event generator circuit that has been optimized so that an event is generated quickly and with little power consumption. A typical digital inverter using minimum size transistors in a 0.5 μm process and 3.3 V supply consumes only about 0.1 pJ capacitive charging energy per transition when the input signal is a perfect square wave. However, a linear photosensor slew rate (or a silicon neuron membrane potential) can be six orders of magnitude slower than typical digital signals (or 1 V/ms) in ambient lighting. In that case, a standard inverter operates in a region where both nFETS and pFETs are simultaneously conducting for a long time, and then an event generator like Mead’s axon hillock consumes about 4 nJ. Therefore, the power consumption of the inverter as the event generator is about four to five orders of magnitude greater than that of a minimum-size inverter in a digital circuit. To solve this problem, the Octopus imager uses positive feedback to shorten the transition time in the circuit shown in Figure 3.15. Unlike the positive feedback circuit used in Mead’s axon hillock circuit (Chapter 7) which uses capacitively coupled voltage feedback, the Octopus pixel uses current mirror-based positive feedback of current.

After reset, Vin is at a high voltage. As Vin is pulled down by photocurrent, it turns on M2, which eventually turns on M5, which accelerates the drop of Vin. Eventually the pixel enters positive feedback that results in a down-going spike on Vout. The circuit is cleverly arranged so that the extra integration capacitor C is disconnected by M7 during the spike to reduce the load and power consumption. Once the input of the inverter Vin reaches ground, the inverter current goes to zero and so does the feedback, because M3 turns off M4. Thus at initial and final state there is no power-supply current. Measurements show that at an average spike rate of 40 Hz, each event requires about 500 pJ of energy, a factor of eight times less power than the axon hillock circuit. The improvement becomes larger as the light intensity decreases. An array of VGA size (640 × 480 pixels) at an average firing rate of 40 Hz would burn about 6 mW in the pixels and would produce events at a total rate of about 13 Meps, which despite all these tricks to reduce power still is a substantial amount of power consumption and data to process.

images

Figure 3.15 Octopus pixel schematic. © 2003 IEEE. Reprinted, with permission, from Culurciello et al. (2003)

3.7   New Specifications for Silicon Retinas

Because of the nature of the event-based readout of AER silicon retinas, new specifications are needed to aid the end user and to enable quantifying improvements in sensor performance. As examples of these specifications, we use the DVS and discuss characteristics such as response uniformity, DR, pixel bandwidth, latency, and latency jitter, with more detail than reported in Lichtsteiner et al. (2008). A good recent publication on measurement of TC noise and uniformity characteristics by using flashing LED stimuli is Posch and Matolin (2011).

3.7.1   DVS Response Uniformity

For standard CMOS image sensors (CIS), FPN characterizes the uniformity of response across pixels of the array. For event-based vision sensors, the equivalent measure is the pixel-to-pixel variation in the event threshold, which is the threshold when the output of the comparator within the pixel generates an event. Some pixels will generate more events and others will generate fewer events or even no events. This variation depends on the settings of the comparator thresholds and is due to pixel-to-pixel mismatch. For the DVS, which we use here as an example, the threshold θ is defined in terms of minimum log intensity change to generate an ON or OFF event. Because d(log I) = dI/I, the threshold θ is the same as the minimum TC to generate an event. The contrast threshold mismatch is defined as σθ, the standard deviation of θ.

The dominant source of mismatch is expected to be found in the relative mismatch between differencing circuit reset level and comparator thresholds because device mismatch for transistors is in the order of 30% while capacitor mismatch is only in the order of 1% and the front-end photoreceptor steady-state mismatch is eliminated by differencing, and gain mismatch (κ mismatch) in the photoreceptor is expected to be in the order of 1%.

To measure the variation in event threshold, Lichtsteiner et al. (2008) used a black bar with linear gradient edges (which reduces the effects of the pixel refractory period) which was moved at a constant projected speed of about 1 pixel/10 ms through the visual field of the sensor. A pan-tilt unit smoothly rotated the sensor and a lens with long focal length minimized geometric distortion. Figure 3.16 shows the resulting histogram of events per pixel per stimulus edge for six different threshold settings. The threshold mismatch is measured from the width of the distributions combined with the known stimulus contrast of 15:1. Assuming an event threshold θ = Δ ln(I), (and assuming identical ON and OFF thresholds), and a threshold variation σθ, then an edge of log contrast C = ln(Ibright/Idark) will make N ± σN events:

images

images

Figure 3.16 Distributions in the number of DVS events recorded per pass of a dark bar for 40 repetitions of the 15 : 1 contrast bar sweeping over the pixel array, for example, for the highest threshold setting there are an average of 4.5 ON and 4.5 OFF events per ON or OFF edge per pixel. © 2008 IEEE. Reprinted, with permission, from Lichtsteiner et al. (2008)

From Eq. (3.11) come the computed expressions for θ and σθ:

images

where C is measured from the stimulus using a spot photometer under uniform lighting; N and σN are measured from the histograms.

Posch and Matolin (2011) summarizes many DVS pixel characteristics and reports a useful method for measuring DVS threshold matching using a flashing LED. This method characterizes the response by analyzing the s-shaped logistic function probabilistic DVS event response to step changes of LED intensity. The LED stimulus has the advantage of being easily controlled but the disadvantage of synchronously stimulating all the illuminated pixels, limiting its practical application to a few hundred pixels and even less if response timing jitter is being measured. Making a small LED flashing spot on the array is not so simple however, especially if the surrounding pixels must be uniformly illuminated by steady light intensity to reduce their dark-current noise-generated activity, which can be significant especially when the DVS thresholds are set low. Therefore, this method is probably best used with sensors that can limit the pixel activity to a controlled ROI.

It may also be possible to use a display to generate the stimulus, but caution should be used to ensure the screen is not flickering, because most screens control backlight intensity using duty-cycle modulation of the backlight. Turning up the display brightness to maximum often turns off the PWM backlight brightness control, reducing the display flicker significantly. The display latency and pixel synchronicity should also be carefully investigated before quantitative conclusions are drawn from using a computer display for stimulating pixels.

3.7.2   DVS Background Activity

Event-based sensors may produce some level of background activity that is not related to the scene. As an example from the DVS pixel, the reset switch junction leakage also produces background events, in this design only ON events. These background events become significant at low frequencies and contribute to the source of noise. The DVS of Lichtsteiner et al. (2008) has a background rate of about 0.05 Hz at room temperature. Since charge input to the floating input node Vf of the differencing amplifier in Figure 3.2 appears at the output Vdiff as a voltage across C2, the rate of background activity events Rleak is related to leakage current Ileak, threshold voltage θ, and feedback capacitor C2 by Eq. (3.13):

images

In most applications this background activity is easily filtered out by the simple digital filter described in Section 15.4.1 that discards events that are not correlated either in space or time with past events.

3.7.3   DVS Dynamic Range

Another important metric for all vision sensors is DR, defined as the ratio of maximum to minimum illumination over which the sensor generates useful output. As an example, for the DVS the DR is defined as the ratio of maximum to minimum scene illumination at which events can be generated by high contrast stimuli. This range is illustrated in Figures 3.17a3.17c. In the DVS, this large range arises from the logarithmic compression in the front-end photoreceptor circuit and the local event-based quantization. Photodiode dark current of 4 fA at room temperature limits the lower end of the range. (In the DVS, it is possible to measure the dark current from the global ΣI current divided by the number of pixels.) Events are generated down to less than 0.1 lux scene illumination using a fast f/1.2 lens (Figure 3.17c). At this illumination level, the signal (photocurrent induced by photons from the scene) is only a small fraction of the noise (background dark current). Operation at this low SNR is possible because the low threshold mismatch allows setting a threshold that is low enough to detect the reduced photocurrent TC. The sensor also operates up to bright sunlight scene illumination of 100 klux; the total DR amounts to about 6 decades, or 120 dB. The vision sensor is usable for typical scene contrast under nighttime street lighting of a few lux. The DR is halved approximately every 8°C increase in temperature. Another relevant metric is the intra-scene DR, which is defined as the range of illumination within a scene over which the sensor is usable. This range could be lower than the total DR if the sensor uses global gain control to center the photodetector operating point.

images

Figure 3.17 Illustration of DVS DR capabilities. (a) Histogrammed output from the vision sensor viewing an Edmund density step chart with a high contrast shadow cast on it. (b) The same scene as photographed by a Nikon 995 digital camera to expose the two halves of the scene. (c) Moving black text on white background under 3/4 moon (<0.1 lux) illumination (180 ms, 8000 events). © 2008 IEEE. Reprinted, with permission, from Lichtsteiner et al. (2008)

3.7.4   DVS Latency and Jitter

The latency and precision of timing of event-based sensor output are important metrics because they define speed of response and the analog quantity (inter-event time) being represented. The pixel event latency depends on the pixel latency and the latency of the peripheral AER circuits. The latency jitter depends on the jitter of the event generation within the pixel and the variance in the transmission time of the event.

images

Figure 3.18 DVS latency and latency jitter (error bars) versus illumination in response to a 30% contrast periodic 10 Hz step stimulus of a single pixel illumination. (a) Measurement of repeated single OFF event responses to the step. (b) Results with two bias settings as a function of chip illuminance. © 2008 IEEE. Reprinted, with permission, from Lichtsteiner et al. (2008)

As an example from the DVS, a basic prediction in Lichtsteiner et al. (2008) is that the increase in latency should be proportional to reciprocal illumination. The latency is typically measured using a small spot covering a few pixels of a low-contrast periodic step stimulus while varying DC illuminance (Figure 3.18). The thresholds are set to produce about one event of each polarity per stimulus cycle. The overall latency is plotted versus stimulus chip illuminance for two sets of measurements, one at the nominal biases used for many applications, the other at higher current levels for the front-end photoreceptor biases. The plots show the measured latency and the 1-σ response jitter. The dashed lines show a reciprocal (first-order) and reciprocal-square-root (second-order) relationship between latency and illumination. As can be seen, latencies are on the order of milliseconds and follow either inverse linear or inverse square root characteristics with intensity depending on photoreceptor biasing. An often quoted metric (with unproven relevance) is the minimum latency, which is obtained under very high illumination and biasing conditions.

3.8   Discussion

This chapter focused on discussion of recent silicon retina designs and their specifications. But there are many areas of possible improvement and innovation in the design of silicon retinas.

The event-based silicon retinas in this chapter offer either spatial or temporal processing and none of them offer both in a form usable for real-world applications. Although the ATIS discussed here offers an intensity output, this output has its own problems, because it has large motion artifacts. Perhaps a new generation of so-called DAVIS silicon retinas that combine DVS and APS technologies in the same pixels (Berner et al. 2013) can offer output that combines the strengths of conventional machine vision based on tiny APS pixels with the strengths of low-latency, sparse output neuromorphic event-based vision. But here the challenge will be to improve the performance of the DAVIS APS output to keep up with conventional APS technology, which is continually being evolved in a competitive commercial market.

Another example is that no one has thus far built a high-performance color silicon retina, although color is a basic feature of biological vision in all diurnal animals. A few recent attempts in this direction have all attempted to use buried double (Berner and Delbruck 2011; Berner et al. 2008; Fasnacht and Delbruck 2007) or triple (Leñero Bardallo et al. 2011a) junctions to separate the colors, as was pioneered by Dick Merrill and others at Foveon Inc. (Gilder 2005). But the usability and performance of these neuromorphic prototypes fall far short of those afforded by commodity CMOS color cameras. This is likely due mostly to the poor color separation properties of standard CMOS implants; Foveon used a modified process with special implants to create buried junctions at optimal depths. Mostly, however, the development of color silicon retinas has been held back by the expense of using the process technologies that provide integrated color filters.

Another example is that implementations of vision sensors with space-varying resolution are in their infancy (Azadmehr et al. 2005) and have not been convincingly demonstrated as useful, although all animals vary the spatial acuity and characteristics of their visual system with eccentricity from the fovea. This could be the symptom of fundamental technological differences between electronics and the ionic basis for biological neural computation. The ability to electronically steer a focus of attention over a sensor with a large number of inexpensive pixels much more cheaply than building a mechanical eye could mean that foveated cameras would find a limited niche commercially.

The rather poor quantum efficiencies and fill factors of silicon retinas are also a consequence of using standard CMOS technologies. One solution is the use of integrated microlenses, which concentrate light onto the photodiode. However, standard microlenses offered in CIS processes are optimized for pixels smaller than 5 μm, so for 20 μm retina pixels they do not offer any benefit. Another possible improvement could come from back-side illumination (BSI). Normally a vision sensor is illuminated from the top of the wafer. However, for tiny-pixel CMOS imagers front-side illumination (FSI) is a big problem, because the photodiode sits at the bottom of a tunnel through all the overlaying metal and insulator layers, making it difficult to capture light, particularly at the edges of the sensor when using a wide angle lens. This problem led the development of BSI, where the wafer is thinned down to less than 20 μm and is illuminated from the back rather than the front. Now all the silicon area can receive light and if properly designed, most of the photocharge is collected by the photodiode. Intense development of BSI image sensors by industry may shortly make this technology more accessible for prototyping. But now new problems can arise, such as unwanted ‘parasitic photocurrents’ in junctions other than the photodiode. These currents can disturb the pixel operation, particularly when the pixel stores a charge on a capacitor like a global-shutter CMOS imager or the DVS pixel.

There are other areas of improvement toward mimicking the features of biological retinas. Mammalian retinas have tens of parallel output channels with different and complex nonlinear signal-processing characteristics. We hope that biologists will understand more about the functional roles of these parallel information channels, which will motivate their implementation either in the focal plane or in post-processing vision sensor output.

Even considering the input photoreceptors, there are opportunities: a cone photoreceptor isolated in a dish is an example of a distributed amplifier with interesting dynamical properties (Sarpeshkar 2010; Shapley and Enroth-Cugell 1984). Because the gain is generated by a chain of biochemical amplifiers, it can be varied by many orders of magnitude with only a small effect on the overall bandwidth or latency. Right now, the way we build logarithmic photoreceptors using a single stage of amplification, the gain-bandwidth product is constant: if the light becomes 10 times dimmer, the photoreceptor becomes 10 times slower. On the one hand this is good because it helps control shot noise. But on the other hand, it means that logarithmic silicon retina photoreceptor dynamics can change over many orders of magnitude depending on the illumination. Perhaps one possible future solution is to build sensors that emulate the rod-cone systems of biology, where some photoreceptors are optimized for low light levels and others for high levels. Then the subsequent processing circuits can be shared between these detectors as has been recently shown to occur in the mouse retina (Farrow et al. 2013).

The retina designs discussed here are related to each other historically (Figure 3.19) and they illustrate the trade-offs that designers face in creating usable designs for the end user and the difficulty of incorporating multiple retina functionalities within the focal plane. The design style in the Parvo–Magno retina – which uses many current mirrors – led to large mismatch. The circuits can be calibrated to reduce the amount of offset in the responses but the calibration circuits themselves occupy a large percentage of the area in the pixel (see, e.g., Costas-Santos et al. 2007; Linares-Barranco et al. 2003).

images

Figure 3.19 Tree diagram of silicon retina designs

In general, the easiest way to improve precision is to ensure that the signal is amplified more than the mismatch. This method is used in the DVS, the ATIS, and the VISe. In the DVS, the signal (change of log intensity) is amplified by the differencing amplifier before it is compared to the thresholds. That way, the comparator mismatch is effectively reduced by a factor of the amplifier gain. In the ATIS and VISe, the signal (intensity or SC) is amplified to a large fraction of the power supply voltage before it is compared to the threshold. Moreover, in the ATIS, the same comparator is used for detecting two different, globally identical threshold levels. This way, the comparator mismatch is canceled.

The retina implementations described in this chapter show the variety of approaches taken toward building high-quality AER vision sensors that can now be used for solving practical machine vision problems. There has been a lot of recent progress using an event-based, data-driven, redundancy-reducing style of computation that underlies the power of biological vision. Chapter 13 discusses examples of multichip AER systems that combine AER retinas with other AER chips. Chapter 15 discusses how AER sensor output are processed algorithmically, by programs running on digital computers, to decrease computational cost and system latency by taking advantage of the sparsity of the events and their precise timing. The use of timing is even more critical in auditory processing, as will be introduced in Chapter 4.

References

Azadmehr M, Abrahamsen JP, and Hafliger P. 2005. A foveated AER imager chip. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS) 3, pp. 2751–2754.

Barbaro M, Burgi P, Mortara R, Nussbaum P, and Heitger F. 2002. A 100 × 100 pixel silicon retina for gradient extraction with steering filter capabilities and temporal output coding. IEEE J. Solid-State Circuits 37(2), 160–172.

Barlow HB. 1961. Possible principles underlying the transformation of sensory messages. In: Sensory Communication (ed. Rosenblith WA). MIT Press, Cambridge, MA. pp. 217–234.

Belbachir AN, Litzenberger M, Schraml S, Hofstatter M, Bauer D, Schon P, Humenberger M, Sulzbachner C, Lunden T, and Merne M. 2012. CARE: a dynamic stereo vision sensor system for fall detection. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 731–734.

Berner R and Delbruck T. 2011. Event-based pixel sensitive to changes of color and brightness. IEEE Trans. Circuits Syst. I: Regular Papers 58(7), 1581–1590.

Berner R, Lichtsteiner P, and Delbruck T. 2008. Self-timed vertacolor dichromatic vision sensor for low power pattern detection. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1032–1035.

Berner R, Brandli C, Yang MH, Liu SC, and Delbrück T. 2013. A 240×180 10 mW 12 μs latency sparse-output vision sensor for mobile applications. Proc. Symp. VLSI Circuits, pp. C186–C187.

Boahen KA. 1996. A retinomorphic vision system. IEEE Micro 16(5), 30–39.

Boahen KA and Andreou AG. 1992. A contrast sensitive silicon retina with reciprocal synapses. In: Advances in Neural Information Processing Systems 4 (NIPS) (eds. Moody, JE, Hanson, SJ, and Lippmann, RP). Morgan-Kaufmann, San Mateo, CA. pp. 764–772.

Conradt J, Berner R, Cook M, and Delbruck T. 2009a. An embedded AER dynamic vision sensor for low-latency pole balancing. Proc. 12th IEEE Int. Conf. Computer Vision Workshops (ICCV), 780–785.

Conradt J, Cook M, Berner R, Lichtsteiner P, Douglas RJ, and Delbruck T. 2009b. A pencil balancing robot using a pair of AER dynamic vision sensors. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 781–784.

Costas-Santos J, Serrano-Gotarredona T, Serrano-Gotarredona R, and Linares-Barranco B. 2007. A spatial contrast retina with on-chip calibration for neuromorphic spike-based AER vision systems. IEEE Trans. Circuits Syst. I 54(7), 1444–1458.

Culurciello E, Etienne-Cummings R, and Boahen K. 2003. A biomorphic digital image sensor. IEEE J. Solid-State Circuits 38(2), 281–294.

Delbrück T and Lichtsteiner P. 2007. Fast sensory motor control based on event-based hybrid neuromorphic-procedural system. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 845–848.

Delbrück T and Mead CA. 1994. Adaptive photoreceptor circuit with wide dynamic range. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS) 4, pp. 339–342.

Delbrück T, Linares-Barranco B, Culurciello E, and Posch C. 2010. Activity-driven, event-based vision sensors. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 2426–2429.

Drazen D, Lichtsteiner P, Hafliger P, Delbrück T, and Jensen A. 2011. Toward real-time particle tracking using an event-based dynamic vision sensor. Exp. Fluids 51(55), 1465–1469.

Farrow K, Teixeira M, Szikra T, Viney TJ, Balint K, Yonehara K, and Roska B. 2013. Ambient illumination toggles a neuronal circuit switch in the retina and visual perception at cone threshold. Neuron 78(2), 325–338.

Fasnacht DB and Delbruck T. 2007. Dichromatic spectral measurement circuit in vanilla CMOS. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 3091–3094.

Fossum ER. 1995. CMOS image sensors: electronic camera on a chip. Intl. Electron Devices Meeting. IEEE, Washington, DC. pp. 17–25.

Fukushima K, Yamaguchi Y, Yasuda M, and Nagata S. 1970. An electronic model of the retina. Proc. IEEE 58(12), 1950–1951.

Gilder G. 2005. The Silicon Eye: How a Silicon Valley Company Aims to Make all Current Computers, Cameras, and Cell Phones Obsolete. WW Norton & Company, New York.

jAER. 2007. jAER Open Source Project, http://jaerproject.org (accessed July 28, 2014).

Kameda S and Yagi T. 2003. An analog VLSI chip emulating sustained and transient response channels of the vertebrate retina. IEEE Trans. Neural Netw. 15(5), 1405–1412.

Kavadias S, Dierickx B, Scheffer D, Alaerts A, Uwaerts D, and Bogaerts J. 2000. A logarithmic response CMOS image sensor with on-chip calibration. IEEE J. Solid-State Circuits 35(8), 1146–1152.

Kramer J. 2002. An integrated optical transient sensor. IEEE Trans. Circuits Syst. II 49(9), 612–628.

Kuffler SW. 1953. Discharge patterns and functional organization of mammalian retina. J. Neurophys. 16(1), 37–68.

Lee J, Delbruck T, Park PKJ, Pfeiffer M, Shin CW, Ryu H, and Kang BC. 2012. Live demonstration: gesture-based remote control using stereo pair of dynamic vision sensors. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 741–745.

Leñero Bardallo JA, Bryn DH, and Hafliger P. 2011a. Bio-inspired asynchronous pixel event tri-color vision sensor. Proc. IEEE Biomed. Circuits Syst. Conf. (BIOCAS), pp. 253–256.

Leñero Bardallo JA, Serrano-Gotarredona T, and Linares-Barranco B. 2011b. A 3.6 μs latency asynchronous frame-free event-driven dynamic-vision-sensor. IEEE J. Solid-State Circuits 46(6), 1443–1455.

Lichtsteiner P, Delbrück T, and Kramer J. 2004. Improved on/off temporally differentiating address-event imager. Proc. 11th IEEE Int. Conf. Electr. Circuits Syst. (ICECS), pp. 211–214.

Lichtsteiner P, Posch C, and Delbrück T. 2008. A 128 × 128 120 dB 15 μs latency asynchronous temporal contrast vision sensor. IEEE J. Solid-State Circuits 43(2), 566–576.

Linares-Barranco B, Serrano-Gotarredona T, and Serrano-Gotarredona R. 2003. Compact low-power calibration mini-DACs for neural massive arrays with programmable weights. IEEE Trans. Neural Netw. 14(5), 1207–1216.

Litzenberger M, Kohn B, Belbachir AN, Donath N, Gritsch G, Garn H, Posch C, and Schraml S. 2006. Estimation of vehicle speed based on asynchronous data from a silicon retina optical sensor. Proc. 2006 IEEE Intell.Transp. Syst. Conf. (ITSC), pp. 653–658.

Liu SC. 1999. Silicon retina with adaptive filtering properties. Analog Integr. Circuits Signal Process. 18(2/3), 243–254.

Loose M, Meier K, and Schemmel J. 2001. A self-calibrating single-chip CMOS camera with logarithmic response. IEEE J. Solid-State Circuits 36(4), 586–596.

Luo Q and Harris J. 2004. A time-based CMOS image sensor. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS) 4, pp. 840–843.

Luo Q, Harris J, and Chen ZJ. 2006. A time-to-first spike CMOS image sensor with coarse temporal sampling. Analog Integr. Circuits Signal Process. 47(3), 303–313.

Mahowald M. 1991. Silicon retina with adaptive photoreceptors. SPIE/SPSE Symp. Electronic Sci. Technol.: From Neurons to Chips 1473, 52–58.

Mahowald M. 1994. An Analog VLSI System for Stereoscopic Vision. Kluwer Academic, Boston.

Mahowald M and Mead C. 1991. The silcon retina. Scientific American 264(5), 76–82.

Mallik U, Vogelstein RJ, Culurciello E, Etienne-Cummings R, and Cauwenberghs G. 2005. A real-time spike-domain sensory information processing system. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS) 3, pp. 1919–1923.

Masland R. 2001. The fundamental plan of the retina. Nat. Neurosci. 4(9), 877–886.

Matolin D, Posch C, and Wohlgenannt R. 2009. True correlated double sampling and comparator design for time-based image sensors. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1269–1272.

Mead CA. 1989. Analog VLSI and Neural Systems. Addison-Wesley, Reading, MA.

Mortara A, Vittoz EA, and Venier P. 1995. A communication scheme for analog VLSI perceptive systems. IEEE J. Solid-State Circuits 30(6), 660–669.

Ni Z, Pacoret C, Benosman R, Ieng S, and Regnier S. 2012. Asynchronous event-based high speed vision for microparticle tracking. J. Microscopy 245(3), 236–244.

Posch C and Matolin D. 2011. Sensitivity and uniformity of a 0.18 μm CMOS temporal contrast pixel array. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1572–1575.

Posch C, Matolin D, and Wohlgenannt R. 2011. A QVGA 143 dB dynamic range frame-free PWM image sensor with lossless pixel-level video compression and time-domain CDS. IEEE J. Solid-State Circuits 46(1), 259–275.

Qi X, Guo X, and Harris J. 2004. A time-to-first-spike CMOS imager. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS) 4, pp. 824–827.

Rodieck RW. 1988. The primate retina. In: Comparative Primate Biology (eds Steklis HD and Erwin J). vol. 4. Alan R. Liss, New York. pp. 203–278.

Rogister P, Benosman R, Leng S, Lichtsteiner P, and Delbrück T. 2012. Asynchronous event-based binocular stereo matching. IEEE Trans. Neural Netw. Learning Syst. 23(2), 347–353.

Ruedi PF, Heim P, Kaess F, Grenet E, Heitger F, Burgi PY, Gyger S, and Nussbaum P. 2003. A 128 × 128, pixel 120-dB dynamic-range vision-sensor chip for image contrast and orientation extraction. IEEE J. Solid-State Circuits 38(12), 2325–2333.

Ruedi PF, Heim P, Gyger S, Kaess F, Arm C, Caseiro R, Nagel JL, and Todeschini S. 2009. An SoC combining a 132 dB QVGA pixel array and a 32 b DSP/MCU processor for vision applications. IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, 1, pp. 46–47.

Sarpeshkar R. 2010. Ultra Low Power Bioelectronics. Cambridge University Press, Cambridge, UK.

Schraml S, Belbachir AN, Milosevic N, and Schön P. 2010. Dynamic stereo vision system for real-time tracking. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1409–1412.

Serrano-Gotarredona R, Oster M, Lichtsteiner P, Linares-Barranco A, Paz-Vicente R, Gomez-Rodriguez F, Camunas-Mesa L, Berner R, Rivas M, Delbrück T, Liu SC, Douglas R, H¨afliger P, Jimenez-Moreno G, Civit A, Serrano-Gotarredona T, Acosta-Jimenez A and Linares-Barranco B. 2009. CAVIAR: a 45 K-neuron, 5 M-synapse, 12 G-connects/sec AER hardware sensory-processing-learning-actuating system for high speed visual object recognition and tracking. IEEE Trans. Neural Netw. 20(9), 1417–1438.

Serrano-Gotarredona T and Linares-Barranco B. 2013. A 128 × 128 1.5% contrast sensitivity 0.9% FPN 3 μs latency 4 mW asynchronous frame-free dynamic vision sensor using transimpedance preamplifiers. IEEE J. Solid-State Circuits 48(3), 827–838.

Shapley R and Enroth-Cugell C. 1984. Visual adaptation and retinal gain controls. Prog. Retin. Res. 3, 263–346.

Shimonomura K and Yagi T. 2005. A 100 × 100 pixels orientation-selective multi-chip vision system. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS) 3, pp. 1915–1918.

Shoushun C and Bermak A. 2007 Arbitrated time-to-first spike CMOS image sensor with on-chip histogram equalization. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 15(3), 346–357.

Thorpe S, Fize D, and Marlot C. 1996. Speed of processing in the human visual system. Nature 381(6582), 520–522.

Werblin FS and Dowling JE. 1969. Organization of the retina of the mudpuppy Necturus maculosus: II. Intracellular recording. J. Neurophys. 32(3), 339–355.

Yang W. 1994. A wide dynamic range low power photosensor array IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, pp. 230–231.

Zaghloul KA and Boahen KA. 2004a. Optic nerve signals in a neuromorphic chip I: Outer and inner retina models. IEEE Trans. Biomed. Eng. 51(4), 657–666.

Zaghloul KA and Boahen KA. 2004b. Optic nerve signals in a neuromorphic chip II: Testing and results. IEEE Trans. Biomed. Eng. 51(4), 667–675.

Zaghloul KA and Boahen KA. 2006. A silicon retina that reproduces signals in the optic nerve. J. Neural Eng. 3(4), 257–267.

__________

The figure shows a cross section of mammalian retina, with input photoreceptors at top and output ganglion cells axons exiting at bottom left. The highlighted parts show the prototypical path from photoreceptors to ON and OFF bipolar cells (also involving the lateral horizontal cell) to ON and OFF ganglion cells. Rodieck (1988). Reproduced with permission of Sinauer Associates.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset