,

15

Algorithmic Processing of Event Streams

@Description("Subsamples x and y addresses")
public class SubSampler extends EventFilter2D {
     /** Process the packet.
     * @param in the input packet
     * @return out the output packet
     */
    synchronized public EventPacket filterPacket(EventPacket in) {
        OutputEventIterator oi=out.outputIterator(); // get the iterator to return output events
        for(BasicEvent e:in){ // for each input event
            BasicEvent o=(BasicEvent)oi.nextOutput(); // get an unused output event
            o.copyFrom(e); // copy the input event to the output event
            o.x = o.x>>bits; // right shift the x and y addresses
            o.y = o.y>>bits;
        }
        return out; // return the output packet
    }
}

This chapter describes event-driven algorithmic processing of AE data streams on digital computers. Algorithms fall into categories such as noise-reduction filters, event labelers, and trackers. The data structures and software architectures are also discussed as well as requirements for software and hardware infrastructure.1

15.1   Introduction

As AER sensors such as the retinas of Chapter 3 and the cochleas of Chapter 4 and their supporting hardware infrastructure have evolved, it has become clear that the most rapid way to take advantage of this new hardware is by developing algorithms for processing the data from the devices on conventional computers. This approach allows rapid development and performance that scales at least as fast as Moore’s law. Such an option is really the only tenable one for integration into practical systems as well, because conventional digital processors will continue to serve as a foundation for building integrated systems for the foreseeable future.

This approach is somewhat in contrast with past and many of the current developments of neuromorphic engineering, that has focused on complete neuromorphic systems like the ones described in Chapter 13, which eschew connections with programmable computers except to the extent of providing supporting infrastructure. But although there has been a lot of developments of such pure neuromorphic systems, they have proven difficult to configure and use. By treating event-based hardware more like standard computer peripherals, applications can be rapidly developed for practical purposes.

In the development of the applications, there is a need for a style of digital signal processing that starts with the AER events as input. These time-stamped digital addresses now are processed in order to achieve a desired aim. For instance, a robotic car that uses silicon retina output can be developed to drive safely along a road, by following the noisy and incomplete lane markings. Or a binaural AER cochlea can be used as an input sensor for a battery-powered wireless sensor that detects and classifies significant events such as the sound and direction of gunshots. These are just two potential examples, neither of which has yet been built, but they serve to illustrate that there are practical (and quite feasible) problems that can drive the commercial development of neuromorphic technology without demanding a complete remake of electronic computing.

In this ‘event-driven’ style of computation, each event’s location and time-stamp are used in the order of arrival. Algorithms can take advantage of the capabilities of synchronous digital processors for high-speed iteration, branching logic operations, and moderate pipelining.

Over the last 10 years, there has been significant development using this approach, and this chapter presents instances of these algorithms and applications. The discussion starts with a description of the required software infrastructure, and then presents examples of existing algorithms. These examples are followed by some remarks on data structures and software architecture. Finally, the discussion comes back to the relationship between existing algorithms and conventional signal processing theory based on Nyquist sampling and the need for new theoretical developments in this area to support the data-driven style of signal processing that underlies the event-based approach.

Some examples of existing algorithms based on silicon retina output include commercial development of vehicle traffic and people counting using DVS silicon retinas (Litzenberger et al. 2006a; Schraml et al. 2010), fast visual robots such as a robotic goalie (Delbrück and Lichtsteiner 2007) and a pencil balancer (Conradt et al. 2009b), hydrodynamic and microscopic particle tracking (Drazen et al. 2011; Ni et al. 2012), low-level motion feature extraction (Benosman et al. 2012), low-level stereopsis (Rogister et al. 2012), and stereo-based gesture recognition (Lee et al. 2012). Examples of event-based audio processing include auditory localization (Finger and Liu 2011) and speaker identification (Li et al. 2012).

Based on these developments, it has become clear that methods for processing events have evolved into the following classes:

  • Filters that clean up the input to reduce noise or redundancy.
  • Low-level feature detectors that we will call labelers, that attach additional labels to the events which are intermediate interpretations of their meanings. For example, a silicon retina event can acquire interpretation, such as contour orientation or direction of motion. Based on these extended types, global metrics such as image velocity are easily computed by integrating these labels.
  • Trackers that detect, track, and potentially classify objects.
  • Cross-correlators that cross-correlate event streams from different sources.

In order to filter and process event streams, some memory structures are needed. For some algorithms, particularly for filters and labelers, it is often the case that they use one or several topographic memory maps of event times. These maps store the last event time-stamps for each address.

Because events in a computer are just data objects, unlike the binary spikes of the nervous system, digital events can carry arbitrary payloads of additional information, which are called annotations here. Events start with precise timing and source address in the AER device. As they are processed by a pipeline of algorithms, extraneous events are discarded, and as events are labeled they can gain additional meaning. Event annotations can be attached by treating the event as a software object. This object is not the same as cell type in cortex. In cortical processing one usually considers that increased dimensionality is implemented by expanding the number of cell types. But because events as considered here can carry arbitrary payloads, algorithmic events are assigned increasing interpretation by larger annotations. Multiple interpretations can then be transported by multiple events instead of activity on multiple hardware units. For instance, a representation of orientation that is halfway between two principal directions can still be represented as near-simultaneous events, each one signifying a different and nearby orientation. In addition, this annotated event can carry scalar and vector values. For example, the ‘motion events’ produced by the algorithm described in Section 15.4.3 add information about the speed and vector direction of the computed optical flow.

15.2   Requirements for Software Infrastructure

One can imagine a future software infrastructure like the one shown in Figure 15.1. This software must handle multiple input streams from several neuromorphic sensors such as retinas and cochleas, and must process data from each stream in a pipeline. It must also combine streams to perform sensor fusion and finally produce output, either in the form of motor commands or analytical results. Ideally, the processing load is distributed efficiently over threads of execution and processing cores.

The organization of events in memory data structures is important for efficiency of processing and flexibility of software development. As an example, the architecture used in jAER (2007) illustrated in Figure 15.2 shows how events are bundled in buffers called packets. A packet is a reusable memory object that contains a list of event objects. These event objects are memory references to structures that contain the event with its annotations. A particular event-based filter or processor maintains its own reused output packet that holds the results. These packets are reused because the cost of object creation is typically hundreds of times more expensive than accessing existing objects. The packets are dynamically grown as necessary, although this expensive process only occurs a few times during program initialization. This way, heap memory use is low because the reused packets are rarely allocated and need not be garbage-collected, which is important particularly for real-time applications.

The event packets are analogous to frames, but are different. A frame of data represents a fixed point or range in time, and the contents of the frame consists of samples of the analog input. By contrast, a packet can represent a variable amount of real time depending on the number of events and their timing, and the events in a packet will tend to carry more identical amounts of useful information than samples in a frame, as long as the sensor reduces redundancy in its input.

images

Figure 15.1   Possible software architecture for handling multiple AER input streams (here a stereo pair of silicon retinas and a binaural silicon cochlea) and processing the data to finally produce a motor output

images

Figure 15.2   Structure of an event packet, holding events that are derived from a base class to contain more extended annotation. Packets are processed by a pipeline of processors, each operating on the output of a previous stage

15.2.1   Processing Latency

Processing latency is important for real-time applications especially when results are fed back to a controller. Examples of these systems are robots and factory production lines. Latency is also important to minimize for human–machine interaction. In an embedded implementation where the processor processes each event directly from the AE device as it is received, latency is simply algorithmic. But more typically, there will be some type of buffer holding a packet of events between the AE device and the processor. Using a software architecture that processes events in packets, the latency is the time between the last events in successive packets, plus the processing time. For example, suppose a light turns on and the system must react to it. The system must fill a hardware buffer (or time out and send it only partially filled), transmit the events, and then process the packet. If the light turns on just after a previous packet has been transmitted, then the total latency is the time to fill the buffer (or some minimum time), plus the transmission time, plus the processing time, plus finally the time required to activate the actuator.

Hardware interfaces between AER sensors and computers can be built to ensure that these packets get delivered to the host with a minimum frequency, for example, 1 kHz. Then the maximum packet latency is 1 ms. But the latency can be much smaller if the event rate is higher. For example, the USB chip used with the DVS silicon retina from Chapter 3 has hardware FIFO buffers of 128 events. If the event rate is 1 MHz, then 128 events fill the FIFO in 128 μs. Then the USB interface running at 480 Mbps requires about 10 μs to transmit the data to the computer and the total latency to get the data into the computer is under 200 μs. Compared with a 100 Hz camera, this represents a factor of 50 times smaller latency.

15.3   Embedded Implementations

When software algorithms are embedded in processors adjacent to AE devices, there are somewhat different considerations than when the software is implemented in a more general framework on a computer connected to the device by some kind of remote interface such as USB. In an embedded implementation, the processor, which is generally either a microcontroller (Conradt et al. 2009a) or a DSP (Litzenberger et al. 2006b), is directly connected to the AE device. In Litzenberger et al. (2006b) and Hofstatter et al. (2011), a hardware FIFO is used to buffer AEs before reaching the processor, allowing more efficient processing of buffers of events. A DSP was also used in the VISe system reported by Grenet et al. (2005). In embedded implementations, it is generally best to inline all processing stages for events as much as possible. This avoids function calling overhead and it is often possible to avoid the need to time-stamp events, because each event is processed in real time as it is received.

On cheaper and lower power processors, only fixed point computation is available, and in many chips only multiply and not divide is offered. For instance, Conradt et al. (2009a) reported an implementation of a pencil balancing robot where each DVS sensor output was directly processed by a fixed-point, 32-bit microcontroller burning about 100 mW. Here, the computation of pencil angle and position was performed entirely in fixed-point arithmetic. A single divide operation was required for every actual update of the balancer hand at a rate of 500 Hz and this divide was split up into pieces, so that it would not result in large gaps in processing events. The update of the LCD panel showing the DVS retina output was cleverly scheduled, so that only two pixels of the LCD screen were updated on the reception of each event—one at the location of the event and the other scanning over the LCD pixel array and decaying the value there. That way, a 2D histogram of event activity could be efficiently maintained without requiring large periods of time during which event processing would be interrupted.

Another highly developed example of an embedded implementation is reported in Schraml et al. (2010). This system uses a pair of DVS retinas and computes stereo correspondence on an FPGA in order to detect ‘falling people’ events, which are important for elderly care. The approach taken is like the one reported in Benosman et al. (2012) where conventional machine vision methods are applied to short-time buffers of accumulated events.

15.4   Examples of Algorithms

This section presents examples of some existing methods for processing events, all from jAER (2007), and mostly for the DVS discussed in Chapter 3. It starts with noise filtering and then continues with low-level visual feature extraction, followed by visual object tracking, followed by examples of audio processing using the AER-EAR silicon cochlea discussed in Chapter 4.

15.4.1   Noise Reduction Filters

It may be beneficial to preprocess data in some way by either transforming or discarding events. As trivial instances, it might be necessary to reduce the size of the address space (e.g., from 128 × 128 to 64 × 64) or transform an image by rotating it. These two operations consist of simply right shifting the x and y addresses (subsampling) or multiplying each (x, y) address by a rotation matrix.

Preprocessing can also be very beneficial to remove ‘noise’ addresses. As an example of noise filtering, we will describe an algorithm (open-sourced as BackgroundActivityFilter in the jAER project) that removes uncorrelated background activity from the DVS sensor described in Chapter 3. These events can arise from thermal noise or junction leakage currents acting on switches connected to floating nodes. The filter only passes events that are supported by recent nearby (in space) events. Background activity is uncorrelated and is largely filtered away, while events that are generated by the world, say by a moving object seen by a silicon retina, even if they are only single pixel in size, mostly pass through. This filter uses two maps of event time-stamps to store its state, that is, because the sensor has 128 × 128 pixels, each with ON and OFF output events, two arrays of 128 × 128 pixels containing integer time-stamp values are used to store the time-stamps. This filter has a single parameter dT that specifies the support time for which an event will be passed, meaning the maximum time that can pass between this event and a nearby past event to allow this event to pass the filter. The steps of the algorithm for each event in a packet are as follows:

  1. Store the event’s time-stamp in all neighboring addresses in the time-stamp memory—for instance, the 8 pixel addresses surrounding the event’s address, overwriting the previous values.
  2. Check if the event’s time-stamp is within dT of the previous value written to the time-stamp map at this event’s address. If a previous event has occurred recently, pass the event to the output, otherwise discard it.

Because branching operations are potentially expensive in modern CPU architectures, two optimizations are used here. First, this implementation avoids the time-stamp difference check on all the neighbors simply storing an event’s time-stamp in all neighbors. Then only a single conditional branch is necessary following the iteration to write the time-stamps to the map. Second, the time-stamp maps are allocated so that they are larger than the input address space by at least the neighborhood distance. Then during iteration to write the time-stamp to the event’s neighboring pixels, there is no need to check array bounds.

Typical results of the background activity filter operating on a DVS retina output are shown in Figure 15.3. The input data are from a walking fruit fly observed from above. The background activity filter removes almost all of the uncorrelated background activity while leaving the spatiotemporally correlated events from the fly. In the jAER implementation running on a Core-i7 870 3 GHz PC, each event is processed in about 100 ns.

Other examples of useful filters are a ‘refractory filter’ that limits event rates for individual addresses, an ‘x-y-type filter’ that only passes events from a particular region of the address space, and a ‘depressing synapse filter’ that passes events with decreasing probability as the average event rate at an address increases. This filter tends to equalize activity across addresses and can limit input from redundant sources, say flickering lights from a silicon retina or spectral bands from a silicon cochlea.

images

Figure 15.3   Background noise filter example. The input is from a DVS retina viewing a scene with a single walking fruit fly over a 9 s period. The data are rendered as a 2D histogram of collected events, like an image of the collected event addresses. (a) Without background filtering, the uncorrelated background activity rate is about 3 kHz and is visible as the gray speckle. The path of the fly is visible as darker pixels. (b) Using the filter, the background rate is reduced to about 50 Hz, a factor of 60 times smaller, while the activity from the fly is unaffected. The gray scale is the same for both plots

15.4.2   Time-Stamp Maps and Subsampling by Bit-Shifting Addresses

In the filtering example just presented, the incoming event times are written to a time-stamp map. This map is a 2D array of the most recent event time-stamps at each address. This map is like a picture of the event times. A moving edge will create a landscape in the time-stamp map that looks like a gradual slope up to a ridge, followed by a sharp cliff that falls down to ancient events that represents retina output from some prior edge, which is probably not relevant anymore. Filtering operations can inspect the time-stamp map at the most recent events around the source address.

When filtering events out or labeling them (as discussed in Section 15.4.3), one very useful operation is subsampling by bit-shifting. This operation has the effect of increasing the area of the time-stamp map that can be inspected by an event filter without increasing the cost. Right-shifting the x and y addresses by n bits will end up filling each element of a time-stamp map with the most recent time-stamp in blocks of 2n × 2n input addresses. While filtering a packet, the same number of events need to be processed, but now operations on the time-stamp map that iterate over neighborhoods of the incoming event address effectively cover an area of the input address space that is (2n)2 times larger. The effect is the same as if the receptive field area is increased by this amount, but at no increase of the cost of iteration over larger neighborhoods.

15.4.3   Event Labelers as Low-Level Feature Detectors

Once noise or redundancy reduction in the event stream is achieved by filtering out events, now the next step could be to detect features, for instance, the edge orientation or the direction and speed of motion of edges. (These algorithms are SimpleOrientationFilter and DirectionSelectiveFilter in the jAER project.) These ‘event labeling’ algorithms result in output streams of new types of events, which are now labeled with the additional annotation detected during processing.

An example of this type of feature detector is a ‘motion labeler’ that measures local normal optical flow. The results of this algorithm are illustrated in Figure 15.4. The steps of motion measurement consist first of determining the edge orientation and then determining from what direction this edge has moved and how fast. By this method only the so-called ‘normal flow’ can be determined, because motion parallel to the edge cannot be determined.

The method of orientation labeling was inspired by the famous Hubel and Wiesel arrangement of center-surround thalamic receptive fields that are lined up to produce an orientation-selective cortical simple-cell receptive field (Hubel and Wiesel 1962). One can think of the simple cell as a coincidence detector for thalamic input. A moving edge will tend to produce events that are correlated more closely in time with nearby events from the same edge. The orientation labeler detects these coincident events and determines the edge orientation by the following steps (see also Figure 15.4):

  1. Store the event time in the time-stamp map at the event address (x,y). An example of input events are shown in Figure 15.4a. There is one time-stamp map for each type of input event, for example, two maps for the DVS sensor ON and OFF event polarities. There is a separate map for each retina polarity so that an ON event can be correlated only with previous ON events and an OFF event can be correlated only with past OFF events.

    images

    Figure 15.4   Orientation features extraction followed by local normal flow events. (a) The input is 3000 events covering 20 ms of DVS activity from the cross pattern which is moving to the right. (b) The orientation events are shown by line segments along the detected orientation. The length of the segments indicates the receptive field size. (c) The motion events are shown by vector arrows. They indicate the normal flow velocity vector of the orientation events. The inset shows a close-up of some motion vectors

  2. For each orientation, measure the correlation time in the area of the receptive field. Correlation is computed as the sum of absolute time-stamp differences (SATD) between the input event and the stored time-stamps in the corresponding (ON or OFF) time-stamp map. Typically, the receptive field size is 1 × 5 pixels centered on the event address, where in that case four time-stamp differences must be computed. The array offsets into the memory of past event times are precomputed for each orientation receptive field.
  3. Output an event labeled with the orientation of the best correlation result from step 2, but only if it passes an SATD threshold test that rejects events with large SATD, representing poor correlation. The results of this extraction of edge orientation events are shown in Figure 15.4b.

The next step of the motion algorithm is to use these ‘orientation events’ to determine normal optical flow (again see Figure 15.4). The steps of this algorithm are as follows:

  1. Just as for the orientation labeler, first the input events (the orientation events) overwrite previous event times in time-stamp maps. One map is used to store each input event type; for example, for four orientation types and two input event polarities, eight maps would be used.
  2. For each orientation event, a search over the corresponding time-stamp map is performed in a direction orthogonal to the edge to determine the likely direction of motion of the edge. The SATD is computed for each of two possible directions of motion. The direction with smaller SATD is the direction from which the edge moved. During computation of SATD, only events with a time-stamp difference smaller than a parameter value are included in the SATD. This way, old events that do not belong to the current edge are not counted.
  3. Output a ‘motion direction event’ labeled with one of the eight possible directions of motion and with a speed value that is computed from the SATD. The speed is computed by computing the time of flight of the current orientation event to each prior orientation event in the direction opposite the direction of motion in the range of the receptive field size, and then taking the average. Typically the range is five pixels, but as for the orientation labeler, this size is adjustable. During this computation of the average time of flight, outliers are rejected by only counting times that are within a time limit. An example of the motion labeler output is shown in Figure 15.4c. The small and thin vectors show the local normal optical flow, while the large vectors show the lowpass-filtered average translation, expansion, and rotational values over the recent past.

The motion labeler algorithm described above was developed in 2005 as part of the jAER project. Benosman et al. (2012) reported a different approach that is interesting because rather than searching over maps of past events, it uses a gradient-based approach based on a straightforward translation of the popular and effective (Lucas and Kanade 1981) optical flow algorithm operating on accumulated-event image patches of size n × n pixels (where n = 5 was used) computed over a short-time window of a reported 50 μs of DVS activity. This way, the algorithm can determine with a linear regression both scalar orientation and speed to result in a local normal flow vector. By using this approach, spatial and temporal gradients are computed at a high effective sample rate while computational cost was reduced by a factor of 25 compared to the frame-based Lucas–Kanade method. This method results in more accurate optical flow than the motion labeler, but it costs about 20 times as much because a set of n2 simultaneous linear equations must be solved for each motion vector. Processing each DVS event on a fast year-2012 PC requires about 7 μs of CPU time compared with 350 ns for the motion labeler.

15.4.4   Visual Trackers

The task of object tracking is well-suited to activity-driven event-based systems, because moving objects generate spatiotemporal energy that generates events, and these events can then be used for tracking the objects.

Cluster Tracker

As an example of a tracking algorithm we will discuss a relatively simple tracker called the cluster tracker (RectangularClusterTracker in the jAER project). The basic cluster tracker tracks the motion of multiple moving compact objects (Delbrück and Lichtsteiner 2007; Litzenberger et al. 2006b), for instance, particles in a 2D fluid, balls on a table, or cars on a highway. It does this by using a model of an object as a spatially connected and compact source of events. As the objects move, they generate events. These events are used to move the clusters. Clusters are spawned when events are detected at a place where there are no clusters, and clusters are pruned away after they receive insufficient support.

The cluster has a size that is fixed or variable depending on the application, and that can also be a function of location in the image. In some scenarios such as looking down from a highway overpass, the class of objects is rather small, consisting of vehicles, and these can all be clumped into a single restricted size range. This size in the image plane is a function of height in the image because the images of vehicles near the horizon are small and those of ones passing under the camera are maximum size. Additionally, the vehicles near the horizon all appear about the same size because they are viewed head-on. In other scenarios, all the objects are nearly the same size. Such is the case when looking at marker particles in a fluid experiment or falling raindrops.

There are several advantages of the cluster tracker compared with conventional frame-based trackers. First, there is no correspondence problem because there are no frames, and events update clusters asynchronously. Second, only pixels that generate events need to be processed and the cost of this processing is dominated by the search for the nearest existing cluster, which is typically a cheap operation because there are few clusters. And lastly, the only memory required is for cluster locations and other statistics, typically about a hundred bytes per cluster.

The steps for the cluster tracker are outlined as follows. It consists firstly of an iteration over each event in the packet, and secondly of global updates during the packet iteration, at a fixed interval of time.

A cluster is described by a number of statistics, including position, velocity, radius, aspect ratio, angle, and event rate. For example, the event rate statistic describes the average rate of events received by the cluster over the past τrate milliseconds, that is, it is a lowpass-filtered instantaneous event rate.

First, for each event in the packet:

  1. Find the nearest existing cluster by iterating over all clusters and computing the minimum distance between the cluster center and the event location.
  2. If the event is within the cluster radius of the center of the cluster, add the event to the cluster by pushing the cluster a bit toward the event and updating the last event time of the cluster. Before the cluster is updated, it is first translated by using its estimated velocity and the time of the event. This way, the cluster has an ‘inertia,’ and the events serve to update the velocity of the cluster rather than its position. Subsequently, the distance the cluster is moved by the event is determined by a ‘mixing factor’ parameter that sets how much the location of the event affects the cluster. If the mixing factor is large, then clusters update more quickly but their movement is noisier. A smaller mixing factor causes smoother movement but rapid movements of the object can cause tracking to be lost. If the cluster is allowed to vary its size, aspect ratio, or angle, then update these parameters as well. For instance, if size variation is allowed, then events far from the cluster center make the cluster grow, while events near the center make it shrink. The cluster event rate statistic and the velocity estimate are also updated using lowpass filters.
  3. If the event is not inside any cluster, seed a new cluster if there are spare unused clusters to allocate. A cluster is not marked as visible until it receives a minimum rate of events. The maximum number of allowed clusters is set by the user to reflect their understanding of the application.

Second, at periodic update intervals (e.g., 1 ms, which is a configurable option), the following cluster tracker update steps are performed. The decision on when to do an update is determined during iteration over the event packet. Each event’s time-stamp is checked to see if it is larger than the next update time. If it is larger, then the update steps as outlined next are performed, and after the update, the next update time is incremented by the update interval.

The steps for the periodic cluster tracker update are as follows:

  1. Iterate overall clusters, pruning out those clusters that have not received sufficient support. A cluster is pruned if its event rate has dropped below a threshold value.
  2. Iterate overall clusters to merge clusters that overlap. This merging operation is necessary because new clusters can be formed when an object increases in size or changes aspect ratio. This merge iteration continues until there are no more clusters to merge. Merged clusters usually (depending on tracker options) take on the statistics of the oldest cluster, because this one presumably has the longest history of tracking the object.

An example of an application of this kind of tracker is the robotic goalie in Delbrück and Lichtsteiner (2007), which was later extended to include self-calibration of its arm. Figure 15.5 shows the setup and a screen shot taken during operation of the goalie. The goal in developing this robot was to demonstrate fast, cheap tracking, and quick reaction time. As shown in Figure 15.5a, a person would try to shoot balls into the goal and the goalie would block the most-threatening ball, which was determined to be first crossing the goal line. A fast shot would cover the 1 m distance to the goal in under 100 ms. To achieve this, the robot needed to track the balls to determine their positions and velocities, so that it could move the arm under open-loop control to the correct position to block the ball. During idle periods the robot also moved its arm to random positions while tracking it, in order to calibrate the mapping from servo position to visual space. That way, the robot could set the arm to a desired visual location in the image. Figure 15.5b is a screenshot captured during operation. Three balls were being tracked and one of them, the attacking ball, was circled to indicate it was the one being blocked. Each ball also has an attached velocity vector estimate and there are a number of other potential ball clusters that did not receive sufficient support to be classified as balls. The scene was statically segmented into a ball tracking region and an arm tracking region. In the arm region, the arm is tracked by a larger tracker cluster and in the ball region the tracker was configured so that the cluster size matched the ball size based on perspective. During that snapshot of 2.9 ms of activity there were 128 DVS events at a rate of 44 keps. The event rate was fairly linear with the ball speed, so slow moving balls generated fewer events and fast moving ones more events. Because the tracker was event-driven, fast moving balls were tracked with the same performance as slow moving ones. The larger the goalie hand size, the easier it is to block the balls. Using a goalie hand size of 1.5 times the ball diameter, the goalie was measured to block 95% of the balls. The goalie ran on a 2006-era laptop processor at a processor load of under 4%. The reaction time—defined as the time interval from starting to track a ball to a change in the servo control signal—was measured to be less than 3 ms.

images

Figure 15.5   The robotic goalie. (a) The setup shows a DVS silicon retina and the 1-axis goalie arm.
(b) A screenshot shows tracking of multiple balls and the goalie arm

More Tracking Applications of the DVS

Other applications of tracking have customized tracking to deal with specific scenarios. For example, Lee et al. (2012) reported a stereopsis-based hand tracker using DVS retinas for gesture recognition. Here the average stereo disparity was determined by cross-correlation of 1D histograms of activity in two DVS sensors. Next, events from the DVS sensors were transformed to bring the peak activities into registration. (This way, the dominant object, which was the moving hand, was focused on a single region, while background movements of the person were suppressed.) Next, a tracker that cooperatively detected connected regions of activity found the moving hand area. This tracker was based on laterally coupled I&F neurons. Neurons were connected to the DVS input and to their neighbors with excitatory connections, so that a connected region of activity was preferred over disconnected regions. The median location of activity in the largest connected region of active I&F neurons was used as the hand location.

In a second example, the pencil balancer robot reported by Conradt et al. (2009b) used an interesting tracker based on a continuous Hough transform. Each retina event was considered to draw a ridge of activity in the Hough transform space (Ballard 1981) of pencil angle and base position. An efficient algebraic transform allowed update of the continuous pencil estimate in Hough space without requiring the usual awkward quantization and peak-finding of the conventional Hough transform.

In a third example, the hydrodynamic particle tracker reported in Drazen et al. (2011) used a method based on the cluster tracker described above to track the fluid flow markers, but extra processing steps were applied to deal with particles with crossing trajectories.

And in the fourth example, Ni et al. (2012) applied a discretized circular Hough transform for microparticle tracking, in a manner quite similar to the CAVIAR hardware convolution in Serrano-Gotarredona et al. (2009). This tracker was not sufficiently accurate on its own, so it was followed by computation of the centroid of recent event activity around the tracker location to increase accuracy.

Finally, Bolopion et al. (2012) reported using a DVS together with a CMOS camera to track a microgripper and the particle being manipulated to provide high-speed haptic feedback for gripping. It used an iterative closest point algorithm based on Besl and McKay (1992) to minimize Euclidean distance between events in the last 10 ms and the gripper shape model by rotation and translation of the gripper model.

Action and Shape Recognition with the DVS

Some work has been performed on classification of events such as detecting people falling down, which is important for elderly care. Fu et al. (2008) reported a simple fall detector based on classifying DVS time-space event histograms from a single DVS looking head-on or sideways at a person. Later, Belbachir et al. (2012) demonstrated an impressive fall detector that used a stereo pair of DVS sensors and that allowed an overhead view of the person falling.

This same group developed a full-custom embedded DSP based on a MIPS core with native AE time-stamping interfaces (Hofstatter et al. 2009) that was used to do high-speed classification of flat shapes using single and dual-line DVS sensor outputs (Belbachir et al. 2007; Hofstatter et al. 2011).

Applications of Event-Based Vision Sensors in Intelligent Transportation Systems

The VISe vision sensor described in Section 3.5.5 was applied in lane detection and lane departure warning in a highly developed manner as reported in Grenet (2007). This report, although not available online, is worth reading for its completeness and the systematic design of the system, including the implementation of several different tracking states. The system switches between bootstrapping, search and tracking states and uses a Kalman filter during the tracking phase. It also detected the difference between continuous and dashed lane markings.

A major application of the DVS has been in intelligent transportation systems, in particular in monitoring traffic on highways. Specialized algorithms have been developed that use an embedded commercial DSP that processes DVS output (Litzenberger et al. 2006b). Using this system Litzenberger et al. (2006a) reported vehicle speed measurement and Litzenberger et al. (2007) reported car counting. In the latest work, this group also was able to classify cars versus trucks on highways (Gritsch et al. 2009) during nighttime, based on headlight separation.

15.4.5   Event-Based Audio Processing

The event-based approach can be particularly beneficial in applications requiring precise timing. One example of this is from binaural auditory processing, where cross correlation is used to determine the interaural time difference (ITD), ultimately to determine the azimuthal directions of sound sources. Processing of cochlea spikes from the AEREAR2 cochlea for localization is described in Section 4.3.2 of Chapter 4. An event-based method reported in Finger and Liu (2011) for estimating ITD (ITDFilter in the jAER project) shows how rolling buffers of events from two sources can be cross-correlated. In this method, the ITDs between events from one ear (one cochlea channel from one ear) are computed to previous events from the other ear (the corresponding cochlea channel from the other ear). Only ITDs up to some limit, for example ±1 ms, are considered. Each of these ITDs is weighted (as described below) and stored in a decaying histogram of ITD values. The locations of the peaks in the histogram signify the ITDs of sound sources.

It is harder to localize sound in reverberant spaces, because the sound can take multiple paths to reach a listener. However, sound onsets can be used to disambiguate the source direction because the onset of a sound can be used to determine the direct path. A special feature used in the ITD algorithm was to multiply the value added to the histogram by the temporal duration of silence (no events) prior to the current event. This way, sound onsets are weighted more strongly. Using this feature improved the performance dramatically.

Finger and Liu (2011) reported that the system could localize the direction of a sound source in a reverberant room to 10 of precision using a microphone separation of about 15 cm, representing a precision of about 60 μs in estimating the ITD. Using the event-based approach to cross correlation reduced computational cost by a factor of about 40 compared with conventional cross correlation of regular sounds samples at a sample rate that achieved the same performance in the room (Liu et al. 2013). The event-based approach results in a latency of sound source localization for human speech of under 200 ms.

15.5   Discussion

An event-driven approach to signal processing based on neuromorphic AER sensors and actuators has the potential of realization of small, fast, low-power embedded sensory-motor processing systems that are beyond the reach of traditional approaches under the constraints of power, memory, and processor cost. However, in order to bring this field to the same level of competence as the conventional Nyquist-based signal processing, there is clearly need for theoretical developments. For instance, it is not clear how very basic elements like analog filters can be designed using an event-based approach. In conventional signal processing, the assumption of linear time invariant systems, regular samples, and the use of the z-transform allow the systematic construction of signal processing pipelines as described in every standard signal processing textbook. No such methodology exists for event-based processing. Although some authors write a kind of formal description of an event-based processing algorithm as a set of equations, so far these equations are only descriptions of an iterator with branching, and they cannot be manipulated to allow derivation of different forms or combinations in the same way that are allowed by regular algebras. And to bring the event-based approach together with rapid advances in machine learning, there is clearly a need to bring learning into the event-based approach. These and other developments will make this area interesting over the next few years. One intriguing possibility is that the most powerful algorithms turn out to be the ones that run best on the kinds of large-scale neuromorphic hardware systems described in Chapter 16.

References

Ballard DH. 1981. Generalizing the Hough transform to detect arbitrary shapes. Pattern Recogn. 13(2), 111–122.

Belbachir AN, Litzenberger M, Posch C, and Schon P. 2007. Real-time vision using a smart sensor system. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1968–1973.

Belbachir AN, Litzenberger M, Schraml S, Hofstatter M, Bauer D, Schon P, Humenberger M, Sulzbachner C, Lunden T, and Merne M. 2012. CARE: a dynamic stereo vision sensor system for fall detection. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 731–734.

Benosman R, Ieng SH, Clercq C, Bartolozzi C, and Srinivasan M. 2012. Asynchronous frameless event-based optical flow. Neural Netw. 27, 32–37.

Besl PJ and McKay HD. 1992. A method for registration of 3-D shapes. IEEE Trans. Pattern Anal. Mach. Intell. 14(2), 239–256.

Bolopion A, Ni Z, Agnus J, Benosman R, and Régnier S. 2012. Stable haptic feedback based on a dynamic vision sensor for microrobotics. Proc. 2012 IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), pp. 3203–3208.

Conradt J, Berner R, Cook M, and Delbruck T. 2009a. An embedded AER dynamic vision sensor for low-latency pole balancing. Proc. 12th IEEE Int. Conf. Computer Vision Workshops (ICCV), pp. 780–785.

Conradt J, Cook M, Berner R, Lichtsteiner P, Douglas RJ, and Delbruck T. 2009b. A pencil balancing robot using a pair of AER dynamic vision sensors. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 781–784.

Delbrück T. 2008. Frame-free dynamic digital vision. Proc. Int. Symp. Secure-Life Electronics, Advanced Electronics for Quality Life and Society, University of Tokyo, March 6–7. pp. 21–26.

Delbrück T and Lichtsteiner P. 2007. Fast sensory motor control based on event-based hybrid neuromorphic-procedural system. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 845–848.

Drazen D, Lichtsteiner P, Hafliger P, Delbrück T, and Jensen A. 2011. Toward real-time particle tracking using an event-based dynamic vision sensor. Exp. Fluids 51(55), 1465–1469.

Finger H and Liu SC. 2011. Estimating the location of a sound source with a spike-timing localization algorithm. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 2461–2464.

Fu Z, Delbruck T, Lichtsteiner P, and Culurciello E. 2008. An address-event fall detector for assisted living applications. IEEE Trans. Biomed. Circuits Syst. 2(2), 88–96.

Grenet E. 2007. Embedded high dynamic range vision system for real-time driving assistance. Technische Akademie Heilbronn e. V., 2. Fachforum Kraftfahrzeugtechnik, pp. 1–16.

Grenet E, Gyger S, Heim P, Heitger F, Kaess F, Nussbaum P, and Ruedi PF. 2005. High dynamic range vision sensor for automotive applications. European Workshop on Photonics in the Automobile, pp. 246–253.

Gritsch G, Donath N, Kohn B, and Litzenberger M. 2009. Night-time vehicle classification with an embedded vision system. Proc. 12th IEEE Int. Conf. Intell. Transp. Syst. (ITSC), pp. 1–6.

Hofstatter M, Schon P, and Posch C. 2009. An integrated 20-bit 33/5M events/s AER sensor interface with 10 ns time-stamping and hardware-accelerated event pre-processing. Proc. IEEE Biomed. Circuits Syst. Conf. (BIOCAS), pp. 257–260.

Hofstatter M, Litzenberger M, Matolin D, and Posch C. 2011. Hardware-accelerated address-event processing for high-speed visual object recognition. Proc. 18th IEEE Int. Conf. Electr. Circuits Syst. (ICECS), pp. 89–92.

Hubel DH and Wiesel TN. 1962. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol. 160(1), 106–154.

jAER. 2007. jAER Open Source Project, http://jaerproject.org (accessed August 6, 2014).

Lee J, Delbruck T, Park PKJ, Pfeiffer M, Shin CW, Ryu H, and Kang BC. 2012. Live demonstration: gesture-based remote control using stereo pair of dynamic vision sensors. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 741–745.

Li CH, Delbrück T, and Liu SC. 2012. Real-time speaker identification using the AEREAR2 event-based silicon cochlea. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1159–1162.

Litzenberger M, Kohn B, Belbachir AN, Donath N, Gritsch G, Garn H, Posch C, and Schraml S. 2006a. Estimation of vehicle speed based on asynchronous data from a silicon retina optical sensor. Proc. 2006 IEEE Intell. Transp. Syst. Conf. (ITSC), pp. 653–658.

Litzenberger M, Posch C, Bauer D, Belbachir AN, Schon P, Kohn B, and Garn H. 2006b. Embedded vision system for real-time object tracking using an asynchronous transient vision sensor. Proc. 2006 IEEE 12th Digital Signal Processing Workshop, and the 4th Signal Processing Education Workshop, pp. 173–178.

Litzenberger M, Kohn B, Gritsch G, Donath N, Posch C, Belbachir NA, and Garn H. 2007. Vehicle counting with an embedded traffic data system using an optical transient sensor. Proc. 2007 IEEE Intell.Transp. Syst. Conf. (ITSC), pp. 36–40.

Liu SC, van Schaik A, Minch B, and Delbrück T. 2013. Asynchronous binaural spatial audition sensor with 2 × 64 × 4 channel output. IEEE Trans. Biomed. Circuits Syst., pp. 1–12.

Lucas BD and Kanade T. 1981. An iterative image registration technique with an application to stereo vision. Proc. 7th Int. Joint Conf. Artificial Intell. (IJCAI), pp. 674–679.

Ni Z, Pacoret C, Benosman R, Ieng S, and Regnier S. 2012. Asynchronous event-based high speed vision for microparticle tracking. J. Microscopy 245(3), 236–244.

Rogister P, Benosman R, Leng S, Lichtsteiner P, and Delbrück T. 2012. Asynchronous event-based binocular stereo matching. IEEE Trans. Neural Netw. Learning Syst. 23(2), 347–353.

Schraml S, Belbachir AN, Milosevic N, and Schön P. 2010. Dynamic stereo vision system for real-time tracking. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1409–1412.

Serrano-Gotarredona R, Oster M, Lichtsteiner P, Linares-Barranco A, Paz-Vicente R, Gomez-Rodriguez F, Camunas-Mesa L, Berner R, Rivas M, Delbrück T, Liu SC, Douglas R, H¨afliger P, Jimenez-Moreno G, Civit A, Serrano-Gotarredona T, Acosta-Jimenez A, and Linares-Barranco B. 2009. CAVIAR: a 45K-neuron, 5M-synapse, 12G-connects/sec AER hardware sensory-processing-learning-actuating system for high speed visual object recognition and tracking. IEEE Trans. Neural Netw. 20(9), 1417–1438.

__________

1Parts of this material stem from Delbrück (2008).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset