Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

13 Hardware Infrastructure

images

To make use of custom address event (AE) based neuromorphic chips built using the circuits described in the previous chapters, they need to be embedded into a larger hardware infrastructure. This chapter describes some of the considerations which must be borne in mind when designing, building, and operating the printed circuit boards which form this infrastructure for relatively small-scale, chiefly experimental systems. Examples are given taken from several projects. The necessary accompanying software is described in Chapter 14. The present chapter also includes a section reviewing the use of field programmable gate arrays (FPGAs) in neuromorphic systems. Hardware infrastructure for larger systems containing more than a handful of neuromorphic chips is discussed separately in Chapter 16.

13.1 Introduction

As we have already seen in Chapter 2, in order to construct larger AER systems, a certain amount of hardware infrastructure ‘glue’ is necessary between the active computational elements.

In fact even for the simplest systems, at least some means of capturing AE data is required. General purpose commercially available data acquisition boards are usually ill-suited to perform the asynchronous handshaking expected by an AER sender, and do not record the times at which individual events are received. For the purpose of capturing data from a simple experiment with perhaps one AER sender device and one AER receiver device, or for debugging purposes on parts of larger systems, a logic analyzer can easily be used to record the AE stream on a parallel bus. Logic analyzers, however, are expensive, have limited memory depth, and usually operate in a batched mode. This means that while they are acquiring data, this data are not available for further processing. Only when an acquisition is complete, or between acquisitions can the data be read into a computer. Thus logic analyzers are not ideally suited for the continuous processing and analysis of AE streams, and specialized AER receiver devices, often known as monitors, will be required to interface to conventional digital hardware.

If an AER system does not contain a sensory input device such as a retina chip or cochlea chip, it will typically require an external source of AEs for stimulation. Even in systems which do contain sensory input devices, it is often useful or necessary to be able to provide AE stream input either as a replacement for or in addition to input from those sensory devices, for instance for debugging, or in order to provide repeatable input for the purposes of an experiment. For this stimulation of AE systems, good off-the-shelf solutions are difficult to find. Most general purpose commercially available digital output boards are again ill-suited because neither are theydesigned to perform the asynchronous handshaking expected by an AER receiver nor are they necessarily designed to preserve the timing between output words. Thus, specialized AER sender devices are required to allow conventional digital hardware to produce AE output. These devices are often known as sequencers because they generate a sequence of AEs.

Section 2.4 referred to idealized address-event receiver blocks within which source addresses (the addresses of source neurons) must be converted to multiple destination addresses (the addresses of individual target synapses), thus performing a fan-out function. Even if no fan-out is required, it is usual that some address translation will be required between sending and receiving devices, since unless sender and receiver have been conceived of together as a pair, it is unlikely that their address spaces will match one to one. In practice, chip-level devices that conform to this idealized model of performing address translation and/or fan-out internally have only relatively recently been constructed (Lin et al. 2006), and are still not the norm. Without these, there is a need for separate AER devices that act as receivers on one bus and senders on another performing any desired address translation and fan-out in between. Such devices are often called mappers. If a mapper has multiple possible output buses, it may also be called a router. Being able to perform programmable address translation between a sender and receiver also provides the very important feature of reconfigurability. Not all source to destination connections must be decided at the time of manufacture of the devices. General purpose devices can be constructed and the precise connectivity between their elements can be determined and modified, after they have been built into systems, and learning in which new connections are made and others eliminated at run-time is facilitated.

Further operations on AE streams which can be supported by interface logic include splitting, in which events on one incoming bus are copied to more than one outgoing bus, and the inverse, merging, in which events are taken from multiple incoming buses and copied to one outgoing bus. These functions can also be implemented by mappers having more than one input or output bus, respectively. (Merging can alternatively be seen as an essential internal part of any AER device which must receive AEs from more than one input bus—there must be some form of arbitration between the incoming buses, unless events are to be dropped in the case of events arriving on multiple buses at the same time.)

Thus we have three major classes of interface logic functionality used between AER modules: monitors, sequencers, and mappers (these terms were introduced in Dante et al. 2005). Combinations of these are of course also possible and are frequently encountered.

13.1.1 Monitoring AER Events¹

An AER monitoring tool should be able to capture long sequences of AEs, interfering as little as possible with the system under test. As events arrive asynchronously and at a wide range of possible rates, depending on system activity, they need to be buffered, typically in dedicated First-In First-Out (FIFO) memory before being further processed or recorded in long-term memory. The FIFO decouples the reception of the incoming AEs from the operations of the host reading those events over some bus, for example, PCI or USB, the use of which may be shared with other peripherals. In the ideal case, the FIFO will be read often enough that it never fills up, or overruns, such that events are lost. A mechanism is usually provided to alert the user if and when such an overrun does occur. Key to avoiding overruns is getting the data out fast. Early monitors, for example, the on Rome PCI–AER board described in Dante et al. (2005), used polled and/or interrupt driven I/O to transfer data from the board, but could not do bus mastering or direct memory access (DMA, a feature that allows peripherals to transfer data to and from memory without involving the processor). This was the main factor limiting its performance to approximately 1 Meps. In some cases this may be acceptable, but many AE chips can produce much higher spike rates. To increase performance, techniques such as DMA and bus mastering are required such that the host CPU is not involved in merely copying AE words and time-stamps out of the monitor FIFO. Otherwise, where high bandwidth monitoring is required, a stream of events and high resolution time-stamps can be captured into RAM local to the monitoring device for later, off-line processing, as was done in the CAVIAR USB–AER board (Gomez-Rodriguez et al. 2006).

Monitor functionality is usually implemented in an FPGA.

Data-Width, Channel Addressing, and Framing Errors

Monitors must obviously be constructed to capture the full width of the AE words in use, but it makes little sense to make a monitor that captures, for example, exactly 21-bit AE words because a particular sender happens to produce 21-bit words. Monitored data will sooner or later be stored in a conventional computer RAM system having a data width of 2ⁿ 8-bit bytes. Thus in our 21-bit example, the data is going to be stored in at least 32-bit words, so it makes sense to construct a more general monitor which can capture up to 32-bit wide AE words. However, any bits which are not driven by the sender must be set to some particular value, typically zero, such that in our 21-bit example, the whole 32-bit word is always consistently guaranteed to be the same for the same 21-bit value.

Some monitors allow the simultaneous monitoring of multiple devices on multiple channels and allow these devices to be distinguished by placing a different value for each channel in otherwise unused high-order address-bits. For example, the Rome PCI-AER board (Dante et al. 2005) could be configured to monitor one 16-bit sender, two 15-bit senders, or four 14-bit senders.

Whenever AE words are to be transmitted over an interface which is less wide than the words in question, one must guard against so-called framing errors whereby the distinction between the inter- and intra-word boundaries may be lost leading to mistaking, for instance the high-order part of one word with the low-order part of the next. This is particularly true across interfaces where the data flow may be interrupted and resumed by the reader. An example of this problem and its resolution occurs in the AEX board (Fasnacht et al. 2008) in which 32-bit AE words are transmitted over serial buses using 16-bit Serializer–Deserializer (SerDes) chips. In order to detect the 32-bit word boundary, it is defined that the two 16-bit words must be sent back-to-back with no IDLE characters in between. Once an IDLE character is seen, the receiver knows where the 32-bit word boundary is. This allows 32-bit words also to be sent back-to-back, once the receiver has seen a single IDLE character, thus the full bandwidth available can still be used for AE data.

Inter-Spike Intervals and Time-Stamps

Since neurons code information not only in their identity but also in the frequency or timing of their spikes, pixels that transmit their addresses over an AER bus also code information in the frequency or timing of appearance of those addresses on the bus. Therefore, the inter-spike intervals (ISIs) are critical for this communication mechanism and monitor needs to preserve the information conveyed in the ISIs of the AER which would otherwise be lost as soon as the incoming addresses are buffered. This is done by recording in the buffer some form of time-stamp, that is, the value of a timer, with the address of each event, as soon as it arrives. Time-stamps can be relative to the previous event arrival time, in which case they directly represent an ISI, or they can be so-called absolute, in which case they are actually relative to some earlier timer reset time, typically the time when the monitor hardware was powered up, or at the beginning of an experiment. Note that in the case of relative time-stamps, the ‘ISI’ to which they refer is an interval between two spikes which are not necessarily, indeed usually not, from the same neuron, but simply between two spikes from any two neurons which report their spikes onto the same bus. This may lead to difficulties later in processing the events, since if any events and their associated relative time-stamps are disregarded, the spike timing of the remaining events can no longer be faithfully reconstructed. Therefore the use of so-called absolute time-stamps, where the time recorded alongside the event address is unambiguous, is more usual. A source of ambiguity remains, however, in that we may not know exactly when the time-stamp timer was reset in relation to other clocks in the system, for example, a host PCs system clock, the timer of another monitor, wall-clock time, or the time of a stimulus onset in an experiment. Achieving correspondence between clocks is not trivial even if we can command a reset of the time-stamp timer. Even if we can do this, given the vagaries of operating system schedulers and PC hardware, we cannot be certain when a PC-initiated timer reset actually takes place. Ideally a time-stamp timer should be not only resettable, but also readable. This does not solve the problem completely, since we will still not know the exact time at which the timer is read, but at least then multiple readings of the timer can be taken and correlated against the PC system clock without having to reset the timer.

Synchronization between Monitors

When multiple AER streams are to be recorded simultaneously by multiple monitors, it can be very difficult if not impossible to ascertain after the fact what the precise timing offsets between the recorded streams might have been. In order that multiple monitors run with synchronized timers, there may be provision made in the monitor hardware to establish direct synchronization links between two or more monitors. One monitor is then designated the master and the remainder are slaves. The time-stamp timers of the slaves can then be clocked by the master. When the master timer is reset, all the slave timers are also reset and thus have the same absolute time-stamp as the master.

This approach to the synchronization of several monitors was implemented in the USBAER-mini2 (Berner et al. 2007; jAER Hardware Reference n.d.; Serrano-Gotarredona et al. 2005), see Section 13.2.3.

Time-base

The time-base used, that is, the period at which the time-stamp timer ‘ticks,’ also requires consideration. A time-base of 1 ms may be considered adequate given the time-scale of a biological action potential, but this depends on the application, and 1 μs or less might be preferred. Some monitor hardware offers the user a choice of time-base, for example, the Rome PCI-AER board (Dante 2004; Dante et al. 2005).

Wrap-around

Obviously, the faster the clock, the more bits are needed to store the time-stamps before they wrap -around from the maximum storable value back to zero. For example, with a time-base of 1 μs and 16-bit time-stamps, wrap-around from 65 535 to zero would happen every 65.536 ms leading to the identical time-stamp value representing multiple points in time on a period of 65.536 ms. There are several possible approaches for dealing with this wrap-around. In some systems and circumstances, for instance when data are simply being captured over short time periods for immediate display, it may be possible to simply do nothing. Alternatively, increasing the number of bits used for the time-stamp clearly reduces the frequency of wrap-around. A 32-bit time-stamp used with a time-base of 1 μs wraps around only roughly every hour and 11.5 minutes, but comes at the expense of having to store and transmit twice as many bits. Running with a slower time-base is of course also possible, but results in a loss of timing resolution for the individual events.

Another approach that has been used is generally to store 16-bit time-stamps, but to reserve a bit pattern to indicate that, exceptionally, a 32-bit time-stamp follows.

The USBAERmini2 described by Berner et al. (2007) takes the approach of retaining 16-bit time-stamps, but places a special time-stamp-wrapped event in the stream of data read by the host whenever the time-stamp time wraps around. The host can then reconstruct 32-bit time-stamps from the 16-bit time-stamps present in the data together with the number of time-stamp-wrapped events that it has seen.

Heartbeats

The device described by Merolla et al. (2005) does not use time-stamps but instead uses a special ‘heartbeat’ address sent at regular intervals. This scheme uses less bandwidth but requires interpolation to reconstruct time-stamps.

Early Packets

One of the main challenges with monitoring is coping with high data rates, but low data rates can also be problematic. Consider for instance a monitor connected to a silicon retina of the kind described in Chapter 3 which only produces events when something in the observed scene changes. If there is very little change in the scene, then very few events are emitted and the monitor FIFO fills up only very slowly. An application performing visualization of the data or using it for real-time control would normally have no visibility of the data accumulating in the FIFO until the amount of data exceeds some threshold that causes it to be read by the host. To avoid this problem, a monitor can implement an ‘early packet’ feature, as was done by Berner et al. (2007), which ensures a certain maximum interval (e.g., 10 ms) between the times at which the data are available to the host. An additional early packet timer is used to force the FIFO to be read even if it is not full whenever the early packet interval has elapsed with no intervening reads.

Sniffers vs. Interposed Monitors

Suppose we have an AER transmitter chip connected to an AER receiver chip, and we want to monitor the communication between the two. In principle, there are two possibilities as shown in Figure 13.1: connecting an AER sniffer element to the bus (Figure 13.1a) or interposing a new AER element between the transmitter and the receiver (Figure 13.1b).

A sniffer element will capture the addresses of the events on an AER bus without being involved in the handshaking taking part on the bus, except insofar as for each request that appears on the AER bus, it stores the corresponding address, together with a time-stamp, in memory. (This is how a logic analyzer may be used to monitor an AER bus). The problems with this approach are that the speed of the bus protocol lines (e.g., 15 ns per event) could be faster than the maximum speed supported by the sniffer, causing events to be lost, or that the throughput of the AER bus might be so high that the contents of the buffers on the interface cannot be transferred to the computer’s main memory in time. This also implies that events are lost.

images

Figure 13.1 A monitor implemented as a sniffer (a) as opposed to an interposed normal AER sender/receiver element (b)

The other possibility is to interpose a new AER element between the two chips. In this case the transmitter sends the event to the AER element and the AER element sends the same event to the receiver chip. The problem now is that the new AER element will always introduce extra delay, and may also block the transmitter if it is not able to keep up with its throughput therefore causing ISIs not to be conserved. But the behavior will be the same as if we had connected the transmitter to a slower receiver.

AER to Frame Conversion

Although much interesting processing can be carried out directly on address event streams (see Chapter 15) it is still sometimes desirable to convert AER data into a frame-based representation, that is one in which ‘pixels’ in regularly generated frames reflect the frequency of occurrence of particular AEs, analogous to a 2D histogram. Some monitors implement AER to frame conversion in hardware (Paz-Vicente et al. 2006). This is a relatively simple task as it basically only requires counting the events received for each pixel address over the frame period. Under certain circumstances, with high activity, this technique may reduce the bandwidth required to represent the data. Of course, experimenters will still sometimes want to know the precise timing of each event, thus both alternatives should be preserved.

13.1.2 Sequencing AER Events

By sequencing AER events, we mean ‘playing’ a stream of AEs out onto an AER bus. A sequencer therefore allows events from a host PC to be sent out to the attached AER receiver. These events may for example represent a pre-computed, buffered stimulus pattern, or they might come from a previously monitored and stored AE stream, but they might also be the result of a real-time computation. In order for the individual AEs to be ‘played’ out onto an AER bus at the correct moments, the AE data delivered to the sequencer needs to include some form of timing information, typically in the presence of words defining the ISIs between the address words. These addresses and ISIs are then buffered in a FIFO with the sequencer simply waiting the indicated length of time, when it reads an ISI word and outputting the AE, when it reads an AE word. (Note that as described for monitoring in Section 13.1.1 these ‘ISIs’ are not true ISIs in the biological sense because the interval to which they refer is an interval between the transmission of any two addresses which represent spikes on the same bus, and not usually between two identical addresses representing the spikes of the same neuron.)

While monitors usually work with absolute time-stamps, sequencers often use relative ‘ISI’ values (as described in Section 13.1.1 under the heading of Inter-Spike Intervals and Time-Stamps) to determine the timing of the address-events they emit. In this case, the output of a monitor is not suitable to be the direct input to a sequencer. Given that relative ISIs are used, there is no concern with the time-stamp wrapping round, but it may not be possible to represent a required delay within the number of bits provided to represent the ISI. Therefore a sequencer operating with relative ISIs should be prepared to accept and honor delays specified by a consecutive sequence of multiple ISI delay words.

Choice of time-base is also of relevance for sequencing, and the same arguments apply as for monitoring (c.f. Section 13.1.1). Sequencer functionality, like monitor functionality, is also usually implemented using an FPGA. And as with monitors, it makes sense to construct sequencers that are able to output AE words which have a width of 2ⁿ 8-bit bytes.

Discounted Delay

If the acknowledge from the receiver is delayed, the delay can be subtracted from the time to wait before transmitting the next event. If the result of the subtraction is negative, there is no wait before transmitting the next event. With this treatment, described in Paz-Vicente et al. (2006) as a ‘discounted’ delay, the delay between events is no longer strictly relative to the earlier one, and a limited delay in receiving an ACK for one event will not propagate to cause a delay of all successive events.

First In, First Out

Analogous to monitoring, but operating in the opposite direction, a FIFO is necessary to decouple the buffer-wise operations of the host writing the events into the sequencer from the asynchronous carefully timed transmission of these events onto the bus. In the ideal case, the FIFO will be written to often enough that it never empties, or underruns. If it does so, the interval between the last spike to be emitted before the underrun and the next spike will be much longer than intended, with potentially bad consequences on the neuromorphic computation being performed by the receiving system. A mechanism is usually provided to alert the user if and when such an underrun does occur. To avoid underruns, data must always be ready to write to the FIFO whenever it signals, for example with a half-empty interrupt, that more data should be supplied.

Frame to AER Conversion

Sometimes it is desirable to produce AER output from a sequencer based on a frame-based image representation; the conversion from frames to AER can be delegated to the sequencer hardware. Particularly if the same frame is to be ‘presented’ to the receiver attached to the sequencer for some time, this may off-load the host CPU and the bus to the sequencer, since the frame need only be handed off to the sequencer once.

Implementing AER to frame conversion is a relatively simple task. Producing AER from a frame representation is not quite so trivial and several conversion methods have been proposed (Linares-Barranco et al. 2006). The number of events for a specific pixel should depend on its associated gray level and those events should be evenly distributed in time. The normalized mean time from the theoretical pixel timing to the pixel timing resulting from the various methods is an important comparison criterion. In Linares-Barranco et al. (2006) it is shown that, in most circumstances, the behavior of the various methods is similar and, thus, hardware implementation complexity is an important selection criterion. From the hardware implementation point of view, random, exhaustive, and uniform methods are especially attractive. Frame to AER conversion was implemented in, for example, the CAVIAR USB-AER board (Paz et al. 2005).

13.1.3 Mapping AER Events

Mapping performs address translation between AE source and destination address spaces, or the fan-out of single input events from AE sources analogous to axons to multiple destination addresses analogous to synapses. To perform the former function, a mapper operates in a one-to-one mode. The latter is a one-to-many mode. In either case the architecture of a mapper remains the same. There is usually a FIFO, an FPGA, and a block of RAM which holds a look-up table. Incoming AEs are first buffered in the FIFO such that they are not lost while the mapper is processing preceding events. When the FPGA reads an AE from the FIFO, it uses the AE as an address in the look-up table and reads from that address. If operating in a one-to-one mode, the data read on this first look-up operation determines the new outgoing AE. In one-to-many mode, there are two possibilities, fixed-length destination address lists and variable length destination address lists, see Figure 13.2.

Fixed vs. Variable Length Destination Lists

In the case of fixed length destination lists, for each source address there is a certain fixed amount of mapper memory available for the associated destination addresses, F_max words. Of course it is still possible to arrange to map source addresses to fewer than F_max words, but F_max is the maximum fan-out of the mapper. If F_max is small, this may severely limit the flexibility of the mapper in comparison with mappers using variable length address lists and may also waste considerable memory if the fan-out required for most source addresses is less than F_max. However, fixed length destination list mappers are easier to construct and to program than variable length destination address list mappers: the address in the mapper memory of the block of destination addresses corresponding to a given source address can be trivially calculated from the source address, for example by a simple left shift if F_max is chosen to be a power of two. Examples of fixed length destination list mappers can be found in Paz et al. (2005) and Fasnacht et al. (2008).

images

Figure 13.2 Fixed-length destination list mapping (a) versus variable length destination list mapping (b). In both cases the source address s_i is treated as a pointer into a look-up table (LUT). In (a) the destination addresses d_i_,0, d_i_,1, d_i_,2 are found directly as a result of the first look-up. Each list of destination addresses is stored in a fixed length block, the length of which determines the maximum fan-out F_max (here F_max = 8). In (b), an extra level of indirection (pointer lookup) is needed through the pointer p_i to find the actual destination addresses. There need not be gaps between the blocks of destination addresses, and the blocks may be arbitrarily long and stored in any order as required for memory management purposes. In both (a) and (b), S is a sentinel value used to mark the end of a block (not required in the case of (a) in blocks in which all F_max entries are used)

Variable length destination list mappers have no per-source address limit on fan-out other than that imposed by the total memory available. They are therefore more flexible than mappers using fixed length destination lists and can use the available memory more efficiently since destinations lists of various lengths can be packed contiguously in memory. However, this makes them a little more complex to construct and program, particularly if dynamic rewriting of the destination lists is required. A two-stage lookup is now required as a simple calculation of the address of the destination list is no longer possible. The first lookup of the source address now yields not a destination address directly, but rather a pointer back into the mapper memory to where the destination list can be found. Following these pointers requires more clock cycles than are required in fixed length destination list mappers. Furthermore, if it is required that the destination lists can be dynamically changed, for example, individual destination addresses added (analogous to the formation of new synapses) to facilitate some form of learning, while the mapper is in operation and thus without rewriting the whole mapper memory, some relatively sophisticated memory management routines are required to move the address lists around in the memory to make room for new entries. Variable length destination list mapping can be found in the SCX project (Deiss et al. 1999) and Rome PCI-AER board (Dante et al. 2005).

Sentinel Values or List Lengths

With variable length destination lists (and with fixed maximum length destination lists when the number of destinations n is less than the maximum) a so-called sentinel value S is required to terminate the list, which then takes the form d₁, d₂, … , d_n, S. The value S must be chosen so that it is not a member of the set of valid destination address. Equivalently formulated, the value chosen for S can never be used as a destination address. Commonly, a binary all ones or all zeros word is chosen meaning that, for example in a 16-bit system the address 65 535 or 0, respectively, cannot be used. Alternatively, the length of the list can be stored explicitly at its head, thus, n, d₁, d₂, … , d_n, in which case no sentinel value is need to be reserved.

Writing to the Mapping Table

Obviously there must be provision to write to the memory forming a mapper lookup table to set up the mapping table in the first place, and possibly to rewrite entries dynamically while the mapper is running. If the memory is attached to an FPGA that performs the mapping, this FPGA must also support writing to the memory from a host, for example, over a USB connection. An alternative would be the use of dual-ported RAM (DPRAM) where the host writes on one port and the mapping FPGA reads on the other, but DPRAM is relatively scarce and expensive and usually less easy to interface to than the more commonly used static RAM (SRAM). SRAMs have very fast access times and high bandwidth, and are typically used for cache memory.

If the mapping table needs to be written to on the fly while the mapper is running, care must be taken to ensure that the device (e.g., the FPGA) doing the mapping only ever sees consistent mapping tables, particularly in the case where a sentinel value is used to mark the end of the list of destination addresses, since a missed sentinel will typically cause the device to go on emitting whatever the LUT memory contains until it happens to encounter another sentinel value.

Algorithmic Mapping

Up to now we have discussed only LUT-based mappers. It is also possible to perform so-called algorithmic mapping in which the destination addresses emitted are generated by applying some algorithm to the source address. This algorithm can be as simple as adding an offset to the source address. More interestingly, one might form projective fields (Lehky and Sejnowski 1988) by specifying for instance that the destination addresses corresponding to source address s are given by , where f(s) is calculated by the mapper for each s as it arrives and w is the width of the projective field. A kind of hybrid between table look-up and algorithmic mapping to produce projective fields was proposed (Whatley 1997), though not finally implemented, for the SCX project (Deiss et al. 1999), in which the mapping was performed by a DSP rather than an FPGA.

Provision of Mapping Memory

Mapping, in contrast to monitoring and sequencing, can be carried out by a stand-alone device, that is in the absence of a host, once the mapping tables have been set up. To facilitate true stand-alone operation, the CAVIAR USB-AER board permitted its initial configuration, including mapping tables, to be loaded from nonvolatile storage in the form of an MMC/SD card as used in digital cameras (Paz-Vicente et al. 2006).

At the other extreme, in order to store mapping tables up to 2 GB in size in affordable, fast memory, Fasnacht and Indiveri (2011) took the approach of using a PC to provide the memory. Affordable FPGA systems do not have large amounts of RAM with high bandwidth and low access latency. On the other hand, in PC hardware, gigabytes of fast DRAM are readily available at prices which are orders of magnitude lower than memory with equivalent specifications on FPGA development platforms. Therefore, PC hardware and FPGAs are combined using the PCI interface. The actual mapper device is an FPGA on a PCI board plugged into a PC motherboard with 4 GB of RAM, of which 2 GB are reserved to store the mapping table, see Figure 13.6.

Repetition Counts, Probabilistic Mapping and Delays

The look-up table in an LUT-based mapper need not only hold destination addresses. It can also encode repetition counts if it is desired to emit some addresses multiple times for the same source address, transmission probabilities (Fasnacht et al. 2008; Paz-Vicente et al. 2008) to mimic synaptic release probabilities, and transmission delays (Linares-Barranco et al. 2009a) to mimic axonal conduction delays. Any or all of these properties can be incorporated by extending the words stored in the LUT, so that only some of the bits represent the destination address with further bitfields being added for the property or properties in question.

Transmission probabilities are implemented by drawing a number from a pseudo-random number generator (PRNG) for each destination address to be processed. If the number drawn is greater than the probability value for the current destination address, then the address is discarded. Otherwise, the destination address is output in the usual way.

Transmission delays are implemented similarly to the way a sequencer handles inter-spike intervals (see Section 13.1.2). However, if different events require different delays, some events which result from a mapper look-up operation will need to be transmitted before events from previous mapper look-up operations. Thus the mapper output can no longer be handled as a simple FIFO queue, but must be kept sorted in order of desired transmission time.

13.2 Hardware Infrastructure Boards for Small Systems

13.2.1 Silicon Cortex

The first attempt to build a board that could handle multiple senders and receivers was the ‘Silicon Cortex’ (SCX) board (Deiss et al. 1999). This board implemented an AE communication infrastructure that could be used for interchip communication in simple neuromorphic systems. The SCX framework was designed to be a flexible prototyping system, providing re-programmable connectivity among on the order of 10⁴ computational nodes spread across multiple chips on a single board, or more across multiple boards. The SCX was implemented as a VME-bus card, designed and fabricated by Applied Neurodynamics of Encinitas, California, USA. Each SCX board could support up to six chips or other AE sources, and multiple boards could be linked together in arbitrary topologies to form larger systems.

The SCX was devised to test and refine several fundamental problems encountered in building systems of analog chips that use the AE representation: coordinating the activity of multiple sender/receiver chips on a common bus; providing a method of building a distributed network of local buses sufficient to build an indefinitely large system; providing a software-programmable facility for translating AEs that enable the user to configure arbitrary connections between neurons; providing extensive digital interfacing opportunities; and providing ‘life-support’ for custom analog chips by maintaining volatile analog parameters or programming analog nonvolatile storage.

This board was an ambitious project at the time. There were two sockets to accommodate custom neuron chips and a daughterboard connector. Daughterboards could contain up to four elements that need to talk on the local address-event bus (LAEB), for example, four additional custom neuron chips, or an interface to peripheral sensory devices, such as retinas.

Communication among all of the chips in the system took place on three address-event buses (AEBs). The LAEB used for intra-board communication used a variant on the usual AER protocol as described in Section 2.4.2. Details of the LAEB protocol are described in Deiss (1994). The domain AEBs (DAEBs) used for inter-board communication used a fast synchronous broadcast protocol.

Communication between the chips on the SCX board (and any daughterboard) took place via the LAEB. The occurrence of an event on any chip was arbitrated within that chip, and led to a request from the chip to the bus arbiter. The bus arbiter determined which chip would have control of the LAEB at each cycle, and that chip broadcast an AE onto the bus. These events could be read by all chips attached to the bus. In particular, the bus was monitored by a DSP chip, which could route the AEs to a number of different destinations. For example, the DSP chip could use a look-up table to translate a source AE into many destination AEs.

Because of the bulkiness of the SCX solution and restrictions on the directly supported chip-pinouts, designers of multichip systems resorted to simpler independent boards for their needs.

13.2.2 Centralized Communication

The Rome PCI-AER board designed by Vittorio Dante of the Istituto Superiore di Sanitàin Rome (Dante et al. 2005), attempts to provide a more flexible partitioning of functionality than the SCX board by separating the neuromorphic chips from the hardware infrastructure in a centralized hardware system. It consists of a 33 MHz, 32-bit, 5 V PCI bus add-in card and a header board which can be conveniently located on the bench-top and provides connectors for up to four AER senders and four AER receivers. Senders must use the so-called ‘SCX’ multisender AER protocol (Deiss 1994) described in Section 2.4.2 in which request and acknowledge signals are active low and the bus may only be driven while the acknowledge signal is active. Receivers may use either this ‘SCX’ protocol, or they may choose to use a point-to-point protocol (AER 1993) in which request and acknowledge are active high and the bus is driven while request is active.

The PCI-AER board performs the three functions executed by monitor, sequencer, and mapper blocks. The monitor can capture and time-stamp events coming from the attached AER senders via an arbiter and makes those events available to the PC for storage or further on-line processing. When an incoming AE is read by the monitor, a time-stamp is stored along with the address in a FIFO. This FIFO decouples the management of the incoming AEs from read operations on the PCI bus, the bandwidth of which must be shared with other peripherals in the PC such as the network card. Interrupts to the host PC can be generated when the FIFO becomes half-full and/or full, and in the ideal case, the driver will read time-stamped AEs from the monitor FIFO whenever the host CPU receives a FIFO half-full interrupt, at a rate sufficient that the FIFO never fills or overruns, given the rate of incoming AEs. If the CPU fails to empty the FIFO at a sufficient rate, the FIFO will fill up and a FIFO full interrupt will be generated. At this point, incoming events will be lost until such time as the CPU can once again read from the FIFO.

The sequencer allows events originated by the host PC to be sent out to the attached AER receivers. These events may, for example represent a pre-computed, buffered stimulus pattern, but they might also be the result of a real-time computation. Like the monitor, the sequencer is decoupled from the PCI bus using an 8 KWord FIFO. The host writes a sequence of words representing addresses and time delays to the sequencer FIFO. The sequencer then reads these words one at a time from the FIFO and either emits an AE or waits the indicated number of microseconds. FIFO half empty interrupts can be generated to signal the CPU to supply further data to the sequencer. If the CPU fails to supply data to the sequencer at a rate sufficient to prevent the sequencer FIFO becoming empty, this may indicate a failure of the system to generate the desired sequence of events with the desired timing. In this case a sequencer FIFO empty interrupt is raised to signal the underrun. The address events generated by the sequencer pass through the mapper and can therefore be transmitted on any of the four output channels.

The mapper implements programmable inter-chip synaptic connectivity. It maps incoming AEs from attached AER senders and/or the sequencer to one or more outgoing addresses for transmission to the attached AER receivers. It can operate in pass-through, one-to-one, or one-to-many modes. It uses 2 MWords of on-board SRAM. The mapper too has a FIFO which decouples the asynchronous reception of the incoming AEs from the generation of outgoing AEs. Should this FIFO become full, it is possible that events will be lost, and this eventuality can be signaled to the CPU by means of an interrupt. The mapper, once it has been configured and the look-up table be filled with the required mappings, operates entirely independently of the PCI bus and host CPU, since all of the necessary operations, including table look-up, are performed by one of the FPGAs. A detailed hardware description of the hardware (the hardware User Manual) is available online (Dante 2004).

Other laboratories have chosen a solution targeted to their setup. One such system which also used a centralized communication block is the OR + IFAT system shown in Figure 13.8b. This system is described further in Section 13.3.1. The main module is the FPGA which controls two AER buses: one internal bus for events sent to and from the IFAT multineuron chip and one external bus for events sent to and from the octopus retina or a computer (CPU). Similar to many other AER systems of this size, the FPGA implements the mapping function needed to create the necessary receptive fields for implementing a network desired of the system.

The FPGA was also used to control the parameters of the neurons and synapses on chip, in particular, it was used to implement a different type of synapse for every virtual connection between neurons and to extend the range of synaptic weights through two dynamically controlled external parameters: the probability of sending an event and the number of post-synaptic events sent for every pre-synaptic event (Vogelstein et al. 2007a). The FPGA also implements a one-to-many mapping function for divergent connectivity (Vogelstein et al. 2007b). It uses a flat address space, that is, every targeted and sending unit has a unique address. This mapping solution is used on many early two-chip systems and has the limitation that every spike event produced by the chips has to go through a single mapper on the FPGA.

13.2.3 Composable Architecture Solution

Another communication infrastructure solution is to use multiple standalone boards between components of a system. The boards can be programmed for various functionalities and composed into systems somewhat analogously to commands in the Unix shell pipe mechanism. This solution was used in the CAVIAR system (see Section 13.3.3). In this way, boards can be added as needed between two chips that need to communicate spikes between each other. These boards remove the need for a PC to host the coordination of the flow of spikes from multiple AER chips and the communication between chips can be done in a distributed manner.

Various boards with different specifications were constructed as part of the CAVIAR project. Each of these boards supports one or more of the basic functionalities like mapping and monitoring. Because of the necessity to implement a hierarchical multilayered system modeled after the visual cortex, simple boards were also developed to merge spike streams from multiple chips onto one chip and to split streams from a single chip to multiple chips. In this scheme, local mappers can be implemented on the standalone boards, so that local networks can be established. The address space in this routing scheme is local and the input AEs are broadcast only to the target modules defined by the mapper on the board connecting the modules.

The five boards developed in CAVIAR had specifications which varied in the amount of functionality, the I/O port of choice, the port bandwidth, and the compactness of the board. The first board was a PCI-AER interfacing PCB capable of sequencing timed AER events out from a computer or, vice versa, capturing and time-stamping events from an AER bus and storing them in computer memory. The second board was a USB-AER board. The third board was the Switch-AER PCB based on a simple CPLD. It splits one AER bus into two, three or four buses, or vice versa, merged two, three or four buses into a single bus. The fourth board, the mini-USB-AER board, had lower performance but used a more compact bus-powered USB interface and was particularly useful for portable demonstrations. The last board, the USBAERmini2 board is a bus-powered USB monitor-sequencer board.

The CAVIAR PCI-AER (Paz-Vicente et al. 2006) and the USBAERmini2 (Berner et al. 2007) boards can perform AE sequencing and monitoring functions but have no hardware mapping capabilities. Although host-based software-driven mapping is feasible, when it is necessary to build AER systems that can operate without requiring a computer, a specific device for this purpose is needed.

CAVIAR PCI-AER²

The CAVIAR PCI-AER tool (Paz-Vicente et al. 2006) is a PCI interface board that allows not only reading an AER stream into a computer memory and displaying it on screen in real-time, but also the opposite; from images available in the computer’s memory, it can generate a synthetic AER stream in a similar manner that a dedicated VLSI-AER transmitter chip would (Boahen 1998; Mahowald 1992; Sivilotti 1991).

PCI Interface Design Considerations

Before the development of the CAVIAR PCI-AER interface, the only available PCI-AER interface board was that described in Dante et al. (2005), see Section 13.2.2. This board encompasses all the requirements mentioned for AER sequencing, mapping, and monitoring. However its performance is limited to approximately 1 Meps. In realistic experiments software overheads may reduce this value even further. In many cases these values are acceptable but many AE chips can produce (or accept) much higher spike rates.

When the development of the CAVIAR PCI-AER board was started, using 64-bit, 66 MHz PCI seemed to be an interesting alternative as computers with this type of bus were popular in the server market. When implementation decisions had to made however, the situation had already altered significantly. Machines with extended PCI buses had almost disappeared and, on the other hand, serial LVDS-based PCI express (PCIe) was emerging clearly as the future standard, but almost no commercial implementations were on the market. Therefore, the most feasible solution at the time was to stay with the common PCI implementation (32-bit, 33 MHz).

The previously available Rome PCI-AER board uses polled and/or interrupt driven I/O to transfer data to and from the board, but does not do DMA or bus mastering. This is the main factor limiting its performance. The design and implementation of the CAVIAR PCI-AER board was developed to include bus mastering, and hardware-based frame to AER conversion (Paz-Vicente et al. 2006). The CAVIAR PCI-AER board can perform sequencing in parallel with monitoring.

The physical implementation is in VHDL for an FPGA. Most of the functionality demanded by the users could be supported by the larger devices in the relatively inexpensive SPARTAN-II family. The connection to the PCI bus is via a VHDL bridge (Paz 2003) that manages the protocol of the PCI bus, decodes the accesses to the PCI base addresses, and supports bus mastering and interrupts. A Windows driver and an API that allows the bus mastering capability of the board to be exploited and a Matlab interface were developed.

CAVIAR USB-AER

The CAVIAR USB-AER board permitted stand-alone operation and has several functionalities including hardware mapping. This stand-alone operating mode requires that the user can load the FPGA configuration and the mapper RAM from some form of nonvolatile storage medium that can be easily modified by users. The MMC/SD cards used in digital cameras provide an attractive possibility. However, users can also load the board directly from a computer via USB.

Many AER researchers would like to demonstrate their systems using instruments that can be easily interfaced to a laptop computer. This requirement can also be supported via USB. Thus the board can be used also as a sequencer or a monitor. The CAVIAR USB-AER board used USB 1.1 and was therefore subject to the bandwidth limitations of full speed USB (12 Mbit/s). Nevertheless, thanks to its 2 MB of on-board SRAM, the board could be used as a data logger to store up to 512 K events (including time-stamps) in memory for later, off-line processing. Similarly it could also be used to replay a sequence of computationally generated events to an AER system at up to 10 Meps.

The CAVIAR USB-AER board has also been modified to implement probabilistic mappings(Paz-Vicente et al. 2008). The same board was also used to implement configurable delays (Linares-Barranco et al. 2009a) and also frame to AER conversion.

Interface Features and Architecture

The board can serve as different functional devices according to the VHDL code that is loaded into the FPGA either from an MMC/SD card or from the USB bus. These functionalities are: sequencing – both frame to AER conversion or sequencing pre-stored events with time-stamps from memory; monitoring – both conventional monitoring of a sequence of events with timestamps and AER to frame conversion; and mapping – one-to-one, one-to-many, with optional output event repetition, probability, and programmable delay. The precise functionality implemented at any one time depends only on the VHDL code which is synthesized and loaded into the FPGA as firmware. A simple architecture is the common basis for all the mapper-related functionalities. This architecture is shown in Figure 13.3.

USBAERmini2³

The USBAERmini2 (Figure 13.4) is a high-speed USB 2.0 AER interface that allows simultaneous monitoring and sequencing of precisely timed AER data. This low-cost (<$100), two chip, bus powered interface can achieve sustained AER event rates of 5 Meps. Several boards can be synchronized, allowing simultaneous synchronized capture from multiple devices. It has three AER ports, one for sequencing, one for monitoring, and one for passing through the monitored events.

images

Figure 13.3 USB-AER mapper block diagram of base architecture. Adapted from Linares-Barranco et al. (2009b). With kind permission from Springer Science and Business Media.

None of a number of earlier generations of AER-computer interfaces (Dante et al. 2005; Deiss et al. 1999; Gomez-Rodriguez et al. 2006; Merolla et al. 2005; Paz-Vicente et al. 2006) had the convenience of high-speed, bus-powered USB combined with the synchronous monitoring of AER traffic. For the CAVIAR project (CAVIAR 2002; Serrano-Gotarredona et al. 2005), a heterogeneous mixture of AER chips was assembled into a visual system, and a device was required that could record from all parts of the system simultaneously, was simple to operate and use, was easy and cheap to build, and could be reused in other contexts. It was also desired to have a convenient device that could sequence recorded or synthesized events into AER chips to allow their characterization.

images

The functionality of the USBAERmini2 board includes sequencing and monitoring capability. The number of jumpers is minimized by the auto-detection of connected devices. A user connects a sending AER device to the monitor port and plugs the board into a USB port on a computer. They are then ready to capture and time-stamped AER traffic from the AER device.

The board uses two main components: a Cypress FX2LP USB 2.0 transceiver and a Xilinx Coolrunner 2 CPLD. The board was designed to be cheap to manufacture. Ten boards were hand-assembled by two people in three days. The components cost less than $100 per board, including the 2-layer PCB.

The Cypress FX2LP is a USB 2.0 transceiver with an enhanced 8051 microcontroller and a flexible interface to its 4 KB of FIFO buffers, which are committed automatically to/from the USB by the Cypress serial interface engine (SIE). As in Merolla et al. (2005), the FX2LP is used in its slave FIFO mode, which means that the device handles all of the low level USB protocol in hardware. USB bulk transfers are used. Two FIFOs (one for each direction) are configured to be quad buffered and hold 128 events each. To the CPLD, the FX2LP appears as a FIFO source or sink of AER data.

The microcontroller part of the FX2LP controls the communication with the host computer. This involves the setup of the USB communication and the interpretation of commands that are received from the host to the control endpoint (such as start capturing, stop capturing, or zero time-stamps).

The CPLD handshakes with the AER sender and receiver, and records a 16-bit time-stamp whenever an event is received. Addresses and time-stamps are written to or read from the FIFOs in the FX2LP.

The CPLD uses a 16-bit counter for the generation of time-stamps. These time-stamps are unwrapped to 32 bits on the host computer. The time-stamp tick period can be controlled from the host computer. Two time-stamp modes can be used, either a tick of 1 μs or a tick of one clock cycle, which is 33 ns. When the time-stamp counter overflows, a special time-stamp-wrapped event is sent to the host to tell it that the time-stamp wrap counter has to be incremented. Explicit time-stamps are sent for each event in order to preserve precise event timing information.

Synchronization

Electrical synchronization enables time-stamp-synchronized capture from multiple AER devices. A USBAERmini2 in ‘slave’ mode can be synchronized to a timing ‘master’, which generally is another USBAERmini2. The slave device’s time-stamp counter can then be clocked by the master. As soon as the synchronization state machine detects a time-stamp master at the synchronization input (i.e., the sync input goes high), it resets the time-stamp, signals to the host by sending a USB control transfer message, and changes to slave mode. The host resets its time-stamp wrap counter so that all slaves have the same absolute time-stamp as the master.

Performance

The peak limit on monitoring and sequencing performance is set by the number of clock cycles used by the state machines for a handshake cycle. The monitor state machine requires five clock cycles and the sequencer requires eight clock cycles. The CPLD clock frequency is 30 MHz, so the peak event rate for monitoring is 6 Meps and for sequencing 3.75 Meps.

The measured performance depends on the number and size of host memory buffers. For high frame-rate visualization at low event rate the host buffer size can be set to the USB transaction size of 512 bytes; for highest performance, the buffers need to be at least 4 KB. With a buffer size of 8 KB and four buffers, an event rate of 5 Meps was achieved when capturing events from one board, which is 83% of the limit defined by handshake timing. USB 2.0 high speed mode has a data rate of 480 Mbit/s. Because each event consists of four bytes, 5 Meps is equivalent to 160 Mbit/s. Monitoring with two or three interfaces simultaneously on one computer using the same USB host controller achieves a total rate of 6 Meps, that is 192 Mbit/s, about half the basic data bandwidth of USB 2.0.

The USBAERmini2 board was successfully used to monitor and log 40 K neurons spread over four different AER chips (Serrano-Gotarredona et al. 2005) and was in regular use in three different laboratories. The complete design of this board along with software (excluding device driver) has been open-sourced (jAER Hardware Reference n.d).

SimpleMonitorUSBXPress

An even simpler, monitoring-only USB interface called the SimpleMonitorUSBXPress (Figure 13.5) was also developed during the CAVIAR project by Delbrück (2007) aiming at applications where simplicity, small size, low cost, and above all portability are important. Although it can only capture events at rates of around 50–100 keps, it is probably the world’s smallest and most portable USB AER interface!

images

Figure 13.5 Two versions of the SimpleMonitorUSBXPress PCB. Taken from Delbrück (2007)

13.2.4 Daisy-Chain Architecture

The next architecture uses a daisy-chain scheme where each device communicates to only one other chip and events are communicated through point-to-point AER links instead of using a central router which connects a number of transmitters and receivers. This communication infrastructure was used on the ORISYS system which consisted of a retina and several Gabor chips programmed to detect different orientations (Choi et al. 2005; see also Section 13.3.2).

This daisy-chain architecture avoids the need for complex circuits that control bus access and perform routing. Instead, the ORISYS system uses two basic routing functions, the split and the merge, which are included on each Gabor chip and they determine the spike routing by the way the chips are linked together. Using this approach, routing circuit complexity expands automatically to accommodate the number of chips in the system. In addition, unique chip addresses are generated and updated automatically by the split and merge circuits, so that they can distinguish spikes from different chips in one AER address stream. Unlike the previous systems which used various boards to implement the splitter and merging circuits, it implements these circuits directly on-chip and includes an on-chip RAM for internally mapping the addresses.

The split circuit makes two copies of the AER events appearing at its input. One copy is sent into the neuron array through a decoder that reads the originating address of each spike and sends a spike to the neuron with the same row and column address, irrespective of the chip address. The other copy has its chip address incremented by one and is sent off chip via a transmitter. By daisy chaining the Gabor chips through the connection of the split output of one with the split input of the next, the designers can distribute the same silicon retina spike output to all chips.

The merge circuits combine the activity of all chips in the system, by daisy chaining the merge output of one chip with the merge input of the next chip. The merge output of each chip encodes all of the spike activity in the chips up to that point in the chain. Because it increments the chip address of input events, the merge circuit makes it possible to distinguish spikes originating in different chips.

13.2.5 Interfacing Boards using Serial AER

Newer interfacing boards substitute parallel AER links with faster serial links (based on, e.g., standardized SATA cables and connectors). Example PCBs which implement a serial communication protocol as described in Section 2.4.5 include the AEX board and the MeshGrid AER FPGA board. The serial SATA-like interfaces can be exploited by the 2.5 Gbps Rocket I/O interfaces available within the Spartan-6 FPGAs used on the MeshGrid FPGA board, for example.

The AEX Board⁴

The AEX board (Fasnacht et al. 2008) consists of three interface sections: a parallel AER section for connecting neuromorphic chips, a USB 2.0 interface for monitoring and sequencing on a PC, and a Serial AER (SAER) section to interface to other AEX boards or other boards with an SAER interface; and an FPGA used to route data between those interfaces.

The serial AER approach employed here differs from previously proposed solutions (e.g., in Berge and Häfliger 2007) in several respects. Instead of using a high-end FPGA natively supporting serial IO standards, the design uses a low-cost Xilinx Spartan FPGA plus a dedicated SerDes chip, an approach which had been previously explored by Miró-Amarante et al. (2006). Such a Serializer–Deserializer locally receives data on a parallel bus and then sends it over a serial output at a multiple of the parallel interface speed and vice versa for the serial receive path. The usage of such a SerDes chip allows higher event rates to be obtained at significantly lower silicon cost. The FPGA and SerDes used cost about a third of the cost for the cheapest Xilinx Virtex-II Pro series FPGA necessary to implement a system such as that described in Berge and Häfliger (2007). Using the AEX board, event rates that are about three to four times faster than reported in Berge and Häfliger (2007) were achieved. In their approach the receiver simply drops events if it is not ready to receive them. On the AEX board, a flow control scheme is implemented that ensures that all events reach their destination. In the case that the receiver is currently unable to receive an event because it does not have the necessary receive buffer space available, it can tell the sender to stop until space is available. Finally, the FPGA package type chosen allows for in-house assembly and repair, whereas the ball grid array (BGA) package used by Berge and Häfliger (2007) would make this very difficult.

Flow Control

The flow control signal has to fulfill the following requirements: it has to be transmitted over a differential pair; for AC coupling it has to be DC free; and it has to represent two states, receiver busy or receiver ready. The flow control signal was chosen to be a square-wave because it is DC free and can easily be generated by clocked digital logic. The receiver FPGA signals that it is ready to receive by generating a square-wave at half the SerDes clock frequency. If the receiver is running out of FIFO space it signals the sender to stop by generating a square-wave at an eighth of the SerDes clock frequency. These signals can be easily decoded by the sender FPGA by counting the number of clock cycles the flow control signal keeps the same value. If this counter is one to three the sender keeps sending, if it counts to four or more the sender has to stop.

In order to know at what receiver FIFO fill-level to signal a stop condition to the sender, the sum of the forward and backward channel latencies must be known. The Texas Instruments SerDes TLK3101 used has a total link latency of 145 bit times (Texas Instruments 2008), giving 7.25 clock cycles, plus the line delay of the cable. The flow control back channel has a latency equal to the line delay plus 2 cycles for the synchronizer registers, plus 4 to 5 cycles to detect the stop state. This adds up to 14.25 cycles plus two line delays. At a 2 m maximum cable length this is 2 × 2m/0.5c = 26.6 ns which is 3.325 cycles (at 125 MHz). (This conservatively assumes the signal propagation speed in SATA cables to be half the speed of light.) Thus the total delay should be less than 18 cycles. The latest time to dispatch the flow control stop signal is thus when 18 words of the 16-bit receiver FIFO remain free.

PCI-Based High-Fanout AER Mapper⁵

AER mappers commonly use SRAM to store the mapping information and FPGAs to handle the I/O interfaces and to control the mapping process, typically having up to 4 MB of SRAM to store the mapping tables, and using parallel AER interfaces. Fasnacht and Indiveri (2011) presented a new AER mapper that can store mapping tables up to 2 GB in size, and uses the same high-speed serial AER interfaces as the AEX board described above and in Fasnacht et al. (2008).

Mapper System Overview

The design choices were dominated by the requirements to interface the mapper to the preexisting AEX boards (Fasnacht et al. 2008) that make use of the Serial AER interface and to affordably store and quickly access Look-up Table (LUT) type mapping tables of hundreds of megabytes in size.

As already noted in Section 13.1.3, affordable FPGA systems do not have large amounts of RAM. On the other hand, in PC hardware, gigabytes of very fast DRAM is readily available at prices which are orders of magnitude lower than memory with equivalent specifications on FPGA development platforms. For this reason, in this design PC hardware and FPGAs are combined using the PCI interface, see Figure 13.6.

The PC motherboard uses the Intel P35/ICH9R chip-set. Unlike in the more recent Intel and AMD CPUs, in this architecture the memory controller is not in the CPU, but in the PCI North Bridge. If a PCI device needs to access the main memory the datapath is:

images

Figure 13.6 PCI-Based High-Fanout Mapper. The basis is a modified PCI FPGA development board containing a Xilinx Spartan-3 FPGA. The FPGA connects to both a custom daughterboard that implements the Serial AER interface and to the PCI bus of a PC Motherboard. The Mapping Table is held in the Main Memory (physical memory map at bottom right) on the motherboard and accessed via the PCI North Bridge. © 2011 IEEE. Reprinted, with permission, from Fasnacht and Indiveri (2011).

(see Shanley and Anderson 1999). This means that the CPU is not involved in the actual mapping operation. The 2 GB of mapping table space is split uniformly into 2¹⁶ pieces giving 32 KB of RAM per possible 16-bit input AE. With two bytes per AE this allows up to 16,384 words to be stored per input AE. A reserved sentinel address value is used to mark the end of a sequence of output AEs means that one input AE can generate up to 16, 383 output AEs; this is the maximum fanout of the mapper. This is an example of a fixed-length destination address list mapper, albeit one in which the destination lists can be very long. Only one contiguous access to the mapping table RAM is needed in order to get all the output AEs for one input AE. Calculating the memory address for a given input AE is simple; the numeric value of the input AE is multiplied by 32 KB, and the mapping table offset is added. For the FPGA this is almost free as it is simply a bit shift followed by an addition, which can be done in a single cycle operation.

The mapper uses a custom PCI implementation in order to get as much control over the PCI communication as possible. The implementation ignores Arbitration Latency and the master Latency Timer. This means that other PCI bus-masters can be prevented from gaining access to the PC bus for longer than usual. In the case of the PCI mapper this is a desired behavior, as ongoing mapping activity should not be suspended to grant bus access to other PCI devices.

The PCI-FPGA board is extended with a daughterboard carrying the components necessary for the AER input and output interfaces. These AER interfaces use the same SerDes chips and SATA connectors and cables as the AEX board presented in Fasnacht et al. (2008) and in Section 13.2.5 above. This interface allows AEs to be transferred at very high speeds between boards in a multichip, multiboard setup. Neuromorphic sensors and processing chips are connected to the mapper through this interface and the AEX board. The Serial AER interface achieves event rates of 156 MHz with 16-bit AEs or 78 MHz with 32-bit AEs.

Typical Usage in Multichip Setups

A practical multichip setup such as that in Neftci et al. (2010) using this mapper typically consists of a number of neuromorphic chips, each of which is directly connected to an AEX board and an AER mapper. These components are connected using their Serial AER interfaces in a ring topology. Each AEX board has a certain address space assigned. If incoming AEs fall into that space, they are sent to the local chip. Otherwise they are forwarded (kept in the ring). This allows the mapper to send AEs to all chips, and allows all chips to send AEs to the mapper. The full network connectivity is controlled at one single point, that is the mapper. Several examples of multichip systems have been built with this mapper, using up to five multineuron AER chips and in none of them was the mapper considered to be the limiting factor.

Performance

The measured latency of 774 ns for a fanout of 64 is the same as for a fanout of 1. The mapper takes 983.7 ns to map and send out 64 events and 15.27 μs for 1024 events. The mapper actually emits two destination AEs per PCI bus-cycle (30 ns). This is to be expected, given that the AEs are 16-bit and the PCI bus transfers 32 bits per cycle. The bandwidth bottleneck is the maximum limit of the PCI bus.

13.2.6 Reconfigurable Mesh-Grid Architecture

In CAVIAR, as described in Section 13.2.3, a bandwidth-efficient interboard (or chip) interconnection scheme was used, as all interconnections were local to the sender and receiver boards (or chips). This maximizes communication efficiency, but at the cost of cumbersome reconfigurability. Whenever a change in interconnections is required, bus connectors need to be unplugged and plugged in elsewhere mechanically by hand.

One way to simplify multichip and multiboard systems’ reconfigurability is to physically assemble them in fixed, hardwired, 2D-grid arrangements, while implementing a logical reconfigurable interconnection layer on top of the physical layer. This idea is used in the SpiNNaker system (described in Section 16.2.1) where the logical interconnection layer defines the interconnectivity among all neurons in the system. However, SpiNNaker’s approach differs from the one developed by Zamarreño-Ramos et al. (2013), where the logical layer only describes the interconnectivity among AER modules (chips or PCBs).

Figure 13.7 (a) shows a 2D grid of modules, where each module can be any neuron/synapse AER block within a chip, an FPGA, or a PCB. Example AER modules can be any sensor chip, convolution module, WTA module, multineuron learning chip module, and so on. Each module is accompanied by a programmable router. The address events have two parts. The lower, less-significant part defines the ‘local’ information of the event, which is only meaningful to the sender and receiver AER modules, as it was in the CAVIAR approach; the upper, more-significant part of the address defines the (x, y) coordinate of the source or destination module within the grid, depending on whether source or destination coding is used. The routers look only at the upper part of the traveling event and send it to neighboring modules until the events reach their destination modules. The logical network is defined by the data programmed in the routing tables of the individual routers. With this approach, any network topology can be mapped onto the 2D grid of modules. This technique was tested on Virtex-6 FPGAs (Zamarreño-Ramos et al. 2013). Up to 48 AER 64 × 64 pixel convolution modules with routers could be fitted within one Virtex-6. Inter-module event hop time was below 200 ns, while inter-FPGA event hop time was about 500 ns. A new multipurpose MeshGrid AER FPGA board was developed as shown in Figure 13.7b. It contains a Spartan-6 FPGA, four SATA connectors for serial AER 2D-grid expansion, and a number of generic purpose pins.

images

Fig. 13-7 (a) A 2 × 2 grid arrangement of AER modules within an arbitrarily large reconfigurable router-based system. © 2013 IEEE. Reprinted, with permission, from Zamarreño-Ramos et al. (2013). (b) Spartan-6, four SATA serial-AER interconnect, multipurpose node board for 2D grid arrangements

This board can also be equipped with custom-made ASIC chips that use serial I/O interfaces (Zamarreño-Ramos et al. 2011, 2012). Serial I/O interfaces are readily available in many commercial FPGAs. These ASIC chips use an LVDS bit-serial intercommunication scheme where the sender and receiver are turned off during inter-event pauses, and turned back on very quickly when a new event needs to be transmitted. The reported maximum bit-serial intercommunication speed was 0.71 Gbps in a relatively old 0.35μm CMOS technology, while using Manchester encoding which reduces effective speed to half the physical speed. Here, there are two key types of circuit blocks: (a) A serializer–deserializer pair to convert back and forth from 32-bit parallel AER events to their serial counterparts, and using handshaking wires to signal when to turn the communication circuits off and back on; and (b) a driver/receiver pair that drives and reads a differential pair of 100 Ω impedance microstrips. All blocks (serializer, deserializer, driver pad, and receiver pad) had a turn off and on time of less than 1-bit transmission time. Consequently, there is no penalty introduced by the capability of turning circuits off and back on. The blocks can communicate 32-bit events at peak rates of up to 62.5 Meps (Iakymchuk et al. 2014).

13.3 Medium-Scale Multichip Systems

This section describes three medium-scale systems (OR+IFAT, ORISYS, and CAVIAR) which illustrate the various roles played by the multineuron chips in these early multichip AER systems. These systems have the following modules in common: a front-end retina sensor and subsequent multineuron chips which process the retina events. The individual sensors in the systems, the Octopus Retina, the Parvo-Magno retina, and the Dynamic Vision Sensor, were discussed in Chapter 3. Each of the three systems also demonstrate the use of different solutions for the hardware infrastructure approaches discussed in Section 13.2.6. The present section also describes some of the applications demonstrated on these systems.

13.3.1 Octopus Retina + IFAT

The combined Octopus Retina (OR) plus integrate-and-fire transceiver (IFAT) system is one example of the two-chip AER vision systems developed by various groups during the late 1990s (Arias-Estrada et al. 1997; Higgins and Shams 2002; Indiveri et al. 1999; Ozalevli and Higgins 2005; Venier et al. 1997). Such systems were usually configured to demonstrate cortical-like responses, such as orientation selectivity and motion selectivity, using small numbers of neurons. These multichip systems were also focused on supporting modular, multilayered, hierarchical networks of brain or bio-inspired architectures such as convolutional networks and HMAX (Fukushima 1989; LeCun et al. 1998; Neubauer 1998; Riesenhuber and Poggio 2000). Systems such as the OR+IFAT system described here have also been used to demonstrate higher-level functions such as attention, feature coding, salience detection, foveation, and simple object recognition. These functions capitalize on the AE domain to maintain arbitrary and reconfigurable connectivity between units in the multichip architecture.

System Description

The system, consisting of an 80 × 60 Octopus Retina (Section 3.5.4) and the IFAT board, is shown in Figure 13.8 (Vogelstein et al. 2007a). The IFAT board itself holds two multineuron chips, an FPGA, 128 MB of digital memory in a 4 MB × 32-bit array, and an 8-bit digital-toanalog converter (DAC) required to operate the multineuron chips. The combined multineuron system is composed of 4800 random-access integrate-and-fire (I&F) neurons implemented on two custom aVLSI chips, each of which contains 2400 cells (Vogelstein et al. 2004, 2007a). All 4800 neurons are identical; each implements a conductance-like model of a general-purpose synapse using a switched-capacitor architecture (described in Chapter 8).

images

Fig. 13-8 (a) Picture of the combined Octopus Retina (OR) + IFAT system. (b) Architecture of the system. The FPGA on the IFAT board receives spikes from either of the two Octopus Retinas (ORs) or the computer (CPU), depending on which is selected by the multiplexer (MUX). Outgoing addresses from the IFAT are sent to the computer for visualization or storage. Adapted from Vogelstein et al. (2007a). Reproduced with permission of MIT Press

images

Fig. 13-9 Hierarchical model of visual information processing based on work by Riesenhuber and Poggio (2000). Spatial features are extracted from retinal images using a small set of oriented spatial filters, whose outputs are combined to form estimates of local salience. The region with maximum salience is selected by a winner-take-all network and used to foveate the image by spatial acuity modulation. A large set of simple cells with many different preferred orientations is then used to process this bandwidthlimited signal. The simple cells’ outputs are combined with a ‘max’ function to form spatially invariant complex cells, and the resulting data are combined in various ways to form feature cells, composite cells, and, finally, ‘view-tuned cells’ that selectively respond to a particular view of an object. Adapted from Vogelstein et al. (2007a). Reproduced with permission of MIT Press

Example Result – Spatial Feature Extraction

A target application of this system and other multichip vision AER systems is the ability of these systems to implement a hierarchical network of visual processing such as the HMAX network shown in Figure 13.9. To construct such a hierarchical system would require many more boards than the single IFAT system. The capability of this board to implement a spatial feature extraction module is described next.

Figure 13.10 illustrates how the IFAT system is used to perform spatial feature extraction by emulating eight different simple cell types. Simple cells are the first stage of processing in the visual cortex (Kandel et al. 2000). They act as oriented spatial filters that detect local changes in contrast, and their receptive fields (RFs) and preferred orientations are both functions of the input they receive from the retina.

Note that because the OR output is proportional to light intensity, these simple cells respond to intensity gradients, not to contrast gradients. In this example, each cortical cell integrates inputs from four pixels in the OR, two of which make excitatory synapses and two of which make inhibitory synapses. The excitatory and inhibitory synaptic weights are balanced so that there is no net response to uniform light. These receptive fields are generated by using the mapping function on the FPGA board.

Because each of the silicon cortex and retina chips only contain 4800 neurons, there is necessarily a trade-off between the spacing of similarly oriented simple cells throughout the visual field and the number of differently oriented simple cells with overlapping receptive fields. For the images in Figure 13.10, this trade-off was resolved in favor of increased resolution; each frame was captured from a different configuration of the system in which all 4800 simple cells had identical preferred orientations. By allowing lower resolution RFs, the system can be configured to simultaneously process two or four different orientations.

images

Fig. 13-10 (B1)–(I1) Orientation-selective kernel compositions in the IFAT simple cell network. Each simple cell has a 4 × 1 receptive field and receives two excitatory (+) and two inhibitory (−) inputs from the silicon retina. (A2) Original image captured by silicon Octopus Retina. (B2)–(I2) Frames captured from real-time video sequences of retinal images processed by simple cell networks implemented on the IFAT system. Adapted from Vogelstein et al. (2007a). Reproduced with permission of MIT Press

13.3.2 Multichip Orientation System

The second system is the multichip orientation system, ORISYS, developed in the laboratories of Shi and Boahen. It uses the 60 × 96 Parvo-Magno silicon retina by Zaghloul and Boahen (2004). The retina (also described in Chapter 3) generates spike outputs that mimic the responses of ON-sustained and OFF-sustained retinal ganglion cells on a 30 × 48 array of retinal positions. The events between the custom chips of the ORISYS system are communicated using the word-serial protocol (Boahen 2004a, 2004b) discussed in Section 2.4.4, that is, the y-address is first transmitted followed by the x-addresses of all active events on that row. The Gabor chip implemented neurons with spatial receptive fields similar to a Gabor function (Choi et al. 2004). These receptive fields are not created through an AER mapper. The receptive field profile has the same form as the Gabor function commonly used to model orientation selective cortical neurons (Daugman 1980; Jones and Palmer 1987), except that the modulating function is not Gaussian.

System Architecture

Figure 13.11 shows the block diagram of a feedforward architecture where individual Gabor chips are preprogrammed for four different orientations. The output of the retina is broadcast to several Gabor chips.

Each Gabor chip can process ON and OFF spikes from a 32 × 64 array of retinal positions (Choi et al. 2004). Every retinal position is handled by four neurons on the Gabor chip. Each of the four neurons computes a weighted sum of the spike rates from the ON and OFF ganglion cells in a small neighborhood of that retinal position, half-wave rectifies it, and encodes the result as an output spike rate. The weighting function used in the sum is equivalent to the neuron’s receptive field (RF) profile. The four neurons are denoted by EVEN-ON, EVENOFF, ODD-ON, and ODD-OFF, and differ according to their RF symmetry (EVEN/ODD) and polarity (ON/OFF). The ON and OFF neurons encode the positive and negative half-wave rectified sums.

images

Fig. 13-11 Feedforward implementation of ORISYS. Each box with the double line border represents a chip containing a retinotopic array of neurons. Gabor chips are represented by the larger boxes with the dark bar indicating the tuned orientation of the neurons on the chip. Boxes with single line borders indicate circuits that manipulate AER encoded spike trains. The ‘flip image’ circuit remaps spike addresses to flip the input and output images of the fourth chip horizontally, resulting in neurons tuned to 135^°. The ‘Chip Select’ block passes only spikes originating from a desired chip (a) Photograph of setup (b) Block diagram. © 2005 IEEE. Reprinted, with permission, from Choi et al. (2005)

The neurons in this system capture many of the important characteristics of orientationtuned cortical neurons. The model of linear filtering followed by a nonlinearity has been shown to account for the responses of a large proportion of V1 neurons in the visual cortex (Albrecht and Geisler 1991; Heeger 1992a, 1992b).

Example Result – Orientation Selectivity

Demonstrating orientation selectivity is a frequent application of AER systems which comprise both a retina and a multineuron module. Two approaches are frequently used in the configuration of an orientation-selective system. The first approach follows the ice-cube model (Hubel 1988; Hubel and Wiesel 1972) where each pixel extracts a different orientation (Cauwenberghs and Waskiewicz 1999; Liu et al. 2001). In the second approach, the neurons of each chip extract the same orientation. Therefore, multiple orientations require multiple chips. These two approaches trade-off orientation resolution and spatial resolution. The ORISYS system follows the second approach (Choi et al. 2004, 2005). It overcomes the limitation of the IFAT system in which only one orientation can be implemented on the multineuron chip at any one time and hence is not suitable for extracting multiple orientations from a scene in real time. The pixels on each of the multineuron chips directly implement a Gabor convolution function thus negating the necessity for creating such an orientation-selective receptive field for the neurons in the AE domain. By doing so, the AER bandwidth requirement is reduced.

The performance of this system in implementing orientation-selective RFs is illustrated in Figure 13.12 which shows the outputs of the neurons in response to a dark ring on a bright background using a feedforward system configuration. The four Gabor-type chips were tuned to similar spatial frequencies and bandwidths, but different orientations: 0^°, 45^°, 90^°, and 135^°. The 135^° orientation was achieved by tuning the chip to 45^° and then flipping the input and output images horizontally.

In total, this system contains 32 768 orientation-tuned neurons tuned to four orientations, two spatial phases, two polarities (ON/OFF), and 32 × 64 retinal positions. Neurons from different chips respond to parts of the ring, depending on where their tuned orientations match that of the ring. Transistor mismatch adds variation in the neural responses across position, due to changes in the gain, tuning, and background firing rates of the neurons (Choi et al. 2004; Tsang and Shi 2004).

images

Fig. 13-12 Responses of the feedforward ORISYS system to a black ring. The image on the left shows the output of the Parvo-Magno retina (Section 3.5.3). Each image position represents one retinal position in the upper-left 30 × 30 corner of the array. The image intensity encodes the difference between the spike rates of the ON and OFF neurons at each position. The eight remaining images show the outputs of the EVEN (top row) and ODD (bottom row) neurons on the Gabor chips. For the EVEN and ODD responses, black corresponds to differential spike rates ≤ −40 Hz and −80 Hz, respectively. White corresponds to differential spike rates ≥ +40 Hz for both EVEN and ODD responses. © 2005 IEEE. Reprinted, with permission, from Choi et al. (2005)

The total spike rate in the array is limited by the time which it takes to transmit one address event, T_cyc = 357 ns. Subsequent events encoded in the same burst require only T_bst = 106 ns. For low loads where each burst contains only one x-address, the link capacity is 1∕T_cyc = 2.8 million spikes per second. The link capacity in burst-mode is 1∕T_bst = 9.4 million spikes per second.

13.3.3 CAVIAR

The third system, called CAVIAR (Convolution AER VIsion Architecture for Real-time), was the largest multichip multilayer AER real-time, frame-free vision system (Figure 13.13) built at the time, with a combined total of 45K neurons and 5M synapses (Serrano-Gotarredona et al. 2009). This system has four custom mixed-signal AER chips, five custom digital AER interface components and it performs up to 12 G synaptic operations per second. It is capable of achieving millisecond object recognition and tracking latencies, illustrating the computational efficiency of AER systems. The CAVIAR system is more complex than the previous systems in that there are three layers of processing including the convolution stage and the interfacing of the system output with a robotic motor output controlling a laser pointer.

System Architecture

The CAVIAR vision system (Figure 13.13) is composed of the following custom chips: the DVS temporal contrast retina chip, a set of programmable kernel convolution chips, a 2D winner-take-all (WTA) object chip, and a delay line chip and learning chip. It also includes the set of AER remapping, splitting, and merging FPGA-based modules, and computer–AER interfacing FPGA modules for generating and/or capturing AER spikes discussed in Section 13.2.3. The hardware infrastructure was developed to support a set of AER modules (chips and interfaces) that are connected in series and in parallel to embody an abstract hierarchical multilayered architecture.

images

Fig. 13-13 CAVIAR system overview. A bio-inspired system architecture performing feedforward sensing plus processing with the following conceptual structure: (a) a sensing layer, (b) a set of low-level processing layers usually implemented through projection fields (convolutions) for feature extraction and combination, and (c) a set of high level processing layers that operate on ‘abstractions’ and progressively compress information through, for example, dimension reduction, competition, and learning. An example output from each VLSI component is shown in response to the rotating stimulus. The basic functionalities of AER communication between chips are performed using custom AER boards (see Section 13.2.3). © 2009 IEEE. Reprinted, with permission, from Serrano-Gottaredona et al. (2009)

In an example object tracking application, moving objects in the field of view of the retina cause spikes. Each spike from the retina causes a splat of each convolution chip’s kernel onto its own integrator array. When the integrator array pixels exceed positive or negative thresholds they in turn emit spikes (Serrano-Gotarredona et al. 2006). The resulting convolution spike outputs are noise-filtered by a winner-take-all (WTA) object chip (Oster et al. 2008). The WTA output spikes, whose addresses represent the location of the ‘best’ circular object, are fed into a configurable delay line chip that maps spikes in time into spikes over space. This spatial pattern of temporal delayed spikes are then learned by a competitive Hebbian learning chip (Häfliger 2007). The delay line chip takes temporal streams of spike inputs and projects them into a spatial dimension, thus allowing the competitive Hebbian learning chip to classify temporal patterns.

The WTA spikes can be used to control a mechanical or electronic tracking system that stabilizes the programmed object in the center of the field of view.

We discuss in more detail two of the multineuron chips (the convolution chip and the WTA object chip) in the CAVIAR system to illustrate the types of function that can be demonstrated on a multineuron chip.

AER Programmable Kernel 2D Convolution Chip

The neurons on the convolution chip perform spike-based convolution in a way that is different from other chips that implement on-chip convolution circuits such as the hard-wired elliptic (Venier et al. 1997), Gabor-shaped kernels (Choi et al. 2004), or x/y separable kernel filtering (Serrano-Gotarredona et al. 1999a).

The convolution kernel is stored in an on-chip RAM leading to two advantages: first, the shape or size of the convolution kernel is only restricted to the size of the on-chip RAM. Second, the external AER bus between the retina and the convolution chip is not used to generate the receptive field, therefore the bandwidth of the bus is not occupied by the mapping activity. The chip is an AER transceiver with an array of event integrators. For each incoming event, integrators within a projection field around the addressed pixel compute a weighted event integration. The weight of this integration is defined by the convolution kernel (Serrano-Gotarredona et al. 2006). This event-driven computation puts the kernel onto the integrators.

Figure 13.14 shows the block diagram of the convolution chip. The main parts of the chip are: (1) an array of 32 × 32 pixels. Each pixel contains a binary-weighted signed current source and an integrate-and-fire signed integrator (Serrano-Gotarredona et al. 2006). The current source is controlled by the kernel weight read from the RAM and stored in a dynamic register. (2) A 32 × 32 kernel RAM. Each kernel weight value is stored with signed 4-bit resolution. (3) A digital control block that handles the sequence of operations for each input event. (4) A monostable. For each incoming event, it generates a pulse of fixed duration that enables the integration simultaneously in all the pixels. (5) x-Neighborhood block. This block performs a displacement of the kernel in the x direction. (6) Arbitration and decoding circuitry that generates the output address events.

images

The chip operation sequence is as follows: (1) Each time an input AE is received, the digital control block stores the (x, y) address and acknowledges reception of the event. (2) The control block computes the x-displacement that has to be applied to the kernel and the limits in the y addresses where the kernel has to be copied. (3) The control block copies the kernel from the kernel RAM row by row to the corresponding rows in the pixel array. (4) Once the kernel copy is finished, the control block activates the generation of a monostable pulse. In this way, in each pixel a current weighted by the corresponding kernel weight is integrated during a fixed time interval. Afterwards, kernel weights in the pixels are erased. (5) When the integrator voltage in a pixel reaches a threshold, that pixel asynchronously sends an event. The pixel integrates both positive and negative events and resets its voltage upon reception of an acknowledge from the periphery.

The maximum input AER rate is 50 Meps and the maximum output rate is 25 Meps. The input event throughput depends on the kernel size and the internal clock frequency. The event cycle time is (4 + 2n_k)T_clock, where n_k is the number of programmed kernel lines (or row lines) and T_clock is the internal clock period. The maximum sustained input event throughput varies between 33 Meps for a one-line kernel to 3 Meps for a full 32-line kernel. AER events can be fed-in at a peak rate of up to 50 Meps.

AER 2D Winner-Take-All Chip

The AER ‘Object’ chip consists of a network of 16 × 16 VLSI integrate-and-fire neurons with excitatory and inhibitory synapses. It implements a winner-take-all operation which is a candidate for a cortical microcanonical function (Oster et al. 2008). By configuring the network connectivity for winner-take-all computation, the chip reduces the dimensionality of the input space by preserving the strongest input and suppressing all other inputs. The ‘Object’ chip receives outputs of four convolution chips (Figure 13.15) and computes the winner (strongest input) in two dimensions. It first determines the strongest input in each feature map, and then it determines the best feature map. The computation to determine the strongest input in each feature map is carried out using a two-dimensional winner-take-all circuit shown in one of the four central blocks in Figure 13.15. The network is configured so that it implements a hard winner-take-all, that is, only one neuron is active at a time. The activity of the winner is proportional to the winner’s input activity (Oster and Liu 2004). Each excitatory input spike charges the membrane of the post-synaptic neuron until one neuron in the array – the winner – reaches threshold and is reset. All other neurons are then inhibited via a global inhibitory neuron which is driven by all the excitatory neurons. Self-excitation provides hysteresis for the winning neuron by facilitating the selection of this neuron as the next winner.

images

Fig. 13-15 ‘Object’ chip with two levels of competition on four WTA networks. The digital-to-analog converter (DAC) sets the synaptic weights. The scanner blocks are used to read off the membrane potentials of the neurons. The AER encoder and decoder blocks are used to transmit the spikes off-chip and to receive spikes on-chip. Open circles are excitatory neurons, filled circles are inhibitory neurons. Curved lines between two neurons indicate a connection (real or virtual). Triangular terminations of these lines indicate excitatory synapses and circular terminations indicate inhibitory synapses. Light gray lines and circles indicate the elements, which carry out the second level of competition. © 2008 IEEE. Reprinted, with permission, from Oster et al. (2008)

Because of the moving stimulus, the network has to determine the winner using an estimate of the instantaneous input firing rates. The number of spikes that the neuron must integrate before eliciting an output spike can be adjusted by varying the weights of the input synapses.

To determine the winning feature map, the authors use the activity of the global inhibitory neuron (which reflects the activity of the strongest input within a feature map) of each feature map in a second layer of competition (see Figure 13.15). By adding a second global inhibitory neuron to each feature map and by driving this neuron from the outputs of the first global inhibitory neurons of all feature maps, only the strongest feature map will survive. The output spikes of the ‘Object’ chip encode both the spatial location of the stimulus and the identity of the winning feature.

Example Result – Tracking

CAVIAR’s capabilities were tested on a demonstration system (Figure 13.16) that could simultaneously track two objects of different size.

A mechanical rotor (1) holds a rotating white piece of paper with two circles of different radius and some distracting geometric figures. The vision system follows the two circles only, and discriminates between the two. A pair of servomotor driven mirrors (2) changes the point of view of the AER retina (3), which sends outputs to a monitor PCB (4), and a mapper PCB (5) before reaching the convolution PCB with four convolution chips (6). The latter PCB’s output is sent through another monitor PCB (7) and mapper PCB (8) to the 2D WTA ‘Object’ chip (9). This output is received by a monitor PCB (10) which sends a copy to a microcontroller (11) that controls the mirror motors to center the detected circle. Another copy of the WTA output is sent to the learning system which consists of a mapper (12), a delay line chip (13), another mapper (14), and a learning classifier chip (15), and which learns to classify trajectories into different classes.

images

Fig. 13-16 Experimental setup of multistage AER CAVIAR system for tracking a circle (white boxes include custom-designed chips, light gray boxes are interfacing PCBs described in Section 13.2.3, dark gray boxes are remaining modules). © 2009 IEEE. Reprinted, with permission, from Serrano-Gottaredona et al. (2009)

images

Fig. 13-17 A 20-ms snapshot of outputs of the various custom chips of the CAVIAR tracking system. The retina central point of view is changed dynamically to follow the small circle, which is then always centered in the field of view. © 2009 IEEE. Reprinted, with permission, from Serrano-Gottaredona et al. (2009)

Figure 13.17(a) shows a histogram reconstructed from the DVS retina output captured by monitor PCB (4) in Figure 13.16. White dots represent positive sign events (dark-to-light transitions) and black dots represent negative sign events (light-to-dark transitions), allowing identification of the direction of motion of the geometric figures, which is clockwise in this case. Figure 13.17b also shows a histogram image reconstructed from the 64 × 64 convolution PCB output captured by monitor PCB (7) in Figure 13.16. In this case, the kernel was programmed to detect the small circle. Positive sign events (white) show where the small circle is centered, while the negative events (dark) show where it is not. The convolution output includes some noise, which is filtered out by the WTA operation. The convolution output pixels are transformed from size 64 × 64 to 32 × 32 (by grouping 2 × 2 pixels into one) by the mapper (8) in Figure 13.16. Figure 13.17c also shows the output of the WTA computing stage, where all noise has been filtered out. The white pixels show the center of mass of the small circle and the dark pixels show the activities of the local and global inhibitory units of each quadrant.

13.4 FPGAs

There is a very active community of neuromorphic engineers working with FPGA devices. These devices allow very much fast system design, debugging and testing workflows compared to full-custom VLSI chip designs (both digital and analog). FPGA devices have continued to improve every year in features, performance, and resources since their invention by Ross H. Freeman (Freeman 1989), who, together with Bernard Vonderschmitt, co-founded the Xilinx company. The FPGA was conceived of as an evolution of the Complex Programmable Logic Device (CPLD). A CPLD is a set of logic blocks distributed in macrocells that can be connected programmatically using an interconnection matrix based mainly on multiplexors. The logic blocks available on a CPLD are very similar to those available on a Programmable Logic Device (PLD), a matrix of OR and AND gates that can be connected to obtain combinational digital circuits. Macrocells on CPLDs also include registers and polarity bits. The main characteristics of the evolution from CPLD to FPGA are the greater flexibility of the interconnectivity between logic cells, the inclusion of SRAM memory bits, and the inclusion of other embedded resources, such as multipliers, adders, clock multipliers, and so on. In the beginning, FPGAs could not work at very high clock speeds (but only at speeds on the order of units to tens of megahertz), but today one can buy a Virtex UltraScale FPGA that takes advantage of 3D Stacked Silicon Interconnect (SSI) technology, with up to 4.4 million logic cells, around 100 Mbits of RAM, 2800 digital signal processing blocks, up to 1500 IO pins, 100 Gbit Ethernet ports, and 16 Gbps and/or 32 Gbps low-voltage differential signaling (LVDS) serial interfaces for PCIe, SATA, or any other standard. Neuromorphic engineers have been using FPGA devices for supplementary tasks at the beginning and today for huge systems development on very powerful platforms.

For neuromorphic system communications support there have been several FPGA based solutions, for example the PCI-AER interface (see Section 13.2.2) developed by Dante (2004) with a performance of up to 1 Meps (Dante et al. 2005) used an FPGA for AER handshaking, time-stamp management, and event mapping. Under the CAVIAR project (2002–2006) a set of AER tools based on FPGAs was developed and distributed in the neuromorphic community by the Robotic and Technology of Computers Lab of the University of Seville (Gomez-Rodriguez et al. 2006; Serrano-Gotarredona et al. 2009), see Section 13.2.3 and Section 13.3.3. The Spartan II was the most used FPGA for sequencing, monitoring, and mapping events in real time using an on-board SRAM for stand-alone demonstrations or using USB when communicating and debugging from a personal computer or laptop. Fasnacht et al. (2008) developed the AEX board (described in Section 13.2.5) for communicating neuromorphic devices and to debug them using a host computer. This board uses a Spartan3 FPGA as a communication core and several commercial chips for interfacing to high-speed serial communications (Ethernet and USB2.0). Cauwenberghs’ laboratory (at the Institute of Neural Computation, University of California) developed a Hierarchical AER (HiAER, see Section 16.2.2) communication routing architecture for neuromorphic systems using a Spartan6 FPGA to implement the communication at every node of the hierarchy. As part of this work, Park et al. (2012) presented a two-level communication hierarchy for connecting four IFAT neuromorphic chips. Under the FACETS project a neuromorphic wafer system was developed, where a set of synapse-and-neuron blocks (HICANN, see Section 16.2.4) are connected through an intra-wafer high-density routing grid. This grid can be connected to the outside world (another wafer or a host PC) using a packet-based protocol implemented on Virtex5 FPGA and specialized digital network chips (DNC). This communication infrastructure can manage up to 2.8 Geps traffic on the system (Hartmann et al. 2010).

In 2007, work incorporating neuromorphic processing using FPGAs started to appear in the literature, for example, Cassidy et al. (2007), who developed an array of Leaky-Integrate and Fire (LIF) neurons on FPGA and tested it by developing auditory Spatiotemporal Receptive Fields (STRFs), a neural parameter optimizing algorithm, and an implementation of the Spike Time-Dependant Plasticity (STDP) learning rule.

Hasler’s laboratory at the Georgia Institute of Technology, developed FPAAs, that is Field Programmable Analog Arrays, that could be used for implementing analog neuromorphic systems (Hall et al. 2005), and Petre et al. (2008) developed an automated way to program them using Simulink.

Later, several works implementing neuromorphic vision processing on FPGA-based systems began to be reported, for example, frame-based convolutional networks (Boser et al. 1992) were implemented on FPGAs under the NeuFlow system developed by Farabet et al. (2011), the architecture of which provides many processing elements that share a smart cache that accelerates the processing in a powerful pipeline. Orchard et al. (2013) developed an implementation of a biologically inspired spatiotemporal energy model for motion estimation in a Xilinx Virtex6 FPGA using frame-based visual information. Sabarad et al. (2012) developed a systolic array-based architecture which includes a run-time reconfigurable convolution engine which can perform multiple, variable-sized convolutions in parallel, synthesized on a Virtex6 platform. Al Maashri et al. (2011) developed a hardware architecture for accelerating the cortically-inspired, visual object classification algorithm, HMAX (Riesenhuber and Poggio 2000), on a four Virtex5 FPGA platform. Okuno and Yagi (2012) developed a real-time, frame-based visual processing system based on an adaptive image sensor with logarithmic compression that can be adjusted by control logic plus an embedded power-PC processor running on a VirtexIIpro FPGA connected to a silicon retina (Shimonomura et al. 2011). Spike-based convolutional processors, equivalent to those developed on VLSI chips (Serrano-Gotarredona et al. 1999b), for FPGAs (Linares-Barranco et al. 2009b), were lately combined in a mesh Network on Chip (NoC) of up to 64 convolution units on a Spartan-6 platform (Zamarreño-Ramos et al. 2013) that can be scaled by connecting several such platforms via 2.5 Gbps serial links (Iakymchuk Iakymchuk et al. 2014).

Bio-inspired, spike-based neural network models have been synthesized successfully on FPGAs to improve accuracy and allow for large-scale neuromorphic algorithm implementations, for example Gomar and Ahmadi (2014), where a high-accuracy implementation of biological neural networks based on LIF, Adaptive Exponential Integrate and Fire model (AdEx) and Izhikevich neuron models were implemented and tested on a VirtexIIpro FPGA.

Even spike-based processing blocks for neuro-inspired motor control (Perez-Peña et al. 2013) have been developed for Spartan3 and Spartan6 FPGAs in such a way that a set of building blocks can be connected to develop custom spike-based processing systems.

13.5 Discussion

We have seen in the examples presented in Section 13.2, the same basic monitoring, sequencing, and mapping functionalities have been re-implemented many times over the years with increasing technological sophistication to achieve ever better performance and/or usability, with occasional extra features also being implemented (see Figure 13.18).

The SCX can be seen as a very ambitious, forward-looking prototype. It incorporated several advanced features, such as its domain buses, which would theoretically have allowed for the construction of large (at least O(10⁵) neuron) multiboard systems. However, the chips of the day had orders of magnitude fewer neurons and no such large scale system was ever constructed using the SCX. The SCX design was such that its multineuron chips were plugged directly into the SCX board and as such very tightly coupled to one particular chip package type, pin-out, and chip parameter update mechanism. Therefore, despite the generality of the SCX’s overall architecture, the SCX framework could not keep up with the progress in chip designs. The SCX as a VME bus based design was also very bulky, and it would never have been very convenient to provide one on the desk of each researcher who might want to use such a system.

images

Fig. 13-18 Hardware infrastructure time-line

The Rome PCI-AER board design (Section 13.2.2) kept the infrastructure components separate from the neuromorphic chips – the chips are expected to be mounted on their own carrier boards and connect to standard ribbon cable headers on a desktop header board which is in turn connected to the PCI bus based main board. With this decoupling from the chip design, the Rome PCI-AER board has achieved a useful life of almost a decade, despite the evolution of chip designs over this time. Also, being PCI-based means that it is much easier and cheaper to provide any and all interested researchers with a device than would ever have been the case with the VME based SCX. In the end, over 20 boards were produced and distributed in several countries. The Rome PCI-AER board was well supported with extensive user documentation, including for its Linux driver and accompanying library. However, because it was an attempt at ‘one tool fits all’ approach, it has a large dimensional configuration space, which makes it rather difficult to use correctly, despite the documentation. Its largest failing though was its monitoring and sequencing performance. Due to its inability to perform DMA and bus mastering data transfers, the host CPU has to transfer monitor and sequencer data across the PCI bus, and only sustained event rates of around 1 Meps could be supported.

The CAVIAR project (Sections 13.2.3 and 13.3.3) introduced a flexible palette of specialized interface boards, including boards which can operate standalone for cases, where it is undesirable to have a desktop PC involved in the system. Where a connection to a PC is required, the CAVIAR project turned in part to the USB bus. Indeed it is much more convenient to be able to move a USB-based device from system to system – unlike PCI, the host computer does not need to be opened – and it becomes possible to use a laptop as a host. CAVIAR’s PCI-AER board learned the lesson from the Rome PCI-AER board and was DMA and bus-mastering capable to gain higher performance at the expense of greater development effort. The provision of multiple types of boards (one PCI-based and several USB-based variants) also meant that a much larger development effort was necessary than for the single Rome PCI-AER board. And yet, unlike for the Rome PCI-AER board (where the lowest level of user interaction was expected to be at the level of the software library API, or in extremis at the level of the driver itself), for some use-cases, the end-user of some of the CAVIAR boards was expected to achieve the promised flexibility by modifying the VHDL code defining the behavior of the FPGAs on the boards. In this way, the CAVIAR USB-AER board was modified to implement probabilistic mapping and mappings with configurable delays, features which other systems had not offered.

The USBAERmini2 introduced the important ‘early packet’ feature and the ability to synchronize the time-stamps across multiple boards, and was well supported and integrated into the jAER project. Because of this and its low cost and simplicity, it has also been continuously used for about a decade in other projects.

More recently, the Mesh-Grid architecture (Section 13.2.6), and the AEX board (Section 13.2.5) and its associated High-Fanout Mapper have moved toward establishing high-speed serial links over industry standard SATA cables as a standard for inter-chip communication, though communication of AEs to a host PC remains USB based. The High-Fanout mapper uses the PCI bus in a novel way. The host PC here is used more or less only to provide power and memory to a PCI device, which is implemented using a commercially available FPGA board with custom daughterboards providing this with SAER capability. This probably points the way to future developments for small scale AE hardware infrastructure systems. Such systems are likely to be increasingly constructed using as far as possible off-the-shelf modules (e.g., FPGA evaluation kits) together with the minimum possible of custom parts to actually interface to the neuromorphic chips. This approach helps to keep down development efforts and costs.

At the same time, particular systems such as retinas and cochleas have spawned their own intense AER infrastructure development efforts, which focus on tight integration, miniaturization, and specialization; for example, the latest silicon retina developments integrate the AER interface, multicamera synchronization, ADC, DAC, and inertial measurement units onto the same small PCB (Delbruck et al. 2014; jAER Hardware Reference n.d.).

Larger scale projects, however, face a different set of challenges, and these are discussed in Chapter 16.

References

AER. 1993. The address-event representation communcation protocol [sic], AER 0.02.

Al Maashri A, DeBole M, Yu CL, Narayanan V, and Chakrabarti C. 2011. A hardware architecture for accelerating neuromorphic vision algorithms. IEEE Workshop on Signal Processing Systems (SiPS), pp. 355–360.

Albrecht G and Geisler WS. 1991. Motion selectivity and the contrast response function of simple cells in the visual cortex. Visual Neurosci. 7(6), 531–546.

Arias-Estrada M, Poussart D, and Tremblay M. 1997. Motion vision sensor architecture with asynchronous selfsignaling pixels. Proc. 7th Intl. Work. Comp. Arch. for Machine Perception (CAMP), pp. 75–83.

Berge HKO and Häfliger P. 2007. High-speed serial AER on FPGA. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 857–860.

Berner R, Delbrück T, Civit-Balcells A, and Linares-Barranco A. 2007. A 5 Meps $100 USB2.0 address-event monitor-sequencer interface. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 2451–2454.

Boahen KA. 1998. Communicating neuronal ensembles between neuromorphic chips. In: Neuromorphic Systems Engineering (ed. Lande TS) The International Series in Engineering and Computer Science, vol. 447. Springer. pp. 229–259.

Boahen KA. 2004a. A burst-mode word-serial address-event link—I: transmitter design. IEEE Trans. Circuits Syst. I, Reg. Papers 51(7), 1269–1280.

Boahen KA. 2004b. A burst-mode word-serial address-event link—II: receiver design. IEEE Trans. Circuits Syst. I, Reg. Papers 51(7), 1281–1291.

Boser BE, Sackinger E, Bromley J, leCun Y, and Jackel LD. 1992. Hardware requirements for neural network pattern classifiers: a case study and implementation. IEEE Micro 12(1), 32–40.

Cassidy A, Denham S, Kanold P, and Andreou A. 2007. FPGA based silicon spiking neural array. Proc. IEEE Biomed. Circuits Syst. Conf. (BIOCAS), pp. 75–78.

Cauwenberghs G and Waskiewicz J. 1999. A focal-plane analog VLSI cellular implementation of the boundary contour system. IEEE Trans. Circuits Syst. I 46(2), 1327–334.

CAVIAR. 2002. CAVIAR project, http://www.imse-cnm.csic.es/caviar/ (accessed August 5, 2014).

Choi TYW, Shi BE, and Boahen K. 2004. An ON-OFF orientation selective address event representation image transceiver chip. IEEE Trans. Circuits Syst. I 51(2), 342–352.

Choi TYW, Merolla PA, Arthur JV, Boahen KA, and Shi BE. 2005. Neuromorphic implementation of orientation hypercolumns. IEEE Trans. Circuits Syst. I 52(6), 1049–1060.

Dante V. 2004. PCI - AER Adapter Board User Manual. 1.1 edn. Istituto Superiore di Sanità Rome, Italy. http://www.ini.uzh.ch/∼amw/pciaer/user_manual.pdf (accessed August 5, 2014).

Dante V, Del Giudice P, and Whatley AM. 2005. Hardware and software for interfacing to address-event based neuromorphic systems. The Neuromorphic Engineer 2(1), 5–6.

Daugman JG. 1980. Two-dimensional spectral analysis of cortical receptive field profiles. Vision Res. 20(10), 847–856.

Deiss SR. 1994. Address-Event Asynchronous Local Broadcast Protocol. 062894 2e edn. Applied Neurodynamics (ANdt). http://appliedneuro.com/ (accessed August 5, 2014).

Deiss SR, Douglas RJ, and Whatley AM. 1999. A pulse-coded communications infrastructure for neuromorphic systems [Chapter 6]. In: Pulsed Neural Networks (eds. Maass W and Bishop CM). MIT Press, Cambridge, MA. pp. 157–178.

Delbrück T. 2007. SimpleMonitorUSBXPress resources. http://www.ini.uzh.ch/∼tobi/caviar/SimpleMonitorUSBXPress/index.php (accessed August 5, 2014).

Delbruck T, Villanueva V, and Longinotti L. 2014. Integration of dynamic vision sensor with inertial measurement unit for electronically stabilized event-based vision. Proc. 2014. Intl. Symp. Circuits Syst. (ISCAS 2014).

Farabet C, Martini B, Corda B, Akselrod P, Culurciello E, and LeCun Y. 2011. NeuFlow: a runtime reconfigurable dataflow processor for vision. IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 109–116.

Fasnacht DB and Indiveri G. 2011. A PCI based high-fanout AER mapper with 2 GB RAM look-up table, 0.8 μs latency and 66 MHz output event-rate. Proc. IEEE 45th Annual Conference on Information Sciences and Systems (CISS), pp. 1–6.

Fasnacht DB, Whatley AM, and Indiveri G. 2008. A serial communication infrastructure for multi-chip address event systems Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 648–651.

Freeman RH. 1989. Configurable electrical circuit having configurable logic elements and configurable interconnects U.S. Patent No. US4870302 A.

Fukushima K. 1989. Analysis of the process of visual pattern recognition by the neocognitron. Neural Networks 2(6), 413–420.

Gomar S and Ahmadi A. 2014. Digital multiplierless implementation of biological adaptive-exponential neuron model. IEEE Trans. Circuits Syst. I: Regular Papers 61(4), 1206–1219.

Gomez-Rodriguez F, Paz R, Linares-Barranco A, Rivas M, Miro L, Vicente S, Jimenez G, and Civit A. 2006. AER tools for communications and debugging. Proc. IEEE Intl. Symp. Circuits Syst. (ISCAS), pp. 3253–3256.

Häfliger P. 2001. Asynchronous event redirecting in bio-inspired communication. Proc. 8th IEEE Int. Conf. Electr. Circuits Syst. (ICECS) 1, 87–90.

Häfliger P. 2007. Adaptive WTA with an analog VLSI neuromorphic learning chip. IEEE Trans. Neural Netw. 18 (2), 551–572.

Hall TS, Twigg CM, Gray JD, Hasler P, and Anderson DV. 2005. Large-scale field-programmable analog arrays for analog signal processing. IEEE Trans. Circuits Syst. I: Regular Papers 52 (11), 2298–2307.

Hartmann S, Schiefer S, Scholze S, Partzsch J, Mayr C, Henker S, and Schiiffny R. 2010. Highly integrated packetbased AER communication infrastructure with 3Gevent/s throughput. Proc. 17th IEEE Int. Conf. Electr. Circuits Syst. (ICECS), pp. 950–953.

Heeger DJ. 1992a. Half-squaring in responses of cat striate cells. Visual Neurosci. 9 (5), 427–443.

Heeger DJ. 1992b. Normalization of cell responses in cat striate cortex. Visual Neurosci. 9 (2), 181–197.

Higgins CM and Shams SA. 2002. A biologically inspired modular VLSI system for visual measurement of self-motion. IEEE Sensors J. 2 (6), 508–528.

Hubel DH. 1988 Eye, Brain, and Vision. WH Freeman, New York.

Hubel DH and Wiesel TN. 1972. Laminar and columnar distribution of geniculo-cortical fibers in the macaque monkey. J. Comp. Neurol. 146 (4), 421–450.

Iakymchuk T, Rosado A, Serrano-Gotarredona T, Linares-Barranco B, Jiménez-Fernández A, Linares-Barranco A, and Jiménez-Moreno G. 2014. An AER handshake-less modular infrastructure PCB with x8 2.5 Gbps LVDS serial links. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1556–1559.

Indiveri G, Whatley AM, and Kramer J. 1999. A reconfigurable neuromorphic VLSI multi-chip system applied to visual motion computation Proceedings of 7th International Conference on Microelectronics for Neural, Fuzzy, and Bio-Inspired Systems (MicroNeuro), pp. 37–44.

jAER Hardware Reference. n.d. jAER Hardware Reference Designs. http://jaerproject.net/Hardware/ (accessed August 5, 2014).

Jones JP and Palmer LA. 1987. An evaluation of the two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. J. Neurophys. 58 (6), 1233–1258.

Kandel ER, Schwartz JH, and Jessell TM. 2000. Principles of Neural Science. 4th edn. McGraw-Hill.

Kolle Riis H and Häfliger P. 2005. An asynchronous 4-to-4 AER. mapper. Lecture Notes in Computer Science, vol. 3512. Springer. pp. 494–501.

LeCun Y, Bottou L, Bengio Y, and Haffner P. 1998. Gradient-based learning applied to document recognition. Proc. IEEE 86 (11), 2278–2324.

Lehky SR and Sejnowski TJ. 1988. Network model of shape-from-shading: neural function arises from both receptive and projective fields. Nature 333, 452–454.

Lin J, Merolla P, Arthur J, and Boahen K. 2006. Programmable connections in neuromorphic grids. Proc. 49th IEEE Int. Midwest Symp. Circuits Syst., pp. 80–84.

Linares-Barranco A, Jimenez-Moreno G, Linares-Barranco B, and Civit-Ballcels A. 2006. On algorithmic rate-coded AER generation. IEEE Trans. Neural Netw. 17(3), 771–788.

Linares-Barranco A, Gomez-Rodriguez F, Jimenez G, Delbrück T, Berner R, and Liu S. 2009a. Implementation of a time-warping AER mapper. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 2886–2889.

Linares-Barranco A, Paz R, Gómez-Rodrguez F, Jiménez A, Rivas M, Jiménez G, and Civit A. 2009b. FPGA implementations comparison of neuro-cortical inspired convolution processors for spiking systems. In: Bio-Inspired Systems: Computational and Ambient Intelligence. Lecture Notes in Computer Science, vol. 5517. Springer. pp. 97–105.

Liu SC, Kramer J, Indiveri G, Delbrück T, Burg T, and Douglas R. 2001. Orientation-selective aVLSI spiking neurons. Neural Netw. 14 (6/7), 629–643.

Mahowald M. 1992. VLSI Analogs of Neural Visual Processing: A Synthesis of Form and Function. PhD thesis. California Institute of Technology, Pasadena, CA.

Merolla P, Arthur J, and Wittig J. 2005. The USB revolution. The Neuromorphic Engineer 2 (2), 10–11.

Miró-Amarante L, Jiménez A, Linares-Barranco A, Gómez-Rodriguez F, Paz R, Jiménez G, Civit A, and Serrano-Gotarredona R. 2006. A LVDS serial AER link. Proc. IEEE Int. Conf. Electr. Circuits Syst. (ICECS), pp. 938–941.

Neftci E, Chicca E, Cook M, Indiveri G, and Douglas R. 2010. State-dependent sensory processing in networks of VLSI spiking neurons Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 2789–2792.

Neubauer C. 1998. Evaluation of convolution neural networks for visual recognition. IEEE Trans. Neural Netw. 9 (4), 685–696.

Okuno H and Yagi T. 2012. Image sensor system with bio-inspired efficient coding and adaptation. IEEE Trans. Biomed. Circuits Syst. 6 (4), 375–384.

Orchard G, Thakor NV, and Etienne-Cummings R. 2013. Real-time motion estimation using spatiotemporal filtering in FPGA. Proc. IEEE Biomed. Circuits Syst. Conf. (BIOCAS), pp. 306–309.

Oster M and Liu SC. 2004. A winner-take-all spiking network with spiking inputs. Proc. 11th IEEE Int. Conf. Electr. Circuits Syst. (ICECS), pp. 203–206.

Oster M, Wang YX, Douglas R, and Liu SC. 2008. Quantification of a spike-based winner-take-all VLSI network. IEEE Trans. Circuits Syst. I: Regular Papers 55 (10), 3160–3169.

Ozalevli E and Higgins CM. 2005. Reconfigurable biologically-inspired visual motion systems using modular neuromorphic VLSI chips. IEEE Trans. Circuits Syst. I: Regular Papers 52 (1), 79–92.

Park J, Yu T, Maier C, Joshi S, and Cauwenberghs G. 2012. Live demonstration: hierarchical address-event routing architecture for reconfigurable large scale neuromorphic systems. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 707–711.

Paz R. 2003. Análisis del bus PCI. Desarrollo de puentes basados en FPGA para placas PCI. Trabajo de investigación para obtención de suficiencia investigadora. Sevilla.

Paz R, Gomez-Rodriguez F, Rodriguez MA, Linares-Barranco A, Jimenez G, and Civit A. 2005. Test infrastructure for address-event-representation communications. Lecture Notes in Computer Science, vol. 3512. Springer. pp. 518– 526.

Paz-Vicente R, Linares-Barranco A, Cascado D, Rodriguez MA, Jimenez G, Civit A and Sevillano JL. 2006. PCI-AER interface for neuro-inspired spiking systems. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 3161–3164.

Paz-Vicente R, Jimenez-Fernandez A, Linares-Barranco A, Moreno G, Gomez-Rodriguez F, Miro-Amarante L, and Civit-Ballcels A. 2008. Image convolution using a probabilistic mapper on USB-AER board. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 1056–1059.

Perez-Peña F, Morgado-Estevez A, Linares-Barranco A, Jimenez-Fernandez A, Gomez-Rodriguez F, Jimenez-Moreno G, and Lopez-Coronado J. 2013. Neuro-inspired spike-based motion: from dynamic vision sensor to robot motor open-loop control through Spike-VITE. Sensors 13 (11), 15805–15832.

Petre C, Schlottmann C, and Hasler P. 2008. Automated conversion of Simulink designs to analog hardware on an FPAA. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS), pp. 500–503.

Riesenhuber M and Poggio T. 2000. Models of object recognition. Nat. Neurosci. 3, 1199–1204.

Sabarad J, Kestur S, Park MS, Dantara D, Narayanan V, Chen Y, and Khosla D. 2012. A reconfigurable accelerator for neuromorphic object recognition. 17th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 813–818.

Serrano-Gotarredona R, Oster M, Lichtsteiner P, Linares-Barranco A, Paz-Vicente R, Gómez-Rodríguez F, Riis HK, Delbrück T, Liu SC, Zahnd S, Whatley AM, Douglas R, Häfliger P, Jimenez-Moreno G, Civit A, Serrano-Gotarredona T, Acosta-Jiménez A, and Linares-Barranco B. 2005. AER building blocks for multi-layers multi-chips neuromorphic vision systems. Advances in Neural Information Processing Systems 18 (NIPS), pp. 1217– 1224.

Serrano-Gotarredona R, Oster M, Lichtsteiner P, Linares-Barranco A, Paz-Vicente R, Gomez-Rodriguez F, Camunas-Mesa L, Berner R, Rivas M, Delbrück T, Liu SC, Douglas R, Häfliger P, Jimenez-Moreno G, Civit A, Serrano-Gotarredona T, Acosta-Jimenez A, and Linares-Barranco B. 2009. CAVIAR: a 45K-neuron, 5M-synapse, 12G-connects/sec AER hardware sensory-processing-learning-actuating system for high speed visual object recognition and tracking. IEEE Trans. Neural Netw. 20(9), 1417–1438.

Serrano-Gotarredona R, Serrano-Gotarredona T, Acosta-Jiménez A, and Linares-Barranco B. 2006. A neuromorphic cortical-layer microchip for spike-based event processing vision systems. IEEE Trans. Circuits Syst. I: Regular Papers 53(12), 2548–2566.

Serrano-Gotarredona T, Andreou A, and Linares-Barranco B. 1999a. AER image filtering architecture for vision-processing systems. IEEE Trans. Circuits Syst. I: Fundam. Theory Appl. 46(9), 1064–1071.

Serrano-Gotarredona T, Andreou AG, and Linares-Barranco B. 1999b. Programmable 2D image filter for AER vision processing. Proc. IEEE Int. Symp. Circuits Syst. (ISCAS) 4, pp. 159–162.

Shanley T and Anderson D. 1999. PCI System Architecture. PC System Architecture Series, 4th edn. Mindshare, Inc./ Addison-Wesley, Boston, MA.

Shimonomura K, Kameda S, Iwata A, and Yagi T. 2011. Wide-dynamic range APS-based silicon retina with brightness constancy. IEEE Trans. Neural Netw. 22(9), 1482–1493.

Sivilotti M. 1991. Wiring Considerations in Analog VLSI Systems with Application to Field-Programmable Networks. PhD thesis. California Institute of Technology, Pasadena, CA.

Texas Instruments. 2008. TLK3101 Datasheet: 2.5 to 3.125 Gbps Transceiver (Rev. B), http://www.ti.com/product/tlk3101#technicaldocuments (accessed August 5, 2014).

Tsang EKC and Shi BE. 2004. A preference for phase-based disparity in a neuromorphic implementation of the binocular energy model. Neural Comput. 16 (8), 1597–1600.

Venier P, Mortara A, Arreguit X, and Vittoz EA. 1997. An integrated cortical layer for orientation enhancement. IEEE J. Solid-State Circuits 32 (2), 177–186.

Vogelstein RJ, Mallik U, Culurciello E, Etienne-Cummings R, and Cauwenberghs G. 2004. Spatial acuity modulation of an address-event imager. Proc. 11th IEEE Int. Conf. Electr. Circuits Syst. (ICECS), pp. 207–210.

Vogelstein RJ, Mallik U, Culurciello E, Cauwenberghs G, and Etienne-Cummings R. 2007a. A multichip neuromorphic system for spike-based visual information processing. Neural Comput. 19 (9), 2281–2300.

Vogelstein RJ, Mallik U, Vogelstein JT, and Cauwenberghs G. 2007b. Dynamically reconfigurable silicon array of spiking neurons with conductance-based synapses. IEEE Trans. Neural Netw. 18 (1), 253–265.

Whatley AM. 1997. Silicon Cortex Software Design Rev. 6, http://www.ini.uzh.ch/∼amw/scx/scx1swod.pdf (accessed August 5, 2014).

Zaghloul KA and Boahen KA. 2004. Optic nerve signals in a neuromorphic chip I: outer and inner retina models. IEEE Trans. Biomed. Eng. 51(4), 657–666.

Zamarreño-Ramos C, Serrano-Gotarredona T, and Linares-Barranco B. 2011. An instant-startup jitter-tolerant Manchester-encoding serializer/deserializer scheme for event-driven bit-serial LVDS interchip AER links. IEEE Trans. Circuits Syst. I: Regular Papers 58(11), 2647–2660.

Zamarreño-Ramos C, Serrano-Gotarredona T, and Linares-Barranco B. 2012. A 0.35 μm sub-ns wake-up time ON-OFF switchable LVDS driver-receiver chip I/O pad pair for rate-dependent power saving in AER bit-serial links. IEEE Trans. Biomed. Circuits Syst. 6(5), 486–497.

Zamarreño-Ramos C, Linares-Barranco A, Serrano-Gotarredona T, and Linares-Barranco B. 2013. Multicasting mesh AER: a scalable assembly approach for reconfigurable neuromorphic structured AER systems. Application to ConvNets. IEEE Trans. Biomed. Circuits Syst. 7(1), 82–102.

__________

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 13 Hardware Infrastructure

Create new playlist

Sign In

Sign Up

13

Hardware Infrastructure

13.1 Introduction

13.1.1 Monitoring AER Events1

Data-Width, Channel Addressing, and Framing Errors

Inter-Spike Intervals and Time-Stamps

Synchronization between Monitors

Time-base

Wrap-around

Heartbeats

Early Packets

Sniffers vs. Interposed Monitors

AER to Frame Conversion

13.1.2 Sequencing AER Events

Discounted Delay

First In, First Out

Frame to AER Conversion

13.1.3 Mapping AER Events

Fixed vs. Variable Length Destination Lists

Sentinel Values or List Lengths

Writing to the Mapping Table

Algorithmic Mapping

Provision of Mapping Memory

Repetition Counts, Probabilistic Mapping and Delays

13.2 Hardware Infrastructure Boards for Small Systems

13.2.1 Silicon Cortex

13.2.2 Centralized Communication

13.2.3 Composable Architecture Solution

CAVIAR PCI-AER2

PCI Interface Design Considerations

CAVIAR USB-AER

Interface Features and Architecture

USBAERmini23

Synchronization

Performance

SimpleMonitorUSBXPress

13.2.4 Daisy-Chain Architecture

13.2.5 Interfacing Boards using Serial AER

The AEX Board4

Flow Control

PCI-Based High-Fanout AER Mapper5

Mapper System Overview

Typical Usage in Multichip Setups

Performance

13.2.6 Reconfigurable Mesh-Grid Architecture

13.3 Medium-Scale Multichip Systems

13.3.1 Octopus Retina + IFAT

System Description

Example Result – Spatial Feature Extraction

13.3.2 Multichip Orientation System

System Architecture

Example Result – Orientation Selectivity

13.3.3 CAVIAR

System Architecture

AER Programmable Kernel 2D Convolution Chip

AER 2D Winner-Take-All Chip

Example Result – Tracking

13.4 FPGAs

13.5 Discussion

References

Table of Contents for
13 Hardware Infrastructure

13.1.1 Monitoring AER Events¹

CAVIAR PCI-AER²

USBAERmini2³

The AEX Board⁴

PCI-Based High-Fanout AER Mapper⁵