Chapter 2

Accelerating Techniques for Particle Filter Implementations on FPGA

B.G. Sileshi1 [email protected]; C. Ferrer1,2 [email protected]; J. Oliver1 [email protected]    1 Departament de Microelectrònica i Sistemes Electrònics, Universitat Autònoma de Barcelona, Bellaterra (Barcelona), Spain
2 Institut de Microelectrònica de Barcelona (CNM-CSIC), Bellaterra (Barcelona), Spain

Abstract

Particle filters (PFs) are Bayesian-based estimation algorithms with attractive theoretical properties for addressing a wide range of complex applications that are nonlinear and non-Gaussian. However, they are associated with a huge computational demand that limited their application in most real-time systems. To address this drawback in PFs, this chapter presents PF acceleration techniques based on a hardware/software codesign approach for the grid-based fast simultaneous localization and mapping (SLAM) application. With initial identification of the computationally intensive steps of the algorithm, techniques are proposed to accelerate the computational bottleneck steps. Based on the proposed PF acceleration techniques, hardware blocks are designed to speed up the computational time and interface in a central MicroBlaze soft processing core platform. The proposed hardware/software implementation resulted in an improvement in the overall execution time of the algorithm.

Keywords

Particle filters (PFs)

hardware/software codesign

field-programmable gate array (FPGA)

simultaneous localization and mapping (SLAM)

computational complexity

1 Introduction

Estimating the state of a dynamic system from a set of observations that arrive sequentially in time and are corrupted by noise has different applications. For systems where the amount of nonlinearity is limited, the parametric Kalman filters (extended or unscented versions) can be applied. However, for systems with nonlinear or non-Gaussian dynamics, the Kalman filters were found to perform poorly (Meinhold and Singpurwalla, 1989; Ristic et al., 2004). On the other hand, a nonparametric sequential Bayesian estimation technique called particle filters (PFs) present accurate estimates under such circumstances (Gordon et al., 1993; Arulampalam et al., 2002) and have been used in applications such as navigation and positioning (Gustafsson, 2010; Nordlund and Gustafsson, 2001; Atia et al., 2010, 2012; Georgy et al., 2012), tracking (Happe et al., 2011; Medeiros et al., 2008), and robotics (Grisetti et al., 2007). There has been also extensive research in recent years for the application PFs to estimate the states and parameters in biological systems. For example, several studies (Nakamura et al., 2009; Yang et al., 2007; Gokdemir and Vikalo, 2009; Liu and Niranjan, 2012) demonstrate the applicability of PFs in the state estimation of biological models by considering biological systems as dynamical, nonlinear, and generally partially observed processes.

PFs have the flexibility of efficiently representing a wide range of probability densities using sets of points called particles. Such representational power of PFs, however, comes at the cost of a great computational complexity that has so far limited their application in different types of real-life problems. It is in this sense that hardware- or hardware/software-based implementations of PFs are essential to accelerate the computational time and achieve the objective of real-time computation. With regard to this, the research literature tackles the real-time constraint of the PFs through hardware implementation platforms such as field-programmable gate arrays (FPGAs). Interesting features of FPGAs, such as their ability to introduce massive parallelization and their low power consumption, which is critical for many applications such as navigation, make them the primary choice for the implementation of PFs in real-life applications.

The hardware implementation of PFs based on FPGA platforms has been proposed by several researchers. For example, Chau et al. (2012) studied the implementation of PFs on an FPGA in a mobile robot localization problem, where the size of the particle set is adapted to reduce the run-time computational complexity of the algorithm. In Fross et al., (2010), the FPGA hardware implementation of PF for location estimation is presented and compared with the software solution running on an ARM7-based microcontroller. A recent study (Chau et al., 2013) presents a heterogeneous reconfigurable system consisting of an FPGA and a central processing unit (CPU) for a simultaneous mobile robot localization and people-tracking application. In this study, they adapt the number of particles dynamically and utilize the run-time reconfigurability of the FPGA to reduce power and energy consumption. According to their results, a speed increase of up to 7.39 times is achieved compared to the Intel Xeon X5650 CPU software implementation. Li et al. (2011) presented a hardware/software codesign approach based on a system on a chip technique that uses a NIOS II processor to calculate the weight for each particle and a hardware implementation to update the particles. They claim that their proposed method significantly improves the efficiency and provides flexibility in the design step for various applications due to the software implementation of the importance weight step. Ye and Zhang (2009) presented hardware architecture for an FPGA implementation of the sampling importance resampling (SIR) PF. The hardware architecture is simulated with ModelSim and the real-time performance of the hardware architecture is evaluated using a universal asynchronous receiver/transmitter (UART) for the input sensor data. Saha et al. (2010) presented a System-on-Chip architecture involving both hardware and software components for a class of PFs. In this chapter, a parameterized framework is used to enable the reuse of the architecture with minimal redesign for a wide range of particle filtering applications. In El-Halym et al. (2012), three different hardware architectures are proposed, and this suggests the use of a piecewise linear function instead of the classical exponential function in order to decrease the complexity of the hardware architecture in the weighting step.

With the objective to reduce the high computational requirements of the PF, this chapter presents PF acceleration techniques based on a hardware/software codesign approach with the following specific contributions:

 As the underlying computations in particle filtering vary from one application to the next, a generic approach is presented for the analysis, design, and speed-up of the PF computational steps. The same approach can be used in the development of other application-specific PF implementations.

 The embedded hardware/software implementation of the PF focuses on a grid-based fast simultaneous localization and mapping (SLAM) algorithm. In order to increase the throughput of the algorithm, two main mechanisms are introduced in the algorithm acceleration:

 Fast hardware Gaussian random number generator design and its application to PF for grid-based Fast SLAM algorithm is presented.

 The design and implementation of the custom COordinate Rotation DIgital Computer (CORDIC) hardware module for the evaluation of the complex mathematical operations involved in several steps of the PF algorithm is presented.

This chapter is an extended version of the work presented by Sileshi et al. (2014), and it is organized as follows: Section 2 first introduces the theory behind the PF algorithm, followed by a discussion on its application with the SLAM algorithm called Grid-based Fast SLAM. Section 3 provides a discussion on the identification of the computational bottlenecks of the PF algorithm and the partitioning of the different steps for hardware and software implementations. Section 4 explains the PF acceleration techniques, and section 5 is dedicated to the hardware architecture design of the proposed acceleration techniques. Section 6 discusses the proposed hardware/software architecture of the PF and Section 7 presents the implementation results. Finally, Section 8 provides a conclusion of these topics.

2 PF and SLAM algorithms

2.1 Particle filtering

The problem of nonlinear filtering is defined by a state space representation of dynamic system given by a discrete-time stochastic model of the state xtRdxsi22_e and measurement ztRdzsi23_e vectors with dimensions dx and dz, respectively:

xt=ftxt1vt1

si24_e  (2.1)

zt=htxtwt

si25_e  (2.2)

where ft and ht are possibly nonlinear function of the state xt1si26_e and the measurement zt, respectively; vt1si27_e and wt are the process and measurement noise sequences, respectively; and t is the time index. The process and measurement noise sequences are assumed to be white, with known probability density functions and mutually independent.

In Bayesian estimation framework, the estimation to the state vector xt at time t is obtained from a noisy measurement zt, where the initial target state is assumed to have a known probability density function (pdf) p(x0) and is also independent of the noise sequences. The estimate of the unknown state xt is performed based on the sequence of all available measurements Z1:t=zi,i=1,,tsi28_e up to time t. The posterior distribution pxt|Ztsi29_e is calculated recursively based on the following two prediction [Eq. (2.3)] and update [Eq. (2.4)] equations (Arulampalam et al., 2002):

pxt|Zt1=pxt|xt1pxt1|Zt1dxt1

si30_e  (2.3)

pxt|Zt=pzt|xtpxt|Zt1pzt|Zt1

si31_e  (2.4)

where the probabilistic model of the state evolution pxt|xt1si32_e, is defined by the system equation [Eq. (2.1)], pzt|xtsi33_e is defined by the measurement equation [Eq. (2.2)], and pzt|Zt1si34_e is a normalizing constant given by

pzt|Zt1=pzt|xtpxt|Zt1dxt.

si35_e  (2.5)

While the assumption of the linear and Gaussian conditions is satisfied, the optimal analytic solution to the integrals given by Eqs. (2.3) and (2.4) is obtained by using the Kalman filtering method. However, it is often difficult to find an analytical solution to these integrals in most nonlinear/non-Gaussian estimation problems, and Sequential Monte Carlo (SMC) approximation method is used in the approximation of these integrals.

PF is an SMC-based approach that approximates a pdf by a discrete set of particles with their associated weights xt,iwtii=1Nsi36_e, where i and N denote the index of the particles and the total number of particles, respectively; xt,iRdxsi37_e and wti denotes the state of the particles and their weights, respectively; and dx is the dimension of the state. The posterior pdf pxt|Ztsi29_e of the state vector is approximated by (Ristic et al., 2004)

pxt|Zti=1Nwtiδxtxti,

si39_e  (2.6)

where δ(.) is a Dirac Delta function and wti is the importance weight, given by

wtiwt1ipzt|xtipxti|xt1iqxti|xt1i,zt.

si40_e  (2.7)

In particle filtering, the particle set St=xti,wtii=1Nsi41_e at time t is recursively propagated from the set St1si42_e at the previous time t1si43_e in such a way that they provide an approximation of the posterior distribution at each iteration. The initial particle set S0 is obtained from samples drawn from the prior density p(x0). In this approach, a major problem called particle degeneracy phenomenon arises. That is, after a certain number of iterations, a small number of normalized importance weights tend to value of one, while the remaining weights are negligible. This is not a desirable phenomenon, as only a few particles contribute to the approximation of the posterior density and much computation is wasted on the particles with negligible weight. A common measure of the degeneracy of the algorithm is the effective sample size Neff, given by (Arulampalam et al., 2002)

Neff=1i=1Nwti2.

si44_e  (2.8)

The degeneracy problem can be reduced by using a very large value of samples N but, since that approach is impractical, other alternative techniques are commonly used. These includes the use of an optimal importance density and resampling (Arulampalam et al., 2002). The optimal choice of the importance density is commonly considered the prior density. In resampling, particles are resampled N times with replacement from the discrete approximation of the posterior density. In this process, particles with low importance weight are eliminated and particles with high importance weight are replicated. This step is performed while Neff falls below a certain predefined threshold, NT. The resampling step is intended to reduce the degeneracy problem, but it leads to the limitation that it parallelizes the algorithms and loses diversity among the particles. The loss of diversity among the particles may lead to the occupancy of the same point in the state space by all N particles, giving poor representation of the posterior density.

The basic steps of the PF discussed so far, which correspond to an SIR PF, is given as follows:

Input:xt1i,wt1ii=1Nztsi45_e

Output:xti,wtii=1Nsi46_e

FOR i=1:Nsi47_e

Draw xti~qxt|xt1,ztsi48_e

Calculate the nonnormalized importance weights w~tisi49_e

w~ti=wt1ipzt|xtipxti|xt1iqxti|xt1i,ztsi50_e

END FOR

Calculate sum of the weights: W =i=1Nw~tisi51_e

FOR i=1:Nsi47_e

Normalize:wti=W1w~tisi53_e

END FOR

Calculate Neff using Eq. (2.8)

If NeffNTsi54_e

RESAMPLE

END IF

SIR is one of the most commonly used variants of the basic particle filtering known as sequential importance sampling (SIS) PF, where the prior density pxti|xt1isi55_e is used as the importance density qxti|xt1i,ztsi56_e and the weights of the particles is given by wtipzt|xtisi57_e.

2.2 Application of PF to SLAM

SLAM involves the problem of localizing and building a map of a given environment simultaneously (Chen et al., 2007). It is usually described with a joint posterior probability density distribution [Eq. (2.9)] of the map (m) and the robot states (xt) at time t, given the observations (z0 : t) and control inputs (u0 : t) up to and including time t, together with the initial state of the robot (x0):

pxt,m|z0:t,u0:t,x0.

si58_e  (2.9)

For the computation of the conditional probability given by Eq. (2.9) for all times t, a recursive solution based on the Bayes theorem is used in SLAM by starting with an estimate for the distribution pxt1,m|z0:t1,u0:t1si59_e at time t1si43_e and following a control ut and observation zt. Therefore, the SLAM algorithm is implemented with the two standard steps of the recursive prediction and measurement update process. The recursion is a function of the robot motion model pxt|xt1,utsi61_e and an observation model pzt|xt,msi62_e.

The prediction is obtained from the following equation:

pxt,m|z0:t1,u0:t1,x0=pxt|xt1,utpxt1,m|z0:t1,u0:t1,x0dxt1

si63_e  (2.10)

and the measurement update is calculated using

pxt,m|z0:t,u0:t,x0=pzt|xt,mpxt,m|z0:t1,u0:t1,x0pzk|zt1,U0:t.

si64_e  (2.11)

The solution to a probabilistic SLAM problem involves an efficient and consistent computation of the prediction and update equations. Based on different estimation frameworks for performing these computations, there are different variants of the SLAM algorithm (Thrun, 2002; Chen et al., 2007; Aulinas et al., 2008). The Grid-based Fast SLAM algorithm is a variant that applies particle filtering methods to solve the SLAM problem (Thrun et al., 2005). Grid-based Fast SLAM incrementally processes the observations and odometry readings as they are available and updates the set of samples representing the posterior about the map and trajectory of the robot. The steps of the Fast SLAM based on PF are summarized as follows:

Initialization: At time t=0si65_e, sample x0ii=1Nsi66_e from the initial density function p(x0).

Sampling: Based of the state of the system at time t1si67_e represented by xt1iwt1ii=1Nsi68_e and the observation at zt, propose a set of new particles xtii=1Nsi69_e from the proposal function qxt|xt1,ztsi70_e. The motion model of the robot is often used as the proposal.

Importance weight: An individual weight wti is assigned for each proposed particle xtii=1Nsi69_e. The weight of each particle is proportional to the likelihood pzt|xt,msi62_e of the most recent observation given the map m associated with the particle and the corresponding pose xt.

Resampling: This step involves sampling new set of particles from the set xtiwtii=1Nsi73_e according to the weights of the particles; i.e., particles with larger weights are replicated, while lighter particles are discarded. The resulting particle set is then used in the next time step to predict the posterior probability. Resampling helps to improve the quality of the estimation by reducing the variance of the particles. In this step, different resampling algorithms (Douc and Cappe, 2005) can be used, as they are not application-specific algorithms. The specific resampling method adopted in this study is the Independent Metropolis Hasting (IMHA) resampling algorithm. This algorithm has the lowest computational complexity of most conventional resampling schemes like systematic resampling (Sileshi et al., 2013). As a result of these interesting characteristics, it is preferred for this work.

Map estimation: For each particle, the corresponding map estimate is updated based on the observation and the pose represented by that particle.

3 Computational bottleneck identification and hardware/software partitioning

In tackling the speed-up of the overall computation in PFs, a preliminary study of the identification of the critical bottlenecks of the algorithm is crucial for the design of hardware modules in order to accelerate the computational bottleneck steps. The study is based on the Grid-based Fast SLAM algorithm discussed in section 2. For an embedded implementation of the Grid-based Fast SLAM algorithm on the FPGA platform, an accurate analysis of the timing required in each step of the PF is obtained in order to identify the computational bottlenecks in each step of the PF.

Figure 2.1 summarizes the critical computational bottlenecks obtained from such preliminary study. In the sampling step, the computation of sine and cosine functions and Gaussian random number generator (GRNG) accounts to 45.91% and 53.62% of its execution time. In the importance weight step, the computation of sine and cosine, and exponential functions contribute 75.34% and 7.65% of the execution time, respectively. For the resampling step, the generation of uniform random numbers accounts for most of the execution time (i.e., 60.71%). For each step of the PF computations shown in Figure 2.1, “others” corresponds to other related computations involved in each step.

f02-01-9780128025086
Figure 2.1 Computational bottlenecks identifications in the sampling (s), importance weight (I), and resampling (R) steps of the PF in the Grid-based Fast SLAM application.

The profiling information of the PF given in Figure 2.1 can used in the hardware/software partitioning; i.e., which parts of the algorithm should be implemented in hardware and which ones can be kept in software (running on the FPGA’s embedded processor). It is clear from Figure 2.1 that the sine and cosine functions and random number generation are the critical bottlenecks of the Grid-based Fast SLAM algorithm, thus requiring acceleration with a hardware implementation.

The techniques used in the speed-up of the computational bottleneck steps of the PF in the Grid-based Fast SLAM application are based on the use of the CORDIC algorithm (Volder, 1959) for the evaluations of the sine, cosine, and exponential functions, and the Ziggurat (Marsaglia and Tsang, 2000) and Tausworthe (Ecuyer, 1996) algorithms for generation of Gaussian and uniform random numbers, respectively. The details of these techniques are provided in section 4, and the respective hardware designs are given in section 5.

4 PF acceleration techniques

4.1 CORDIC acceleration technique

CORDIC is an iterative algorithm capable of evaluating many basic arithmetic operations and mathematical functions (Aulinas et al., 2008). It involves the rotation of a vector v=xysi74_e with an angle z in circular, linear, and hyperbolic coordinate systems depending on the functions to be evaluated. Using shift and add operations, it performs a rotation iteratively using a series of specific incremental rotation angles. It can operate in one of three configurations (circular, linear, or hyperbolic) and two different modes (rotation or vectoring). In rotation mode, it performs a general rotation by a given angle z, and in vectoring mode, it computes the unknown angle z of a vector by performing a finite number of microrotations.

The unified CORDIC iteration equations for evaluation of a wide range of functions are given by Eq. (2.12) (Walther, 1971; Lakshmi and Dhar, 2010):

xi+1=xiμdiyi2iyi+1=yi+dixi2izi+1=zidiei,

si75_e  (2.12)

where (xy) defines a vector with angle z; μ defines a circular (μ=1si76_e), linear (μ=0si77_e), or hyperbolic (μ=1si78_e) coordinate system; di represents either a clockwise or counterclockwise direction of rotation in rotation mode di=singzisi79_e and vectoring mode di=singyisi80_e; and ei defines a table of constant values in circular (ei=tan12isi81_e), linear (ei=2isi82_e), and hyperbolic (ei=tanh12isi83_e) configurations. With the unified CORDIC iteration equations, a wide range of functions can be evaluated. However, the discussion is focused to the evaluation of the different functions involved in the PF steps for Grid-based Fast SLAM application and the Ziggurat Gaussian random-number generator algorithm. These include the sine, cosine, exponential, and natural logarithm functions. The configurations to the variables xy, and z, and the set of equations that the unified CORDIC iteration equations converges after i iterations in rotation and vectoring modes is given in Table 2.1, where, in circular-rotation mode of configuration, the functions f1 and f2 correspond to the sine () and cosine () functions, respectively; and for hyperbolic-rotation configuration, f1 and f2 correspond to the sinh () and cosh () functions. In vectoring-hyperbolic configuration, the function f3 corresponds to tanh1si84_e.

Table 2.1

Configuration to CORDIC for Evaluation of Different Functions

CircularHyperbolic
Rotation
x=Kxf1zyf2zsi3_e
y=Kyf1zxf2zsi4_e
z=0si5_e
Sine/Cosine
x=1/K,y=0si6_e and z as input,
K=1.646si7_e
Exponential
x=1/K,y=0si6_e and z as input
K=0.828159si9_e
Vectoring
x=Kx2y2si10_e
y=0si11_e
z=z+f3yxsi12_e
-Logarithmic
x=1,z=0si13_e and y as input

The evaluation of the sine and cosine functions is obtained directly by applying the properties given in Table 2.1; however, for the evaluation of exponential and natural logarithm functions, indirect properties are used. The evaluation of the exponential function is obtained indirectly by applying the following property:

expz=sinhz+coshz.

si85_e  (2.13)

Similarly, in the case of the natural logarithm function, the property given by the following equation is used for its indirect evaluation:

lnw=2tanh1yx,

si86_e  (2.14)

where, x=w+1si87_e and y=w1si88_e.

As the CORDIC algorithm works for a limited range of the input arguments for the evaluation of the elementary functions, it is required to extend the range of the inputs for each mode of operation by applying proper prescaling identities. This is achieved by dividing the original input arguments to the CORDIC algorithm by a constant to obtain a quotient Q and remainder D (Boudabous et al., 2004). In the case of the sine and cosine functions, the constant value corresponds to π2si89_e, and loge2 for the exponential and logarithmic functions. The prescaling identities for all the required functions are given in Table 2.2.

Table 2.2

Prescaling Identities for Function Evaluations

IdentityDomain
sinQπ2+D=sinDifQmod4=0cosDifQmod4=1sinDifQmod4=2cosDifQmod4=3si14_eD<π2si15_e
cosQπ2+D=cosDifQmod4=0sinDifQmod4=1cosDifQmod4=2sinDifQmod4=3si16_eD<π2si15_e
expQloge2+D=2QcoshD+sinhDsi18_eD<loge2si19_e
logeM2M=logeM+Eloge2si20_e0.5M<1.0si21_e

Source: Walther (1971); Lakshmi and Dhar, 2010; Boudabous et al. (2004).

4.2 Ziggurat acceleration technique

The sampling step of the PF for the Grid-based Fast SLAM application requires the generation of normal random numbers. For fast generation of such random numbers in this study, a Ziggurat algorithm was used. The Ziggurat algorithm is an efficient method for generating normally distributed random variates from an arbitrary decreasing probability density function y=fxsi90_e by applying the acceptance-rejection method. This method involves conditionality in accepting correct random values that are part of the target distribution and rejecting incorrect ones.

The Ziggurat technique partitions the standard normal density function x=exp0.5x2,si91_e x>0si92_e, into n horizontal rectangular blocks Ri, where i=0,1,2,,n1si93_e, extending horizontally from x=0si94_e to xi and vertically from f(xi) to fxi1si95_e. The bottom of the block consists of a rectangular area joined with the remainder of the density starting from a point r (Figure 2.2). All the rectangular blocks have equal area v, given by

v=xifxi1fxi=rfr+rfxdx.

si96_e  (2.15)

f02-02-9780128025086
Figure 2.2 Partitioning of a Gaussian distribution with the Ziggurat method.

The description of the Ziggurat algorithm based on Doornik (2005) for generating normally distributed random variates by partitioning of the normal density distribution, and with the application of the acceptance-rejection method, is given by

 Input: xii=0n1si97_e

 Output: normal random number x.

Draw an index i of a rectangle (0in1si98_e) at random with probability 1/n.

Draw a random number x from the rectangle i as x=U0xisi99_e, where U0 is a uniform random number drawn from a uniform distribution U(0, 1).

Rectangular:

IF i1si100_e and x<xi1si101_e accept x.

Tail:

IF i=0si102_e, accept x from the tail.

Wedge:

ELSE IF i>0si103_e and U1fxi1fxi<fxfxisi104_e accept x, where U1 is a uniform random number.

RETURN to step 1

where, in line 4, the method to generate the normal random numbers from the tail of the distribution is obtained as follows:

 DO

 Generate i.i.d. uniform (0, 1) variates u0 and u1

 xlnu0/rsi105_e, ylnu1si106_e

 WHILE (y+y)<x2si107_e

 RETURN x>0?r+x:r+xsi108_e

The Ziggurat algorithm requires a table of xi points and their corresponding function values fi. The number of rectangular blocks n is normally considered as a power of 2 (64, 128, 256, …), and for n=128si109_e, a value of r=3.442619855899si110_e is used to determine the xi points that are required in the hardware realization of the algorithm (Marsaglia and Tsang, 2000). The uniform random numbers U0 and U1 required in the Ziggurat algorithm are generated with a Tausworthe Uniform Random Number Generator (URNG). For evaluating the exponential and natural logarithmic functions in the wedge (line 5) and tail regions (line 4), respectively, a CORDIC algorithm was used.

5 Hardware implementation

This section describes the hardware designs for the PF acceleration techniques listed in section 4 for Grid-based Fast SLAM applications. The hardware design for the CORDIC acceleration technique is based on a CORDIC hardware module (shown in Figure 2.3) composed of three hardware computational submodules: CORDIC core, prescaling, and postscaling. The implementation of these submodules is based on the unified CORDIC iteration equations given in section 4 for the CORDIC-core submodule, and the identity equations given in Table 2.2 for the prescaling and postscaling submodules. The identity equations are used by the prescaling and postscaling modules to extend the range of input arguments while evaluating the different functions. The evaluation of the different functions by a single CORDIC module is required to avoid the use of multiple instances of the CORDIC module and consequently save the additional resource requirements. This is achieved by providing a runtime configuration to the CORDIC module through its Config port to configure the module to a specific function evaluation.

f02-03-9780128025086
Figure 2.3 Architecture of the CORDIC module.

A serial CORDIC algorithm with 24-bit accuracy is implemented in the CORDIC core shown in Figure 2.3. In the input/output interfaces for the CORDIC module shown in Figure 2.3, one of the two possible inputs U0 and U1 is evaluated for a specific function, one at a time. The configuration to a specific function evaluations is provided by the configuration bits at the Config input port, where, depending on these bits, a specific function is evaluated and the result is provided to the output interface f (). The functions f () that can be obtained at the output are the sin (), cos (), exp () and ln () functions.

The hardware architecture for the generation of both uniform and Gaussian random numbers is given in Figure 2.4. The three hardware submodules involved in Figure 2.4 are the Ziggurat NRNG submodule for generation of normally distributed random numbers; the Tausworthe submodule for the generation of two uniform random numbers (U0 and U1) that are used in the Ziggurat NRNG submodule and resampling step of the PF; and the CORDIC submodule for evaluating the complex functions required by the Ziggurat NRNG submodule. The CORDIC submodule implements the architecture presented in Figure 2.3. In the hardware architecture of the Ziggurat NRNG submodule, independently working individual hardware blocks for the generation of normal random variates from the rectangular, tail, and wedge regions are implemented. The computation of the exponential and natural logarithmic functions while normal variates are generated from the wedge and tail regions of the distribution, respectively, is achieved with the CORDIC algorithm.

f02-04-9780128025086
Figure 2.4 Hardware architecture for Gaussian and uniform random number generators

To further speed up the normal random number generation process in the Ziggurat NRNG submodule, an effective mechanism for the simultaneous access to the coefficients xi and fi is required. This is achieved by dividing the random index i in to even and odd values, and storing the respective xi and fi in separate memories (Figure 2.5). While the generated index i is an odd value, xi and xi1si1_e are read from the odd and even memories at the same memory index positions simultaneously. However, if the generated index is with an even value, then xi is read from the even memory and xi1si1_e is read from the odd memory at the next memory position in parallel. As a result, parallel access of the coefficients is achieved.

f02-05-9780128025086
Figure 2.5 Illustration of the storage of the xi and fi coefficients in memory (MEM) and the parallel access of the xi, xi1si1_e, fi and fi1si2_e for n = 128.

6 Hardware/software Architecture

The global architecture of the proposed system for the PF in a grid based Fast SLAM algorithm is given in Figure 2.6. It comprises of an embedded Microblaze processor responsible for the execution of software functions, the PF hardware accelerator (PF HW accelerator) and, timer and UART cores with the purpose to help the analysis and verification of the system. The PF HW accelerator contains the CORDIC and the random number generator cores explained in section V, and they are connected to the Microblaze soft-core processor through a dedicated one-to- one communication bus (Fast simplex Link) for fast streaming of data.

f02-06-9780128025086
Figure 2.6 Global architecture of the hardware/software co-design system for PF.

To test the Grid-based Fast SLAM algorithm, real-time odometry and laser data collected from a mobile robot platform interfaces with random access memory (RAM) in order to synchronize data for processing. The odometry data are used in the sampling step to generate new particle instances, and the laser data are used in the importance weight step for the evaluation of particle weights. The particles and their associated weights are buffered to Block RAM for fast accessing. Individual CORDIC hardware modules are assigned to evaluate the functions in the sampling and importance weight steps. Uniform and Gaussian random numbers are provided to the resampling and sampling steps of the PF, respectively, by the random number generator module.

7 Results and discussion

The implementation of the architecture shown in Figure 2.6 is performed on a Xilinx Kintex-7 KC705 FPGA device running at 100 MHz. The design of the hardware modules is written in VHSIC Hardware Description Language (VHDL). For the implementation of the Ziggurat NRNG module, a value of n=128si113_e is used and, the xi and fi coefficients are represented as fixed-point numbers with Q8 : 24 format (i.e., 8 bits for the integer part and 24 bits for the fractional part). The same fixed-point format is used for the representation of the data in the CORDIC module design. For the different variables in the software part of the algorithm, a 32-bit floating-point representation is used by enabling the floating-point unit (FPU) of the MicroBlaze processor.

Figure 2.7 summarizes the execution time given by the number of clock cycles for each step of the PF in the Grid-based Fast SLAM algorithm. The results show that the hardware acceleration leads to greater speed in the execution time of the PF in all three steps of the algorithm. In particular, a significant speed-up is achieved in the sampling step, which can be attributed to the fast generation of Gaussian random numbers by the random number generator hardware module. Compared to the other steps, the importance weight step shows a relatively higher clock cycle, as shown in Figure 2.7. This is because for the evaluation of the weight of each particle in the importance weight step, first it is required to transform every laser scan measurement point of data (range and bearing angle) from a robot frame of reference to a global frame of reference, where normally more than hundreds of laser scan points have to be evaluated from the sensor. This requires the evaluation of the sine and cosine functions for every scan point. After this step comes a search for the closest point between every laser scan end point and occupied points in a map (which is a computationally intensive step). Though at this stage, this step is implemented in software in the embedded MicroBlaze processor, it is intended to speed up this process in hardware. Furthermore, the computation of an exponential function is required in the calculation of the weight of each particle for every laser scan points. These are the main reasons for the relatively large clock cycles obtained in the importance weight step.

f02-07-9780128025086
Figure 2.7 Comparison on execution times for the sampling (S), importance weight (I), and resampling (R) steps for embedded software implementation (SW) versus hardware/software (HW/SW)–embedded implementation and the corresponding increase in execution time.

8 Conclusions

This chapter presented techniques aimed to minimize the huge computational demand in PFs applied to a Grid-based Fast SLAM application. With initial identification of the computational bottlenecks in each step of the algorithm, techniques are proposed in order to tackle the computational bottlenecks of the algorithm. Based on the proposed techniques, and with a hardware/software codesign approach, hardware acceleration blocks are designed and implemented to speed up the computational time. The presented approach leads to an improvement in the speedup of 140 ×, 14.87 ×, and 19.36 × in the sampling, importance weight and resampling steps, respectively (Figure 2.7). Due to the flexibility in the hardware/software codesign and the presented approach, the acceleration in the computational time of other real-time particle filtering applications can be easily adopted following the approach presented here.

References

Arulampalam MS, Maskell S, Gordon N, Clapp T. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 2002;50(2):174–188.

Atia MM, Georgy J, Korenberg MJ, Noureldin A. Real-time implementation of mixture particle filter for 3D RISS/GPS integrated navigation solution. Electron. Lett. 2010;46(15):1083–1084.

Atia MM, Korenberg MJ, Noureldin A. Particle-Filter-Based WiFi-Aided Reduced Inertial Sensors Navigation System for Indoor and GPS-Denied Environments. International Journal of Navigation and Observation. 2012;2012:doi:10.1155/2012/753206 Article ID 753206, 12 pages.

Aulinas J, Petillot Y, Salvi J, Lladò X. The slam problem: a survey. In: Proceedings of the 2008 conference on Artificial Intelligence Research and Development; 2008:363–371.

Boudabous A, Ghozzi F, Kharrat MW, Masmoudi N. Implementation of hyperbolic functions using CORDIC algorithm. In: Microelectronics, 2004. ICM 2004 Proceedings. The 16th International Conference; 6–8 Dec. 2004:738–741.

Chau TCP, Niu X, Eele A, Luk W, Cheung PYK, Maciejowski JM. Heterogeneous reconfigurable system for adaptive particle filters in real-time applications. In: Proc. Int. Conf. on Reconfigurable Computing: Architectures, Tools and Applications (ARC); Springer; Mar. 2013:1–12.

Chau TCP, Luk W, Cheung PYK, Eele A, Maciejowski J. Adaptive Sequential Monte Carlo approach for real-time applications. In: Field Programmable Logic and Applications (FPL), 2012 22nd International Conference; 29–31 Aug. 2012:527–530.

Chen Z, Samarabandu J, Rodrigo R. Recent advances in simultaneous localization and map-building using computer vision. Adv. Robot. 2007;21(3–4):233–265.

Doornik JA. An Improved Ziggurat Method to Generate Normal Random Samples. University of Oxford; 2005.

Douc R, Cappe O. Comparison of resampling schemes for particle filtering. In: Image and Signal Processing and Analysis, 2005. ISPA 2005. Proceedings of the 4th International Symposium; 15–17 Sept. 2005:64–69.

Ecuyer PL. Maximally equidistributed combined Tausworthe generators. Math. Comput. 1996;65(213):203–213.

El-Halym HAA, Mahmoud I, Habib S. Proposed hardware architectures of particle filter for object tracking. EURASIP J. Adv. Signal Process. 2012;2012(17).

Fross D, Langer J, Fross A, Rössler M, Heinkel U. Hardware implementation of a Particle Filter for location estimation. In: Indoor Positioning and Indoor Navigation (IPIN), 2010 International Conference; 15–17 Sept. 2010:1–6.

Georgy J, Noureldin A, Goodall C. Vehicle navigator using a mixture particle filter for inertial sensors/odometer/map data/GPS integration. EEE Trans. Consum. Electron. 2012;58(2):544–552.

Gokdemir M, Vikalo H. A particle filtering algorithm for parameter estimation in real-time biosensor arrays. In: Genomic Signal Processing and Statistics, 2009. GENSIPS 2009. IEEE International Workshop on. 17–21 May 2009:1–4.

Gordon NJ, Salmond DJ, Smith AFM. Novel approach to nonlinear/non-Gaussian Bayesian state estimation. IEE proc. 1993;140(2):107–113.

Grisetti G, Stachniss C, Burgard W. Improved techniques for grid mapping with Rao-Blackwellized particle filters. IEEE Trans. Robot. 2007;23(1):34–46.

Gustafsson F. Particle filter theory and practice with positioning applications. IEEE Aero. Electron. Syst. Mag. 2010;25(7):53–82.

Happe M, et al. A self-adaptive heterogeneous multi-core architecture for embedded real-time video object tracking. J. Real Time Image Process. 2011.

Lakshmi B, Dhar AS. CORDIC architectures: a survey. VLSI Des. 2010;2010: Article ID 794891, 19 pages.

Li S-A, Hsu C-C, Lin W-L, Wang J-P. Hardware/software co-design of particle filter and its application in object tracking. In: System Science and Engineering (ICSSE), 2011 International Conference; 8–10 June 2011:87–91.

Liu X, Niranjan M. State and parameter estimation of the heat shock response system using Kalman and particle filters, Bioinformatics. 2012;28:1501–1507.

Marsaglia G, Tsang WW. The ziggurat method for generating random variables. J. Stat. Soft. 2000;5:1–7.

Medeiros H, Park J, Kak A. A parallel color-based particle filter for object tracking. In: Computer Vision and Pattern Recognition Workshops, 2008. CVPRW ‘08. IEEE Computer Society Conference on; 23–28 June 2008:1–8.

Meinhold RJ, Singpurwalla ND. Robustification of Kalman filter models. J. Am. Stat. Assoc. 1989;84:479–486.

Nakamura K, Yoshida R, Nagasaki M, Miyano S, Higuchi T. Parameter estimation of in silico biological pathways with particle filtering towards a petascale computing. Pac. Symp. Biocomput. 2009;14:227–238.

Nordlund P-J, Gustafsson F. Sequential Monte Carlo filtering techniques applied to integrated navigation systems. In: American Control Conference, 2001. 4375–4380. Proceedings of the 2001. 2001;vol. 6.

Ristic B, Arulampalam S, Gordon NJ. Beyond the Kalman Filter: Particle Filters for Tracking Applications. Norwood, MA: Artech House Publishers; 2004.

Saha S, Bambha NK, Bhattacharyya SS. Design and implementation of embedded computer vision system based on particle filters. Comput. Vis. Image Understand. 2010;114(11):1203–1214.

Sileshi BG, Ferrer C, Oliver J. Hardware/software co-design of particle filter in grid based Fast-SLAM algorithm. In: International Conference on Embedded Systems and Applications (ESA), 2014 Conference; 24–27 Jul. 2014.

Sileshi BG, Ferrer C, Oliver J. Particle filters and resampling techniques: Importance in computational complexity analysis. In: Design and Architectures for Signal and Image Processing (DASIP), 2013 Conference; 8–10 Oct. 2013:319–325.

Thrun S. Robotic mapping: a survey. In: Lakemeyer G, Nebel B, eds. Exploring Artificial Intelligence in the New Millenium. San Mateo, CA: Morgan Kaufmann; 2002:1–35.

Thrun S, Burgard W, Fox D. Probabilistic Robotics (Intelligent Robotics and Autonomous Agents series). Intelligent Robotics and Autonomous Agents. The MIT Press; August 2005.

Volder JE. The CORDIC trigonometric computing technique. IRE Trans. Electron. Comput. Sept 1959;330–334.

Walther JS. A unified algorithm for elementary functions. In: Proceedings of the Spring Joint Computer Conference; 1971:379–385.

Yang J, Kadirkamanathan V, Billings SA. In vivo intracellular metabolite dynamics estimation by sequential Monte Carlo filter. In: IEEE Symposium on Computational Intelligence, Bioinformatics and Computational Biology; 2007:387–394.

Ye B, Zhang Y. Improved FPGA implementation of particle filter for radar tracking applications. In: Synthetic Aperture Radar, 2009. APSAR 2009. 2nd Asian-Pacific Conference; 26–30 Oct. 2009:943–946.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset