B.G. Sileshi1 [email protected]; C. Ferrer1,2 [email protected]; J. Oliver1 [email protected] 1 Departament de Microelectrònica i Sistemes Electrònics, Universitat Autònoma de Barcelona, Bellaterra (Barcelona), Spain
2 Institut de Microelectrònica de Barcelona (CNM-CSIC), Bellaterra (Barcelona), Spain
Particle filters (PFs) are Bayesian-based estimation algorithms with attractive theoretical properties for addressing a wide range of complex applications that are nonlinear and non-Gaussian. However, they are associated with a huge computational demand that limited their application in most real-time systems. To address this drawback in PFs, this chapter presents PF acceleration techniques based on a hardware/software codesign approach for the grid-based fast simultaneous localization and mapping (SLAM) application. With initial identification of the computationally intensive steps of the algorithm, techniques are proposed to accelerate the computational bottleneck steps. Based on the proposed PF acceleration techniques, hardware blocks are designed to speed up the computational time and interface in a central MicroBlaze soft processing core platform. The proposed hardware/software implementation resulted in an improvement in the overall execution time of the algorithm.
Estimating the state of a dynamic system from a set of observations that arrive sequentially in time and are corrupted by noise has different applications. For systems where the amount of nonlinearity is limited, the parametric Kalman filters (extended or unscented versions) can be applied. However, for systems with nonlinear or non-Gaussian dynamics, the Kalman filters were found to perform poorly (Meinhold and Singpurwalla, 1989; Ristic et al., 2004). On the other hand, a nonparametric sequential Bayesian estimation technique called particle filters (PFs) present accurate estimates under such circumstances (Gordon et al., 1993; Arulampalam et al., 2002) and have been used in applications such as navigation and positioning (Gustafsson, 2010; Nordlund and Gustafsson, 2001; Atia et al., 2010, 2012; Georgy et al., 2012), tracking (Happe et al., 2011; Medeiros et al., 2008), and robotics (Grisetti et al., 2007). There has been also extensive research in recent years for the application PFs to estimate the states and parameters in biological systems. For example, several studies (Nakamura et al., 2009; Yang et al., 2007; Gokdemir and Vikalo, 2009; Liu and Niranjan, 2012) demonstrate the applicability of PFs in the state estimation of biological models by considering biological systems as dynamical, nonlinear, and generally partially observed processes.
PFs have the flexibility of efficiently representing a wide range of probability densities using sets of points called particles. Such representational power of PFs, however, comes at the cost of a great computational complexity that has so far limited their application in different types of real-life problems. It is in this sense that hardware- or hardware/software-based implementations of PFs are essential to accelerate the computational time and achieve the objective of real-time computation. With regard to this, the research literature tackles the real-time constraint of the PFs through hardware implementation platforms such as field-programmable gate arrays (FPGAs). Interesting features of FPGAs, such as their ability to introduce massive parallelization and their low power consumption, which is critical for many applications such as navigation, make them the primary choice for the implementation of PFs in real-life applications.
The hardware implementation of PFs based on FPGA platforms has been proposed by several researchers. For example, Chau et al. (2012) studied the implementation of PFs on an FPGA in a mobile robot localization problem, where the size of the particle set is adapted to reduce the run-time computational complexity of the algorithm. In Fross et al., (2010), the FPGA hardware implementation of PF for location estimation is presented and compared with the software solution running on an ARM7-based microcontroller. A recent study (Chau et al., 2013) presents a heterogeneous reconfigurable system consisting of an FPGA and a central processing unit (CPU) for a simultaneous mobile robot localization and people-tracking application. In this study, they adapt the number of particles dynamically and utilize the run-time reconfigurability of the FPGA to reduce power and energy consumption. According to their results, a speed increase of up to 7.39 times is achieved compared to the Intel Xeon X5650 CPU software implementation. Li et al. (2011) presented a hardware/software codesign approach based on a system on a chip technique that uses a NIOS II processor to calculate the weight for each particle and a hardware implementation to update the particles. They claim that their proposed method significantly improves the efficiency and provides flexibility in the design step for various applications due to the software implementation of the importance weight step. Ye and Zhang (2009) presented hardware architecture for an FPGA implementation of the sampling importance resampling (SIR) PF. The hardware architecture is simulated with ModelSim and the real-time performance of the hardware architecture is evaluated using a universal asynchronous receiver/transmitter (UART) for the input sensor data. Saha et al. (2010) presented a System-on-Chip architecture involving both hardware and software components for a class of PFs. In this chapter, a parameterized framework is used to enable the reuse of the architecture with minimal redesign for a wide range of particle filtering applications. In El-Halym et al. (2012), three different hardware architectures are proposed, and this suggests the use of a piecewise linear function instead of the classical exponential function in order to decrease the complexity of the hardware architecture in the weighting step.
With the objective to reduce the high computational requirements of the PF, this chapter presents PF acceleration techniques based on a hardware/software codesign approach with the following specific contributions:
• As the underlying computations in particle filtering vary from one application to the next, a generic approach is presented for the analysis, design, and speed-up of the PF computational steps. The same approach can be used in the development of other application-specific PF implementations.
• The embedded hardware/software implementation of the PF focuses on a grid-based fast simultaneous localization and mapping (SLAM) algorithm. In order to increase the throughput of the algorithm, two main mechanisms are introduced in the algorithm acceleration:
• Fast hardware Gaussian random number generator design and its application to PF for grid-based Fast SLAM algorithm is presented.
• The design and implementation of the custom COordinate Rotation DIgital Computer (CORDIC) hardware module for the evaluation of the complex mathematical operations involved in several steps of the PF algorithm is presented.
This chapter is an extended version of the work presented by Sileshi et al. (2014), and it is organized as follows: Section 2 first introduces the theory behind the PF algorithm, followed by a discussion on its application with the SLAM algorithm called Grid-based Fast SLAM. Section 3 provides a discussion on the identification of the computational bottlenecks of the PF algorithm and the partitioning of the different steps for hardware and software implementations. Section 4 explains the PF acceleration techniques, and section 5 is dedicated to the hardware architecture design of the proposed acceleration techniques. Section 6 discusses the proposed hardware/software architecture of the PF and Section 7 presents the implementation results. Finally, Section 8 provides a conclusion of these topics.
The problem of nonlinear filtering is defined by a state space representation of dynamic system given by a discrete-time stochastic model of the state and measurement vectors with dimensions dx and dz, respectively:
where ft and ht are possibly nonlinear function of the state and the measurement zt, respectively; and wt are the process and measurement noise sequences, respectively; and t is the time index. The process and measurement noise sequences are assumed to be white, with known probability density functions and mutually independent.
In Bayesian estimation framework, the estimation to the state vector xt at time t is obtained from a noisy measurement zt, where the initial target state is assumed to have a known probability density function (pdf) p(x0) and is also independent of the noise sequences. The estimate of the unknown state xt is performed based on the sequence of all available measurements up to time t. The posterior distribution is calculated recursively based on the following two prediction [Eq. (2.3)] and update [Eq. (2.4)] equations (Arulampalam et al., 2002):
where the probabilistic model of the state evolution , is defined by the system equation [Eq. (2.1)], is defined by the measurement equation [Eq. (2.2)], and is a normalizing constant given by
While the assumption of the linear and Gaussian conditions is satisfied, the optimal analytic solution to the integrals given by Eqs. (2.3) and (2.4) is obtained by using the Kalman filtering method. However, it is often difficult to find an analytical solution to these integrals in most nonlinear/non-Gaussian estimation problems, and Sequential Monte Carlo (SMC) approximation method is used in the approximation of these integrals.
PF is an SMC-based approach that approximates a pdf by a discrete set of particles with their associated weights , where i and N denote the index of the particles and the total number of particles, respectively; and wti denotes the state of the particles and their weights, respectively; and dx is the dimension of the state. The posterior pdf of the state vector is approximated by (Ristic et al., 2004)
where δ(.) is a Dirac Delta function and wti is the importance weight, given by
In particle filtering, the particle set at time t is recursively propagated from the set at the previous time in such a way that they provide an approximation of the posterior distribution at each iteration. The initial particle set S0 is obtained from samples drawn from the prior density p(x0). In this approach, a major problem called particle degeneracy phenomenon arises. That is, after a certain number of iterations, a small number of normalized importance weights tend to value of one, while the remaining weights are negligible. This is not a desirable phenomenon, as only a few particles contribute to the approximation of the posterior density and much computation is wasted on the particles with negligible weight. A common measure of the degeneracy of the algorithm is the effective sample size Neff, given by (Arulampalam et al., 2002)
The degeneracy problem can be reduced by using a very large value of samples N but, since that approach is impractical, other alternative techniques are commonly used. These includes the use of an optimal importance density and resampling (Arulampalam et al., 2002). The optimal choice of the importance density is commonly considered the prior density. In resampling, particles are resampled N times with replacement from the discrete approximation of the posterior density. In this process, particles with low importance weight are eliminated and particles with high importance weight are replicated. This step is performed while Neff falls below a certain predefined threshold, NT. The resampling step is intended to reduce the degeneracy problem, but it leads to the limitation that it parallelizes the algorithms and loses diversity among the particles. The loss of diversity among the particles may lead to the occupancy of the same point in the state space by all N particles, giving poor representation of the posterior density.
The basic steps of the PF discussed so far, which correspond to an SIR PF, is given as follows:
Input:
Output:
FOR
Draw
Calculate the nonnormalized importance weights
END FOR
Calculate sum of the weights: W =
FOR
Normalize:
END FOR
Calculate Neff using Eq. (2.8)
If
RESAMPLE
END IF
SIR is one of the most commonly used variants of the basic particle filtering known as sequential importance sampling (SIS) PF, where the prior density is used as the importance density and the weights of the particles is given by .
SLAM involves the problem of localizing and building a map of a given environment simultaneously (Chen et al., 2007). It is usually described with a joint posterior probability density distribution [Eq. (2.9)] of the map (m) and the robot states (xt) at time t, given the observations (z0 : t) and control inputs (u0 : t) up to and including time t, together with the initial state of the robot (x0):
For the computation of the conditional probability given by Eq. (2.9) for all times t, a recursive solution based on the Bayes theorem is used in SLAM by starting with an estimate for the distribution at time and following a control ut and observation zt. Therefore, the SLAM algorithm is implemented with the two standard steps of the recursive prediction and measurement update process. The recursion is a function of the robot motion model and an observation model .
The prediction is obtained from the following equation:
and the measurement update is calculated using
The solution to a probabilistic SLAM problem involves an efficient and consistent computation of the prediction and update equations. Based on different estimation frameworks for performing these computations, there are different variants of the SLAM algorithm (Thrun, 2002; Chen et al., 2007; Aulinas et al., 2008). The Grid-based Fast SLAM algorithm is a variant that applies particle filtering methods to solve the SLAM problem (Thrun et al., 2005). Grid-based Fast SLAM incrementally processes the observations and odometry readings as they are available and updates the set of samples representing the posterior about the map and trajectory of the robot. The steps of the Fast SLAM based on PF are summarized as follows:
Initialization: At time , sample from the initial density function p(x0).
Sampling: Based of the state of the system at time represented by and the observation at zt, propose a set of new particles from the proposal function . The motion model of the robot is often used as the proposal.
Importance weight: An individual weight wti is assigned for each proposed particle . The weight of each particle is proportional to the likelihood of the most recent observation given the map m associated with the particle and the corresponding pose xt.
Resampling: This step involves sampling new set of particles from the set according to the weights of the particles; i.e., particles with larger weights are replicated, while lighter particles are discarded. The resulting particle set is then used in the next time step to predict the posterior probability. Resampling helps to improve the quality of the estimation by reducing the variance of the particles. In this step, different resampling algorithms (Douc and Cappe, 2005) can be used, as they are not application-specific algorithms. The specific resampling method adopted in this study is the Independent Metropolis Hasting (IMHA) resampling algorithm. This algorithm has the lowest computational complexity of most conventional resampling schemes like systematic resampling (Sileshi et al., 2013). As a result of these interesting characteristics, it is preferred for this work.
Map estimation: For each particle, the corresponding map estimate is updated based on the observation and the pose represented by that particle.
In tackling the speed-up of the overall computation in PFs, a preliminary study of the identification of the critical bottlenecks of the algorithm is crucial for the design of hardware modules in order to accelerate the computational bottleneck steps. The study is based on the Grid-based Fast SLAM algorithm discussed in section 2. For an embedded implementation of the Grid-based Fast SLAM algorithm on the FPGA platform, an accurate analysis of the timing required in each step of the PF is obtained in order to identify the computational bottlenecks in each step of the PF.
Figure 2.1 summarizes the critical computational bottlenecks obtained from such preliminary study. In the sampling step, the computation of sine and cosine functions and Gaussian random number generator (GRNG) accounts to 45.91% and 53.62% of its execution time. In the importance weight step, the computation of sine and cosine, and exponential functions contribute 75.34% and 7.65% of the execution time, respectively. For the resampling step, the generation of uniform random numbers accounts for most of the execution time (i.e., 60.71%). For each step of the PF computations shown in Figure 2.1, “others” corresponds to other related computations involved in each step.
The profiling information of the PF given in Figure 2.1 can used in the hardware/software partitioning; i.e., which parts of the algorithm should be implemented in hardware and which ones can be kept in software (running on the FPGA’s embedded processor). It is clear from Figure 2.1 that the sine and cosine functions and random number generation are the critical bottlenecks of the Grid-based Fast SLAM algorithm, thus requiring acceleration with a hardware implementation.
The techniques used in the speed-up of the computational bottleneck steps of the PF in the Grid-based Fast SLAM application are based on the use of the CORDIC algorithm (Volder, 1959) for the evaluations of the sine, cosine, and exponential functions, and the Ziggurat (Marsaglia and Tsang, 2000) and Tausworthe (Ecuyer, 1996) algorithms for generation of Gaussian and uniform random numbers, respectively. The details of these techniques are provided in section 4, and the respective hardware designs are given in section 5.
CORDIC is an iterative algorithm capable of evaluating many basic arithmetic operations and mathematical functions (Aulinas et al., 2008). It involves the rotation of a vector with an angle z in circular, linear, and hyperbolic coordinate systems depending on the functions to be evaluated. Using shift and add operations, it performs a rotation iteratively using a series of specific incremental rotation angles. It can operate in one of three configurations (circular, linear, or hyperbolic) and two different modes (rotation or vectoring). In rotation mode, it performs a general rotation by a given angle z, and in vectoring mode, it computes the unknown angle z of a vector by performing a finite number of microrotations.
The unified CORDIC iteration equations for evaluation of a wide range of functions are given by Eq. (2.12) (Walther, 1971; Lakshmi and Dhar, 2010):
where (x, y) defines a vector with angle z; μ defines a circular (), linear (), or hyperbolic () coordinate system; di represents either a clockwise or counterclockwise direction of rotation in rotation mode and vectoring mode ; and ei defines a table of constant values in circular (), linear (), and hyperbolic () configurations. With the unified CORDIC iteration equations, a wide range of functions can be evaluated. However, the discussion is focused to the evaluation of the different functions involved in the PF steps for Grid-based Fast SLAM application and the Ziggurat Gaussian random-number generator algorithm. These include the sine, cosine, exponential, and natural logarithm functions. The configurations to the variables x, y, and z, and the set of equations that the unified CORDIC iteration equations converges after i iterations in rotation and vectoring modes is given in Table 2.1, where, in circular-rotation mode of configuration, the functions f1 and f2 correspond to the sine () and cosine () functions, respectively; and for hyperbolic-rotation configuration, f1 and f2 correspond to the sinh () and cosh () functions. In vectoring-hyperbolic configuration, the function f3 corresponds to .
Table 2.1
Configuration to CORDIC for Evaluation of Different Functions
Circular | Hyperbolic | |
Rotation | Sine/Cosine and z as input, | Exponential and z as input |
Vectoring | - | Logarithmic and y as input |
The evaluation of the sine and cosine functions is obtained directly by applying the properties given in Table 2.1; however, for the evaluation of exponential and natural logarithm functions, indirect properties are used. The evaluation of the exponential function is obtained indirectly by applying the following property:
Similarly, in the case of the natural logarithm function, the property given by the following equation is used for its indirect evaluation:
where, and .
As the CORDIC algorithm works for a limited range of the input arguments for the evaluation of the elementary functions, it is required to extend the range of the inputs for each mode of operation by applying proper prescaling identities. This is achieved by dividing the original input arguments to the CORDIC algorithm by a constant to obtain a quotient Q and remainder D (Boudabous et al., 2004). In the case of the sine and cosine functions, the constant value corresponds to , and loge2 for the exponential and logarithmic functions. The prescaling identities for all the required functions are given in Table 2.2.
Table 2.2
Prescaling Identities for Function Evaluations
Identity | Domain |
Source: Walther (1971); Lakshmi and Dhar, 2010; Boudabous et al. (2004).
The sampling step of the PF for the Grid-based Fast SLAM application requires the generation of normal random numbers. For fast generation of such random numbers in this study, a Ziggurat algorithm was used. The Ziggurat algorithm is an efficient method for generating normally distributed random variates from an arbitrary decreasing probability density function by applying the acceptance-rejection method. This method involves conditionality in accepting correct random values that are part of the target distribution and rejecting incorrect ones.
The Ziggurat technique partitions the standard normal density function , into n horizontal rectangular blocks Ri, where , extending horizontally from to xi and vertically from f(xi) to . The bottom of the block consists of a rectangular area joined with the remainder of the density starting from a point r (Figure 2.2). All the rectangular blocks have equal area v, given by
The description of the Ziggurat algorithm based on Doornik (2005) for generating normally distributed random variates by partitioning of the normal density distribution, and with the application of the acceptance-rejection method, is given by
Input:
Output: normal random number x.
Draw an index i of a rectangle () at random with probability 1/n.
Draw a random number x from the rectangle i as , where U0 is a uniform random number drawn from a uniform distribution U(0, 1).
Rectangular:
IF and accept x.
Tail:
IF , accept x from the tail.
Wedge:
ELSE IF and accept x, where U1 is a uniform random number.
RETURN to step 1
where, in line 4, the method to generate the normal random numbers from the tail of the distribution is obtained as follows:
DO
Generate i.i.d. uniform (0, 1) variates u0 and u1
,
WHILE (
RETURN
The Ziggurat algorithm requires a table of xi points and their corresponding function values fi. The number of rectangular blocks n is normally considered as a power of 2 (64, 128, 256, …), and for , a value of is used to determine the xi points that are required in the hardware realization of the algorithm (Marsaglia and Tsang, 2000). The uniform random numbers U0 and U1 required in the Ziggurat algorithm are generated with a Tausworthe Uniform Random Number Generator (URNG). For evaluating the exponential and natural logarithmic functions in the wedge (line 5) and tail regions (line 4), respectively, a CORDIC algorithm was used.
This section describes the hardware designs for the PF acceleration techniques listed in section 4 for Grid-based Fast SLAM applications. The hardware design for the CORDIC acceleration technique is based on a CORDIC hardware module (shown in Figure 2.3) composed of three hardware computational submodules: CORDIC core, prescaling, and postscaling. The implementation of these submodules is based on the unified CORDIC iteration equations given in section 4 for the CORDIC-core submodule, and the identity equations given in Table 2.2 for the prescaling and postscaling submodules. The identity equations are used by the prescaling and postscaling modules to extend the range of input arguments while evaluating the different functions. The evaluation of the different functions by a single CORDIC module is required to avoid the use of multiple instances of the CORDIC module and consequently save the additional resource requirements. This is achieved by providing a runtime configuration to the CORDIC module through its Config port to configure the module to a specific function evaluation.
A serial CORDIC algorithm with 24-bit accuracy is implemented in the CORDIC core shown in Figure 2.3. In the input/output interfaces for the CORDIC module shown in Figure 2.3, one of the two possible inputs U0 and U1 is evaluated for a specific function, one at a time. The configuration to a specific function evaluations is provided by the configuration bits at the Config input port, where, depending on these bits, a specific function is evaluated and the result is provided to the output interface f (). The functions f () that can be obtained at the output are the sin (), cos (), exp () and ln () functions.
The hardware architecture for the generation of both uniform and Gaussian random numbers is given in Figure 2.4. The three hardware submodules involved in Figure 2.4 are the Ziggurat NRNG submodule for generation of normally distributed random numbers; the Tausworthe submodule for the generation of two uniform random numbers (U0 and U1) that are used in the Ziggurat NRNG submodule and resampling step of the PF; and the CORDIC submodule for evaluating the complex functions required by the Ziggurat NRNG submodule. The CORDIC submodule implements the architecture presented in Figure 2.3. In the hardware architecture of the Ziggurat NRNG submodule, independently working individual hardware blocks for the generation of normal random variates from the rectangular, tail, and wedge regions are implemented. The computation of the exponential and natural logarithmic functions while normal variates are generated from the wedge and tail regions of the distribution, respectively, is achieved with the CORDIC algorithm.
To further speed up the normal random number generation process in the Ziggurat NRNG submodule, an effective mechanism for the simultaneous access to the coefficients xi and fi is required. This is achieved by dividing the random index i in to even and odd values, and storing the respective xi and fi in separate memories (Figure 2.5). While the generated index i is an odd value, xi and are read from the odd and even memories at the same memory index positions simultaneously. However, if the generated index is with an even value, then xi is read from the even memory and is read from the odd memory at the next memory position in parallel. As a result, parallel access of the coefficients is achieved.
The global architecture of the proposed system for the PF in a grid based Fast SLAM algorithm is given in Figure 2.6. It comprises of an embedded Microblaze processor responsible for the execution of software functions, the PF hardware accelerator (PF HW accelerator) and, timer and UART cores with the purpose to help the analysis and verification of the system. The PF HW accelerator contains the CORDIC and the random number generator cores explained in section V, and they are connected to the Microblaze soft-core processor through a dedicated one-to- one communication bus (Fast simplex Link) for fast streaming of data.
To test the Grid-based Fast SLAM algorithm, real-time odometry and laser data collected from a mobile robot platform interfaces with random access memory (RAM) in order to synchronize data for processing. The odometry data are used in the sampling step to generate new particle instances, and the laser data are used in the importance weight step for the evaluation of particle weights. The particles and their associated weights are buffered to Block RAM for fast accessing. Individual CORDIC hardware modules are assigned to evaluate the functions in the sampling and importance weight steps. Uniform and Gaussian random numbers are provided to the resampling and sampling steps of the PF, respectively, by the random number generator module.
The implementation of the architecture shown in Figure 2.6 is performed on a Xilinx Kintex-7 KC705 FPGA device running at 100 MHz. The design of the hardware modules is written in VHSIC Hardware Description Language (VHDL). For the implementation of the Ziggurat NRNG module, a value of is used and, the xi and fi coefficients are represented as fixed-point numbers with Q8 : 24 format (i.e., 8 bits for the integer part and 24 bits for the fractional part). The same fixed-point format is used for the representation of the data in the CORDIC module design. For the different variables in the software part of the algorithm, a 32-bit floating-point representation is used by enabling the floating-point unit (FPU) of the MicroBlaze processor.
Figure 2.7 summarizes the execution time given by the number of clock cycles for each step of the PF in the Grid-based Fast SLAM algorithm. The results show that the hardware acceleration leads to greater speed in the execution time of the PF in all three steps of the algorithm. In particular, a significant speed-up is achieved in the sampling step, which can be attributed to the fast generation of Gaussian random numbers by the random number generator hardware module. Compared to the other steps, the importance weight step shows a relatively higher clock cycle, as shown in Figure 2.7. This is because for the evaluation of the weight of each particle in the importance weight step, first it is required to transform every laser scan measurement point of data (range and bearing angle) from a robot frame of reference to a global frame of reference, where normally more than hundreds of laser scan points have to be evaluated from the sensor. This requires the evaluation of the sine and cosine functions for every scan point. After this step comes a search for the closest point between every laser scan end point and occupied points in a map (which is a computationally intensive step). Though at this stage, this step is implemented in software in the embedded MicroBlaze processor, it is intended to speed up this process in hardware. Furthermore, the computation of an exponential function is required in the calculation of the weight of each particle for every laser scan points. These are the main reasons for the relatively large clock cycles obtained in the importance weight step.
This chapter presented techniques aimed to minimize the huge computational demand in PFs applied to a Grid-based Fast SLAM application. With initial identification of the computational bottlenecks in each step of the algorithm, techniques are proposed in order to tackle the computational bottlenecks of the algorithm. Based on the proposed techniques, and with a hardware/software codesign approach, hardware acceleration blocks are designed and implemented to speed up the computational time. The presented approach leads to an improvement in the speedup of 140 ×, 14.87 ×, and 19.36 × in the sampling, importance weight and resampling steps, respectively (Figure 2.7). Due to the flexibility in the hardware/software codesign and the presented approach, the acceleration in the computational time of other real-time particle filtering applications can be easily adopted following the approach presented here.