Lukas von Stumberg⁎,†; Vladyslav Usenko⁎; Daniel Cremers⁎,† ⁎Department of Informatics, Technical University of Munich, Munich, Germany
†Artisense GmbH, Garching, Germany
Simultaneous location and mapping (SLAM) is an integral part of scene understanding. There have been two sensors that in combination have shown to be particularly useful for this task: cameras and inertial measurement units (IMUs). In the last years a lot of very powerful and precise purely visual SLAM methods have been proposed, yet all of them exhibit a lack of robustness especially to fast motion and textureless areas. This is where IMUs are a great addition. They can provide short-term motion constraints and in contrast to cameras do not contain outliers.
In order to complement vision in the most optimal way, IMU data needs to be injected into the vision algorithm on a very deep layer resulting in a tight integration. In this chapter we will review the fundamentals of visual–inertial sensor fusion and explain a current state-of-the-art method called visual–inertial direct sparse odometry (VI-DSO).
VI-DSO jointly estimates camera poses and sparse scene geometry by minimizing photometric and IMU measurement errors in a combined energy functional. IMU information is accumulated between several frames using measurement preintegration, and it is inserted into the optimization as an additional constraint. In VI-DSO, scale and gravity direction are explicitly included into the model and jointly optimized together with other variables such as poses. As the scale is often not immediately observable using IMU data this allows one to initialize the visual–inertial system with an arbitrary scale instead of having to delay the initialization until everything is observable.
We will also cover partial marginalization of old variables, which keeps the computation time bounded as we explore the environment. We also describe dynamic marginalization, which allows us to use partial marginalization even in cases where the initial scale estimate is far from the optimum. We evaluate our system and compare to other visual–inertial methods on a publicly available dataset.
SLAM; Visual–inertial odometry; Sensor fusion
One of the most important tasks in modern robotics is to determine the motion of the robot and a reconstruction of the environment. Applications can be found in robot navigation [1] and autonomous driving [2] [3]. There are several different sensors available for this task, but cameras offer several advantages. They are cheap, lightweight, small and passive, which explains why they have been used very frequently for this task during the last years. While current visual odometry methods are already capable of providing highly accurate pose-estimates, they are not optimal in terms of robustness yet. Especially fast motions and low-textured areas present challenges for current solutions. This is where a combination with an inertial measurement unit (IMU) can help. An IMU contains a gyroscope and an accelerometer, which measure rotational velocities and linear acceleration at a very high rate (typically more than 100 Hz). This provides accurate short-term motion constraints and, unlike vision, is not prone to outliers.
In this chapter we will describe a state-of-the-art direct and sparse approach to visual–inertial odometry. The system is based on Direct Sparse Odometry (DSO) which was developed by Engel et al. [4]. DSO represents the geometry with a sparse set of points, each associated with an estimated inverse depth. They are jointly optimized together with 3D poses and camera parameters in a combined energy functional. We will describe how to complement this function with an additional IMU error term. This greatly improves the robustness of the system, as the inertial error term is convex and not prone to outliers.
A key drawback of monocular visual odometry is that it is not possible to obtain the metric scale of the environment. Adding an IMU enables us to observe the scale. Yet, depending on the performed motions this can take infinitely long, making the monocular visual–inertial initialization a challenging task. Rather than relying on a separate IMU initialization we show how to include the scale as a variable into the model of our system and jointly optimize it together with the other parameters.
Quantitative evaluation on the EuRoC dataset [5] demonstrates that we can reliably determine camera motion and sparse 3D structure (in metric units) from a visual–inertial system on a rapidly moving micro aerial vehicle (MAV) despite challenging illumination conditions (Fig. 7.1).
Note that this book chapter builds up on and significantly extends the paper [6], which was presented at the International Conference on Robotics and Automation (ICRA) 2018.
Motion estimation in 3D is a very important topic in robotics, and an IMU can be very useful when combined with vision data properly. Consequently there have been many works revolving around this topic in the past. In the following subsections we will present a short summary of visual and visual–inertial odometry methods.
When combining a monocular camera with IMU measurements a challenging task is the visual–inertial initialization. Therefore we will also discuss existing approaches, where the scale of the scene, initial orientation and velocity are not known from the beginning.
The task of incrementally tracking the camera motion from video is commonly called visual odometry. It was established by Nister et al. [7]. They proposed to match a sparse set of features from frame to frame, yielding multiple observations of the same point in different images. The 3D camera motion is then computed by minimizing the reprojection error. Afterwards there were many other works based on this feature-based approach. The general idea is to extract and track features over the images and then minimize the reprojection error of them. Later works like MonoSLAM [8] and PTAM [9] added the ability to create a consistent map in real-time. A very recent example is ORB-SLAM [10]. It is capable of tracking the camera motion and creating a consistent map with the ability to close trajectory loops and re-localization, outperforming previous methods in accuracy and robustness.
Another approach to visual odometry are the direct methods. Instead of the tracking features and minimizing the reprojection error, they directly use the image data to obtain the camera motion by optimizing a photometric error. In the work of [11] this error was minimized in real-time in a stereo setting by formulating quadrifocal constraints on the pixels. Especially for tracking with RGB-D cameras direct solutions were proposed by Newcombe et al. [12] and Kerl et al. [13]. Recently, several direct approaches were proposed, which use only a monocular camera. They differ in the density of the image information used. Newcombe et al. [14] proposed a dense approach which uses the complete image for tracking. However, for monocular visual odometry mainly image regions with a sufficient gradient contribute information. Therefore Engel et al. [15] have presented a semi-dense SLAM system exploiting this fact. Although sparse methods have traditionally been keypoint-based, direct sparse methods have advantages discussed in two recent publications. SVO proposed by Forster et al. [16] uses a keypoint-based approach for mapping keyframes and performs sparse direct image alignment to track non-keyframes. In contrast DSO proposed by Engel et al. [4] is a purely direct and sparse system. By using a sparse set of points it can efficiently perform bundle adjustment on a window of keyframes yielding superior accuracy and robustness compared to the semi-dense approach.
IMU sensors are in many ways complementary to vision. The accelerometer and gyroscope measurements provide good short-term motion information. Furthermore they make scale as well as the roll and pitch angles observable, which is relevant for some control tasks, like the navigation of drones. Therefore there were several methods which combine them with vision. They can be classified into two general approaches.
The first one is the so-called loosely coupled fusion. Here the vision system is used as a blackbox, providing pose measurements, which are then combined with the inertial data, usually using an (extended or unscented) Kalman filter. Examples of this are shown in [1], [17] and [18].
Recently, tightly coupled works became more popular. Here the motion of the camera is computed by jointly optimizing the parameters in a combined energy function consisting of a visual and an inertial error term. The advantage is that the two measurement sources are combined on a very deep level of the system yielding a very high accuracy and especially robustness. Especially for vision systems that are based on nonlinear optimization the inertial error term can help overcome nonlinearities in the energy functional. The disadvantage is that the tightly coupled fusion has to be done differently for different vision systems making it much harder to implement. Still a tightly-coupled fusion has been done for filtering-based approaches [19] [20] [21] as well as for energy-minimization based systems [22] [23] [24] [25] [26].
Especially for monocular energy-minimization based approaches it is important to start with a good visual–inertial initialization. In practical use-cases the system does not have any prior knowledge about initial velocities, poses and geometry. Therefore apart from the usual visual initialization procedure most visual–inertial systems need a specific visual–inertial initialization. However, the problem is even more difficult, because several types of motion do not allow one to uniquely determine all variables. In [27] a closed-form solution for visual–inertial initialization was presented, as well as an analysis of the exceptional cases. It was later extended to work with IMU biases by [28]. In [25] a visual–inertial initialization procedure was proposed that works on the challenging EuRoC dataset. They use 15 seconds of flight data for initialization to make sure that all variables are observable on this specific dataset. Also in [26] and [29] IMU initialization methods are proposed which work on the EuRoC dataset, the latter even works without prior initialization of IMU extrinsics.
The goal of the Gauss–Newton algorithm is to solve optimization problems of the form
minxE(x)=minx∑iwi⋅ri(x)2
where each ri is a residual and wi is the associated weight.
We will start with an explanation of the general Newton methods. The idea is based on a Taylor approximation of the energy function around x0,
E(x)≈E(x0)+bT(x−x0)+12(x−x0)TH(x−x0)
where b=dEdx is the gradient and H=d2Edx2 is the Hessian.
Using this approximation the optimal solution is where
0!=dEdx=b+H(x−x0).
The optimal solution is therefore at
x=x0−H−1⋅b.
Of course, this is not the exact solution because we used an approximation to derive it. Therefore the optimization is performed iteratively by starting with an initial solution x0 and successively calculating
xt+1=xt−λ⋅H−1⋅b
where λ is the step size.
Calculating the Hessian is very time-intensive. Therefore the idea of the Gauss–Newton method is to use an approximation for it instead. In the case where the problem has the form of Eq. (7.1) we can compute
bj=2∑iwi⋅ri(x)∂ri∂xj,
Hjk=2∑iwi(∂ri∂xj∂ri∂xk+ri∂2ri∂xj∂xk)≈2∑iwi(∂ri∂xj∂ri∂xk).
We define the stacked residual vector r by stacking all residuals ri. The Jacobian J of r therefore contains at each element
Jij=∂rixj.
We also stack the weights into a diagonal matrix W.
Then the approximated Hessian can be calculated with
H≈2∑iwiJijJik=2JTWJ.
Similarly the gradient of the energy function is
b=2JTWr.
In summary the Gauss–Newton algorithm iteratively computes updates using Eq. (7.5), where the Hessian is approximated using Eq. (7.9).
When initialized to far off from the optimum it can happen that the Gauss–Newton algorithm does not converge. The idea of the Levenberg–Marquandt algorithm is to dampen the Gauss–Newton updates. It still uses the same definition of H and b as the Gauss–Newton algorithm.
When computing the iterative update, however, it uses
xt+1=xt−(H+λ⋅diag(H))−1⋅b.
The rough idea is to create a mixture between the Gauss–Newton algorithm and a simple gradient descent. Intuitively for a value of λ=0 this will degenerate to a normal Gauss–Newton step and for a high value it will be dampened to something similar to gradient descent.
Clearly there is no optimal value for λ so it is usually adapted on the fly. After each update it is calculated whether or not the update reduces the energy (and the update is discarded if not). When an update is discarded λ is increased and when it is accepted λ gets reduced.
Compared to the Gauss–Newton algorithm the Levenberg–Marquandt method is in general more robust to a bad initialization.
Direct sparse odometry (DSO) is a method that was very recently proposed by Engel et al. [4] (see Fig. 7.2). The novelty is that it optimizes a direct error term but still uses only a sparse set of points making it very accurate and at the same time real-time capable. The main advantage over previous direct methods like LSD-SLAM [15] is that it performs a bundle adjustment like optimization, which is computationally feasible because the set of points is sparse. In this section we will explain the theory behind DSO as well as the general structure.
Throughout the chapter we will use the following notation: bold upper case letters H represent matrices, bold lower case x vectors and light lower case λ represent scalars. Transformations between coordinate frames are denoted as Ti_j∈SE(3) where point in coordinate frame i can be transformed to the coordinate frame j using the following equation: pi=Ti_jpj. We denote Lie algebra elements as ˆξ∈se(3), where ξ∈R6, and use them to apply small increments to the 6D pose ξ′i_j=ξi_j⊞ξ:=log(eˆξi_j⋅eˆξ)∨. We define the world as a fixed inertial coordinate frame with gravity acting along the negative Z axis. We also assume that the transformation from camera to IMU frame Timu_cam is fixed and calibrated in advance. Factor graphs are expressed as a set G of factors and we use G1∪G2 to denote a factor graph containing all factors that are either in G1 or in G2.
The main idea behind direct methods is that the intensity of a point should be constant over the images it is observed in. However, changes in illumination and photometric camera parameters such as vignetting, auto-exposure and response function can produce a change of the pixel value even when the real-world intensity is constant. To account for this, DSO can perform a photometric calibration and account for illumination changes in the cost function.
To do this it uses the following error definition. The photometric error of a point p∈Ωi in reference frame i observed in another frame j is defined as follows:
Epj=∑p∈Npωp‖(Ij[p′]−bj)−tjeajtieai(Ii[p]−bi)‖γ,
where Np is a set of eight pixels around the point p, Ii and Ij are images of respective frames, ti, tj are the exposure times, ai, bi, aj, bj are the coefficients to correct for affine illumination changes, γ is the Huber norm, ωp is a gradient-dependent weighting and p′ is the point projected into Ij.
The key idea here is that brightness constancy of a pixel is assumed when observed in different frames. However, the method compensates for illumination changes in the image, the vignette of the camera and the response function and the exposure time.
With that we can formulate the photometric error as
Ephoto=∑i∈F∑p∈Pi∑j∈obs(p)Epj,
where F is a set of keyframes that we are optimizing, Pi is a sparse set of points in keyframe i, and obs(p) is a set of observations of the same point in other keyframes.
For the accuracy of visual odometry it is important to optimize geometry and camera poses jointly. In DSO this can be done efficiently because the Hessian matrix of the system is sparse (see Sect. 7.4.5). However, the joint optimization still takes relatively much time which is why it cannot be done for each frame in real-time. Therefore the joint optimization is only performed for keyframes.
As we want to obtain poses for all frames however, we also perform a coarse tracking. Contrary to the joint estimation, we fix all point depths in the coarse tracking and only optimize the pose of the newest frame, as well as its affine illumination parameters (ai and bi). This is done using normal direct image alignment in a pyramid scheme where we track against the latest keyframe (see Sect. 7.4.4).
In summary the system has the structure depicted in Fig. 7.3. For each frame we perform the coarse tracking yielding a relative pose of the newest frame with respect to the last keyframe. For each frame we also track candidate points (during the “trace points” procedure). This serves as an initialization when some of the candidate points are later activated during the joint optimization.
For some frames we create a new keyframe, depending on the distance and visual change since the last keyframe. Then the joint estimation is performed, optimizing the poses of all active keyframes as well as the geometry. In the real-time version of the system coarse tracking and joint optimization are performed in parallel in two separate threads. Active keyframes are all keyframes which have not been marginalized yet.
The coarse tracking is based on normal direct image alignment.
In order to speed up computation all active points are projected into the frame of the latest keyframe and tracking is performed only with reference to this keyframe.
As in the rest of the system the coarse tracking minimizes the photometric error in order to get a pose estimate. The optimized parameters include the 6 DOF pose which is represented as the transformation from the new frame to the reference frame.
This error is minimized using the Levenberg–Marquandt algorithm. In order to increase the convergence radius the well known pyramid scheme is used. This means that the image is scaled down multiple times (using Gaussian blurring). When tracking, at first the lowest resolution is used. After convergence the optimization is proceeded with the next level of the image pyramid (using the result from the previous level as an initialization). This is done until the original image is reached and the parameters are optimized using it.
This procedure has two advantages. First, tracking on a low resolution first is very fast, as less pixels have to be handled. The main advantage, however, is that it increases the convergence radius. The Taylor approximation of the energy, which in practice uses the image gradient, is only valid in the vicinity of the optimum. When using a smaller-scaled image the gradient contains information from a larger region (in terms of the original image), and therefore increases the convergence basin.
In contrast to the coarse tracking the joint optimization not only optimizes poses, but also the geometry, making the task significantly more difficult.
However, here all variables are initialized close to the optimum: The approximate pose of the newest frame is obtained from the previously executed coarse tracking. Newly activated points have a depth estimated by the tracing-procedure and all other variables have already been optimized in the previous execution of the joint optimization. Therefore DSO uses the Gauss–Newton algorithm for this task (instead of Levenberg–Marquandt), and no image pyramid.
When looking at the update step
δx=H−1⋅b
performed in each iteration it is clear that the most calculation-intensive part is inverting the Hessian matrix.
However, we can use the sparsity of the structure (see Fig. 7.4). The reason for this specific structure is that each residual only depends on the pose of two frames, the affine lighting parameters and one point depth (see also Eq. (7.12)). Therefore there is no residual depending on two different points, meaning that the off-diagonal part of JTJ is 0 for all depths.
This information can be used to efficiently solve the problem. We can partition the linear system into
[HααHαβHβαHββ][δxαδxβ]=[bαbβ]
where xβ are the depth variables and xα are all other variables. Note that this means that Hββ is a diagonal matrix.
We can now apply the Schur complements, which yields the new linear system ˆHααxα=ˆbα, with
ˆHαα=Hαα−HαβH−1ββHβα,
ˆbα=bα−HαβH−1ββbβ.
In this procedure the only inversion needed is H−1ββ, which is trivial because Hββ is diagonal.
The resulting ˆHαα is much smaller than the original H because there are typically much more points, than other variables. With it we can efficiently compute the update for the non-depth variables with
δxα=ˆHαα−1⋅ˆbα.
The update step of for the depth variables is done by reinserting
Hβαδxα+Hββδxβ=bβ⇔δxβ=H−1ββ(bβ−Hβαδxα).
The visual–inertial odometry system we describe in this chapter was proposed in [6]. It is based on DSO [4] and could also be viewed as a direct formulation of [22]. The approach differs from [24] and [23] by jointly estimating poses and geometry in a bundle adjustment like optimization.
As mentioned in the introduction it is important to integrate IMU data on a very tight level to the system. In the odometry system poses and geometry are estimated by minimizing a photometric error function Ephoto. We integrate IMU data by instead optimizing the following energy function:
Etotal=λ⋅Ephoto+Einertial.
The summands Ephoto and Einertial are explained in Sects. 7.4.2 and 7.5.1, respectively.
The system contains two main parts running in parallel:
In contrast to [25] this method does not wait for a fixed amount of time before initializing the visual–inertial system but instead jointly optimizes all parameters including the scale. This yields a higher robustness as inertial measurements are used right from the beginning.
In this section we derive the nonlinear dynamic model that we use to construct the error term, which depends on rotational velocities measured by gyroscope and linear acceleration measured by accelerometer. We decompose the 6D pose from the state s into rotation and translational part:
Tw_imu=(Rp01),
with that we can write the following dynamic model:
˙p=v,
˙v=R(az+ϵa−ba)+g,
˙R=R[ωz+ϵω−bω]×,
˙ba=ϵb,a,
˙bω=ϵb,ω,
where ϵa, ϵω, ϵb,a, and ϵb,ω denote the Gaussian white noise that affects the measurements and ba and bω denote slowly evolving biases. [⋅]× is the skew-symmetric matrix such that, for vectors a, b, [a]×b=a×b.
As IMU data is obtained with a much higher frequency than images, we follow the preintegration approach proposed in [30] and improved in [31] and [23]. We integrate the IMU measurements between timestamps i and j in the IMU coordinate frame and obtain pseudo-measurements Δpi→j, Δvi→j, and Ri→j.
We initialize pseudo-measurements with Δpi→i=0, Δvi→i=0, Ri→i=I, and assuming the time between IMU measurements is Δt we integrate the raw measurements:
Δpi→k+1=Δpi→k+Δvi→kΔt,
Δvi→k+1=Δvi→k+Ri→k(az−ba)Δt,
Ri→k+1=Ri→kexp([ωz−bω]×Δt).
Given the initial state and integrated measurements the state at the next time-step can be predicted:
pj=pi+(tj−ti)vi+12(tj−ti)2g+RiΔpi→j,
vj=vi+(tj−ti)g+RiΔvi→j,
Rj=RiRi→j.
For the previous state si−1 (based on the state definition in Eq. (7.39)) and IMU measurements ai−1, ωi−1 between frames i and i−1, the method yields a prediction
ˆsi:=h(ξi−1,vi−1,bi−1,ai−1,ωi−1)
of the pose, velocity, and biases in frame i with associated covariance estimate ˆΣs,i. Hence, the IMU error function terms are
Einertial(si,sj):=(sj⊟ˆsj)TˆΣ−1s,j(sj⊟ˆsj).
In contrast to a purely monocular system the usage of inertial data enables us to observe more parameters such as scale and gravity direction. While this on the one hand is an advantage, on the other hand it also means that these values have to be correctly estimated as otherwise the IMU-integration will not work correctly. This is why other methods such as [25] need to estimate the scale and gravity direction in a separate initialization procedure before being able to use inertial measurements for tracking.
Especially the metric scale, however, imposes a problem as it is only observable after the camera has performed certain motions. An example of a degenerate case is when the system is initialized while the camera moves with a fixed speed that does not change for the first minute. Then the IMU will measure no acceleration (except gravity and the bias) for the first minute rendering it useless for scale estimation. This shows that in theory it can take an arbitrary amount of time until all parameters are observable. Reference [27, Tables I and II] contains an excellent summary, showing the different cases where a unique solution for the problem cannot be found. Although these completely degenerate scenarios usually do not occur, also in practice it can often take multiple seconds for the scale to be observable. In [25] the authors say that they wait for 15 seconds of camera motion in order to be sure that everything is observable when processing the popular EuRoC dataset. Obviously it is not optimal to wait for such a long time before using any inertial measurements for tracking especially as on other datasets it might take even longer. However, as explained previously, using a wrong scale will render the inertial data useless. Therefore in [6] a novel strategy for handling this issue has been proposed. We explicitly include scale (and gravity direction) as a parameter in our visual–inertial system and jointly optimize them together with the other values such as poses and geometry. This means that we can initialize with an arbitrary scale instead of waiting until it is observable. We initialize the various parameters as follows:
All these parameters are then jointly optimized during a bundle adjustment like optimization.
In order to be able to start tracking and mapping with a preliminary scale and gravity direction we need to include them into our model. Therefore in addition to the metric coordinate frame we define the DSO coordinate frame to be a scaled and rotated version of it. The transformation from the DSO frame to the metric frame is defined as Tm_d∈{T∈SIM(3)|translation(T)=0}, together with the corresponding ξm_d=log(Tm_d)∈sim(3). We add a superscript D or M to all poses denoting in which coordinate frame they are expressed. In the optimization the photometric error is always evaluated in the DSO frame, making it independent of the scale and gravity direction. The inertial error on the other hand has to be evaluated in the metric frame.
We optimize the poses, IMU-biases and velocities of a fixed number of keyframes. Fig. 7.5A shows a factor graph of the problem. Note that there are in fact many separate visual factors connecting two keyframes each, which we have combined to one big factor connecting all the keyframes in this visualization. Each IMU-factor connects two subsequent keyframes using the preintegration scheme described in Sect. 7.5.1. As the error of the preintegration increases with the time between the keyframes we ensure that the time between two consecutive keyframes is not bigger than 0.5 seconds, which is similar to what [25] have done. Note that in contrast to their method, however, we allow the marginalization procedure described in Sect. 7.5.4.2 to violate this constraint which ensures that long-term relationships between keyframes can be properly observed.
An important property of our algorithm is that the optimized poses are not represented in the metric frame but in the DSO frame. This means that they do not depend on the scale of the environment. If we were optimizing them in the metric frame instead a change of the scale would mean that the translational part of all poses also changes which does not happen during nonlinear optimization.
We perform nonlinear optimization using the Gauss–Newton algorithm. For each active keyframe we define a state vector
si:=[(ξDcami_w)T,vTi,bTi,ai,bi,d1i,d2i,...,dmi]T
where vi∈R3 is the velocity, bi∈R6 is the current IMU bias, ai and bi are the affine illumination parameters used in Eq. (7.12) and dji are the inverse depths of the points hosted in this keyframe.
The full state vector is then defined as
s=[cT,ξTm_d,sT1,sT2,...,sTn]T
where c contains the geometric camera parameters and ξm_d denotes the translation-free transformation between the DSO frame and the metric frame as defined in Sect. 7.5.3. In order to simplify the notation we define the operator s⊞s′ to work on state vectors by applying the concatenation operation ξ⊞ξ′ for Lie algebra components and a plain addition for other components.
Using the stacked residual vector r we define the Jacobian
J=dr(s⊞ϵ)dϵ|ϵ=0
where W is a diagonal weight matrix. Now we define H=JTWJ and b=−JTWr where W is a diagonal weight matrix. Then the update that we compute is δ=H−1b.
Note that the visual energy term Ephoto and the inertial error term Eimu do not have common residuals. Therefore we can divide H and b each into two independent parts
H=Hphoto+Himu and b=bphoto+bimu.
As the inertial residuals compare the current relative pose to the estimate from the inertial data they need to use poses in the metric frame relative to the IMU. Therefore we define additional state vectors (containing only IMU-relevant data) for the inertial residuals.
s′i:=[ξMw_imui,vi,bi]T,
s′=[s′T1,s′T2,...,s′Tn]T.
The inertial residuals lead to
H′imu=J′TimuWimuJ′imu
b′imu=−J′TimuWimurimu.
For the joint optimization, however, we need to obtain Himu and bimu based on the state definition in Eq. (7.36). As the two definitions mainly differ in their representation of the poses we can compute Jrel such that
Himu=JTrel⋅H′imu⋅Jrelandbimu=JTrel⋅b′imu.
The computation of Jrel is detailed in Sect. 7.6. Note that we represent all transformations as elements of sim(3) and fix the scale to 1 for all of them except ξm_d.
In order to compute Gauss–Newton updates in a reasonable time-frame we perform partial marginalization for older keyframes. This means that all variables corresponding to this keyframe (pose, bias, velocity and affine illumination parameters) are marginalized out using the Schur complements. Fig. 7.5B shows an example how the factor graph changes after the marginalization of a keyframe.
The marginalization of the visual factors is handled as in [4] by dropping residual terms that affect the sparsity of the system and by first marginalizing all points in the keyframe before marginalizing the keyframe itself.
We denote the set of variables that we want to marginalize as sβ and the ones that are connected to them in the factor graph as sα. Using this the linear system becomes
[HααHαβHβαHββ][sαsβ]=[bαbβ].
The application of the Schur complements yields the new linear system ˆHααsα=ˆbα, with
ˆHαα=Hαα−HαβH−1ββHβα,
ˆbα=bα−HαβH−1ββbβ.
Note that this factor resulting from the marginalization requires the linearization point of all connected variables in sα to remain fixed. However, we can still approximate the energy around further linearization points s′=s⊞Δs using
E(s′⊞ϵ)=E(s⊞Δs⊞ϵ)≈E(s⊞(Δs+ϵ))≈(Δs+ϵ)TˆHαα(Δs+ϵ)+2(Δs+ϵ)Tˆbα+const=ΔsTˆHααΔs+ΔsTˆHααϵ+ϵTˆHααΔs+ϵTˆHααϵ+2ΔsTˆbα+2ϵTˆbα+const=ϵTˆHααϵ+2ϵT(ˆbα+ˆHααΔs)︸=:ˆb′α+const.
In order to maintain consistency of the system it is important that Jacobians are all evaluated at the same value for variables that are connected to a marginalization factor as otherwise the nullspaces get eliminated. Therefore we apply “first estimates Jacobians”. For the visual factors we follow [4] and evaluate Jphoto and Jgeo at the linearization point. When computing the inertial factors we fix the evaluation point of Jrel for all variables which are connected to a marginalization factor. Note that this always includes ξm_d.
The marginalization procedure described in Sect. 7.5.4.2 has two purposes: reduce the computation complexity of the optimization by removing old states and maintain the information about the previous states of the system. This procedure fixes the linearization points of the states connected to the old states, so they should already have a good estimate. In our scenario this is the case for all variables except of scale.
The main idea of “dynamic marginalization” is to maintain several marginalization priors at the same time and reset the one we currently use when the scale estimate moves too far from the linearization point in the marginalization prior. To get an idea of the method we strongly recommend watching the animation at https://youtu.be/GoqnXDS7jbA?t=2m6s.
In our implementation we use three marginalization priors: Mvisual, Mcurr and Mhalf. Mvisual contains only scale independent information from previous states of the vision and cannot be used to infer the global scale. Mcurr contains all information since the time we set the linearization point for the scale and Mhalf contains only the recent states that have a scale close to the current estimate.
When the scale estimate deviates too much from the linearization point of Mcurr, the value of Mcurr is set to Mhalf and Mhalf is set to Mvisual with corresponding changes in the linearization points. This ensures that the optimization always has some information about the previous states with consistent scale estimates. In the remaining part of the section we provide the details of our implementation.
We define Gmetric to contain only the visual–inertial factors (which depend on ξm_d) and Gvisual to contain all other factors, except the marginalization priors. Then
Gfull=Gmetric∪Gvisual.
Fig. 7.6 depicts the partitioning of the factor graph.
We define three different marginalization factors Mcurr, Mvisual and Mhalf. For the optimization we always compute updates using the graph
Gba=Gmetric∪Gvisual∪Mcurr.
When keyframe i is marginalized we update Mvisual with the factor arising from marginalizing frame i in Gvisual∪Mvisual. This means that Mvisual contains all marginalized visual factors and no marginalized inertial factors making it independent of the scale.
For each marginalized keyframe i we define
si:=scale estimate at the time, i was marginalized.
We define i∈M if and only if M contains an inertial factor that was marginalized at time i. Using this we enforce the following constraints for inertial factors contained in Mcurr and Mhalf:
∀i∈Mcurr:si∈[smiddle/di,smiddle⋅di],
∀i∈Mhalf:si∈{[smiddle,smiddle⋅di],if scurr>smiddle,[smiddle/di,smiddle],otherwise,
where smiddle is the current middle of the allowed scale interval (initialized with s0), di is the size of the scale interval at time i, and scurr is the current scale estimate.
We update Mcurr by marginalizing frame i in Gba and we update Mhalf by marginalizing i in Gmetric∪Gvisual∪Mhalf.
In order to preserve the constraints in Eqs. (7.51) and (7.52) we apply Algorithm 1 every time a marginalization happens. By following these steps on the one hand we make sure that the constraints are satisfied which ensures that the scale difference in the currently used marginalization factor stays smaller than d2i. On the other hand the factor always contains some inertial factors so that the scale estimation works at all times. Note also that Mcurr and Mhalf have separate first estimate Jacobians that are employed when the respective marginalization factor is used. Fig. 7.7 shows how the system works in practice.
An important part of this strategy is the choice of di. It should be small, in order to keep the system consistent, but not too small so that Mcurr always contains enough inertial factors. Therefore we chose to dynamically adjust the parameter as follows. At all time steps i we calculate
di=min{djmin|j∈N∖{0},sisi−1<di}.
This ensures that it cannot happen that the Mhalf gets reset to Mvisual at the same time that Mcurr is exchanged with Mhalf. Therefore it prevents situations where Mcurr contains no inertial factors at all, making the scale estimation more reliable. On the one hand dmin should not be too small as otherwise not enough inertial information is accumulated. On the other hand it should not be too big, because else the marginalization can get inconsistent. In our experiments we chose dmin=√1.1 which is a good tradeoff between the two.
For many applications it is very useful to know when the scale has converged, especially as this can take an arbitrary amount of time in theory. The lack of this measure is also one of the main drawbacks of the initialization method described in [25]. Here we propose a simple but effective strategy. With si being the scale at time i and n being the maximum queue size (60 in all our experiments) we compute
c=maxnj:=i−n+1sjminnj:=i−n+1sj−1.
Fig. 7.13B shows a plot of this measure. When c is below a certain threshold cmin (0.005 in our experiments) we consider the scale to have converged and fix ξm_d. While this does not influence the accuracy of the system, fixing the scale might be useful for some applications.
The coarse tracking is responsible for computing a fast pose estimate for each frame that also serves as an initialization for the joint estimation detailed in Sect. 7.5.4. We perform conventional direct image alignment between the current frame and the latest keyframe, while keeping the geometry and the scale fixed. The general factor graph for the coarse tracking is shown in Fig. 7.8A. Inertial residuals using the previously described IMU preintegration scheme are placed between subsequent frames. Note that in contrast to the factor graph for the joint optimization, the visual factor for the coarse tracking does not connect all keyframes and depths, but instead just connects each frame with the current keyframe.
Every time the joint estimation is finished for a new frame, the coarse tracking is reinitialized with the new estimates for scale, gravity direction, bias, and velocity as well as the new keyframe as a reference for the visual factors. Similar to the joint estimation we perform partial marginalization to keep the update time constrained. After estimating the variables for a new frame we marginalize out all variables except the keyframe pose and the variables of the newest frame. Also we immediately marginalize the photometric parameters ai and bi as soon as the pose estimation for frame i is finished. Fig. 7.8B–D shows how the coarse factor graph changes over time. In contrast to the joint optimization we do not need to use dynamic marginalization because the scale is not included in the optimization.
As mentioned in Sect. 7.5.4.1 we want to compute Jrel, such that (7.43) holds.
We convert between the poses using
ξMw_imu=ξm_d⊞(ξDcam_w)−1⊞ξ−1m_d⊞ξMcam_imu=:Ψ(ξDcam_w,ξm_d).
For any function Ω(ξ):sim(3)→sim(3) we define the derivative dΩ(ξ⊞ϵ)dϵ implicitly using
Ω(ξ)−1⊞Ω(ξ⊞ϵ)=dΩ(ξ⊞ϵ)dϵ⋅ϵ+η(ϵ)⋅ϵ
where the error function η(ϵ) goes to 0 as ||ϵ|| goes to 0.
Note that in principle there are three other ways to define this derivative (as you can place the increment with ϵ as well as the multiplication with the inverse on either side). However, it can be shown (see Sect. 7.6.1) that only with this version the following chain rule holds for f(ξ):sim(3)→R:
df(Ω(ξ⊞ϵ))dϵ=df(Ω(ξ)⊞δ)δ⋅dΩ(ξ⊞ϵ)ϵ.
Using these definitions the relevant derivatives for Ψ can be computed (see Sects. 7.6.2 and 7.6.3)
∂Ψ(ξDcam_w⊞ϵ,ξm_d)∂ϵ=−Adj(T−1cam_imu⋅Tm_d⋅Tcam_w),
∂Ψ(ξDcam_w,ξm_d⊞ϵ)∂ϵ=Adj(T−1cam_imu⋅Tm_d⋅Tcam_w)−Adj(T−1cam_imu⋅Tm_d).
Stacking these derivatives correctly we can compute a Jacobian Jrel such that
Jimu=J′imu⋅Jrel.
Using this we can finally compute
Himu=JTimuWimuJimu=JTrel⋅(J′imu)T⋅Wimu⋅J′imu⋅Jrel=JTrel⋅H′imu⋅Jrel,
bimu=−JTimuWimurimu=−JTrel(J′imu)TWimurimu=JTrel⋅b′imu.
In this section we will prove the chain rule (7.57). We define the multivariate derivatives implicitly: f(Ω(ξ)⊞δ)−f(Ω(ξ))=df(Ω(ξ)⊞δ)δ⋅δ+μ(δ)⋅δ, where the error function μ(δ) goes to 0 as δ goes to zero. We can rewrite Eq. (7.56) by multiplying Ω(ξ) from the left, Ω(ξ⊞ϵ)=Ω(ξ)⊞(dlΩ(ξ⊞ϵ)dϵ⋅ϵ+η(ϵ)⋅ϵ). Using this we can compute f(Ω(ξ⊞ϵ))−f(Ω(ξ))(7.64)=f(Ω(ξ)⊞(dlΩ(ξ⊞ϵ)dϵ⋅ϵ+η(ϵ)⋅ϵ)︸=:δϵ)−f(Ω(ξ))=f(Ω(ξ)⊞δϵ)−f(Ω(ξ))(7.63)=df(Ω(ξ)⊞δ)δ⋅δϵ+μ(δϵ)⋅δϵ=df(Ω(ξ)⊞δ)δ⋅(dlΩ(ξ⊞ϵ)dϵ⋅ϵ+η(ϵ)⋅ϵ)+μ(δϵ)⋅(dlΩ(ξ⊞ϵ)dϵ⋅ϵ+η(ϵ)⋅ϵ)=df(Ω(ξ)⊞δ)δ⋅dlΩ(ξ⊞ϵ)dϵ⋅ϵ+df(Ω(ξ)⊞δ)δ⋅η(ϵ)⋅ϵ+μ(δϵ)⋅dlΩ(ξ⊞ϵ)dϵ⋅ϵ+μ(δϵ)⋅η(ϵ)⋅ϵ=df(Ω(ξ)⊞δ)δ⋅dlΩ(ξ⊞ϵ)dϵ⋅ϵ+(df(Ω(ξ)⊞δ)δ⋅η(ϵ)+μ(δϵ)⋅dlΩ(ξ⊞ϵ)dϵ+μ(δϵ)⋅η(ϵ))︸=:γ(ϵ)⋅ϵ. When ϵ goes to 0, then δϵ also goes to 0 (as can be seen from its definition). Using this it follows by the definition of the derivative that η(ϵ) and μ(δϵ) go to 0 as well. This shows that γ(ϵ) goes to 0 when ϵ goes to 0. Therefore the last line of Eq. (7.65) is in line with our definition of the derivative and df(Ω(ξ⊞ϵ))dϵ=df(Ω(ξ)⊞δ)δ⋅dlΩ(ξ⊞ϵ)ϵ. □ In this section we will show how to derive the Jacobians ∂Ψ(ξDcam_w⊞ϵ,ξm_d)∂ϵ using the implicit definition of the derivative shown in Eq. (7.56). In order to do this we need the definition of the adjoint. T⋅exp(ϵ)=exp(AdjT⋅ϵ)⋅T with T∈SIM(3). It follows that log(T⋅exp(ϵ)⋅T−1)=AdjT⋅ϵ. Using this we can compute Ψ(ξDcam_w,ξm_d)−1⊞Ψ(ξDcam_w⊞ϵ,ξm_d)(7.55)=(ξm_d⊞(ξDcam_w)−1⊞ξ−1m_d⊞ξMcam_imu)−1⊞(ξm_d⊞(ξDcam_w⊞ϵ)−1⊞ξ−1m_d⊞ξMcam_imu)=(ξMcam_imu)−1⊞ξm_d⊞ξDcam_w⊞ξ−1m_d⊞ξm_d︸=0⊞ϵ−1⊞(ξDcam_w)−1⊞ξ−1m_d⊞ξMcam_imu=log((TMcam_imu)−1⋅Tm_d⋅TDcam_w⋅exp(ϵ)−1⋅(TDcam_w)−1⋅T−1m_d⋅TMcam_imu)(7.68)=Adj((TMcam_imu)−1⋅Tm_d⋅TDcam_w)⋅(−ϵ). After moving the minus sign to the left this is in line with our definition of the derivative in Eq. (7.56) and it follows that ∂Ψ(ξDcam_w⊞ϵ,ξm_d)∂ϵ=−Adj((TMcam_imu)−1⋅Tm_d⋅TDcam_w). In order to derive the Jacobian ∂Ψ(ξDcam_w,ξm_d⊞ϵ)∂ϵ we need the Baker–Campbell–Hausdorff formula: Let a,b∈sim(3), then log(exp(a)⋅exp(b))=a+b+12[a,b]+112([a,[a,b]]+[b,[b,a]])+148([b,[a,[b,a]]]+[a,[b,[b,a]]])+.... Here [a,b]:=ab−ba denotes the Lie bracket. In this section we will omit the superscripts to simplify the notation. We have Ψ(ξcam_w,ξm_d)−1⊞Ψ(ξcam_w,ξm_d⊞ϵ)(7.55)=(ξm_d⊞ξ−1cam_w⊞ξ−1m_d⊞ξcam_imu)−1⊞(ξm_d⊞ϵ⊞ξ−1cam_w⊞(ξm_d⊞ϵ)−1⊞ξcam_imu)=ξ−1cam_imu⊞ξm_d⊞ξcam_w⊞ξ−1m_d⊞ξm_d⊞ϵ⊞ξ−1cam_w⊞ϵ−1⊞ξ−1m_d⊞ξcam_imu=ξ−1cam_imu⊞ξm_d⊞ξcam_w⊞ϵ⊞ξ−1cam_w⊞(ξ−1cam_imu⊞ξm_d)−1⊞(ξ−1cam_imu⊞ξm_d)︸=0⊞ϵ−1⊞ξ−1m_d⊞ξcam_imu=ξ−1cam_imu⊞ξm_d⊞ξcam_w⊞ϵ⊞ξ−1cam_w⊞ξ−1m_d⊞ξcam_imu︸:=a⊞ξ−1cam_imu⊞ξm_d⊞ϵ−1⊞ξ−1m_d⊞ξcam_imu︸:=b. Using Eq. (7.68) we can compute a=Adj(T−1cam_imu⋅Tm_d⋅Tcam_w)⋅ϵ, and b=−Adj(T−1cam_imu⋅Tm_d)⋅ϵ. Now we have to prove that all terms of the Baker–Campbell–Hausdorff formula which contain a Lie bracket can be written as μ(ϵ)⋅ϵ, where μ(ϵ) goes to zero, when ϵ goes to zero. We can compute a⋅b=Adj(T−1cam_imu⋅Tm_d⋅Tcam_w)⋅ϵ⋅(−Adj(T−1cam_imu⋅Tm_d))︸μ1(ϵ)⋅ϵ, b⋅a=−Adj(T−1cam_imu⋅Tm_d)⋅ϵ⋅Adj(T−1cam_imu⋅Tm_d⋅Tcam_w)︸μ2(ϵ)⋅ϵ, [a,b]=ab−ba=(μ1(ϵ)+μ2(ϵ))⋅ϵ. Obviously μ1(ϵ) and μ2(ϵ) go to zero when ϵ goes to zero. For the remaining summands of Eq. (7.71) the same argumentation can be used. It follows that (7.72)=a+b+μ(ϵ)⋅ϵ=(Adj(T−1cam_imu⋅Tm_d⋅Tcam_w)−Adj(T−1cam_imu⋅Tm_d)⋅ϵ)⋅ϵ+μ(ϵ)⋅ϵ where μ(ϵ) goes to zero when ϵ goes to zero. According to (7.56) this means that ∂Ψ(ξDcam_w,ξm_d⊞ϵ)∂ϵ=Adj(T−1cam_imu⋅Tm_d⋅Tcam_w)−Adj(T−1cam_imu⋅Tm_d).7.6.1 Proof of the Chain Rule
7.6.2 Derivation of the Jacobian with Respect to Pose in Eq. (7.58)
7.6.3 Derivation of the Jacobian with Respect to Scale and Gravity Direction in Eq. (7.59)
We evaluate the described VI-DSO on the publicly available EuRoC dataset [5]. The performance is compared to [4], [20], [10], [24], [22], [32] and [33]. We also recommend watching the video at vision.in.tum.de/vi-dso.
In order to obtain an accurate evaluation we run our method 10 times for each sequence of the dataset (using the left camera). We directly compare the results to visual-only DSO [4] and ROVIO [20]. As DSO cannot observe the scale we evaluate using the optimal ground truth scale in some plots (with the description “gt-scaled”) to enable a fair comparison. For all other results we scale the trajectory with the final scale estimate (our method) or with 1 (other methods). For DSO we use the results published together with their paper. We use the same start and end times for each sequence to run our method and ROVIO. Note that the drone has a high initial velocity in some sequences when using these start times making it especially challenging for our IMU initialization. Fig. 7.9 shows the root mean square error (RMSE) for every run and Fig. 7.10 displays the cumulative error plot. Clearly our method significantly outperforms DSO and ROVIO. Without inertial data DSO is not able to work on all sequences especially on V1_03_difficult and V2_03_difficult and it is also not able to scale the results correctly. ROVIO on the other hand is very robust but as a filtering-based method it cannot provide sufficient accuracy.
Table 7.1 shows a comparison to several other methods. For our results we have displayed the median error for each sequence from the 10 runs plotted in Fig. 7.9C. This makes the results very meaningful. For the other methods unfortunately only one result was reported so we have to assume that they are representative as well. The results for [22] and [32] were taken from [32] and the results for visual–inertial ORB-SLAM were taken from [10]. For VINS-Mono [26] the results were taken from their paper [33]. The results for [10] differ slightly from the other methods as they show the error of the keyframe trajectory instead of the full trajectory. This is a slight advantage as keyframes are bundle adjusted in their method which does not happen for the other frames.
Table 7.1
Accuracy of the estimated trajectory on the EuRoC dataset for several methods. Note that ORB-SLAM does a convincing job showing leading performance on some of the sequences. Nevertheless, since VI-DSO directly works on the sensor data (colors and IMU measurements), we observe similar precision and a better robustness—even without loop closuring. Moreover, the proposed method is the only one not to fail on any of the sequences (except ROVIO).
Sequence | MH1 | MH2 | MH3 | MH4 | MH5 | V11 | V12 | V13 | V21 | V22 | V23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
VI-DSO (RT) (median of 10 runs each) | RMSE | 0.062 | 0.044 | 0.117 | 0.132 | 0.121 | 0.059 | 0.067 | 0.096 | 0.040 | 0.062 | 0.174 |
RMSE gt-scaled | 0.041 | 0.041 | 0.116 | 0.129 | 0.106 | 0.057 | 0.066 | 0.095 | 0.031 | 0.060 | 0.173 | |
Scale Error (%) | 1.1 | 0.5 | 0.4 | 0.2 | 0.8 | 1.1 | 1.1 | 0.8 | 1.2 | 0.3 | 0.4 | |
VI ORB-SLAM (keyframe trajectory) | RMSE | 0.075 | 0.084 | 0.087 | 0.217 | 0.082 | 0.027 | 0.028 | X | 0.032 | 0.041 | 0.074 |
RMSE gt-scaled | 0.072 | 0.078 | 0.067 | 0.081 | 0.077 | 0.019 | 0.024 | X | 0.031 | 0.026 | 0.073 | |
Scale Error (%) | 0.5 | 0.8 | 1.5 | 3.4 | 0.5 | 0.9 | 0.8 | X | 0.2 | 1.4 | 0.7 | |
VINS-MONO [33] | RMSE | 0.12 | 0.12 | 0.13 | 0.18 | 0.21 | 0.068 | 0.084 | 0.19 | 0.081 | 0.16 | X |
VI odometry [22], mono | RMSE | 0.34 | 0.36 | 0.30 | 0.48 | 0.47 | 0.12 | 0.16 | 0.24 | 0.12 | 0.22 | X |
VI odometry [22], stereo | RMSE | 0.23 | 0.15 | 0.23 | 0.32 | 0.36 | 0.04 | 0.08 | 0.13 | 0.10 | 0.17 | X |
VI SLAM [32], mono | RMSE | 0.25 | 0.18 | 0.21 | 0.30 | 0.35 | 0.11 | 0.13 | 0.20 | 0.12 | 0.20 | X |
VI SLAM [32], stereo | RMSE | 0.11 | 0.09 | 0.19 | 0.27 | 0.23 | 0.04 | 0.05 | 0.11 | 0.10 | 0.18 | X |
In comparison with VI ORB-SLAM our method outperforms it in terms of RMSE on several sequences. As ORB-SLAM is a SLAM system while ours is a pure odometry method this is a remarkable achievement especially considering the differences in the evaluation. Note that the Vicon room sequences (V*) are executed in a small room and contain a lot of loopy motions where the loop closures done by a SLAM system significantly improve the performance. Also our method is more robust as ORB-SLAM fails to track one sequence. Even considering only sequences where ORB-SLAM works our approach has a lower maximum RMSE.
Compared to [22] and [32] our method obviously outperforms them. It is better than the monocular versions on every single sequence and it beats even the stereo and SLAM-versions on 9 out of 11 sequences. Also VINS-Mono [33] is outperformed by VI-DSO on every sequence of the EuRoC dataset.
In summary our method is the only one which is able to track all the sequences successfully except ROVIO.
We also compare the relative pose error with [10] and [24] on the V1_0*-sequences of EuRoC (Figs. 7.11 and 7.12). While our method cannot beat the SLAM system and the stereo method on the easy sequence we outperform [24] and we are as good as [10] on the medium sequence. On the hard sequence we outperform both of the contenders even though we neither use stereo nor loop closures.
Especially in the initialization there is a difference in how methods are evaluated on the EuRoC dataset. The machine hall sequences (MH*) all start with a part where the drone is shaken manually and then put onto the floor again. A few seconds later the drone is started and the main part of the sequence begins. In our evaluation we follow visual–inertial ORB-SLAM [10] and monocular DSO [4] and crop the first part of the sequence and only use the part where the drone is flying. On the other hand VINS-Mono [33] uses the whole sequence including the part where the drone is shaken. Note that both starting times involve different challenges for visual–inertial initialization: Using the first part of the sequence is a challenge especially for the visual part, as the drone is accelerated very quickly and there is a lot of motion blur involved. On the other hand the motions in this first part are almost purely translational and are quite fast, which is perfect for a fast initialization of scale and IMU biases.
Using only the cropped sequence makes the task for the visual initialization relatively easy, because the motions are relatively slow. However, the task of visual–inertial initialization becomes very hard, because it takes quite a lot of time for the parameters (especially the scale) to become observable.
This difference can also explain why the initializer of visual–inertial ORB-SLAM [10] needs more than 10 seconds on some sequences of the EuRoC dataset, whereas the initializer of VINS-Mono [33] is much faster. The reason is most likely not so much the algorithm, but more how the dataset is used.
In this comparison we will focus on initialization methods that use only the cropped part of the machine hall sequences—namely VI ORB-SLAM, as only then the problem of late scale observability is present. We do note, however, that an extensive evaluation of different initialization methods has not been done yet and would be a great contribution at this point.
In comparison to [10] our estimated scale is better overall (Table 7.1). On most sequences our method provides a better scale, and our average scale error (0.7% compared to 1.0%) as well as our maximum scale error (1.2% compared to 3.4%) is lower. In addition our method is more robust as the initialization procedure of [10] fails on V1_03_difficult.
Apart from the numbers we argue that our approach is superior in terms of the general structure. While [10] have to wait for 15 seconds until the initialization is performed, our method provides an approximate scale and gravity direction almost instantly that gets enhanced over time. Whereas in [10] the pose estimation has to work for 15 seconds without any IMU data, in our method the inertial data is used to improve the pose estimation from the beginning. This is probably one of the reasons why our method is able to process V1_03_difficult. Finally, our method is better suited for robotics applications. For example an autonomous drone is not able to fly without gravity direction and scale for 15 seconds and hope that afterwards the scale was observable. In contrast our method offers both of them right from the start. The continuous rescaling is also not a big problem as an application could use the unscaled measurements for building a consistent map and for providing flight goals, whereas the scaled measurements can be used for the controller. Finally, the measure c defined in Eq. (7.54) allows one to detect how accurate the current scale estimate is which can be particularly useful for robotics applications. Fig. 7.13 shows the estimated scale and c for the sequence MH_04 of the EuRoC dataset.
In order to further evaluate the method we provide several parameter studies in this section. As in the previous sections, we have run all variants 10 times for each sequence of the EuRoC dataset and present either the accumulation of all results or the median of the 10 runs. First of all we show the dependence on the number of points in Fig. 7.14. The main takeaway of this should be that because of the addition of inertial data, we can significantly reduce the number of points without losing tracking performance.
In order to evaluate the importance of the dynamic marginalization strategy we have replaced it with two alternatives (Fig. 7.15). For one example we have used normal marginalization instead where only one marginalization prior is used. Furthermore we have tried a simpler dynamic marginalization strategy, where we do not use Mhalf, but instead directly reset the marginalization prior with Mvisual, as soon as the scale interval is exceeded. Clearly dynamic marginalization yields the most robust result. Especially on sequences with a large initial scale error, the other strategies do not work well. Fig. 7.16 shows the difference in the scale convergence. When using the simple marginalization strategy the marginalization prior gets reset directly to Mvisual resulting in a slower scale convergence and oscillations (Figs. 7.16C and 7.16D). With a normal marginalization the scale estimate overshoots and it takes a long time for the system to compensate for the factors with a wrong scaled that were initially marginalized (Figs. 7.16E and 7.16F). With our dynamic marginalization implementation, however, the scale converges much faster and with almost no overshoot or oscillations (Figs. 7.16A and 7.16B).
We have also disabled the joint optimization of gravity direction in the main system (Fig. 7.15). Clearly the simple initialization of gravity direction (described in Sect. 7.5.2) is not sufficient to work well, without adding the direction of gravity to the model.
We have described a state-of-the-art approach to direct sparse visual–inertial odometry. The combination of direct image residuals with IMU data enables our system to outperform state of the art methods especially in terms of robustness. We explicitly include scale and gravity direction in our model in order to deal with cases where the scale is not immediately observable. As the initial scale can be very far from the optimum we have described a technique called dynamic marginalization where we maintain multiple marginalization priors and constrain the maximum scale difference. Extensive quantitative evaluation on the EuRoC dataset demonstrates that the described visual–inertial odometry method outperforms other state-of-the-art methods, both the complete system and the IMU initialization procedure. In particular, experiments confirm that inertial information not only provides a reliable scale estimate, but it also drastically increases precision and robustness.