Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7

A Review and Quantitative Evaluation of Direct Visual–Inertial Odometry

Lukas von Stumberg^⁎^,^†; Vladyslav Usenko^⁎; Daniel Cremers^⁎^,^† ^⁎Department of Informatics, Technical University of Munich, Munich, Germany
^†Artisense GmbH, Garching, Germany

Abstract

Simultaneous location and mapping (SLAM) is an integral part of scene understanding. There have been two sensors that in combination have shown to be particularly useful for this task: cameras and inertial measurement units (IMUs). In the last years a lot of very powerful and precise purely visual SLAM methods have been proposed, yet all of them exhibit a lack of robustness especially to fast motion and textureless areas. This is where IMUs are a great addition. They can provide short-term motion constraints and in contrast to cameras do not contain outliers.

In order to complement vision in the most optimal way, IMU data needs to be injected into the vision algorithm on a very deep layer resulting in a tight integration. In this chapter we will review the fundamentals of visual–inertial sensor fusion and explain a current state-of-the-art method called visual–inertial direct sparse odometry (VI-DSO).

VI-DSO jointly estimates camera poses and sparse scene geometry by minimizing photometric and IMU measurement errors in a combined energy functional. IMU information is accumulated between several frames using measurement preintegration, and it is inserted into the optimization as an additional constraint. In VI-DSO, scale and gravity direction are explicitly included into the model and jointly optimized together with other variables such as poses. As the scale is often not immediately observable using IMU data this allows one to initialize the visual–inertial system with an arbitrary scale instead of having to delay the initialization until everything is observable.

We will also cover partial marginalization of old variables, which keeps the computation time bounded as we explore the environment. We also describe dynamic marginalization, which allows us to use partial marginalization even in cases where the initial scale estimate is far from the optimum. We evaluate our system and compare to other visual–inertial methods on a publicly available dataset.

Keywords

SLAM; Visual–inertial odometry; Sensor fusion

7.1 Introduction

One of the most important tasks in modern robotics is to determine the motion of the robot and a reconstruction of the environment. Applications can be found in robot navigation [1] and autonomous driving [2] [3]. There are several different sensors available for this task, but cameras offer several advantages. They are cheap, lightweight, small and passive, which explains why they have been used very frequently for this task during the last years. While current visual odometry methods are already capable of providing highly accurate pose-estimates, they are not optimal in terms of robustness yet. Especially fast motions and low-textured areas present challenges for current solutions. This is where a combination with an inertial measurement unit (IMU) can help. An IMU contains a gyroscope and an accelerometer, which measure rotational velocities and linear acceleration at a very high rate (typically more than 100 Hz). This provides accurate short-term motion constraints and, unlike vision, is not prone to outliers.

In this chapter we will describe a state-of-the-art direct and sparse approach to visual–inertial odometry. The system is based on Direct Sparse Odometry (DSO) which was developed by Engel et al. [4]. DSO represents the geometry with a sparse set of points, each associated with an estimated inverse depth. They are jointly optimized together with 3D poses and camera parameters in a combined energy functional. We will describe how to complement this function with an additional IMU error term. This greatly improves the robustness of the system, as the inertial error term is convex and not prone to outliers.

A key drawback of monocular visual odometry is that it is not possible to obtain the metric scale of the environment. Adding an IMU enables us to observe the scale. Yet, depending on the performed motions this can take infinitely long, making the monocular visual–inertial initialization a challenging task. Rather than relying on a separate IMU initialization we show how to include the scale as a variable into the model of our system and jointly optimize it together with the other parameters.

Quantitative evaluation on the EuRoC dataset [5] demonstrates that we can reliably determine camera motion and sparse 3D structure (in metric units) from a visual–inertial system on a rapidly moving micro aerial vehicle (MAV) despite challenging illumination conditions (Fig. 7.1).

Figure 7.1 Bottom: example images from the EuRoC dataset: low illumination, strong motion blur and little texture impose significant challenges for odometry estimation. Still our method is able to process all sequences with a RMSE of less than 0.23 m. Top: reconstruction, estimated pose (red camera) and ground-truth pose (green camera) at the end of V1_03_difficult.

Note that this book chapter builds up on and significantly extends the paper [6], which was presented at the International Conference on Robotics and Automation (ICRA) 2018.

7.2 Related Work

Motion estimation in 3D is a very important topic in robotics, and an IMU can be very useful when combined with vision data properly. Consequently there have been many works revolving around this topic in the past. In the following subsections we will present a short summary of visual and visual–inertial odometry methods.

When combining a monocular camera with IMU measurements a challenging task is the visual–inertial initialization. Therefore we will also discuss existing approaches, where the scale of the scene, initial orientation and velocity are not known from the beginning.

7.2.1 Visual Odometry

The task of incrementally tracking the camera motion from video is commonly called visual odometry. It was established by Nister et al. [7]. They proposed to match a sparse set of features from frame to frame, yielding multiple observations of the same point in different images. The 3D camera motion is then computed by minimizing the reprojection error. Afterwards there were many other works based on this feature-based approach. The general idea is to extract and track features over the images and then minimize the reprojection error of them. Later works like MonoSLAM [8] and PTAM [9] added the ability to create a consistent map in real-time. A very recent example is ORB-SLAM [10]. It is capable of tracking the camera motion and creating a consistent map with the ability to close trajectory loops and re-localization, outperforming previous methods in accuracy and robustness.

Another approach to visual odometry are the direct methods. Instead of the tracking features and minimizing the reprojection error, they directly use the image data to obtain the camera motion by optimizing a photometric error. In the work of [11] this error was minimized in real-time in a stereo setting by formulating quadrifocal constraints on the pixels. Especially for tracking with RGB-D cameras direct solutions were proposed by Newcombe et al. [12] and Kerl et al. [13]. Recently, several direct approaches were proposed, which use only a monocular camera. They differ in the density of the image information used. Newcombe et al. [14] proposed a dense approach which uses the complete image for tracking. However, for monocular visual odometry mainly image regions with a sufficient gradient contribute information. Therefore Engel et al. [15] have presented a semi-dense SLAM system exploiting this fact. Although sparse methods have traditionally been keypoint-based, direct sparse methods have advantages discussed in two recent publications. SVO proposed by Forster et al. [16] uses a keypoint-based approach for mapping keyframes and performs sparse direct image alignment to track non-keyframes. In contrast DSO proposed by Engel et al. [4] is a purely direct and sparse system. By using a sparse set of points it can efficiently perform bundle adjustment on a window of keyframes yielding superior accuracy and robustness compared to the semi-dense approach.

7.2.2 Visual–Inertial Odometry

IMU sensors are in many ways complementary to vision. The accelerometer and gyroscope measurements provide good short-term motion information. Furthermore they make scale as well as the roll and pitch angles observable, which is relevant for some control tasks, like the navigation of drones. Therefore there were several methods which combine them with vision. They can be classified into two general approaches.

The first one is the so-called loosely coupled fusion. Here the vision system is used as a blackbox, providing pose measurements, which are then combined with the inertial data, usually using an (extended or unscented) Kalman filter. Examples of this are shown in [1], [17] and [18].

Recently, tightly coupled works became more popular. Here the motion of the camera is computed by jointly optimizing the parameters in a combined energy function consisting of a visual and an inertial error term. The advantage is that the two measurement sources are combined on a very deep level of the system yielding a very high accuracy and especially robustness. Especially for vision systems that are based on nonlinear optimization the inertial error term can help overcome nonlinearities in the energy functional. The disadvantage is that the tightly coupled fusion has to be done differently for different vision systems making it much harder to implement. Still a tightly-coupled fusion has been done for filtering-based approaches [19] [20] [21] as well as for energy-minimization based systems [22] [23] [24] [25] [26].

Especially for monocular energy-minimization based approaches it is important to start with a good visual–inertial initialization. In practical use-cases the system does not have any prior knowledge about initial velocities, poses and geometry. Therefore apart from the usual visual initialization procedure most visual–inertial systems need a specific visual–inertial initialization. However, the problem is even more difficult, because several types of motion do not allow one to uniquely determine all variables. In [27] a closed-form solution for visual–inertial initialization was presented, as well as an analysis of the exceptional cases. It was later extended to work with IMU biases by [28]. In [25] a visual–inertial initialization procedure was proposed that works on the challenging EuRoC dataset. They use 15 seconds of flight data for initialization to make sure that all variables are observable on this specific dataset. Also in [26] and [29] IMU initialization methods are proposed which work on the EuRoC dataset, the latter even works without prior initialization of IMU extrinsics.

7.3 Background: Nonlinear Optimization and Lie Groups

7.3.1 Gauss–Newton Algorithm

The goal of the Gauss–Newton algorithm is to solve optimization problems of the form

$\min_{x} E (x) = \min_{x} \sum_{i} w_{i} \cdot r_{i} {(x)}^{2}$

(7.1)

where each $r_{i}$ is a residual and $w_{i}$ is the associated weight.

We will start with an explanation of the general Newton methods. The idea is based on a Taylor approximation of the energy function around $x_{0}$ ,

$E (x) \approx E (x_{0}) + b^{T} (x - x_{0}) + \frac{1}{2} {(x - x_{0})}^{T} H (x - x_{0})$

(7.2)

where $b = \frac{d E}{d x}$ is the gradient and $H = \frac{d^{2} E}{d x^{2}}$ is the Hessian.

Using this approximation the optimal solution is where

$0 \overset{!}{=} \frac{d E}{d x} = b + H (x - x_{0}) .$

(7.3)

The optimal solution is therefore at

$x = x_{0} - H^{- 1} \cdot b .$

(7.4)

Of course, this is not the exact solution because we used an approximation to derive it. Therefore the optimization is performed iteratively by starting with an initial solution $x_{0}$ and successively calculating

$x_{t + 1} = x_{t} - λ \cdot H^{- 1} \cdot b$

(7.5)

where λ is the step size.

Calculating the Hessian is very time-intensive. Therefore the idea of the Gauss–Newton method is to use an approximation for it instead. In the case where the problem has the form of Eq. (7.1) we can compute

$b_{j} = 2 \sum_{i} w_{i} \cdot r_{i} (x) \frac{\partial r_{i}}{\partial x_{j}},$

(7.6)

$H_{j k} = 2 \sum_{i} w_{i} (\frac{\partial r_{i}}{\partial x_{j}} \frac{\partial r_{i}}{\partial x_{k}} + r_{i} \frac{\partial^{2} r_{i}}{\partial x_{j} \partial x_{k}}) \approx 2 \sum_{i} w_{i} (\frac{\partial r_{i}}{\partial x_{j}} \frac{\partial r_{i}}{\partial x_{k}}) .$

(7.7)

We define the stacked residual vector r by stacking all residuals $r_{i}$ . The Jacobian J of r therefore contains at each element

$J_{i j} = \frac{\partial r_{i}}{x_{j}} .$

(7.8)

We also stack the weights into a diagonal matrix W.

Then the approximated Hessian can be calculated with

$H \approx 2 \sum_{i} w_{i} J_{i j} J_{i k} = 2 J^{T} W J .$

(7.9)

Similarly the gradient of the energy function is

$b = 2 J^{T} W r .$

(7.10)

In summary the Gauss–Newton algorithm iteratively computes updates using Eq. (7.5), where the Hessian is approximated using Eq. (7.9).

7.3.2 Levenberg–Marquandt Algorithm

When initialized to far off from the optimum it can happen that the Gauss–Newton algorithm does not converge. The idea of the Levenberg–Marquandt algorithm is to dampen the Gauss–Newton updates. It still uses the same definition of H and b as the Gauss–Newton algorithm.

When computing the iterative update, however, it uses

$x_{t + 1} = x_{t} - {(H + λ \cdot diag (H))}^{- 1} \cdot b .$

(7.11)

The rough idea is to create a mixture between the Gauss–Newton algorithm and a simple gradient descent. Intuitively for a value of $λ = 0$ this will degenerate to a normal Gauss–Newton step and for a high value it will be dampened to something similar to gradient descent.

Clearly there is no optimal value for λ so it is usually adapted on the fly. After each update it is calculated whether or not the update reduces the energy (and the update is discarded if not). When an update is discarded λ is increased and when it is accepted λ gets reduced.

Compared to the Gauss–Newton algorithm the Levenberg–Marquandt method is in general more robust to a bad initialization.

7.4 Background: Direct Sparse Odometry

Direct sparse odometry (DSO) is a method that was very recently proposed by Engel et al. [4] (see Fig. 7.2). The novelty is that it optimizes a direct error term but still uses only a sparse set of points making it very accurate and at the same time real-time capable. The main advantage over previous direct methods like LSD-SLAM [15] is that it performs a bundle adjustment like optimization, which is computationally feasible because the set of points is sparse. In this section we will explain the theory behind DSO as well as the general structure.

Figure 7.2 Direct sparse odometry (DSO) running on the EuRoC dataset.

7.4.1 Notation

Throughout the chapter we will use the following notation: bold upper case letters H represent matrices, bold lower case x vectors and light lower case λ represent scalars. Transformations between coordinate frames are denoted as $T_{i_j} \in SE (3)$ where point in coordinate frame i can be transformed to the coordinate frame j using the following equation: $p_{i} = T_{i_j} p_{j}$ . We denote Lie algebra elements as $\hat{ξ} \in se (3)$ , where $ξ \in R^{6}$ , and use them to apply small increments to the 6D pose $ξ_{i_j}^{'} = ξ_{i_j} ⊞ ξ : = \log {(e^{{\hat{ξ}}_{i_j}} \cdot e^{\hat{ξ}})}^{\lor}$ . We define the world as a fixed inertial coordinate frame with gravity acting along the negative Z axis. We also assume that the transformation from camera to IMU frame $T_{imu_cam}$ is fixed and calibrated in advance. Factor graphs are expressed as a set G of factors and we use $G_{1} \cup G_{2}$ to denote a factor graph containing all factors that are either in $G_{1}$ or in $G_{2}$ .

7.4.2 Photometric Error

The main idea behind direct methods is that the intensity of a point should be constant over the images it is observed in. However, changes in illumination and photometric camera parameters such as vignetting, auto-exposure and response function can produce a change of the pixel value even when the real-world intensity is constant. To account for this, DSO can perform a photometric calibration and account for illumination changes in the cost function.

To do this it uses the following error definition. The photometric error of a point $p \in Ω_{i}$ in reference frame i observed in another frame j is defined as follows:

$E_{p j} = \sum_{p \in N_{p}} ω_{p} {‖ (I_{j} [p^{'}] - b_{j}) - \frac{t_{j} e^{a_{j}}}{t_{i} e^{a_{i}}} (I_{i} [p] - b_{i}) ‖}_{γ},$

(7.12)

where $N_{p}$ is a set of eight pixels around the point p, $I_{i}$ and $I_{j}$ are images of respective frames, $t_{i}$ , $t_{j}$ are the exposure times, $a_{i}$ , $b_{i}$ , $a_{j}$ , $b_{j}$ are the coefficients to correct for affine illumination changes, γ is the Huber norm, $ω_{p}$ is a gradient-dependent weighting and $p^{'}$ is the point projected into $I_{j}$ .

The key idea here is that brightness constancy of a pixel is assumed when observed in different frames. However, the method compensates for illumination changes in the image, the vignette of the camera and the response function and the exposure time.

With that we can formulate the photometric error as

$E_{photo} = \sum_{i \in F} \sum_{p \in P_{i}} \sum_{j \in obs (p)} E_{p j},$

(7.13)

where $F$ is a set of keyframes that we are optimizing, $P_{i}$ is a sparse set of points in keyframe i, and $obs (p)$ is a set of observations of the same point in other keyframes.

7.4.3 Interaction Between Coarse Tracking and Joint Optimization

For the accuracy of visual odometry it is important to optimize geometry and camera poses jointly. In DSO this can be done efficiently because the Hessian matrix of the system is sparse (see Sect. 7.4.5). However, the joint optimization still takes relatively much time which is why it cannot be done for each frame in real-time. Therefore the joint optimization is only performed for keyframes.

As we want to obtain poses for all frames however, we also perform a coarse tracking. Contrary to the joint estimation, we fix all point depths in the coarse tracking and only optimize the pose of the newest frame, as well as its affine illumination parameters ( $a_{i}$ and $b_{i}$ ). This is done using normal direct image alignment in a pyramid scheme where we track against the latest keyframe (see Sect. 7.4.4).

In summary the system has the structure depicted in Fig. 7.3. For each frame we perform the coarse tracking yielding a relative pose of the newest frame with respect to the last keyframe. For each frame we also track candidate points (during the “trace points” procedure). This serves as an initialization when some of the candidate points are later activated during the joint optimization.

Figure 7.3 Structure of DSO. The coarse tracking is run for every frame and tracks the current image against the reference frame. Meanwhile the joint optimization is performed only when a new keyframe is created. Note that for the real-time-version the coarse tracking runs in a separate thread.

For some frames we create a new keyframe, depending on the distance and visual change since the last keyframe. Then the joint estimation is performed, optimizing the poses of all active keyframes as well as the geometry. In the real-time version of the system coarse tracking and joint optimization are performed in parallel in two separate threads. Active keyframes are all keyframes which have not been marginalized yet.

7.4.4 Coarse Tracking Using Direct Image Alignment

The coarse tracking is based on normal direct image alignment.

In order to speed up computation all active points are projected into the frame of the latest keyframe and tracking is performed only with reference to this keyframe.

As in the rest of the system the coarse tracking minimizes the photometric error in order to get a pose estimate. The optimized parameters include the 6 DOF pose which is represented as the transformation from the new frame to the reference frame.

This error is minimized using the Levenberg–Marquandt algorithm. In order to increase the convergence radius the well known pyramid scheme is used. This means that the image is scaled down multiple times (using Gaussian blurring). When tracking, at first the lowest resolution is used. After convergence the optimization is proceeded with the next level of the image pyramid (using the result from the previous level as an initialization). This is done until the original image is reached and the parameters are optimized using it.

This procedure has two advantages. First, tracking on a low resolution first is very fast, as less pixels have to be handled. The main advantage, however, is that it increases the convergence radius. The Taylor approximation of the energy, which in practice uses the image gradient, is only valid in the vicinity of the optimum. When using a smaller-scaled image the gradient contains information from a larger region (in terms of the original image), and therefore increases the convergence basin.

7.4.5 Joint Optimization

In contrast to the coarse tracking the joint optimization not only optimizes poses, but also the geometry, making the task significantly more difficult.

However, here all variables are initialized close to the optimum: The approximate pose of the newest frame is obtained from the previously executed coarse tracking. Newly activated points have a depth estimated by the tracing-procedure and all other variables have already been optimized in the previous execution of the joint optimization. Therefore DSO uses the Gauss–Newton algorithm for this task (instead of Levenberg–Marquandt), and no image pyramid.

When looking at the update step

$δ x = H^{- 1} \cdot b$

(7.14)

performed in each iteration it is clear that the most calculation-intensive part is inverting the Hessian matrix.

However, we can use the sparsity of the structure (see Fig. 7.4). The reason for this specific structure is that each residual only depends on the pose of two frames, the affine lighting parameters and one point depth (see also Eq. (7.12)). Therefore there is no residual depending on two different points, meaning that the off-diagonal part of $J^{T} J$ is 0 for all depths.

Figure 7.4 Structure of the Hessian in the joint optimization. While the pose-pose and pose-depths correlations can be dense, the depth-depth-part of the Hessian is purely diagonal.

This information can be used to efficiently solve the problem. We can partition the linear system into

$[\begin{matrix} H_{α α} & H_{α β} \\ H_{β α} & H_{β β} \end{matrix}] [\begin{matrix} δ x_{α} \\ δ x_{β} \end{matrix}] = [\begin{matrix} b_{α} \\ b_{β} \end{matrix}]$

(7.15)

where $x_{β}$ are the depth variables and $x_{α}$ are all other variables. Note that this means that $H_{β β}$ is a diagonal matrix.

We can now apply the Schur complements, which yields the new linear system $\hat{H_{α α}} x_{α} = \hat{b_{α}}$ , with

$\hat{H_{α α}} = H_{α α} - H_{α β} H_{β β}^{- 1} H_{β α},$

(7.16)

$\hat{b_{α}} = b_{α} - H_{α β} H_{β β}^{- 1} b_{β} .$

(7.17)

In this procedure the only inversion needed is $H_{β β}^{- 1}$ , which is trivial because $H_{β β}$ is diagonal.

The resulting $\hat{H_{α α}}$ is much smaller than the original H because there are typically much more points, than other variables. With it we can efficiently compute the update for the non-depth variables with

$δ x_{α} = {\hat{H_{α α}}}^{- 1} \cdot \hat{b_{α}} .$

(7.18)

The update step of for the depth variables is done by reinserting

$H_{β α} δ x_{α} + H_{β β} δ x_{β} = b_{β} \Leftrightarrow δ x_{β} = H_{β β}^{- 1} (b_{β} - H_{β α} δ x_{α}) .$

(7.19)

7.5 Direct Sparse Visual–Inertial Odometry

The visual–inertial odometry system we describe in this chapter was proposed in [6]. It is based on DSO [4] and could also be viewed as a direct formulation of [22]. The approach differs from [24] and [23] by jointly estimating poses and geometry in a bundle adjustment like optimization.

As mentioned in the introduction it is important to integrate IMU data on a very tight level to the system. In the odometry system poses and geometry are estimated by minimizing a photometric error function $E_{photo}$ . We integrate IMU data by instead optimizing the following energy function:

$E_{total} = λ \cdot E_{photo} + E_{inertial} .$

(7.20)

The summands $E_{photo}$ and $E_{inertial}$ are explained in Sects. 7.4.2 and 7.5.1, respectively.

The system contains two main parts running in parallel:

• The coarse tracking is executed for every frame and uses direct image alignment combined with an inertial error term to estimate the transformation between the new frame and the latest keyframe.
• When a new keyframe is created we perform a visual–inertial bundle adjustment like optimization that estimates the poses of all active keyframes as well as the geometry.

In contrast to [25] this method does not wait for a fixed amount of time before initializing the visual–inertial system but instead jointly optimizes all parameters including the scale. This yields a higher robustness as inertial measurements are used right from the beginning.

7.5.1 Inertial Error

In this section we derive the nonlinear dynamic model that we use to construct the error term, which depends on rotational velocities measured by gyroscope and linear acceleration measured by accelerometer. We decompose the 6D pose from the state s into rotation and translational part:

$T_{w_i m u} = (\begin{matrix} R & p \\ 0 & 1 \end{matrix}),$

(7.21)

with that we can write the following dynamic model:

$\dot{p} = v,$

(7.22)

$\dot{v} = R (a_{z} + ϵ_{a} - b_{a}) + g,$

(7.23)

$\dot{R} = R {[ω_{z} + ϵ_{ω} - b_{ω}]}_{\times},$

(7.24)

${\dot{b}}_{a} = ϵ_{b, a},$

(7.25)

${\dot{b}}_{ω} = ϵ_{b, ω},$

(7.26)

where $ϵ_{a}$ , $ϵ_{ω}$ , $ϵ_{b, a}$ , and $ϵ_{b, ω}$ denote the Gaussian white noise that affects the measurements and $b_{a}$ and $b_{ω}$ denote slowly evolving biases. ${[\cdot]}_{\times}$ is the skew-symmetric matrix such that, for vectors a, b, ${[a]}_{\times} b = a \times b$ .

As IMU data is obtained with a much higher frequency than images, we follow the preintegration approach proposed in [30] and improved in [31] and [23]. We integrate the IMU measurements between timestamps i and j in the IMU coordinate frame and obtain pseudo-measurements $Δ p_{i \to j}$ , $Δ v_{i \to j}$ , and $R_{i \to j}$ .

We initialize pseudo-measurements with $Δ p_{i \to i} = 0$ , $Δ v_{i \to i} = 0$ , $R_{i \to i} = I$ , and assuming the time between IMU measurements is Δt we integrate the raw measurements:

$Δ p_{i \to k + 1} = Δ p_{i \to k} + Δ v_{i \to k} Δ t,$

(7.27)

$Δ v_{i \to k + 1} = Δ v_{i \to k} + R_{i \to k} (a_{z} - b_{a}) Δ t,$

(7.28)

$R_{i \to k + 1} = R_{i \to k} \exp ({[ω_{z} - b_{ω}]}_{\times} Δ t) .$

(7.29)

Given the initial state and integrated measurements the state at the next time-step can be predicted:

$p_{j} = p_{i} + (t_{j} - t_{i}) v_{i} + \frac{1}{2} {(t_{j} - t_{i})}^{2} g + R_{i} Δ p_{i \to j},$

(7.30)

$v_{j} = v_{i} + (t_{j} - t_{i}) g + R_{i} Δ v_{i \to j},$

(7.31)

$R_{j} = R_{i} R_{i \to j} .$

(7.32)

For the previous state $s_{i - 1}$ (based on the state definition in Eq. (7.39)) and IMU measurements $a_{i - 1}$ , $ω_{i - 1}$ between frames i and $i - 1$ , the method yields a prediction

${\hat{s}}_{i} : = h (ξ_{i - 1}, v_{i - 1}, b_{i - 1}, a_{i - 1}, ω_{i - 1})$

(7.33)

of the pose, velocity, and biases in frame i with associated covariance estimate ${\hat{Σ}}_{s, i}$ . Hence, the IMU error function terms are

$E_{inertial} (s_{i}, s_{j}) : = {(s_{j} ⊟ {\hat{s}}_{j})}^{T} {\hat{Σ}}_{s, j}^{- 1} (s_{j} ⊟ {\hat{s}}_{j}) .$

(7.34)

7.5.2 IMU Initialization and the Problem of Observability

In contrast to a purely monocular system the usage of inertial data enables us to observe more parameters such as scale and gravity direction. While this on the one hand is an advantage, on the other hand it also means that these values have to be correctly estimated as otherwise the IMU-integration will not work correctly. This is why other methods such as [25] need to estimate the scale and gravity direction in a separate initialization procedure before being able to use inertial measurements for tracking.

Especially the metric scale, however, imposes a problem as it is only observable after the camera has performed certain motions. An example of a degenerate case is when the system is initialized while the camera moves with a fixed speed that does not change for the first minute. Then the IMU will measure no acceleration (except gravity and the bias) for the first minute rendering it useless for scale estimation. This shows that in theory it can take an arbitrary amount of time until all parameters are observable. Reference [27, Tables I and II] contains an excellent summary, showing the different cases where a unique solution for the problem cannot be found. Although these completely degenerate scenarios usually do not occur, also in practice it can often take multiple seconds for the scale to be observable. In [25] the authors say that they wait for 15 seconds of camera motion in order to be sure that everything is observable when processing the popular EuRoC dataset. Obviously it is not optimal to wait for such a long time before using any inertial measurements for tracking especially as on other datasets it might take even longer. However, as explained previously, using a wrong scale will render the inertial data useless. Therefore in [6] a novel strategy for handling this issue has been proposed. We explicitly include scale (and gravity direction) as a parameter in our visual–inertial system and jointly optimize them together with the other values such as poses and geometry. This means that we can initialize with an arbitrary scale instead of waiting until it is observable. We initialize the various parameters as follows:

• We use the same visual initializer as [4] which computes a rough pose estimate between two frames as well as approximate depths for several points. They are normalized so that the average depth is 1.
• The initial gravity direction is computed by averaging up to 40 accelerometer measurements, yielding a sufficiently good estimate even in cases of high acceleration.
• We initialize the velocity and IMU-biases with zero and the scale with 1.0.

All these parameters are then jointly optimized during a bundle adjustment like optimization.

7.5.3 $SIM (3)$ -based Model

In order to be able to start tracking and mapping with a preliminary scale and gravity direction we need to include them into our model. Therefore in addition to the metric coordinate frame we define the DSO coordinate frame to be a scaled and rotated version of it. The transformation from the DSO frame to the metric frame is defined as $T_{m_d} \in {T \in SIM (3) | translation (T) = 0}$ , together with the corresponding $ξ_{m_d} = \log (T_{m_d}) \in sim (3)$ . We add a superscript D or M to all poses denoting in which coordinate frame they are expressed. In the optimization the photometric error is always evaluated in the DSO frame, making it independent of the scale and gravity direction. The inertial error on the other hand has to be evaluated in the metric frame.

7.5.4 Scale-Aware Visual–Inertial Optimization

We optimize the poses, IMU-biases and velocities of a fixed number of keyframes. Fig. 7.5A shows a factor graph of the problem. Note that there are in fact many separate visual factors connecting two keyframes each, which we have combined to one big factor connecting all the keyframes in this visualization. Each IMU-factor connects two subsequent keyframes using the preintegration scheme described in Sect. 7.5.1. As the error of the preintegration increases with the time between the keyframes we ensure that the time between two consecutive keyframes is not bigger than 0.5 seconds, which is similar to what [25] have done. Note that in contrast to their method, however, we allow the marginalization procedure described in Sect. 7.5.4.2 to violate this constraint which ensures that long-term relationships between keyframes can be properly observed.

Figure 7.5 Factor graphs for the visual–inertial joint optimization before and after the marginalization of a keyframe. (A) Factor graph for the visual–inertial joint optimization. (B) Factor graph after keyframe 1 has been marginalized.

An important property of our algorithm is that the optimized poses are not represented in the metric frame but in the DSO frame. This means that they do not depend on the scale of the environment. If we were optimizing them in the metric frame instead a change of the scale would mean that the translational part of all poses also changes which does not happen during nonlinear optimization.

7.5.4.1 Nonlinear optimization

We perform nonlinear optimization using the Gauss–Newton algorithm. For each active keyframe we define a state vector

$s_{i} : = {[{(ξ_{c a m_{i}_w}^{D})}^{T}, v_{i}^{T}, b_{i}^{T}, a_{i}, b_{i}, d_{i}^{1}, d_{i}^{2}, . . ., d_{i}^{m}]}^{T}$

(7.35)

where $v_{i} \in R^{3}$ is the velocity, $b_{i} \in R^{6}$ is the current IMU bias, $a_{i}$ and $b_{i}$ are the affine illumination parameters used in Eq. (7.12) and $d_{i}^{j}$ are the inverse depths of the points hosted in this keyframe.

The full state vector is then defined as

$s = {[c^{T}, ξ_{m_d}^{T}, s_{1}^{T}, s_{2}^{T}, . . ., s_{n}^{T}]}^{T}$

(7.36)

where c contains the geometric camera parameters and $ξ_{m_d}$ denotes the translation-free transformation between the DSO frame and the metric frame as defined in Sect. 7.5.3. In order to simplify the notation we define the operator $s ⊞ s^{'}$ to work on state vectors by applying the concatenation operation $ξ ⊞ ξ^{'}$ for Lie algebra components and a plain addition for other components.

Using the stacked residual vector r we define the Jacobian

$J = {\frac{d r (s ⊞ ϵ)}{d ϵ} |}_{ϵ = 0}$

(7.37)

where W is a diagonal weight matrix. Now we define $H = J^{T} W J$ and $b = - J^{T} W r$ where W is a diagonal weight matrix. Then the update that we compute is $δ = H^{- 1} b$ .

Note that the visual energy term $E_{photo}$ and the inertial error term $E_{imu}$ do not have common residuals. Therefore we can divide H and b each into two independent parts

$H = H_{photo} + H_{imu} and b = b_{photo} + b_{imu} .$

(7.38)

As the inertial residuals compare the current relative pose to the estimate from the inertial data they need to use poses in the metric frame relative to the IMU. Therefore we define additional state vectors (containing only IMU-relevant data) for the inertial residuals.

$s_{i}^{'} : = {[ξ_{w_i m u_{i}}^{M}, v_{i}, b_{i}]}^{T},$

(7.39)

$s^{'} = {[s_{1}^{' T}, s_{2}^{' T}, . . ., s_{n}^{' T}]}^{T} .$

(7.40)

The inertial residuals lead to

$H_{imu}^{'} = J_{imu}^{' T} W_{imu} J_{imu}^{'}$

(7.41)

$b_{imu}^{'} = - J_{imu}^{' T} W_{imu} r_{imu} .$

(7.42)

For the joint optimization, however, we need to obtain $H_{imu}$ and $b_{imu}$ based on the state definition in Eq. (7.36). As the two definitions mainly differ in their representation of the poses we can compute $J_{rel}$ such that

$H_{imu} = J_{rel}^{T} \cdot H_{imu}^{'} \cdot J_{rel} and b_{imu} = J_{rel}^{T} \cdot b_{imu}^{'} .$

(7.43)

The computation of $J_{rel}$ is detailed in Sect. 7.6. Note that we represent all transformations as elements of $sim (3)$ and fix the scale to 1 for all of them except $ξ_{m_d}$ .

7.5.4.2 Marginalization using the Schur complement

In order to compute Gauss–Newton updates in a reasonable time-frame we perform partial marginalization for older keyframes. This means that all variables corresponding to this keyframe (pose, bias, velocity and affine illumination parameters) are marginalized out using the Schur complements. Fig. 7.5B shows an example how the factor graph changes after the marginalization of a keyframe.

The marginalization of the visual factors is handled as in [4] by dropping residual terms that affect the sparsity of the system and by first marginalizing all points in the keyframe before marginalizing the keyframe itself.

We denote the set of variables that we want to marginalize as $s_{β}$ and the ones that are connected to them in the factor graph as $s_{α}$ . Using this the linear system becomes

$[\begin{matrix} H_{α α} & H_{α β} \\ H_{β α} & H_{β β} \end{matrix}] [\begin{matrix} s_{α} \\ s_{β} \end{matrix}] = [\begin{matrix} b_{α} \\ b_{β} \end{matrix}] .$

(7.44)

The application of the Schur complements yields the new linear system $\hat{H_{α α}} s_{α} = \hat{b_{α}}$ , with

$\hat{H_{α α}} = H_{α α} - H_{α β} H_{β β}^{- 1} H_{β α},$

(7.45)

$\hat{b_{α}} = b_{α} - H_{α β} H_{β β}^{- 1} b_{β} .$

(7.46)

Note that this factor resulting from the marginalization requires the linearization point of all connected variables in $s_{α}$ to remain fixed. However, we can still approximate the energy around further linearization points $s^{'} = s ⊞ Δ s$ using

$E (s^{'} ⊞ ϵ) = E (s ⊞ Δ s ⊞ ϵ) \approx E (s ⊞ (Δ s + ϵ)) \approx {(Δ s + ϵ)}^{T} \hat{H_{α α}} (Δ s + ϵ) + 2 {(Δ s + ϵ)}^{T} \hat{b_{α}} + const = Δ s^{T} \hat{H_{α α}} Δ s + Δ s^{T} \hat{H_{α α}} ϵ + ϵ^{T} \hat{H_{α α}} Δ s + ϵ^{T} \hat{H_{α α}} ϵ + 2 Δ s^{T} \hat{b_{α}} + 2 ϵ^{T} \hat{b_{α}} + const = ϵ^{T} \hat{H_{α α}} ϵ + 2 ϵ^{T} \underset{= : \hat{b_{α}^{'}}}{\underset{︸}{(\hat{b_{α}} + \hat{H_{α α}} Δ s)}} + const .$

(7.47)

In order to maintain consistency of the system it is important that Jacobians are all evaluated at the same value for variables that are connected to a marginalization factor as otherwise the nullspaces get eliminated. Therefore we apply “first estimates Jacobians”. For the visual factors we follow [4] and evaluate $J_{photo}$ and $J_{geo}$ at the linearization point. When computing the inertial factors we fix the evaluation point of $J_{rel}$ for all variables which are connected to a marginalization factor. Note that this always includes $ξ_{m_d}$ .

7.5.4.3 Dynamic marginalization for delayed scale convergence

The marginalization procedure described in Sect. 7.5.4.2 has two purposes: reduce the computation complexity of the optimization by removing old states and maintain the information about the previous states of the system. This procedure fixes the linearization points of the states connected to the old states, so they should already have a good estimate. In our scenario this is the case for all variables except of scale.

The main idea of “dynamic marginalization” is to maintain several marginalization priors at the same time and reset the one we currently use when the scale estimate moves too far from the linearization point in the marginalization prior. To get an idea of the method we strongly recommend watching the animation at https://youtu.be/GoqnXDS7jbA?t=2m6s.

In our implementation we use three marginalization priors: $M_{visual}$ , $M_{curr}$ and $M_{half}$ . $M_{visual}$ contains only scale independent information from previous states of the vision and cannot be used to infer the global scale. $M_{curr}$ contains all information since the time we set the linearization point for the scale and $M_{half}$ contains only the recent states that have a scale close to the current estimate.

When the scale estimate deviates too much from the linearization point of $M_{curr}$ , the value of $M_{curr}$ is set to $M_{half}$ and $M_{half}$ is set to $M_{visual}$ with corresponding changes in the linearization points. This ensures that the optimization always has some information about the previous states with consistent scale estimates. In the remaining part of the section we provide the details of our implementation.

We define $G_{metric}$ to contain only the visual–inertial factors (which depend on $ξ_{m_d}$ ) and $G_{visual}$ to contain all other factors, except the marginalization priors. Then

$G_{full} = G_{metric} \cup G_{visual} .$

(7.48)

Fig. 7.6 depicts the partitioning of the factor graph.

Figure 7.6 Partitioning of the factor graph from Fig. 7.5A into G_metric (A) and G_visual (B). G_metric contains all IMU-factors while G_visual contains the factors that do not depend on ξ_{m_d}. Note that both of them do not contain any marginalization factors.

We define three different marginalization factors $M_{curr}$ , $M_{visual}$ and $M_{half}$ . For the optimization we always compute updates using the graph

$G_{b a} = G_{metric} \cup G_{visual} \cup M_{curr} .$

(7.49)

When keyframe i is marginalized we update $M_{visual}$ with the factor arising from marginalizing frame i in $G_{visual} \cup M_{visual}$ . This means that $M_{visual}$ contains all marginalized visual factors and no marginalized inertial factors making it independent of the scale.

For each marginalized keyframe i we define

$s_{i} : = scale estimate at the time, i was marginalized.$

(7.50)

We define $i \in M$ if and only if M contains an inertial factor that was marginalized at time i. Using this we enforce the following constraints for inertial factors contained in $M_{curr}$ and $M_{half}$ :

$\forall i \in M_{curr} : s_{i} \in [s_{middle} / d_{i}, s_{middle} \cdot d_{i}],$

(7.51)

$\forall i \in M_{half} : s_{i} \in {\begin{matrix} [s_{middle}, s_{middle} \cdot d_{i}], & if s_{curr} > s_{middle}, \\ [s_{middle} / d_{i}, s_{middle}], & otherwise, \end{matrix}$

(7.52)

where $s_{middle}$ is the current middle of the allowed scale interval (initialized with $s_{0}$ ), $d_{i}$ is the size of the scale interval at time i, and $s_{curr}$ is the current scale estimate.

We update $M_{curr}$ by marginalizing frame i in $G_{ba}$ and we update $M_{half}$ by marginalizing i in $G_{metric} \cup G_{visual} \cup M_{half}$ .

In order to preserve the constraints in Eqs. (7.51) and (7.52) we apply Algorithm 1 every time a marginalization happens. By following these steps on the one hand we make sure that the constraints are satisfied which ensures that the scale difference in the currently used marginalization factor stays smaller than $d_{i}^{2}$ . On the other hand the factor always contains some inertial factors so that the scale estimation works at all times. Note also that $M_{curr}$ and $M_{half}$ have separate first estimate Jacobians that are employed when the respective marginalization factor is used. Fig. 7.7 shows how the system works in practice.

Figure 7.7 The scale estimation running on the V1_03_difficult sequence from the EuRoC dataset. We show the current scale estimate (bold blue), the groundtruth scale (bold red) and the current scale interval (light lines). The vertical dotted lines denote when the side changes (blue) and when the boundary of the scale interval is exceeded (red). In practice this means that M_curr contains the inertial factors since the last blue or red dotted line that is before the last red dotted line. For example at 16 s it contains all inertial data since the blue line at 9 seconds.

An important part of this strategy is the choice of $d_{i}$ . It should be small, in order to keep the system consistent, but not too small so that $M_{curr}$ always contains enough inertial factors. Therefore we chose to dynamically adjust the parameter as follows. At all time steps i we calculate

$d_{i} = \min {d_{min}^{j} | j \in N ∖ {0}, \frac{s_{i}}{s_{i - 1}} < d_{i}} .$

(7.53)

This ensures that it cannot happen that the $M_{half}$ gets reset to $M_{visual}$ at the same time that $M_{curr}$ is exchanged with $M_{half}$ . Therefore it prevents situations where $M_{curr}$ contains no inertial factors at all, making the scale estimation more reliable. On the one hand $d_{min}$ should not be too small as otherwise not enough inertial information is accumulated. On the other hand it should not be too big, because else the marginalization can get inconsistent. In our experiments we chose $d_{min} = \sqrt{1.1}$ which is a good tradeoff between the two.

7.5.4.4 Measuring scale convergence

For many applications it is very useful to know when the scale has converged, especially as this can take an arbitrary amount of time in theory. The lack of this measure is also one of the main drawbacks of the initialization method described in [25]. Here we propose a simple but effective strategy. With $s_{i}$ being the scale at time i and n being the maximum queue size (60 in all our experiments) we compute

$c = \frac{\max_{j : = i - n + 1}^{n} s_{j}}{\min_{j : = i - n + 1}^{n} s_{j}} - 1 .$

(7.54)

Fig. 7.13B shows a plot of this measure. When c is below a certain threshold $c_{min}$ (0.005 in our experiments) we consider the scale to have converged and fix $ξ_{m_d}$ . While this does not influence the accuracy of the system, fixing the scale might be useful for some applications.

7.5.5 Coarse Visual–Inertial Tracking

The coarse tracking is responsible for computing a fast pose estimate for each frame that also serves as an initialization for the joint estimation detailed in Sect. 7.5.4. We perform conventional direct image alignment between the current frame and the latest keyframe, while keeping the geometry and the scale fixed. The general factor graph for the coarse tracking is shown in Fig. 7.8A. Inertial residuals using the previously described IMU preintegration scheme are placed between subsequent frames. Note that in contrast to the factor graph for the joint optimization, the visual factor for the coarse tracking does not connect all keyframes and depths, but instead just connects each frame with the current keyframe.

Figure 7.8 Factor graphs for the coarse tracking. (A) General factor graph for the problem. (B) First pose estimate after new keyframe. (C) Second pose estimate. (D) Third pose estimate.

Every time the joint estimation is finished for a new frame, the coarse tracking is reinitialized with the new estimates for scale, gravity direction, bias, and velocity as well as the new keyframe as a reference for the visual factors. Similar to the joint estimation we perform partial marginalization to keep the update time constrained. After estimating the variables for a new frame we marginalize out all variables except the keyframe pose and the variables of the newest frame. Also we immediately marginalize the photometric parameters $a_{i}$ and $b_{i}$ as soon as the pose estimation for frame i is finished. Fig. 7.8B–D shows how the coarse factor graph changes over time. In contrast to the joint optimization we do not need to use dynamic marginalization because the scale is not included in the optimization.

7.6 Calculating the Relative Jacobians

As mentioned in Sect. 7.5.4.1 we want to compute $J_{rel}$ , such that (7.43) holds.

We convert between the poses using

$ξ_{w_i m u}^{M} = ξ_{m_d} ⊞ {(ξ_{c a m_w}^{D})}^{- 1} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{c a m_i m u}^{M} = : Ψ (ξ_{c a m_w}^{D}, ξ_{m_d}) .$

(7.55)

For any function $Ω (ξ) : sim (3) \to sim (3)$ we define the derivative $\frac{d Ω (ξ ⊞ ϵ)}{d ϵ}$ implicitly using

$Ω {(ξ)}^{- 1} ⊞ Ω (ξ ⊞ ϵ) = \frac{d Ω (ξ ⊞ ϵ)}{d ϵ} \cdot ϵ + η (ϵ) \cdot ϵ$

(7.56)

where the error function $η (ϵ)$ goes to 0 as $| | ϵ | |$ goes to 0.

Note that in principle there are three other ways to define this derivative (as you can place the increment with ϵ as well as the multiplication with the inverse on either side). However, it can be shown (see Sect. 7.6.1) that only with this version the following chain rule holds for $f (ξ) : sim (3) \to R$ :

$\frac{d f (Ω (ξ ⊞ ϵ))}{d ϵ} = \frac{d f (Ω (ξ) ⊞ δ)}{δ} \cdot \frac{d Ω (ξ ⊞ ϵ)}{ϵ} .$

(7.57)

Using these definitions the relevant derivatives for Ψ can be computed (see Sects. 7.6.2 and 7.6.3)

$\frac{\partial Ψ (ξ_{c a m_w}^{D} ⊞ ϵ, ξ_{m_d})}{\partial ϵ} = - Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w}),$

(7.58)

$\frac{\partial Ψ (ξ_{c a m_w}^{D}, ξ_{m_d} ⊞ ϵ)}{\partial ϵ} = Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w}) - Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d}) .$

(7.59)

Stacking these derivatives correctly we can compute a Jacobian $J_{rel}$ such that

$J_{imu} = J_{imu}^{'} \cdot J_{rel} .$

(7.60)

Using this we can finally compute

$H_{imu} = J_{imu}^{T} W_{imu} J_{imu} = J_{rel}^{T} \cdot {(J_{imu}^{'})}^{T} \cdot W_{imu} \cdot J_{imu}^{'} \cdot J_{rel} = J_{rel}^{T} \cdot H_{imu}^{'} \cdot J_{rel},$

(7.61)

$b_{imu} = - J_{imu}^{T} W_{imu} r_{imu} = - J_{rel}^{T} {(J_{imu}^{'})}^{T} W_{imu} r_{imu} = J_{rel}^{T} \cdot b_{imu}^{'} .$

(7.62)

7.6.1 Proof of the Chain Rule

In this section we will prove the chain rule (7.57).

We define the multivariate derivatives implicitly:

$f (Ω (ξ) ⊞ δ) - f (Ω (ξ)) = \frac{d f (Ω (ξ) ⊞ δ)}{δ} \cdot δ + μ (δ) \cdot δ,$

(7.63)

where the error function $μ (δ)$ goes to 0 as δ goes to zero. We can rewrite Eq. (7.56) by multiplying $Ω (ξ)$ from the left,

$Ω (ξ ⊞ ϵ) = Ω (ξ) ⊞ (\frac{d_{l} Ω (ξ ⊞ ϵ)}{d ϵ} \cdot ϵ + η (ϵ) \cdot ϵ) .$

(7.64)

Using this we can compute

$f (Ω (ξ ⊞ ϵ)) - f (Ω (ξ)) \overset{(7.64)}{=} f (Ω (ξ) ⊞ \underset{= : δ_{ϵ}}{\underset{︸}{(\frac{d_{l} Ω (ξ ⊞ ϵ)}{d ϵ} \cdot ϵ + η (ϵ) \cdot ϵ)}}) - f (Ω (ξ)) = f (Ω (ξ) ⊞ δ_{ϵ}) - f (Ω (ξ)) \overset{(7.63)}{=} \frac{d f (Ω (ξ) ⊞ δ)}{δ} \cdot δ_{ϵ} + μ (δ_{ϵ}) \cdot δ_{ϵ} = \frac{d f (Ω (ξ) ⊞ δ)}{δ} \cdot (\frac{d_{l} Ω (ξ ⊞ ϵ)}{d ϵ} \cdot ϵ + η (ϵ) \cdot ϵ) + μ (δ_{ϵ}) \cdot (\frac{d_{l} Ω (ξ ⊞ ϵ)}{d ϵ} \cdot ϵ + η (ϵ) \cdot ϵ) = \frac{d f (Ω (ξ) ⊞ δ)}{δ} \cdot \frac{d_{l} Ω (ξ ⊞ ϵ)}{d ϵ} \cdot ϵ + \frac{d f (Ω (ξ) ⊞ δ)}{δ} \cdot η (ϵ) \cdot ϵ + μ (δ_{ϵ}) \cdot \frac{d_{l} Ω (ξ ⊞ ϵ)}{d ϵ} \cdot ϵ + μ (δ_{ϵ}) \cdot η (ϵ) \cdot ϵ = \frac{d f (Ω (ξ) ⊞ δ)}{δ} \cdot \frac{d_{l} Ω (ξ ⊞ ϵ)}{d ϵ} \cdot ϵ + \underset{= : γ (ϵ)}{\underset{︸}{(\frac{d f (Ω (ξ) ⊞ δ)}{δ} \cdot η (ϵ) + μ (δ_{ϵ}) \cdot \frac{d_{l} Ω (ξ ⊞ ϵ)}{d ϵ} + μ (δ_{ϵ}) \cdot η (ϵ))}} \cdot ϵ .$

(7.65)

When ϵ goes to 0, then $δ_{ϵ}$ also goes to 0 (as can be seen from its definition). Using this it follows by the definition of the derivative that $η (ϵ)$ and $μ (δ_{ϵ})$ go to 0 as well. This shows that $γ (ϵ)$ goes to 0 when ϵ goes to 0. Therefore the last line of Eq. (7.65) is in line with our definition of the derivative and

$\frac{d f (Ω (ξ ⊞ ϵ))}{d ϵ} = \frac{d f (Ω (ξ) ⊞ δ)}{δ} \cdot \frac{d_{l} Ω (ξ ⊞ ϵ)}{ϵ} .$

(7.66)

□

7.6.2 Derivation of the Jacobian with Respect to Pose in Eq. (7.58)

In this section we will show how to derive the Jacobians $\frac{\partial Ψ (ξ_{c a m_w}^{D} ⊞ ϵ, ξ_{m_d})}{\partial ϵ}$ using the implicit definition of the derivative shown in Eq. (7.56).

In order to do this we need the definition of the adjoint.

$T \cdot \exp (ϵ) = \exp ({Adj}_{T} \cdot ϵ) \cdot T$

(7.67)

with $T \in SIM (3)$ . It follows that

$\log (T \cdot \exp (ϵ) \cdot T^{- 1}) = {Adj}_{T} \cdot ϵ .$

(7.68)

Using this we can compute

$Ψ {(ξ_{c a m_w}^{D}, ξ_{m_d})}^{- 1} ⊞ Ψ (ξ_{c a m_w}^{D} ⊞ ϵ, ξ_{m_d}) \overset{(7.55)}{=} {(ξ_{m_d} ⊞ {(ξ_{c a m_w}^{D})}^{- 1} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{c a m_i m u}^{M})}^{- 1} ⊞ (ξ_{m_d} ⊞ {(ξ_{c a m_w}^{D} ⊞ ϵ)}^{- 1} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{c a m_i m u}^{M}) = {(ξ_{c a m_i m u}^{M})}^{- 1} ⊞ ξ_{m_d} ⊞ ξ_{c a m_w}^{D} ⊞ \underset{= 0}{\underset{︸}{ξ_{m_d}^{- 1} ⊞ ξ_{m_d}}} ⊞ ϵ^{- 1} ⊞ {(ξ_{c a m_w}^{D})}^{- 1} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{c a m_i m u}^{M} = \log ({(T_{c a m_i m u}^{M})}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w}^{D} \cdot \exp {(ϵ)}^{- 1} \cdot {(T_{c a m_w}^{D})}^{- 1} \cdot T_{m_d}^{- 1} \cdot T_{c a m_i m u}^{M}) \overset{(7.68)}{=} Adj ({(T_{c a m_i m u}^{M})}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w}^{D}) \cdot (- ϵ) .$

(7.69)

After moving the minus sign to the left this is in line with our definition of the derivative in Eq. (7.56) and it follows that

$\frac{\partial Ψ (ξ_{c a m_w}^{D} ⊞ ϵ, ξ_{m_d})}{\partial ϵ} = - Adj ({(T_{c a m_i m u}^{M})}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w}^{D}) .$

(7.70)

7.6.3 Derivation of the Jacobian with Respect to Scale and Gravity Direction in Eq. (7.59)

In order to derive the Jacobian $\frac{\partial Ψ (ξ_{c a m_w}^{D}, ξ_{m_d} ⊞ ϵ)}{\partial ϵ}$ we need the Baker–Campbell–Hausdorff formula: Let $a, b \in sim (3)$ , then

$\log (\exp (a) \cdot \exp (b)) = a + b + \frac{1}{2} [a, b] + \frac{1}{12} ([a, [a, b]] + [b, [b, a]]) + \frac{1}{48} ([b, [a, [b, a]]] + [a, [b, [b, a]]]) + . . . .$

(7.71)

Here $[a, b] : = a b - b a$ denotes the Lie bracket.

In this section we will omit the superscripts to simplify the notation. We have

$Ψ {(ξ_{c a m_w}, ξ_{m_d})}^{- 1} ⊞ Ψ (ξ_{c a m_w}, ξ_{m_d} ⊞ ϵ) \overset{(7.55)}{=} {(ξ_{m_d} ⊞ ξ_{c a m_w}^{- 1} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{c a m_i m u})}^{- 1} ⊞ (ξ_{m_d} ⊞ ϵ ⊞ ξ_{c a m_w}^{- 1} ⊞ {(ξ_{m_d} ⊞ ϵ)}^{- 1} ⊞ ξ_{c a m_i m u}) = ξ_{c a m_i m u}^{- 1} ⊞ ξ_{m_d} ⊞ ξ_{c a m_w} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{m_d} ⊞ ϵ ⊞ ξ_{c a m_w}^{- 1} ⊞ ϵ^{- 1} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{c a m_i m u} = ξ_{c a m_i m u}^{- 1} ⊞ ξ_{m_d} ⊞ ξ_{c a m_w} ⊞ ϵ ⊞ ξ_{c a m_w}^{- 1} ⊞ \underset{= 0}{\underset{︸}{{(ξ_{c a m_i m u}^{- 1} ⊞ ξ_{m_d})}^{- 1} ⊞ (ξ_{c a m_i m u}^{- 1} ⊞ ξ_{m_d})}} ⊞ ϵ^{- 1} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{c a m_i m u} = \underset{: = a}{\underset{︸}{ξ_{c a m_i m u}^{- 1} ⊞ ξ_{m_d} ⊞ ξ_{c a m_w} ⊞ ϵ ⊞ ξ_{c a m_w}^{- 1} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{c a m_i m u}}} ⊞ \underset{: = b}{\underset{︸}{ξ_{c a m_i m u}^{- 1} ⊞ ξ_{m_d} ⊞ ϵ^{- 1} ⊞ ξ_{m_d}^{- 1} ⊞ ξ_{c a m_i m u}}} .$

(7.72)

Using Eq. (7.68) we can compute

$a = Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w}) \cdot ϵ,$

(7.73)

and

$b = - Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d}) \cdot ϵ .$

(7.74)

Now we have to prove that all terms of the Baker–Campbell–Hausdorff formula which contain a Lie bracket can be written as $μ (ϵ) \cdot ϵ$ , where $μ (ϵ)$ goes to zero, when ϵ goes to zero.

We can compute

$a \cdot b = \underset{μ_{1} (ϵ)}{\underset{︸}{Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w}) \cdot ϵ \cdot (- Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d}))}} \cdot ϵ,$

(7.75)

$b \cdot a = \underset{μ_{2} (ϵ)}{\underset{︸}{- Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d}) \cdot ϵ \cdot Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w})}} \cdot ϵ,$

(7.76)

$[a, b] = a b - b a = (μ_{1} (ϵ) + μ_{2} (ϵ)) \cdot ϵ .$

(7.77)

Obviously $μ_{1} (ϵ)$ and $μ_{2} (ϵ)$ go to zero when ϵ goes to zero. For the remaining summands of Eq. (7.71) the same argumentation can be used. It follows that

$(7.72) = a + b + μ (ϵ) \cdot ϵ = (Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w}) - Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d}) \cdot ϵ) \cdot ϵ + μ (ϵ) \cdot ϵ$

(7.78)

where $μ (ϵ)$ goes to zero when ϵ goes to zero. According to (7.56) this means that

$\frac{\partial Ψ (ξ_{c a m_w}^{D}, ξ_{m_d} ⊞ ϵ)}{\partial ϵ} = Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d} \cdot T_{c a m_w}) - Adj (T_{c a m_i m u}^{- 1} \cdot T_{m_d}) .$

(7.79)

7.7 Results

We evaluate the described VI-DSO on the publicly available EuRoC dataset [5]. The performance is compared to [4], [20], [10], [24], [22], [32] and [33]. We also recommend watching the video at vision.in.tum.de/vi-dso.

7.7.1 Robust Quantitative Evaluation

In order to obtain an accurate evaluation we run our method 10 times for each sequence of the dataset (using the left camera). We directly compare the results to visual-only DSO [4] and ROVIO [20]. As DSO cannot observe the scale we evaluate using the optimal ground truth scale in some plots (with the description “gt-scaled”) to enable a fair comparison. For all other results we scale the trajectory with the final scale estimate (our method) or with 1 (other methods). For DSO we use the results published together with their paper. We use the same start and end times for each sequence to run our method and ROVIO. Note that the drone has a high initial velocity in some sequences when using these start times making it especially challenging for our IMU initialization. Fig. 7.9 shows the root mean square error (RMSE) for every run and Fig. 7.10 displays the cumulative error plot. Clearly our method significantly outperforms DSO and ROVIO. Without inertial data DSO is not able to work on all sequences especially on V1_03_difficult and V2_03_difficult and it is also not able to scale the results correctly. ROVIO on the other hand is very robust but as a filtering-based method it cannot provide sufficient accuracy.

Figure 7.9 RMSE for different methods run 10 times (lines) on each sequence (columns) of the EuRoC dataset. (A) DSO [4] realtime gt-scaled. (B) ROVIO real-time [20]. (C) VI-DSO real-time gt-scaled. (D) VI-DSO real-time.

Figure 7.10 Cumulative error plot on the EuRoC dataset (RT means real-time). This experiment demonstrates that the additional IMU not only provides a reliable scale estimate, but that it also significantly increases accuracy and robustness.

Table 7.1 shows a comparison to several other methods. For our results we have displayed the median error for each sequence from the 10 runs plotted in Fig. 7.9C. This makes the results very meaningful. For the other methods unfortunately only one result was reported so we have to assume that they are representative as well. The results for [22] and [32] were taken from [32] and the results for visual–inertial ORB-SLAM were taken from [10]. For VINS-Mono [26] the results were taken from their paper [33]. The results for [10] differ slightly from the other methods as they show the error of the keyframe trajectory instead of the full trajectory. This is a slight advantage as keyframes are bundle adjusted in their method which does not happen for the other frames.

Table 7.1

Accuracy of the estimated trajectory on the EuRoC dataset for several methods. Note that ORB-SLAM does a convincing job showing leading performance on some of the sequences. Nevertheless, since VI-DSO directly works on the sensor data (colors and IMU measurements), we observe similar precision and a better robustness—even without loop closuring. Moreover, the proposed method is the only one not to fail on any of the sequences (except ROVIO).

Sequence		MH1	MH2	MH3	MH4	MH5	V11	V12	V13	V21	V22	V23
VI-DSO (RT) (median of 10 runs each)	RMSE	0.062	0.044	0.117	0.132	0.121	0.059	0.067	0.096	0.040	0.062	0.174
	RMSE gt-scaled	0.041	0.041	0.116	0.129	0.106	0.057	0.066	0.095	0.031	0.060	0.173
	Scale Error (%)	1.1	0.5	0.4	0.2	0.8	1.1	1.1	0.8	1.2	0.3	0.4
VI ORB-SLAM (keyframe trajectory)	RMSE	0.075	0.084	0.087	0.217	0.082	0.027	0.028	X	0.032	0.041	0.074
	RMSE gt-scaled	0.072	0.078	0.067	0.081	0.077	0.019	0.024	X	0.031	0.026	0.073
	Scale Error (%)	0.5	0.8	1.5	3.4	0.5	0.9	0.8	X	0.2	1.4	0.7
VINS-MONO [33]	RMSE	0.12	0.12	0.13	0.18	0.21	0.068	0.084	0.19	0.081	0.16	X
VI odometry [22], mono	RMSE	0.34	0.36	0.30	0.48	0.47	0.12	0.16	0.24	0.12	0.22	X
VI odometry [22], stereo	RMSE	0.23	0.15	0.23	0.32	0.36	0.04	0.08	0.13	0.10	0.17	X
VI SLAM [32], mono	RMSE	0.25	0.18	0.21	0.30	0.35	0.11	0.13	0.20	0.12	0.20	X
VI SLAM [32], stereo	RMSE	0.11	0.09	0.19	0.27	0.23	0.04	0.05	0.11	0.10	0.18	X

In comparison with VI ORB-SLAM our method outperforms it in terms of RMSE on several sequences. As ORB-SLAM is a SLAM system while ours is a pure odometry method this is a remarkable achievement especially considering the differences in the evaluation. Note that the Vicon room sequences (V*) are executed in a small room and contain a lot of loopy motions where the loop closures done by a SLAM system significantly improve the performance. Also our method is more robust as ORB-SLAM fails to track one sequence. Even considering only sequences where ORB-SLAM works our approach has a lower maximum RMSE.

Compared to [22] and [32] our method obviously outperforms them. It is better than the monocular versions on every single sequence and it beats even the stereo and SLAM-versions on 9 out of 11 sequences. Also VINS-Mono [33] is outperformed by VI-DSO on every sequence of the EuRoC dataset.

In summary our method is the only one which is able to track all the sequences successfully except ROVIO.

We also compare the relative pose error with [10] and [24] on the V1_0*-sequences of EuRoC (Figs. 7.11 and 7.12). While our method cannot beat the SLAM system and the stereo method on the easy sequence we outperform [24] and we are as good as [10] on the medium sequence. On the hard sequence we outperform both of the contenders even though we neither use stereo nor loop closures.

Figure 7.11 Relative pose error (translation) evaluated on three sequences of the EuRoC dataset for visual–inertial ORB-SLAM [10], visual–inertial stereo LSD-SLAM [24] and our method. (A) Translation error V1_01_easy. (B) Translation error V1_02_medium. (C) Translation error V1_03_difficult. Note that [10] does loop closures and [24] uses stereo.

Figure 7.12 Relative pose error (orientation) evaluated on three sequences of the EuRoC dataset. (A) Orientation error V1_01_easy. (B) Orientation error V1_02_medium. (C) Orientation error V1_03_difficult.

7.7.2 Evaluation of the Initialization

Especially in the initialization there is a difference in how methods are evaluated on the EuRoC dataset. The machine hall sequences (MH*) all start with a part where the drone is shaken manually and then put onto the floor again. A few seconds later the drone is started and the main part of the sequence begins. In our evaluation we follow visual–inertial ORB-SLAM [10] and monocular DSO [4] and crop the first part of the sequence and only use the part where the drone is flying. On the other hand VINS-Mono [33] uses the whole sequence including the part where the drone is shaken. Note that both starting times involve different challenges for visual–inertial initialization: Using the first part of the sequence is a challenge especially for the visual part, as the drone is accelerated very quickly and there is a lot of motion blur involved. On the other hand the motions in this first part are almost purely translational and are quite fast, which is perfect for a fast initialization of scale and IMU biases.

Using only the cropped sequence makes the task for the visual initialization relatively easy, because the motions are relatively slow. However, the task of visual–inertial initialization becomes very hard, because it takes quite a lot of time for the parameters (especially the scale) to become observable.

This difference can also explain why the initializer of visual–inertial ORB-SLAM [10] needs more than 10 seconds on some sequences of the EuRoC dataset, whereas the initializer of VINS-Mono [33] is much faster. The reason is most likely not so much the algorithm, but more how the dataset is used.

In this comparison we will focus on initialization methods that use only the cropped part of the machine hall sequences—namely VI ORB-SLAM, as only then the problem of late scale observability is present. We do note, however, that an extensive evaluation of different initialization methods has not been done yet and would be a great contribution at this point.

In comparison to [10] our estimated scale is better overall (Table 7.1). On most sequences our method provides a better scale, and our average scale error (0.7% compared to 1.0%) as well as our maximum scale error (1.2% compared to 3.4%) is lower. In addition our method is more robust as the initialization procedure of [10] fails on V1_03_difficult.

Apart from the numbers we argue that our approach is superior in terms of the general structure. While [10] have to wait for 15 seconds until the initialization is performed, our method provides an approximate scale and gravity direction almost instantly that gets enhanced over time. Whereas in [10] the pose estimation has to work for 15 seconds without any IMU data, in our method the inertial data is used to improve the pose estimation from the beginning. This is probably one of the reasons why our method is able to process V1_03_difficult. Finally, our method is better suited for robotics applications. For example an autonomous drone is not able to fly without gravity direction and scale for 15 seconds and hope that afterwards the scale was observable. In contrast our method offers both of them right from the start. The continuous rescaling is also not a big problem as an application could use the unscaled measurements for building a consistent map and for providing flight goals, whereas the scaled measurements can be used for the controller. Finally, the measure c defined in Eq. (7.54) allows one to detect how accurate the current scale estimate is which can be particularly useful for robotics applications. Fig. 7.13 shows the estimated scale and c for the sequence MH_04 of the EuRoC dataset.

Figure 7.13 Scale estimate (A) and scale accuracy measure c (B) for MH_04_difficult (median result of 10 runs in terms of tracking accuracy). Note how the estimated scale converges to the correct value despite being initialized far from the optimum.

7.7.3 Parameter Studies

In order to further evaluate the method we provide several parameter studies in this section. As in the previous sections, we have run all variants 10 times for each sequence of the EuRoC dataset and present either the accumulation of all results or the median of the 10 runs. First of all we show the dependence on the number of points in Fig. 7.14. The main takeaway of this should be that because of the addition of inertial data, we can significantly reduce the number of points without losing tracking performance.

Figure 7.14 RMSE when running the method (not in real-time) with different numbers of active points. Interestingly focusing on fewer (but more reliable points) actually improves the performance, both in precision and in runtime.

In order to evaluate the importance of the dynamic marginalization strategy we have replaced it with two alternatives (Fig. 7.15). For one example we have used normal marginalization instead where only one marginalization prior is used. Furthermore we have tried a simpler dynamic marginalization strategy, where we do not use $M_{half}$ , but instead directly reset the marginalization prior with $M_{visual}$ , as soon as the scale interval is exceeded. Clearly dynamic marginalization yields the most robust result. Especially on sequences with a large initial scale error, the other strategies do not work well. Fig. 7.16 shows the difference in the scale convergence. When using the simple marginalization strategy the marginalization prior gets reset directly to $M_{visual}$ resulting in a slower scale convergence and oscillations (Figs. 7.16C and 7.16D). With a normal marginalization the scale estimate overshoots and it takes a long time for the system to compensate for the factors with a wrong scaled that were initially marginalized (Figs. 7.16E and 7.16F). With our dynamic marginalization implementation, however, the scale converges much faster and with almost no overshoot or oscillations (Figs. 7.16A and 7.16B).

Figure 7.15 The method run (in real-time) with different changes. “Simple dynamic marginalization” means that we have changed the marginalization to not use M_half, but rather replace M_curr directly with M_visual, when the scale interval is exceeded. For “normal marginalization” we have used the normal marginalization procedure with just one marginalization prior. For the pink line we have turned off the gravity direction optimization in the system (and only optimize for scale). This plot shows that the individual parts presented in this chapter, in particular the joint optimization with scale and gravity, and the dynamic marginalization procedure are important for the accuracy and robustness of the system.

Figure 7.16 Scale estimates for the different marginalization strategies evaluated on V2_03_difficult (left) and MH_04_difficult (right). (A) Scale estimate for V2_03_difficult with our method. After 21 seconds our system considered the scale converged (using the measure defined in Sect. 7.5.4.4) and fixed it for the rest of the sequence. (B) Scale estimate for MH_04_difficult with our method. (C) Scale estimate for V2_03_difficult with the simple dynamic marginalization strategy (see also Fig. 7.15). (D) Scale estimate for MH_04_difficult with the simple dynamic marginalization strategy. (E) Scale estimate for V2_03_difficult with normal marginalization. (F) Scale estimate for MH_04_difficult with normal marginalization. All figures show the median result of the 10 runs that were accumulated in Fig. 7.15.

We have also disabled the joint optimization of gravity direction in the main system (Fig. 7.15). Clearly the simple initialization of gravity direction (described in Sect. 7.5.2) is not sufficient to work well, without adding the direction of gravity to the model.

7.8 Conclusion

We have described a state-of-the-art approach to direct sparse visual–inertial odometry. The combination of direct image residuals with IMU data enables our system to outperform state of the art methods especially in terms of robustness. We explicitly include scale and gravity direction in our model in order to deal with cases where the scale is not immediately observable. As the initial scale can be very far from the optimum we have described a technique called dynamic marginalization where we maintain multiple marginalization priors and constrain the maximum scale difference. Extensive quantitative evaluation on the EuRoC dataset demonstrates that the described visual–inertial odometry method outperforms other state-of-the-art methods, both the complete system and the IMU initialization procedure. In particular, experiments confirm that inertial information not only provides a reliable scale estimate, but it also drastically increases precision and robustness.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 7: A Review and Quantitative Evaluation of Direct Visual–Inertial Odometry

Create new playlist

Sign In

Sign Up