Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3
Computer Vision for MAVs

Friedrich Fraundorfer

Institute for Computer Graphics and Vision, Graz University of Technology, Graz, Austria

3.1 Introduction

This chapter discusses recent advances in the use of computer vision for the control of micro aerial vehicles (MAVs). The term MAV typically denotes a class of small-scale unmanned aerial vehicles like multirotor helicopters, for example, a quadrotor helicopter. Figure 3.1 shows such an MAV equipped with digital cameras for control and 3D mapping.

MAVs have a big potential to be used in a various number of applications, for example, search–and-rescue scenario, surveillance, industrial inspection, delivery services, and so on. MAVs can carry a variety of sensors; however, the small scale imposes a strict weight limit. Almost all types of MAVs, however, are able to carry a digital camera and this makes it possible for them to take aerial images or, in general, images from vantage points not reachable otherwise. Already this capability enables an immense variety of applications. Piloting an MAV, however, takes a lot of training and requires constant attention of the pilot. It also requires that the MAV operates within the line of sight. These limitations can be overcome by an autopilot system with the main task of keeping the MAV hovering on a spot. In outdoor environments, this can be achieved using GPS; however, a MAV operating indoors cannot rely on GPS and needs an alternative sensor for the autopilot system. Lately, onboard cameras have been successfully used within such an autopilot system. Computer vision algorithms compute the ego-motion of the MAV from camera images, in most cases then these measurements are fused with measurements of inertial measurement units and used in a control loop to hover the MAV. To achieve these results, some serious challenges have had to be overcome:

Limited onboard processing power for image processing;
High frame rate for control is necessary;
High reliability;
Ego-motion estimation from a single camera cannot measure metric scale.

A main insight has been that fusing inertial measurement unit (IMU) and camera-based measurements allows for robust and efficient ego-motion estimation algorithms.

nfgz001 — **Figure 3.1** A micro aerial vehicle (MAV) equipped with digital cameras for control and environment mapping. The depicted MAV has been developed within the SFLY project (see Scaramuzza et al. 2014)

In addition to ego-motion estimation, the camera images can also be used for environment sensing and interpretation. From camera images, it is possible to compute a 3D map of the environment that can be used for autonomous navigation and exploration. Having the ability to map the environment in 3D allows collision-free navigation and is a prerequisite for autonomous operation of the MAV. In the following, computer vision methods for MAV control, 3D mapping, autonomous navigation, and scene interpretation are introduced and discussed.

In a number of publications, the astounding flight performances of MAVs have been demonstrated, ranging from acrobatics to fast and dynamic maneuvers (see Mellinger et al. (2011 2010), Michael et al. (2010a), Mueller et al. (2011), and Schoellig et al. (2010)). However, until now, these impressive performances could only be performed within a specially instrumented area (see Lupashin et al. (2011) and Michael et al. (2010b)) where a tracking system (typically cameras observing markers on the MAVs) computes the exact position and orientation of the MAV from outside. Systems with onboard sensors are far from reaching these performance parameters. Naturally, MAV control using onboard sensors is a heavily researched area, with the goal to achieve the aforementioned capabilities with onboard sensors. This research not only is limited to digital cameras as onboard sensors but also includes the use of laser range-finder, depth cameras, or combinations of different sensors (see Achtelik et al. (2009 2011), Ahrens et al. (2009), Bachrach et al. (2009), Bills et al. (2011a), Bloesch et al. (2010), Eberli et al. (2011), Engel et al. (2014), Forster et al. (2014b), Grzonka et al. (2012), Herisse et al. (2008), Hrabar et al. (2005), Klose et al. (2010), Loianno and Kumar (2014), Nieuwenhuisen et al. (2015), Shen et al. (2011 2012 2013a), Yang et al. (2014), Zingg et al. (2010), and Zufferey and Floreano (2006)).

3.2 System and Sensors

This section gives a brief overview of system design and sensors of a MAV. The system design described is from the Pixhawk MAV platform (see Meier et al. (2012)), but it is exemplary for many other MAV platforms, for example, see Achtelik et al. (2011), Scaramuzza et al. (2014), and Schmid et al. (2014). The system design is illustrated in Figure 3.2. A main characteristic is the existence of a low-level processing unit for flight control and a high-level processing unit for image processing and high-level tasks (e.g., path planning). The main task of the low-level processing unit is state estimation and control. State estimation has to run with high update rates and takes the measurements of IMU, digital compass, and visual pose as input. An attitude controller and pose controller use these measurements to steer the motors. Stateestimation and control have to run in real time; thus, it is implemented directly on a microcontroller. Detailed information about the low-level control process can be found in Bouabdallah et al. (2004) and Mahony et al. (2012).

Image processing and high-level tasks, however, need a powerful onboard computer; thus, the MAV is also equipped with a standard, but small scale, Linux computer. These computers run image processing, visual localization, obstacle detection, mapping, and path planning onboard the MAV.

The visual localization module computes the full 6-DOF pose of the MAV. The visual pose is fed to the state estimator for position control. The stereo processing module computes real-time disparity maps from the front-looking stereo pair, which are used by the visual localization module as well as the mapping module.

The minimal sensor set of a MAV comprises an IMU consisting of three-axis accelerometers and three-axis gyroscopes. This IMU can track the attitude of the MAV that is used for attitude control. A digital compass and a barometric pressure sensor are typically used to keep a constant heading and a constant height. For position control that is necessary for automatic hovering and any kind of automatic navigation, a sensor capable of full 6-DOF pose measurements is necessary, which in this case is solved using digital cameras.

nfgz002 — **Figure 3.2** The system diagram of the autonomous Pixhawk MAV using a stereo system and an optical flow camera as main sensors

3.3 Ego-Motion Estimation

In most approaches, ego-motion estimation is achieved by fusing information from an IMU with visual measurements similar to the work in Scaramuzza et al. (2014). By means of sensor fusion, the pose of the MAV can be computed reliably at high update rates. Visual measurements are essential in this process, as the integration of inertial measurements would accumulate large drifts otherwise.

3.3.1 State Estimation Using Inertial and Vision Measurements

State estimation is a filtering approach using an extended Kalman filter (EKF) for fusing inertial and vision measurements. The inputs for the state estimator are rotational velocities and accelerations from gyroscopes and accelerometers of the IMU at high update rates and camera poses at low update rates. Figure 3.3 illustrates the state estimation process. Camera poses are represented by a 6-DOF transformation consisting of three rotational parameters and three translational parameters and need to include the uncertainty of the measurement. Depending on the visual pose estimator (monocular, stereo vision), the camera poses could include an unknown scale factor. The IMU measurements consist of three parameters for the rotational velocities and three parameters for the accelerations. The state of the EKF contains the vehicle pose, the scale factor, and biases of the IMU. For state estimation, the vehicle state is propagated forward in time using the high-rate IMU measurements. When a vision measurement becomes available (at low update rates), an EKF update step is performed using the vision pose.

nfgz003 — **Figure 3.3** The state estimation work flow for a loosely coupled visual-inertial fusion scheme

3.3.1.1 Sensor Model

The vehicle pose is represented by the pose of the IMU within the world coordinate frame. The camera sensor is the offset of the IMU coordinate frame by a rotation and translation. This transformation can be computed by a calibration step or it can be included in the state estimation. Figure 3.4 illustrates the different coordinate systems. The IMU measures the angular velocity around each of the three-axis $c03-math-001$ and the acceleration in each of the three-axis $c03-math-002$ . These measurements are perturbed by noise and contain a bias. The real angular velocity of the system is denoted by $c03-math-003$ , and the real acceleration by $c03-math-004$ . The quantities are modeled by

3.1

3.2

$c03-math-007$ and $c03-math-008$ are the biases for the angular velocity and the accelerations, respectively. $c03-math-009$ and $c03-math-010$ are the noise parameters modeled as additive white Gaussian noise. The biases are nonstatic and are included in the state to be estimated. The noise parameters typically can be taken from the sensor data sheet.

nfgz004 — **Figure 3.4** A depiction of the involved coordinate systems for the visual-inertial state estimation

3.3.1.2 State Representation and EKF Filtering

The vehicle state consists of the following necessary parameters:

3.3

$c03-math-012$ : This parameter represents the vehicle's position (i.e., IMU position) in the world coordinate system. It consists of three parameters ( $c03-math-013$ position).
$c03-math-014$ : This parameter represents the vehicles' velocity. It consists of three parameters, that is, the speeds of all the three axes.
$c03-math-015$ : This parameter represents the vehicles' orientation within the world coordinate system described by a rotation. The world coordinate system typically is chosen to be aligned with the gravity vector.
$c03-math-016$ : This parameter is the bias for the gyroscopes and is represented as a three-vector.
$c03-math-017$ : This parameter is the bias for the accelerometers and is represented as a three-vector.
$c03-math-018$ : This parameter represents the scale factor between the IMU and the camera measurements. If the camera measurements are already in metric scale (e.g., when using a stereo system), then this parameter will converge to 1 (or can be omitted from the state). However, when using a monocular camera system, ego-motion can only be estimated up to an unknown scale factor, which is then represented by $c03-math-019$ .
$c03-math-020$ : This parameter represents the translational offset between camera sensor and IMU. When computed by a pre-calibration step, this parameter can be omitted from the state vector.
$c03-math-021$ : This parameter represents the rotational offset between camera sensor and IMU. When computed by a pre-calibration step, this parameter can be omitted from the state vector.

The state transition model of the EKF is used without control input and is defined as

3.4

The observation model is defined as

3.5

The current vehicle state $c03-math-024$ and its uncertainties $c03-math-025$ will be propagated forward in time using inertial measurements of high update rate. On availability of a vision measurement, the vehicle state and its uncertainties will be updated.

The state propagation needs two steps:

1. Propagate state variables $c03-math-026$ .
2. Propagate the state covariance matrix $c03-math-027$ .

$c03-math-028$ is the Jacobian of the state transition function and $c03-math-029$ is the Jacobian of the observation function.

The state transition function $c03-math-030$ is explained as follows. The new position $c03-math-031$ is calculated by adding the doubly integrated accelerations to the current position, and the new orientation $c03-math-032$ is calculated by adding the integrated rotational velocities to the current orientation. The biases, scale, and calibration parameters are kept constant in the prediction step. For detailed derivation of $c03-math-033$ and $c03-math-034$ , the reader is referred to Weiss and Siegwart (2011).

On the availability of a visual pose estimate with position $c03-math-035$ and orientation $c03-math-036$ and its covariances, the filter update consisting of two steps can be carried out:

1. Update state estimate
2. Update covariance estimate.

For the detailed update equations, the reader is referred to Weiss and Siegwart (2011).

Please note that the observation process assumes that the vision system is able to directly observe the 6-DOF full pose. In addition, this means that any computer vision algorithm that delivers a pose estimate plus its uncertainties can be used for the filtering process. In the following sections, three different vision algorithms will be explained that already have been used successfully for the control of a MAV.

3.3.2 MAV Pose from Monocular Vision

It is a fundamental property of computer vision that from multiple images of a scene it is possible to compute a 3D reconstruction of the scene and also to compute the position and orientation of the cameras at the same time. This method is known as structure from motion (SfM) and is also often called Visual SLAM (simultaneous localization and mapping) within the robotics community. By attaching a camera to a MAV, it is therefore possible to keep track of the MAV movements by computing the camera poses using such an algorithm. If one is only interested in the camera poses, such an algorithm is often referred to as visual odometry. A well-known implementation of such an algorithm that also works for high frame rates is the parallel tracking and mapping (PTAM) method (see Klein and Murray (2007)). The basic idea behind such a method is that tracked feature points get triangulated in 3D. For a new camera, image feature points are matched with the so far triangulated 3D points and the camera pose is computed from 3D to 2D point correspondences. Afterwards, bundle adjustment is performed for accuracy and new features get triangulated in 3D. The detailed steps are as follows:

1. Capture two images $c03-math-037$ , $c03-math-038$ .
2. Extract and match features between them.
3. Triangulate features from $c03-math-039$ , $c03-math-040$ .
4. Capture new image $c03-math-041$ .
5. Extract features and match with previous frame $c03-math-042$ .
6. Compute camera pose from 3D to 2D matches using the PnP algorithm and RANSAC as a robust estimator.
7. Refine camera pose and 3D points with bundle adjustment.
8. Triangulate all new feature matches between $c03-math-043$ and $c03-math-044$ .
9. Repeat at step 4.

For the details of the different steps, the reader is referred to Scaramuzza and Fraundorfer (2011)

It is possible to run such an inherently complex algorithm on a low-power computer onboard of an MAV if certain criteria are met. First, robust feature matching between images can be very computationally complex; however, for the case of high frame rate cameras, the feature matching step can be significantly accelerated. For real-time frame rates of about 30 frames per second, the motion between two images is small, which means that image features almost have the same coordinates for a neighboring image pair. In such a case, it is possible to refrain from computing a feature descriptor at all and perform feature matching as a nearest neighbor matching of image coordinates. If in addition a fast feature detector, for example, FAST corners, is used, then the feature detection and matching step can be carried out very fast. Second, it is actually possible to skip the pose estimation from 3D to 2D point correspondences and instead compute the pose by bundle adjustment directly. For real-time frame rates, the camera pose for bundle adjustment can be initialized by the previous camera pose or in particular for MAVs with the propagated pose from the state estimator. By using a robust cost function in bundle adjustment, outliers can be removed as well. Third, the computational complexity of bundle adjustment would increase with increasing number of 3D points and camera poses. By keeping only the last $c03-math-045$ frames and 3D points within these frames, the runtime can be kept constant. In Achtelik et al. (2011), an algorithm following these principles was able to compute the pose of an MAV with 30 frames per second and has successfully been used to control the MAV.

However, monocular visual odometry has a significant drawback. The metric scale of the camera position cannot be computed. It is therefore necessary to calibrate the scale upon initialization with a pattern with known metric scale (e.g., a checkerboard) or to estimate the metric scale by fusion with inertial sensors. In addition, the scale estimate also is prone to drift. These problems can be alleviated using a stereo vision system.

3.3.3 MAV Pose from Stereo Vision

The big advantage of stereo vision is that the metric scale of camera poses and 3D points can be computed directly. This means that scale drift as present in a monocular system cannot happen. And in particular for the case of state estimation, it makes the inclusion of a scale factor in the MAV state unnecessary. Furthermore, the triangulation of 3D points always happens from two precisely pre-calibrated image frames. Camera pose from stereo vision can be computed either from 3D to 3D correspondences or 3D to 2D correspondences. The 3D–2D methods recently gained a lot of popularity and is therefore described here. For the 3D–3D method, the reader is referred to Scaramuzza and Fraundorfer (2011).

The basic outline of the 3D–2D method is to use the PnP algorithm to compute camera pose from 3D to 2D correspondences. 3D points are triangulated from each stereo image pair. To compute the current pose, feature matching is performed between one image of the current stereo image pair and one image of the previous stereo image pair. The PnP algorithm is then used to compute the camera pose from corresponding 2D feature points of the current stereo image pair and 3D points from the previous stereo image pair (Figures 3.5 and 3.6).

1. Capture a stereo image pair $c03-math-046$ , $c03-math-047$ .
2. Extract, match, and triangulate features between $c03-math-048$ , $c03-math-049$ .
3. Capture new stereo image pair $c03-math-050$ , $c03-math-051$ .
4. Extract features and match between $c03-math-052$ and $c03-math-053$ .
5. Compute camera pose from 3D to 2D matches using the PnP algorithm and random sample consensus (RANSAC) as a robust estimator.
6. Refine camera pose and 3D points with bundle adjustment.
7. Triangulate all new feature matches from stereo image pair $c03-math-054$ , $c03-math-055$ .
8. Repeat at step 3.

nfgz005 — **Figure 3.5** Illustration of monocular pose estimation. The new camera pose is computed from 3D points triangulated from at least two subsequent images

nfgz006 — **Figure 3.6** Illustration of stereo pose estimation. At each time index, 3D points can be computed from the left and right images of the stereo pair. The new camera pose can be computed directly from the 3D points triangulated from the previous stereo pair

Stereo vision processing adds additional computational complexity to the pose estimation process. It is necessary to create matches between the left and right images of the stereo pair and between the current and previous image pair. However, feature matching between the left and right images can be done very efficiently as the camera poses are fixed and pre-calibrated. The corresponding feature needs to lie on the epipolar line and the search for a feature correspondence can be restricted to this epipolar line. Stereo processing, however, has been shown to run in real time on dedicated hardware (e.g., FPGAs; see Schmid et al. (2014)), freeing the main computer from the low-level vision tasks.

The described method is not limited to stereo systems but can be utilized for any RGBD sensor system, that is, a sensor that captures depth (D) and a camera image (RGB). A popular example for such a sensor is the Kinect sensor. Currently, a wide variety of RGBD sensors are developed, and all of these sensors will work as well with the above-described algorithm.

3.3.4 MAV Pose from Optical Flow Measurements

The use of optical flow for controlling MAVs has been popularized with the appearance of Parrot's AR Drone. The AR Drone comes with a position control mechanism based on optical flow (see Bristeau et al. (2011)), which allows it to hover stable on the spot when there is no user input. The main benefits of using the optical flow algorithm for pose estimation are the possibility to achieve very high update rates (e.g., >100 Hz) and that it can easily be implemented on resource-limited embedded controllers. However, the algorithm comes with some limiting assumptions. The optical flow algorithm measures pixel-shift in the image plane. If the image plane is, for example, parallel to the ground plane, the observed pixel-shift is proportional to the camera motion. If the distance between camera and ground plane gets measured (e.g., by an ultrasound sensor), the pixel-shift can be translated into metric camera movements. To disambiguate optical flow that gets introduced from rotational motion and translational motion, IMU measurements can be used to correct for the rotational part. Necessary components for such a metric optical flow sensor are a camera sensor, a distance sensor (e.g., ultrasound or infrared) and a three-axis gyroscope. The images for the camera sensors are used to compute $c03-math-056$ and $c03-math-057$ pixel-shifts using the optical flow algorithm. The distance sensor is used to measure the distance between the camera and the world plane that induces the optical flow to translate the pixel-shifts into metric shifts. Furthermore, the distance sensor is used to track changes in the $c03-math-058$ -direction (height). The gyroscope is used to measure rotational velocity of the sensor between two image frames. In this way, the rotationally induced part of the optical flow can be computed and subtracted from the measured optical flow. The camera movement (translation) can be computed from the remaining metric optical flow and distance measurements in metric units. Themathematical relation is given in Eq. (3.6). The metric translation of the camera (i.e., the translation of the MAV) is denoted by $c03-math-059$ , where the movements are in $c03-math-060$ , and $c03-math-061$ directions. Eq. (3.6) gives the relation of how to compute $c03-math-062$ from the measured optical flow between two images $c03-math-063$ , the measured rotation between two images and the measured depths. The rotation between two images is measured by the gyroscopes and denoted by the $c03-math-064$ rotation matrix $c03-math-065$ . $c03-math-066$ is the measured optical flow between the two images in pixels. $c03-math-067$ is the image coordinate of the location in image $c03-math-068$ for which the optical flow has been computed. $c03-math-069$ are the distance measurements in meters to the ground plane for the images $c03-math-070$ , respectively, and $c03-math-071$ is the focal length of the camera in pixels.

3.6

Equation (3.6) can be derived from the simple equation for optical flow given by Eq. (3.7) and substituting the quantities $c03-math-073$ and $c03-math-074$ with the terms of Eqs (3.8) and (3.9). Equations (3.8) and (3.9) describe the camera projection of a 3D point before and after the movement of the camera with rotation $c03-math-075$ and translation $c03-math-076$ .

3.7

3.8

3.9

The geometric relations and quantities used in the derivation are illustrated in Figure 3.7.

nfgz007 — **Figure 3.7** Concept of the optical flow sensor depicting the geometric relations used to compute metric optical flow

Equation (3.6) describes how to compute $c03-math-080$ for one image location and the optical flow. In practice, multiple flow vectors within an image are computed (e.g., by KLT feature tracking), which gives rise to multiple estimates of $c03-math-081$ , which can be averages for robustness. The algorithm can now be summarized as follows:

1. Compute optical flow between consecutive images $c03-math-082$ using the KLT method (see Shi and Tomasi (1994)).
2. Measure distances $c03-math-083$ with distance sensor.
3. Measure rotation matrix $c03-math-084$ with gyroscopes.
4. Compute $c03-math-085$ using Eq. (3.6).

To achieve the high update rates needed for MAV control, two suggestions can be followed. First, by using a low-resolution camera (e.g., $c03-math-086$ ), a high image capture rate can be achieved. This could, for instance, be done by turning on hardware binning on a standard VGA camera, which will reduce the resolution but increase the image capture rate. Together with a low resolution, a zoom lens is suggested such that small details on the ground plane can be resolved for better feature tracking. Second, by computing KLT feature tracks only for a fixed number of image locations defined on a grid, the computations for the feature detection part of KLT can be avoided and the number of computations will stay constant.

These ideas are actually implemented in the PX4Flow sensor depicted in Figure 3.8. The PX4Flow sensor computes metric optical flow as described earlier. For this, the sensor is equipped with a camera, gyroscopes, an ultrasound ranging device, and an ARM processor that performs the optical flow computation. The PX4Flow sensor is described in detail in Honegger et al. (2013).

nfgz008 — **Figure 3.8** The PX4Flow sensor to compute MAV movements using the optical flow principle. It consists of a digital camera, gyroscopes, a range sensor, and an embedded processor for image processing

3.4 3D Mapping

3D mapping, that is, the generation of 3D data from images, is one of the most prominent tasks for MAVs. MAVs provide unique vantage points for digital cameras as compared to street-level images and traditional aerial images. For building reconstruction in urban scenarios, images can be taken from very close to the building. Images can be taken facing straight down (called nadir images), looking at the facades directly or from general oblique directions at low heights. This however results in nonuniform flight patterns in contrast to the typical regularly sampled grids used in aerial imaging. Computer vision systems that generate 3D data from image sets are known under the term SfM systems. These systems can also be used to process image data taken by MAVs. However, by working only on the image data, these systems neglect the additional information provided by the sensors onboard of the MAV that are used for MAV control. These sensors can be used to improve quality and efficiency of SfM systems in a number of ways.

nfgz009 — **Figure 3.9** The different steps of a typical structure from motion (SfM) pipeline to compute 3D data from image data. The arrows on the right depict the additional sensor data provided from a MAV platform and highlight for which steps in the pipeline it can be used

Figure 3.9 shows the basic building steps that are common to most SfM systems. Depicted by the arrows from the right side are the additional sensor measurements available from the MAV that can be used to improve the individual steps.

In the feature extraction step, a lot of computational power is used to extract viewpoint, scale, and orientation-invariant features and descriptors. Such an invariance, however, can also be achieved by using IMU measurements from the MAV. For the case of a downward-looking camera, the viewpoint invariance can be achieved by aligning all images with the earth's gravity normal. The deviation of the MAV from this normal is measured by the MAV's IMU. The images can be pre-rotated such that all the image planes are parallel, thus removing viewpoint changes. The remaining rotation invariance to the yaw angle can be achieved by utilizing the MAV's absolute compass measurements and by applying an in-plane rotation. Finally, scale invariance can be achieved by measuring the height over ground with the MAV's pressure sensor and correcting for this scale change. After this pre-processing, simpler, less complex features and descriptors can be used. The extraction process is faster and the matching ambiguity gets typically reduced (see Meier et al. (2012)).

In the image matching step, traditionally an exhaustive pairwise matching is computed with the main goal to establish an ordering of how to process the images. This step is extremely time consuming; it consists of descriptor matching and two-view geometry estimation for match verification. Here, additional measures from the MAV's sensor can speed up the process significantly and also increase robustness. The GPS positions of the MAV can be used to avoid full exhaustive matching. With known GPS positions of the images, shortlists of nearby images can be created and matching can be performed on the shortlist images only, avoiding to try to matchimages that will not match anyway. In the two-view geometry estimation step, the essential matrix between two images is computed to verify the matching. This requires the use of the 5 pt essential matrix estimation algorithm within a RANSAC loop. For the case of MAVs when the gravity normal is measured by the IMU, the more efficient 3 pt essential matrix algorithm, which was developed exactly for this case, can be used instead of the 5 pt algorithm. The 3 pt algorithm takes two rotation measurements directly from the IMU and only estimates the three unknowns that remain. To do this, the algorithm only needs 3 pt correspondences, whereas the 5 pt method needs 5 pt correspondences. This is the reason for its much higher efficiency. When used within a RANSAC loop, the necessary RANSAC iterations depend exponentially on the number of point correspondences, thus 3 or 5 necessary point correspondences make a big difference in runtime.

In the triangulation and pose estimation step, nearby images are registered to each other and 3D points are triangulated. Image registration typically works by computing a pose hypothesis from 3D to 2D matches with the PnP algorithm and then to a nonlinear optimization of the re-projection error. Computing the pose hypothesis from 3D to 2D matches typically is a time-consuming algorithm and can be replaced when IMU measurements of the MAV are used. The pose hypothesis can be computed by integrating IMU measurements of the MAV into a pose hypothesis. It has been shown that such kind of pose hypothesis is sufficiently accurate for subsequent state-of-the-art nonlinear pose optimization.

In the final step, bundle adjustment is used to optimize all the camera poses and 3D points in a global manner. Here, the GPS and IMU measurements can be used as additional constraints, as global pose constraints (GPS), global attitude constraints (IMU), and relative orientation constraints (IMU).

Many of these concepts have already been implemented in the open-source SfM software MAVMAP (see Schönberger et al. (2014)), which has specifically been designed for 3D mapping using MAVs. The main benefits are a major speedup for important computer vision algorithms and increased robustness. In areas with low texture, which leads to a low number of extracted image features, additional information such as GPS or IMU will provide valuable constraints for solving the camera pose estimation problem. Image data from MAVs can be acquired in a highly irregular manner, which typically puts the SfM system to the test, and therefore, the SfM process will benefit a lot from using additional constraints. Figure 3.10 shows a 3D map of a rubble field used for the training of search and rescue personal. The 3D map has been computed from more than 3000 images taken by multiple low-flying MAVs. The camera poses have been computed using MAVMAP and the point cloud has been densified using SURE (see Rothermel et al. (2012)). Figure 3.10a depicts the trajectories of the MAVs in red. In this image, the irregularity of the acquisition path is clearly visible. Figure 3.10b shows a detailed view of a part of the 3D map from a viewpoint originally not observed from the MAVs.

nfgz010 — **Figure 3.10** A 3D map generated from image data of three individual MAVs using MAVMAP. (a) 3D point cloud including MAVs' trajectories (camera poses are shown in red). (b) Detailed view of a part of the 3D map from a viewpoint originally not observed from the MAVs

3.5 Autonomous Navigation

Digital cameras most likely will be the most important sensors for autonomous navigation of MAVs. Digital cameras allow to map the environment of a MAV in 3D using sophisticated computer vision algorithms. Other options for 3D data generation, including laser range-finders, radar, ultrasound, infrared, still come with significant disadvantages. Full 3D laser range-finders, for instance, are currently simply too heavy to be put on small MAVs. Digital cameras, however, already today, weigh only mere grams and are incredibly tiny. Further miniaturization of both digital cameras and more importantly computer technology will allow to run ever more complex computer vision algorithms onboard of the MAV. Depth cameras in this context are also counted toward the digital camera class, as Kinect-style depth cameras resemble a camera-projector system. Online 3D mapping is a key ability to perform autonomous navigation. The MAV needs to sense obstacles and its surrounding environment in more detail to do path planning and obstacle avoidance for safe navigation. The challenge is real-time, high-fidelity depth sensing on a resource-limited platform with strict weight limitations. The current trend is to use specialized hardware embedded in the sensor to let computer vision algorithms such as stereo processing run in real time. The Kinect sensor, for example, is performing depth estimation within a system-on-a-chip (SoC) system. In other works, the semi-global matching algorithm for stereo processing is running on an FPGA chip to achieve real-time depth sensing for MAVs. The integration of ever more complex algorithms into sensor hardware is already in the works. For example, the Skybotix VI-sensor has its goal to put a complete visual-inertial pose estimation system into the sensor hardware itself, utilizing FPGA processing and dedicated hardware.

The sensor measurements have to be fused together into a common map representation of the environment that facilitates path planning, obstacle avoidance, and higher level AI operations such as exploration strategies. In this context, methods that have been developed for ground robots cannot directly be used, they have to be first extended from 2D representation to 3D representation as a MAV can navigate freely in 3D. 3D occupancy grids represent the state of the art of environment representation. They allow the incremental fusion of different depth scans into a common map of user-defined accuracy. 3D occupancy maps can efficiently be used for path planning and navigation. However, one quickly runs into scalability problems on memory-limited computers onboard of MAVs and typically tiling approaches have to be developed.

Figure 3.11 shows a 3D occupancy grid used for MAV navigation as described in Heng et al. (2014). Grid cells that are occupied by an object (walls, floor, furniture, etc.) are visualized as blue cubes. The occupancy grid was computed by fusing a large number of individual depth sensor measurements. Each grid cell further holds a probability that indicates the certainty about themeasurement.

nfgz011 — **Figure 3.11** Environment represented as a 3D occupancy grid suitable for path planning and MAV navigation. Blue blocks are the occupied parts of the environment

3.6 Scene Interpretation

In the previous section, the focus was on how to use digital images for MAV navigation and mapping mainly by extracting 3D information from image data. However, the analysis of the image content itself by computer vision methods to perform scene interpretation is of importance for MAVs as well. Fully autonomous task-driven operation of a MAV needs knowledge about not only the geometry of its environment but also a semantic description and interpretation. The MAV needs to be able to detect and identify objects, people, and so on. It needs to be able to semantically localize itself, that is, recognize if it is in a kitchen, bathroom, or living room. It needs to identify people and understand gestures to interact with them. And it needs to be able to identify passageways, doors, windows, or exits. Computer vision is the most promising technology to achieve this necessary semantic interpretation.

Figure 3.12 shows an example for scene interpretation by a MAV (see Meier et al. (2012)). The system uses a face detector to detect and identify persons in its vicinity. This lets the system always keep a safe distance from people. The system also identifies objects from a pre-trained database. One of these objects is the exit sign attached above the door. The system can use information like this for path planning and navigation.

nfgz012 — **Figure 3.12** Live view from a MAV with basic scene interpretation capabilities. The MAV detects faces and pre-trained objects (e.g., the exit sign) and marks them in the live view

3.7 Concluding Remarks

Computer vision proved to be a key enabling technology for autonomous MAVs. Digital cameras are used for pose estimation and for navigation of MAVs using onboard sensors. The possibility to do pose estimation of a MAV with onboard sensors immensely widened the scenarios in which MAVs can be used. Without onboard pose estimation, autonomous operation would be restricted to outdoor areas where GPS reception is available or to instrumented areas (e.g., areas equipped with a tracking system). Applications like search and rescue in partially collapsed buildings, plant inspection tasks, or indoor metrology for documentation would otherwise just not be possible. In addition, the emergence of autonomous MAV can be considered as a new driving force for new developments in computer vision. Computer vision systems used for autonomous MAVs need to be highly reliable and robust. Many computer vision algorithms still lack in this respect. One promising route to more reliable computer vision algorithms is visual-inertial fusion, which is currently heavily investigated. Interest in this direction is largely due to the fact that a MAV is naturally equipped with an IMU and as such there is a big incentive to look into these approaches further. And of course, new insights into computer vision algorithms for MAVs can readily be transferred to other kinds of autonomous systems as well.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Computer Vision for MAVs

Create new playlist

Sign In

Sign Up