Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1
Computer Vision in Vehicles

Reinhard Klette

School of Engineering, Computer and Mathematical Sciences, Auckland University of Technology, Auckland, New Zealand

This chapter is a brief introduction to academic aspects of computer vision in vehicles. It briefly summarizes basic notation and definitions used in computer vision. The chapter discusses a few visual tasks as of relevance for vehicle control and environment understanding.

1.1 Adaptive Computer Vision for Vehicles

Computer vision designs solutions for understanding the real world by using cameras. See Rosenfeld (1969), Horn (1986), Hartley and Zisserman (2003), or Klette (2014) for examples of monographs or textbooks on computer vision.

Computer vision operates today in vehicles including cars, trucks, airplanes, unmanned aerial vehicles (UAVs) such as multi-copters (see Figure 1.1 for a quadcopter), satellites, or even autonomous driving rovers on the Moon or Mars.

nfgz001 — **Figure 1.1** (a) Quadcopter. (b) Corners detected from a flying quadcopter using a modified FAST feature detector.

Courtesy of Konstantin Schauwecker

In our context, the ego-vehicle is that vehicle where the computer vision system operates in; ego-motion describes the ego-vehicle's motion in the real world.

1.1.1 Applications

Computer vision solutions are today in use in manned vehicles for improved safety or comfort, in autonomous vehicles (e.g., robots) for supporting motion or action control, and also for misusing UAVs for killing people remotely. The UAV technology has also good potentials for helping to save lives, to create three-dimensional (3D) models of the environment, and so forth. Underwater robots and unmanned sea-surface vehicles are further important applications of vision-augmented vehicles.

1.1.2 Traffic Safety and Comfort

Traffic safety is a dominant application area for computer vision in vehicles. Currently, about 1.24 million people die annually worldwide due to traffic accidents (WHO 2013), this is, on average, 2.4 people die per minute in traffic accidents. How does this compare to the numbers Western politicians are using for obtaining support for their “war on terrorism?” Computer vision can play a major role in solving the true real-world problems (see Figure 1.2). Traffic-accident fatalities can be reduced by controlling traffic flow (e.g., by triggering automated warning signals at pedestrian crossings or intersections with bicycle lanes) using stationary cameras, or by having cameras installed in vehicles (e.g., for detecting safe distances and adjusting speed accordingly, or by detecting obstacles and constraining trajectories).

nfgz002 — **Figure 1.2** The 10 leading causes of death in the world. Chart provided online by the World Health Organization (WHO). Road injury ranked number 9 in 2011

Computer vision is also introduced into modern cars for improving driving comfort. Surveillance of blind spots, automated distance control, or compensation of unevenness of the road are just three examples for a wide spectrum of opportunities provided by computer vision for enhancing driving comfort.

1.1.3 Strengths of (Computer) Vision

Computer vision is an important component of intelligent systems for vehicle control (e.g., in modern cars, or in robots). The Mars rovers “Curiosity” and “Opportunity” operate based on computer vision; “Opportunity” has already operated on Mars for more than ten years. The visual system of human beings provides a proof of existence that vision alone can deliver nearly all of the information required for steering a vehicle. Computer vision aims at creating comparable automated solutions for vehicles, enabling them to navigate safely in the real world. Additionally, computer vision can also work constantly “at the same level of attention,” applying the same rules or programs; a human is not able to do so due to becoming tired or distracted.

A human applies accumulated knowledge and experience (e.g., supporting intuition), and it is a challenging task to embed a computer vision solution into a system able to have, for example, intuition. Computer vision offers many more opportunities for future developments in a vehicle context.

1.1.4 Generic and Specific Tasks

There are generic visual tasks such as calculating distance or motion, measuring brightness, or detecting corners in an image (see Figure 1.1b). In contrast, there are specific visual tasks such as detecting a pedestrian, understanding ego-motion, or calculating the free space a vehicle may move in safely in the next few seconds. The borderline between generic and specific tasks is not well defined.

Solutions for generic tasks typically aim at creating one self-contained module for potential integration into a complex computer vision system. But there is no general-purpose corner detector and also no general-purpose stereo matcher. Adaptation to given circumstances appears to be the general way for an optimized use of given modules for generic tasks.

Solutions for specific tasks are typically structured into multiple modules that interact in a complex system.

1.1.5 Multi-module Solutions

Designing a multi-module solution for a given task does not need to be more difficult than designing a single-module solution. In fact, finding solutions for some single modules (e.g., for motion analysis) can be very challenging. Designing a multi-module solution requires:

1. that modular solutions are available and known,
2. tools for evaluating those solutions in dependency of a given situation (or scenario; see Klette et al. (2011) for a discussion of scenarios) for being able to select (or adapt) solutions,
3. conceptual thinking for designing and controlling an appropriate multi-module system,
4. a system optimization including a more extensive testing on various scenarios than for a single module (due to the increase in combinatorial complexity of multi-module interactions), and
5. multiple modules require control (e.g., when many designers separately insert processors for controlling various operations in a vehicle, no control engineer should be surprised if the vehicle becomes unstable).

1.1.6 Accuracy, Precision, and Robustness

Solutions can be characterized as being accurate, precise, or robust. Accuracy means a systematic closeness to the true values for a given scenario. Precision also considers the occurrence of random errors; a precise solution should lead to about the same results under comparable conditions. Robustness means approximate correctness for a set of scenarios that includes particularly challenging ones: in such cases, it would be appropriate to specify the defining scenarios accurately, for example, by using video descriptors (Briassouli and Kompatsiaris 2010) or data measures (Suaste et al. 2013). Ideally, robustness should address any possible scenario in the real world for a given task.

1.1.7 Comparative Performance Evaluation

An efficient way for a comparative performance analysis of solutions for one task is by having different authors testing their own programs on identical benchmark data. But we not only need to evaluate the programs, we also need to evaluate the benchmark data used (Haeusler and Klette 2010 2012) for identifying their challenges or relevance.

Benchmarks need to come with measures for quantifying performance such that we can compare accuracy on individual data or robustness across a diversity of different input data.

Figure 1.4 illustrates two possible ways for generating benchmarks, one by using computer graphics for rendering sequences with accurately known ground truth,¹ and the other one by using high-end sensors (in the illustrated case, ground truth is provided by the use of a laser range-finder).²

nfgz004 — **Figure 1.4** Examples of benchmark data available for a comparative analysis of computer vision algorithms for motion and distance calculations. (a) Image from a synthetic sequence provided on EISATS with accurate ground truth. (b) Image of a real-world sequence provided on KITTI with approximate ground truth

But those evaluations need to be considered with care since everything is not comparable. Evaluations depend on the benchmark data used; having a few summarizing numbers may not be really of relevance for particular scenarios possibly occurring in the real world. For some input data we simply can not answer how a solution performs; for example, in the middle of a large road intersection, we cannot answer which lane border detection algorithm performs best for this scenario.

1.1.8 There Are Many Winners

We are not so naive to expect an all-time “winner” when comparatively evaluating computer vision solutions. Vehicles operate in the real world (whether on Earth, the Moon, or on Mars), which is so diverse that not all of the possible event occurrences can be modeled in underlying constraints for a designed program. Particular solutions perform differently for different scenarios, and a winning program for one scenario may fail for another. We can only evaluate how particular solutions perform for particular scenarios. At the end, this might support an optimization strategy by adaptation to a current scenario that a vehicle experiences at a time.

1.2 Notation and Basic Definitions

The following basic notations and definitions (Klette 2014) are provided.

1.2.1 Images and Videos

An image $c01-math-001$ is defined on a set

1.1

of pairs of integers (pixel locations), called the image carrier, where $c01-math-003$ and $c01-math-004$ define the number of columns and rows, respectively. We assume a left-hand coordinate system with the coordinate origin in the upper-left corner of the image, the $c01-math-005$ -axis to the right, and the $c01-math-006$ -axis downward. A pixel of an image $c01-math-007$ combines a location $c01-math-008$ in the carrier $c01-math-009$ with the value $c01-math-010$ of $c01-math-011$ at this location.

A scalar image $c01-math-012$ takes values in a set $c01-math-013$ , typically with $c01-math-014$ , $c01-math-015$ , or $c01-math-016$ . A vector-valued image $c01-math-017$ has scalar values in a finite number of channels or bands. A video or image sequence consists of frames $c01-math-018$ , for $c01-math-019$ , all being images on the same carrier $c01-math-020$ .

1.2.1.1 Gauss Function

The zero-mean Gauss function is defined as follows:

1.2

A convolution of an image $c01-math-035$ with the Gauss function produces smoothed images

1.3

also known as Gaussians, for $c01-math-037$ . (We stay with symbol $c01-math-038$ here as introduced by Lindeberg (1994) for “layer”; a given context will prevent confusion with the left image $c01-math-039$ of a stereo pair.)

1.2.1.2 Edges

Step-edges in images are detected based on first- or second-order derivatives, such as values of the gradient $c01-math-040$ or the Laplacian $c01-math-041$ given by

1.4

Local maxima of $c01-math-043$ - or $c01-math-044$ -magnitudes $c01-math-045$ or $c01-math-046$ , or zero-crossings of values $c01-math-047$ are taken as an indication for a step-edge. The gradient or Laplacian is commonly preceded by smoothing, using a convolution with the zero-mean Gauss function.

Alternatively, Phase-congruency edges in images are detected based on local frequency-space representations (Kovesi 1993).

1.2.1.3 Corners

Let $c01-math-048$ , $c01-math-049$ , $c01-math-050$ , and $c01-math-051$ denote the second-order derivatives of image $c01-math-052$ . Corners in images are localized based on high curvature of intensity values, to be identified by two large eigenvalues of the Hessian matrix

1.5

at a pixel location $c01-math-054$ in a scalar image $c01-math-055$ (see Harris and Stephens (1988)). Figure 1.1 shows the corners detected by FAST. Corner detection is often preceded by smoothing using a convolution with the zero-mean Gauss function.

1.2.1.4 Scale Space and Key Points

Key points or interest points are commonly detected as maxima or minima in a $c01-math-056$ subset of the scale space of a given image (Crowley and Sanderson 1987; Lindeberg 1994). A finite set of differences of Gaussians

1.6

produces a DoG scale space. These differences are approximations to Laplacians of increasingly smoothed versions of an image (see Figure 1.5 for an example of such Laplacians forming an LoG scale space).

nfgz005 — **Figure 1.5** Laplacians of smoothed copies of the same image using `cv::GaussianBlur` and `cv::Laplacian` in `OpenCV`, with values 0.5, 1, 2, and 4, for parameter $c01-math-058$ for smoothing. Linear scaling is used for better visibility of the resulting *Laplacians*.

Courtesy of Sandino Morales

1.2.1.5 Features

An image feature is finally a location (an interest point), defined by a key point, edge, corner, and so on, together with a descriptor, usually given as a data vector (e.g., in case of scale-invariant feature transform (SIFT) of length 128 representing local gradients), but possibly also in other formats such as a graph. For example, the descriptor of a step-edge can be mean and variance of gradient values along the edge, and the descriptor of a corner can be defined by the eigenvalues of the Hessian matrix.

1.2.2 Cameras

We have an $c01-math-059$ world coordinate system, which is not defined by a particular camera or other sensor, and a camera coordinate system $c01-math-060$ (index “s” for “sensor”), which is described with respect to the chosen world coordinates by means of an affine transform, defined by a rotation matrix $c01-math-061$ and a translation vector $c01-math-062$ .

A point in 3D space is given as $c01-math-063$ in world coordinates or as $c01-math-064$ in camera coordinates. In addition to the coordinate notation for points, we also use vector notation, such as $c01-math-065$ for point $c01-math-066$ .

1.2.2.1 Pinhole-type Camera

The $c01-math-067$ -axis models the optical axis. Assuming an ideal pinhole-type camera, we can ignore radial distortion and can have undistorted projected points in the image plane with coordinates $c01-math-068$ and $c01-math-069$ . The distance $c01-math-070$ between the $c01-math-071$ image plane and the projection center is the focal length.

A visible point $c01-math-072$ in the world is mapped by central projection into pixel location $c01-math-073$ in the undistorted image plane:

1.7

with the origin of $c01-math-075$ image coordinates at the intersection point of the $c01-math-076$ -axis with the image plane.

The intersection point $c01-math-077$ of the optical axis with the image plane in $c01-math-078$ coordinates is called the principal point. It follows that $c01-math-079$ . A pixel location $c01-math-080$ in the 2D $c01-math-081$ image coordinate system has 3D coordinates $c01-math-082$ in the $c01-math-083$ camera coordinate system.

1.2.2.2 Intrinsic and Extrinsic Parameters

Assuming multiple cameras $c01-math-084$ , for some indices $c01-math-085$ (e.g., just $c01-math-086$ and $c01-math-087$ for binocular stereo), camera calibration specifies intrinsic parameters such as edge lengths $c01-math-088$ and $c01-math-089$ of camera sensor cells (defining the aspect ratio), a skew parameter $c01-math-090$ , coordinates of the principal point $c01-math-091$ where optic axis of camera $c01-math-092$ and image plane intersect, the focal length $c01-math-093$ , possibly refined as $c01-math-094$ and $c01-math-095$ , and lens distortion parameters starting with $c01-math-096$ and $c01-math-097$ . In general, it can be assumed that lens distortion has been calibrated before and does not need to be included anymore in the set of intrinsic parameters. Extrinsic parameters are defined by rotation matrices and translation vectors, for example, matrix $c01-math-098$ and vector $c01-math-099$ for the affine transform between camera coordinate systems $c01-math-100$ and $c01-math-101$ , or matrix $c01-math-102$ and vector $c01-math-103$ for the affine transform between camera coordinate system $c01-math-104$ and $c01-math-105$ .

1.2.2.3 Single-Camera Projection Equation

The camera projection equation in homogeneous coordinates, mapping a 3D point $c01-math-106$ into image coordinates $c01-math-107$ of the $c01-math-108$ th camera, is as follows:

1.8

1.9

where $c01-math-111$ is a scaling factor. This defines a $c01-math-112$ matrix $c01-math-113$ of intrinsic camera parameters and a $c01-math-114$ matrix $c01-math-115$ of extrinsic parameters (of the affine transform) of camera $c01-math-116$ . The $c01-math-117$ camera matrix $c01-math-118$ is defined by 11 parameters if we allow for an arbitrary scaling of parameters; otherwise it is 12.

1.2.3 Optimization

We specify one popular optimization strategy that has various applications in computer vision. In an abstract sense, we assign to each pixel a label $c01-math-119$ (e.g., an optical flow vector $c01-math-120$ , a disparity $c01-math-121$ , a segment identifier, or a surface gradient) out of a set $c01-math-122$ of possible labels (e.g., all vectors pointing from a pixel $c01-math-123$ to points in a Euclidean distance to $c01-math-124$ of less than a given threshold). Labels $c01-math-125$ are thus in the 2D continuous plane.

1.2.3.1 Optimizing a Labeling Function

Labels are assigned to all the pixels in the carrier $c01-math-126$ by a labeling function $c01-math-127$ . Solving a labeling problem means to identify a labeling $c01-math-128$ that approximates somehow an optimum of a defined error or energy

1.10

where $c01-math-130$ is a weight. Here, $c01-math-131$ is the data-cost term and $c01-math-132$ is the smoothness-cost term. A decrease in $c01-math-133$ works toward reduced smoothing of calculated labels. Ideally, we search for an optimal (i.e., of minimal total error) $c01-math-134$ in the set of all possible labelings, which defines a total variation (TV).

We detail Eq. (1.10) by adding costs at pixels. In a current image, label $c01-math-135$ is assigned by the value of labeling function $c01-math-136$ at pixel position $c01-math-137$ . Then we have that

1.11

where $c01-math-139$ is an adjacency relation between pixel locations.

In optical flow or stereo vision, label $c01-math-140$ (i.e., optical flow vector or disparity) defines a pixel $c01-math-141$ in another image (i.e., in the following image, or in the left or right image of a stereo pair); in this case, we can also write $c01-math-142$ instead of $c01-math-143$ .

1.2.3.2 Invalidity of the Intensity Constancy Assumption

Data-cost terms are defined for windows that are centered at the considered pixel locations. The data in both windows, around the start pixel location $c01-math-144$ , and around the pixel location $c01-math-145$ in the other image, are compared for understanding “data similarity.”

For example, in the case of stereo matching, we have $c01-math-146$ in the right image $c01-math-147$ and $c01-math-148$ in the left image $c01-math-149$ , for disparity $c01-math-150$ , and the data in both $c01-math-151$ windows are identical if and only if the data-cost measure

1.12

results in value 0, where SSD stands for sum of squared differences.

The use of such a data-cost term would be based on the intensity constancy assumption (ICA), that is, intensity values around corresponding pixel locations $c01-math-153$ and $c01-math-154$ are (basically) identical within a window of specified size. However, the ICA is invalid for real-world recording. Intensity values at corresponding pixels and in their neighborhoods are typically impacted by lighting variations, or just by image noise. There are also impacts of differences in local surface reflectance, differences in cameras when comparing images recorded by different cameras, or effects of perspective distortion (the local neighborhood around a surface point is differently projected into different cameras). Thus, energy optimization needs to apply better data measures compared to SSD, or other measures are also defined based on the ICA.

1.2.3.3 Census Data-Cost Term

The census-cost function has been identified as being able to compensate successfully bright variations in input images of a recorded video (Hermann and Klette 2009; Hirschmüller and Scharstein 2009). The mean-normalized census-cost function is defined by comparing a $c01-math-155$ window centered at pixel location $c01-math-156$ in frame $c01-math-157$ with a window of the same size centered at a pixel location $c01-math-158$ in frame $c01-math-159$ . Let $c01-math-160$ be the mean of the window around $c01-math-161$ for $c01-math-162$ or $c01-math-163$ . Then we have that

1.13

with

1.14

Note that value 0 corresponds to consistency in both comparisons. If the comparisons are performed with respect to values $c01-math-166$ and $c01-math-167$ , rather than the means $c01-math-168$ and $c01-math-169$ , then we have the census-cost function $c01-math-170$ as a candidate for a data-cost term.

Let $c01-math-171$ be the vector listing results $c01-math-172$ in a left-to-right, top-to-bottom order (with respect to the applied $c01-math-173$ window), where $c01-math-174$ is the signum function; $c01-math-175$ lists values $c01-math-176$ . The mean-normalized census data-cost $c01-math-177$ equals the Hamming distance between vectors $c01-math-178$ and $c01-math-179$ .

1.3 Visual Tasks

This section briefly outlines some of the visual tasks that need to be solved by computer vision in vehicles.

1.3.1 Distance

Laser range-finders are increasingly used for estimating distance mainly based on the time-of-flight principle. Assuming sensor arrays of larger density in the nearfuture, laser range-finders will become a standard option for cost-efficient accurate distance calculations. Combining stereo vision with distance data provided by laser range-finders is a promising multi-module approach toward distance calculations.

Stereo vision is the dominant approach in computer vision for calculating distances. Corresponding pixels are here defined by projections of the same surface point in the scene into the left and right images of a stereo pair. After having recorded stereo pairs rectified into canonical stereo geometry, one-dimensional (1D) correspondence search can be limited to identical image rows.

1.3.1.1 Stereo Vision

We address the detection of corresponding points in a stereo image $c01-math-180$ , a basic task for distance calculation in vehicles using binocular stereo.

Corresponding pixels define a disparity, which is mapped based on camera parameters into distance or depth. There are already very accurate solutions for stereo matching, but challenging input data (rain, snow, dust, sunstroke, running wipers, and so forth) still pose unsolved problems (see Figure 1.6 for an example of a depth map).

nfgz006 — **Figure 1.6** (a) Image of a stereo pair (from a test sequence available on EISATS). (b) Visualization of a depth map using the color key shown at the top for assigning distances in meters to particular colors. A pixel is shown in gray if there was low confidence for the calculated disparity value at this pixel.

Courtesy of Simon Hermann

1.3.1.2 Binocular Stereo Vision

After camera calibration, we have two virtually identical cameras $c01-math-181$ and $c01-math-182$ , which are perfectly aligned defining canonical stereo geometry. In this geometry, we have an identical copy of the camera on the left translated by base distance $c01-math-183$ along the $c01-math-184$ -axis of the $c01-math-185$ camera coordinate system of the left camera. The projection center of the left camera is at $c01-math-186$ and the projection center of the cloned right camera is at $c01-math-187$ . A 3D point $c01-math-188$ is mapped into undistorted image points

1.15

1.16

in the left and right image planes, respectively. Considering $c01-math-191$ and $c01-math-192$ in homogeneous coordinates, we have that

1.17

for the 3 $c01-math-194$ 3 bifocal tensor $c01-math-195$ , defined by the configuration of the two cameras. The dot product $c01-math-196$ defines an epipolar line in the image plane of the right camera; any stereo point corresponding to $c01-math-197$ needs to be on that line.

1.3.1.3 Binocular Stereo Matching

Let $c01-math-198$ be the base image and $c01-math-199$ be the match image. We calculate corresponding pixels $c01-math-200$ and $c01-math-201$ in the $c01-math-202$ image coordinates of carrier $c01-math-203$ following the optimization approach as expressed by Eq. (1.11). A labeling function $c01-math-204$ assigns a disparity $c01-math-205$ to pixel location $c01-math-206$ , which specifies a corresponding pixel $c01-math-207$ .

For example, we can use the census data-cost term $c01-math-208$ as defined in Eq. (1.13), and for the smoothness-cost term, either the Potts model, linear truncated cost, or quadratic truncated costs is used (see Chapter 5 in Klette (2014)). Chapter 6 of Klette (2014) discusses also different algorithms for stereo matching, including belief-propagation matching (BPM) (Sun et al. 2003) and dynamic-programming stereo matching (DPSM). DPSM can be based on scanning along the epipolar line only using either an ordering or a smoothness constraint, or it can be based (for symmetry?) on scanning along multiple scanlines using a smoothness constraint along those lines; the latter case is known as semi-global matching (SGM) if multiple scanlines are used for error minimization (Hirschmüller, 2005). A variant of SGM is used in Daimler's stereo vision system, available since March 2013 in their Mercedes cars (see also Chapter 2 by U. Franke in this book).

Iterative SGM (iSGM) is an example for a modification of baseline SGM; for example, error minimization along the horizontal scanline should in general contribute more to the final result than optimization along other scanlines (Hermann and Klette, 2012). Figure 1.7 also addresses confidence measurement; for a comparative discussion of confidence measures, see Haeusler and Klette (2012). Linear BPM (linBPM) applies the MCEN data-cost term and the linear truncated smoothness-cost term (Khan et al. 2013).

nfgz007 — **Figure 1.7** Resulting disparity maps for stereo data when using only *one* scanline for DPSM with the SGM smoothness constraint and a $c01-math-209$ MCEN data-cost function. *From top to bottom and left to right*: Left-to-right horizontal scanline, and lower-left to upper-right diagonal scanline, top-to-bottom vertical scanline, and upper-left to lower-right diagonal scanline. Pink pixels are for low-confidence locations (here identified by inhomogeneous disparity locations).

Courtesy of Simon Hermann; the input data have been provided by Daimler A.G.

1.3.1.4 Performance Evaluation of Stereo Vision Solutions

Figure 1.8 provides a comparison of iSGM to linBPM on four frame sequences each of 400 frames length. It illustrates that iSGM performs better (with respect to the used measure, see the following section for its definition) on the bridge sequence that is characterized by many structural details in the scene, but not as good as linBPM on the other three sequences. For sequences dusk and midday, both performances are highly correlated, but not for the other two sequences. Of course, evaluating on only a few sequences of 400 frames each is insufficient for making substantial evaluations, but it does illustrate performance.

nfgz008 — **Figure 1.8** Normalized cross-correlation results when applying the third-eye technology for stereo matchers iSGM and linBPM for four real-world trinocular sequences of Set 9 of EISATS.

Courtesy of Waqar Khan, Veronica Suaste, and Diego Caudillo

The diagrams in Figure 1.8 are defined by the normalized cross-correlation (NCC) between a recorded third-frame sequence and a virtual sequence calculated based on the stereo matching results of two other frame sequences. This third-eye technology (Morales and R 2009) also uses masks such that only image values are compared which are close to step-edges (e.g., see Figure 1.5 for detected edges at bright pixels in LoG scale space) in the third frame. It enables us to evaluate performance on any calibrated trinocular frame sequence recorded in the real world.

Example 1.3.1 Environment Reconstruction

3D road-side visualization or 3D environment modeling is the application where a 3D reconstruction from a moving platform can be used (Xiao et al. 2009), possibly in combination with 3D reconstructions from a flying platform such as a multi-copter.

There are unresolved issues in the required very high accuracy of ego-motion analysis for mapping 3D results obtained at time $c01-math-210$ in a uniform world coordinate system. This is in particular apparent when trying to unify results from different runs through the same street (Zeng and Klette 2013). Figure 1.9 shows the 3D results from a single run (for a site at Tamaki campus, Auckland).

Figure 1.9 (a) Reconstructed cloud of points. (b) Reconstructed surface based on a single run of the ego-vehicle.

Courtesy of Yi Zeng

1.3.2 Motion

A sequence of video frames $c01-math-211$ , all defined on the same carrier $c01-math-212$ , is recorded with a time difference $c01-math-213$ between two subsequent frames; frame $c01-math-214$ is recorded at time $c01-math-215$ counted from the start of the recording.

The projection of a static or moving surface point into pixel $c01-math-216$ in frame $c01-math-217$ and into pixel $c01-math-218$ in frame $c01-math-219$ defines a pair of corresponding pixels represented by a motion vector $c01-math-220$ from $c01-math-221$ to $c01-math-222$ in $c01-math-223$ .

1.3.2.1 Dense or Sparse Motion Analysis

Dense motion analysis aims at calculating approximately correct motion vectors for “basically” every pixel location $c01-math-224$ in frame $c01-math-225$ (see Figure 1.10 for an example). Sparse motion analysis is designed for having accurate motion vectors at a few selected pixel locations.

nfgz010 — **Figure 1.10** Visualization of optical flow using the color key shown around the border of the image for assigning a direction to particular colors; the length of the flow vector is represented by saturation, where value “white” (i.e., undefined saturation) corresponds to “no motion.” (a) Calculated optical flow using the original Horn–Schunck algorithm. (b) Ground truth for the image shown in Figure 1.4a.

Courtesy of Tobi Vaudrey

Motion analysis is a difficult 2D correspondence problem, and it might become easier by having recorded high-resolution images at a higher frame rate in future. For example, motion analysis is approached by a single-module solution by optical flow calculation, or as a multi-module solution when combining image segmentation with subsequent estimations of motion vectors for image segments.

1.3.2.2 Optical Flow

Optical flow $c01-math-226$ is the result of dense motion analysis. It represents motion vectors between corresponding pixels $c01-math-227$ in frames $c01-math-228$ and $c01-math-229$ . Figure 1.10 shows the visualization of an optical flow map.

1.3.2.3 Optical Flow Equation and Image Constancy Assumption

The derivation of the optical flow equation (Horn and Schunck 1981)

1.18

for $c01-math-231$ and first-order derivatives $c01-math-232$ , $c01-math-233$ , and $c01-math-234$ follows from the ICA, that is, by assuming that corresponding 3D world points are represented in frame $c01-math-235$ and $c01-math-236$ by the same intensity. This is actually not true for computer vision in vehicles. Light intensities change frequently due to lighting artifacts (e.g., driving below trees), changing angles to the Sun, or simply due to sensor noise. However, the optical flow equation is often used as a data-cost term in an optimization approach (minimizing energy as defined in Eq. (1.10)) for solving the optical flow problem.

1.3.2.4 Examples of Data and Smoothness Costs

If we accept Eq. (1.18)due to Horn and Schunck (and thus the validity of the ICA) as data constraint, then we derive

1.19

as a possible data-cost term for any given time $c01-math-238$ .

We introduced above the zero-mean-normalized census-cost function $c01-math-239$ . The sum $c01-math-240$ can replace $c01-math-241$ in an optimization approach as defined by Eq. (1.10) (see Hermann and Werner (2013)). This corresponds to the invalidity of the ICA for video data recorded in the real world.

For the smoothness-error term, we may use

1.20

This smoothness-error term applies squared penalties to first-order derivatives in the $c01-math-243$ sense. Applying a smoothness term in an approximate $c01-math-244$ sense reduces the impact of outliers (Brox et al. 2004).

Terms $c01-math-245$ and $c01-math-246$ define the $c01-math-247$ optimization problem as originally considered by Horn and Schunck (1981).

1.3.2.5 Performance Evaluation of Optical Flow Solutions

Apart from using data with provided ground truth (see EISATS and KITTI and Figure 1.4), there is also a way for evaluating calculated flow vectors on recorded real-world video assuming that the recording speed is sufficiently large; for calculated flow vectors for frames $c01-math-248$ and $c01-math-249$ , we calculate an image “half-way” using the mean of image values at corresponding pixels and we compare this calculated image with frame $c01-math-250$ (see Szeliski (1999)). Limitations for recording frequencies of current cameras make this technique not yet practically appropriate, but it is certainly appropriate for fundamental research.

1.3.3 Object Detection and Tracking

In general, an object detector is defined by applying a classifier for an object detection problem. We assume that any decision made can be evaluated as being either correct or false.

1.3.3.1 Measures for Object Detection

Let tp or fp denote the numbers of true-positives or false-positives, respectively. Analogously we define tn and fn for the negatives; tn is not a common entry for performance measures.

Precision (PR) is the ratio of true-positives compared to all detections. Recall (RC) (or sensitivity) is the ratio of true-positives compared to all potentially possible detections (i.e., to the number of all visible objects):

1.21

The miss rate (MR) is the ratio of false-negatives compared to all objects in an image. False-positives per image (FPPI) is the ratio of false-positives compared to all detected objects in an image:

1.22

In case of multiple images, the mean of measures can be used (i.e., averaged over all the processed images).

How to decide whether a detected object is true-positive? Assume that objects in images have been locally identified manually by bounding boxes, serving as the ground truth. All detected objects are matched with these ground-truth boxes by calculating ratios of areas of overlapping regions

1.23

where $c01-math-254$ denotes the area of a region in an image, $c01-math-255$ is the detected bounding box of the object, and $c01-math-256$ is the area of the bounding box of the matched ground-truth box. If $c01-math-257$ is larger than a threshold $c01-math-258$ , say $c01-math-259$ , then the detected object is taken as a true-positive.

Example 1.3.2 Driver Monitoring

Besides measurements for understanding the steadiness of driver's movement of the steering wheel, cameras are also an appropriate tool for understanding the state of the driver (e.g., drowsiness detection, or eye gaze).

Face and eye detection (Viola and Jones 2001b) or head pose analysis (Murphy-Chutorian and Trivedi 2009) are basic tasks in this area (Figure 1.11). Challenging lighting conditions still define unsatisfactorily solved scenarios (e.g., see Rezaei and Klette (2012) for such scenarios).

Figure 1.11 Face detection, eye detection, and face tracking results under challenging lighting conditions. Typical Haar-like features, as introduced in Viola and Jones (2001b), are shown in the upper right. The illustrated results for challenging lighting conditions require additional efforts.

Courtesy of Mahdi Rezaei

Driver awareness can be defined by relating driver monitoring results to environment analysis for the given traffic scenario. The driver not only needs to pay attention to driving, eye gaze or head pose (Rezaei and Klette 2011) should also correspond (for some time) to those outside regions where safety-related events occur.

1.3.3.2 Object Tracking

Object tracking is an important task for understanding the motion of a mobile platform or of other objects in a dynamic environment. The mobile platform with the installed system is also called the ego-vehicle whose ego-motion needs to be calculated for understanding the movement of the installed sensors in the three-dimensional (3D) world.

Calculated features in subsequent frames $c01-math-260$ can be tracked (e.g., by using RANSAC for identifying an affine transform between feature points) and then used for estimating ego-motion based on bundle adjustment. This can also be combined with another module using nonvisual sensor data such as GPS or of an inertial measurement unit (IMU). For example, see Geng et al. (2015) for an integration of GPS data.

Other moving objects in the scene can be tracked using repeated detections or by following a detected object in frame $c01-math-261$ to frame $c01-math-262$ . A Kalman filter (e.g., linear, general, or unscented) can be used for building a model for the motion as well as for involved noise. A particle filter can also be used based on extracted weights for potential moves of a particle in particle space. Kalman and particle filters are introduced, with references to related original sources, in Klette (2014).

1.3.4 Semantic Segmentation

When segmenting a scene, ideally obtained segments should correspond to defined objects in the scene, such as a house, a person, or a car in a road scene. These segments define semantic segmentation. Segmentation for vehicle technology aims at semantic segmentation (Floros and Leibe 2012; Ohlich et al. 2012) with temporal consistency along a recorded video sequence. Appearance is an important concept for semantic segmentation (Mohan 2014). The concept of super pixels (see, e.g., Liu et al. (2012)) might be useful for achieving semantic segmentation. Temporal consistency requires tracking of segments and similarity calculations between tracked segments.

1.3.4.1 Environment Analysis

There are static (i.e., fixed with respect to the Earth) or dynamic objects in a scenario which need to be detected, understood, and possibly further analyzed.

A flying helicopter (or just multi-copter) should be able to detect power lines or other potential objects defining a hazard. Detecting traffic signs or traffic lights, or understanding lane borders of highways or suburban roads are examples for driving vehicles. Boats need to detect buoys and beacons.

Pedestrian detection became a common subject for road-analysis projects. After detecting a pedestrian on a pathway next to an inner-city road, it would be helpful to understand whether this pedestrian intends to step onto the road in the next few seconds.

After detecting more and more objects, we may have the opportunity to model and understand a given environment.

Example 1.3.3 Use of Stereo Analysis and Optical Flow Calculations

Modules for solving stereo matching and optical flow calculation can be used for designing a system for video segmentation. For example, following Hermann et al. (2011), stereo matching for images $c01-math-263$ and $c01-math-264$ of frame $c01-math-265$ results in a depth map that is segmented by

1. preprocessing for removing noisy (i.e., isolated) depthvalues and irrelevant depth values (e.g., in the sky region),
2. estimating a ground manifold using v-disparities—depth values identified as being in the ground manifold are also removed—and
3. performing a segmentation procedure (e.g., simple region growing) on the remaining depth values.

Resulting segments are likely to be of similar shape and location as those obtained for stereo frame $c01-math-266$ by the same procedure. For each segment obtained for frame $c01-math-267$ , the mean optical flow vector for pixels in this segment defines the expected move of this segment into a new position in frame $c01-math-268$ . Those expected segments (of frame $c01-math-269$ after expected moves into frame $c01-math-270$ ) are compared with the actual segments of frame $c01-math-271$ for identifying correspondences, for example, by applying a set-theoretical metric, which represents the ratio between overlap and total area of both segments.

1.3.4.2 Performance Evaluation of Semantic Segmentation

There is a lack of provided ground truth for semantic segmentations in traffic sequences. Work reported in current publications on semantic segmentation, such as Floros and Leibe (2012) and Ohlich et al. (2012), can be used for creating test databases. There is also current progress in available online data; see www.cvlibs.net/datasets/kitti/eval_road.php, www.cityscapes-dataset.net, and (Ros et al. 2015) for a study for such data.

Barth et al. (2010) proposed a method for segmentation, which is based on evaluating pixel probabilities of whether they are in motion in the real world or not (using scene flow and ego-motion). Barth et al. (2010) also provides ground truth for image segmentation in Set 7 of EISATS, illustrated by Figure 1.12. Figure 1.12 also shows resulting SGM stereo maps andsegments obtained when following the multi-module approach briefly sketched earlier.

nfgz012 — **Figure 1.12** Two examples for Set 7 of EISATS illustrated by preprocessed depth maps following the described method (Steps 1 and 2). Ground truth for segments is provided by Barth et al. (2010) and shown on top in both cases. Resulting segments using the described method are shown below in both cases;

courtesy of Simon Hermann

Modifications in the involved modules for stereo matching and optical flow calculation influence the final result. There might be dependencies between performances of contributing programs.

1.4 Concluding Remarks

The vehicle industry worldwide has assigned major research and development resources for offering competitive solutions for vision-based components for vehicles. Research at academic institutions needs to address future or fundamental tasks, challenges that are not of immediate interest for the vehicle industry, for being able to continue to contribute to this area.

The chapter introduced basic notation and selected visual tasks. It reviewed work in the field of computer vision in vehicles. There are countless open questions in this area, often related to

1. adding further alternatives to only a few existing robust solutions for one generic or specific task,
2. a comparative evaluation of such solutions,
3. ways of analyzing benchmarks for their particular challenges,
4. the design of more complex systems, and
5. ways to test such complex systems.

Specifying and solving a specific task might be a good strategy to define fundamental research, ahead of currently extremely intense industrial research and development within the area of computer vision for vehicles. Aiming at robustness including challenging scenarios and understanding interactions in dynamic scenes between multiple moving objects are certainly examples where further research is required.

Computer vision can help to solve true problems in society or industry, thus contributing to the prevention of social harms or atrocities; it is a fundamental ethical obligation of researchers in this field not to contribute to those, for example, by designing computer vision solutions for the use in UAVs for killing people. Academics identify ethics in research often with subjects such as plagiarism, competence, or objectivity, and a main principle is also social responsibility. Computer vision in road vehicles can play, for example, a major role in reducingcasualties in traffic accidents, which are counted by hundreds of thousands of people worldwide each year; it is a very satisfying task for a researcher to contribute to improved road safety.

Acknowledgments

The author thanks Simon Hermann, Mahdi Rezaei, Konstantin Schauwecker, Junli Tao, and Garry Tee for comments on drafts of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.