Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10

3D Urban Scene Reconstruction and Interpretation from Multisensor Imagery

Hai Huang; Andreas Kuhn; Mario Michelini; Matthias Schmitz; Helmut Mayer Institute for Applied Computer Science, Bundeswehr University Munich, Neubiberg, Germany

Abstract

We present an approach for 3D urban scene reconstruction and interpretation based on the fusion of terrestrial and unmanned aerial vehicle (UAV) imagery. The terrestrial close range images acquired with high-resolution cameras reveal details of buildings, particularly of the facades. Yet, they have a poor coverage of the roofs and the ground due to the obtuse viewing angle. Thus, they are complemented by UAV imagery taken from larger distances using nadir and oblique views, with a clear view on the ground and the roofs. The resulting wide-baseline images are fused by a precise and reliable pose estimation approach and dense 3D point clouds including color are reconstructed.

The colored 3D points are the input to semantic scene classification, which effectively fuses color and 3D geometric information. A set of “relative” features is proposed to provide an intra-class stable as well as inter-class discriminative description of objects. In comparison with the conventional “absolute” attributes, relative features provide context-sensitive measurements of both color and 3D geometry. The classification results for buildings are input to an automatic pipeline for level of detail 2 (LOD2) building model reconstruction combining a reliable scene as well as building decomposition with a subsequent primitive-based reconstruction and assembly. Finally, LoD3 models are obtained by integrating the results of facade image interpretation with an adapted convolutional neural network, which employs the 3D point cloud as well as the terrestrial images.

Keywords

Automatic pose estimation; Dense 3D reconstruction; Wide-baseline; Scene classification; Generative models; 3D building modeling; Convolutional neural network

10.1 Introduction

Related work on 3D (urban) scene reconstruction goes back to the 1970s and 1980s. The Ascona workshop in 1995 [1] gave a good overview of the state reached by the middle of the 1990s. Since then, numerous developments have occurred.

This paper deals with the complete chain from pose estimation via dense 3D reconstruction, scene classification, scene and building decomposition to building modeling of levels of detail 2 (LoD2) and 3. Because the related work for all these different areas is rather disparate, we decided to include it in the respective sections. In the following we start with the description of our work on pose estimation suitable for unordered image sets possibly containing wide-baseline connections whose closing is essential to avoid that the image sets are split into disjoint partitions.

10.2 Pose Estimation for Wide-Baseline Image Sets

Pose estimation was in the focus of computer vision in the 1990s with the book of Hartley and Zisserman [2] summing up the theoretical findings. Arguably, random sample consensus—RANSAC [3] and the five-point algorithm [4] introduced the two most important ingredients to make fully automatic pose estimation without approximate values for the pose as in classical photogrammetric setups [5] a reality.

By now classical work such as by Pollefeys et al. [6] was based on (linear) image sequences. Later, a focus have been (extremely) large sets of images, so-called “community photo collections” [7]. While the feasibility has been shown in [8], particularly, [9] and later [10] have demonstrated how to deal with millions of or even one hundred million images.

Compared to the latter, our focus is rather different. We only deal with thousands of images, but we want to link as many of them as possible even if they are connected by a large baseline in comparison to the distance of the scene, as is needed when combining images from the ground with images from unmanned aerial vehicles (UAVs). This implies rather different viewing angles on the scene and, thus, makes the matching of points difficult.

Our approach requires no additional information except (an approximate) camera calibration. The latter can often be obtained from the meta-data of the images. The presentation in the following is split in two parts: In Sects. 10.2.1 and 10.2.2 means for pose estimation suitable for wide baselines are given, assuming that the overlap between the images is known. Part one basically follows [11,12], but with a couple of extensions and improvements added over the years. The second part in Sect. 10.2.3 consists of the capability to determine the overlap and, thus, to estimate the pose for the images even when the ordering of the images is unknown following [13,14].

10.2.1 Pose Estimation for Wide-Baseline Pairs and Triplets

As detailed below in Sect. 10.2.2, the basic building block for the determination of the poses for an image block by merging are triplets. While one could derive the poses for triplets directly from matches in three images, we start pose estimation with the smallest possible image sets, namely pairs. By this means, the complexity for matching in three images can be reduced by constraining it to the areas around the epipolar lines estimated for the pairs.

Because we want to deal with rather large baselines, we decided to employ affine least-squares matching [15,16] as the basic means for matching. “Affine” means that a patch is rotated by two angles (rotation and sheer), different scales are employed for the two basic axes and two translations (in x- and y-direction) are used. This enables a much better adaptation of two patches than the usually employed translations in combination with one rotation and one scale. What is more, least-squares matching does not only produce very accurate results, but additionally also provides estimates for the precision in the form of covariance matrices.

While the mathematically correct model for matching a planar patch would be the eight parameter projective transformation, we found empirically that it cannot be reliably estimated from the typical patch size of $13 \times 13$ $13 \times 13$ pixels. The latter has also been determined empirically and gives a good balance between sufficient information content and not being too large so that disturbances, e.g., due to occlusions or non-planarity, do not affect the matching too adversely.

Unfortunately, least-squares matching is computationally very demanding and, thus, has to be limited to regions where it is likely to be successful. To this end, first points are detected by means of the scale invariant feature transform—SIFT [17]—in images reduced to a resolution of about $200 \times 150$ $200 \times 150$ pixels. The points are then matched via (normalized) cross-correlation, because we found that the results of matching by cross-correlation do not degrade as abruptly as for matching by means of SIFT. Pairs with a correlation coefficient higher than an empirically determined low threshold of 0.5 are then input to least-squares matching, the results of which are only accepted if a high threshold of 0.95 is exceeded.

The resulting point pairs are input to the estimation of the relative pose of the pairs based on RANSAC and the five-point algorithm. We have extended this by a variant of the expectation maximization (EM) algorithm where we vary between expectation in the form of the determination of the inliers and maximization realized by means of robust bundle adjustment [18].

Once the relative pose has been determined for the pairs, the matching is repeated on double the resolution for triplets, reducing the search space considerably by means of the epipolar lines derived from the essential matrices estimated for the pairs. Pose estimation employs one “master” image for which the relative pose is determined to the other two “slave” images as above with RANSAC and the five-point algorithm for the same five points in the “master image”. To link the pairs, relative scale is computed by means of triangulating the five points and computing the relative distances in both pairs. The median of the ratio of the distances is taken as the relative scale, i.e., relative length of the baselines. This basic result for triplets is refined on again the double resolution, projecting points in three images via the trifocal tensors derived from the relative poses of the triplets.

Additionally to the above approach for pairs, we have developed means to estimate also the poses for pairs with extreme baselines [19]. It is based on distorting the images by means of a large set of different projective transformations, detecting and matching points in the transformed images. While the results demonstrate that it is possible to estimate the pose for these extreme configurations, the computational complexity is extremely high, making the approach only useful if not more than a couple of image pairs need to be matched. In [20] an approach has been presented addressing the problem that often the pose is precisely determined only for parts of an image, as not enough points can be matched on difficult areas with strong perspective distortions such as roads seen from ground level. While [20] demonstrates the feasibility, it is limited to specific setups (roads seen from ground level in the lower part of the image).

10.2.2 Hierarchical Merging of Triplets

The derived triplets are the basis for the determination of the poses for the whole image set. The approach is described in detail in [21]. At the core is the hierarchical merging of image sets starting with triplets as the minimum sets.

While in principle 3D points alone or also combinations of 3D points and camera poses can be used to align the coordinate system of an image set to another image set, here the poses for two images included in both image sets (e.g., images 2 and 3 in the two triplets consisting of images 1, 2, 3 and 2, 3, 4, respectively) are used to compute a 3D Euclidean transformation. Even though the poses for one image are enough to compute the relative translation and rotation, the relative scale of both coordinate systems can only be computed using additional information such as 3D points seen in two or more images in both sets. As individual 3D points can be wrong due to matching errors particularly for wide baselines, we opted to use the poses of an additional image included in both sets (i.e., image 3 additionally to image 2 in the above example) as accumulator for the relative distances of the 3D points and the relative scale of both sets is computed as the ratio between the lengths of the baselines (of images 2 and 3 in the above example) in both sets.

Because two images included in both sets are used, a merge of two image sets with $n_{1}$ $n_{1}$ and $n_{2}$ $n_{2}$ images leads to a combined set of images consisting of $n_{1} + n_{2} - 2$ $n_{1} + n_{2} - 2$ images.

While in earlier work [11] we had employed a serial extension by one triplet at a time, this was replaced by a hierarchical approach in [21]. Here, sets of approximately the same size are merged. This has two advantages: First, because we have found that after merging bundle adjustment is a must to obtain a reliable result by staying close to an optimal solution, the hierarchical approach avoids numerous bundle adjustments for large sets only slightly smaller than the complete set (in the serial case one has to adjust the complete set, the set minus one image, the set minus two images, etc.). Second, when the sets are of significantly different size, the employed robust bundle adjustment (cf. Sect. 10.2.1) tends to throw out the points from the smaller set, which due to a lower redundancy are less precise and thus more likely regarded as outliers.

In [21] it is also shown how one should reduce the points for merging if one wants to obtain a higher efficiency. The only way found to reduce the number of points which avoids a bias in the result is by randomly deleting 3D points. Intuitive means such as keeping points seen in as many images as possible also in combination with a policy to keep points well spread over the images give a better precision, but it was found that it can lead to a severe bias.

Recently, in [14] it has been shown that a search for maximum matching in the graph describing the connections between the images leads to a higher degree of parallelization and, therefore, a higher efficiency.

10.2.3 Automatic Determination of Overlap

The approach introduced in Sects. 10.2.1 and 10.2.2 assumes that the relation, i.e., overlap, between the images is known. Determining the overlap for larger image sets is a difficult problem, because basically all images have to be compared to all other. The least-squares matching approach of Sect. 10.2.1 can deal with large baselines, but its complexity is way too high to be used for all possible pairs.

Thus, a multistage approach has been devised [13,14], which first matches SIFT points in image pairs highly efficiently. The number of matches in each pair is the basis to restrict the approach introduced in Sect. 10.2.1 to a small subset of promising pairs and triplets. This allows one to estimate the pose for image sets containing essential wide-baseline connections which are the only means to connect certain subsets, but at the same time limits the computational complexity.

The relationships between images are modeled by an undirected weighted graph $G_{I} = (V_{I}, E_{I})$ $G_{I} = (V_{I}, E_{I})$ , where a node $v \in V_{I}$ $v \in V_{I}$ represents an image and an edge $(v_{i}, v_{j}) \in E_{I}$ $(v_{i}, v_{j}) \in E_{I}$ connects two overlapping images corresponding to the nodes $v_{i}$ $v_{i}$ and $v_{j}$ $v_{j}$ . The edges are weighted using the Jaccard distance [22] with correspondences determined by matching binary SIFT descriptors. The latter are obtained by embedding SIFT descriptors [17] from continuous space $R^{128}$ $R^{128}$ into Hamming space $H^{128} = {0, 1}^{128}$ $H^{128} = {0, 1}^{128}$ based on orthants. By this means, a compact representation of the descriptors as bit vectors is achieved resulting in a very fast comparison using the Hamming distance followed by the distance ratio test [17].

The graph $G_{I}$ $G_{I}$ allows for a straightforward modeling if pairs are used for linking. However, it lacks descriptiveness in the case of triplet-based linking requiring higher-order relationships. In addition, the employed hierarchical merging of triplets (cf. Sect. 10.2.2) uses pairs to propagate the geometry (i.e., linkable triplets must have two images in common) and this constraint cannot be modeled by $G_{I}$ $G_{I}$ .

Therefore, we employ the line graph $L (G_{I}) = (V_{L}, E_{L})$ $L (G_{I}) = (V_{L}, E_{L})$ of the graph $G_{I}$ $G_{I}$ to describe the linking of images. It has as set of nodes $V_{L}$ $V_{L}$ the edges of $G_{I}$ $G_{I}$ . An edge $(v_{i}, v_{j}) \in E_{L}$ $(v_{i}, v_{j}) \in E_{L}$ exists iff the incident nodes $v_{i}$ $v_{i}$ and $v_{j}$ $v_{j}$ , corresponding to edges in $E_{I}$ $E_{I}$ , have exactly one node $v \in V_{I}$ $v \in V_{I}$ in common. Hence, $V_{L}$ $V_{L}$ contains nodes corresponding to pairs of overlapping images, where two nodes are adjacent if the pairs have an image in common. Because of this, the linking of triplets using pairs for geometry propagation corresponds to a traversal through $L (G_{I})$ $L (G_{I})$ . An edge in $L (G_{I})$ $L (G_{I})$ corresponds to a triplet and its weighting is based on the lowest quality of the three pairs of the triplet, where the quality of a pair is defined as the number of correspondences weighted by the roundness [23] of the corresponding reconstructed 3D points.

For a direct modeling of images, we extend $L (G_{I})$ $L (G_{I})$ to explicitly represent the images using a second node type leading to the so-called linking graph $G = (V_{I} \cup V_{L}, E)$ $G = (V_{I} \cup V_{L}, E)$ . It comprises pair nodes $n_{p} \in V_{L}$ $n_{p} \in V_{L}$ corresponding to image pairs and image nodes $n_{i} \in V_{I}$ $n_{i} \in V_{I}$ corresponding to images. An image and a pair node are adjacent if the pair node contains the image corresponding to the image node. Because the image nodes are only required for a complete modeling, no edges exist between them.

The linking graph completely describes the image linking, but can contain links of varying quality, e.g., due to critical camera configurations [24], as well as redundant links which are either less suitable or not required for pose estimation. Thus, we determine a linking subgraph (LSG) containing only essential links by searching for a terminal minimum Steiner tree [25] in the linking graph.

However, due to the tree-like structure, potential image loops in the LSG are not closed. At this stage, we can obtain additional information in the form of approximate poses very fast by hierarchically merging the triplets contained in the LSG without bundle adjustment. By this means we are able to efficiently search for image pairs which can be used to close loops by restricting the search space using a Euclidean neighborhood as well as the view direction difference between images. In addition, we determine the length of a potential image loop in the form of the graph distance between its outermost image nodes and restrict it as well.

For an efficient pose estimation the linking graph is iteratively constructed using minimum spanning trees of $G_{I}$ $G_{I}$ as well as its subgraph LSG determined until all images are suitably linked or the further search is not meaningful anymore. As the terminal Steiner problem has been shown to be NP-complete [25], an approximation [26] is used for the determination of the LSG. After closing the image loops, hierarchical merging of triplets (cf. Sect. 10.2.2) is employed to determine the actual poses.

Fig. 10.1 shows the poses (left) as well as the links between them (right) for our running example scene Building 1.

Figure 10.1 Pose estimation for Building 1. Camera poses are presented as pyramids and the colors symbolize a different camera type/calibration. The links between the poses visualize detected overlap between the respective images.

10.3 Dense 3D Reconstruction

Accurately estimated camera poses for each image are the basis for the generation of dense depth maps considering geometric epipolar constraints. For surface reconstruction from the depth maps a multiplicity of methods has been developed in the recent decades.

In particular, volumetric methods based on local optimization have demonstrated their potential for 3D Reconstruction on a larger scale. The basic idea of extracting an iso-surface from voxels numerically occupied by propagated truncated signed distance functions was first presented by Curless and Levoy [27]. Goesele et al. [28] showed their adaptation to 3D Reconstruction from Multi-View Stereo (MVS)-based depth maps while Sagawa et al. [29] extended the fusion to varying voxel sizes. Finally, Fuhrmann and Goesele [30] employed variable surface qualities for large-scale volumetric reconstruction from MVS images.

Instead of linear distance functions used in the latter methods, we employ an alternative probabilistic distance function for varying voxel resolutions. An additional filtering step allows for outlier removal in challenging configurations from noisy depth maps. The filtering is based on the idea of free-space constraints, which was originally proposed by Merrell et al. [31]. In contrast to the image-based consistency checks, we propose a probabilistic filtering in the volumetric 3D space.

10.3.1 Dense Depth Map Generation and Uncertainty Estimation

We use semi-global matching (SGM) [32] for MVS matching as it allows for an efficient but still effective processing even for high-resolution images. In our pipeline, stereo pairs that have a sufficient overlap determined from the sparse 3D point cloud generated by pose estimation are selected from the entire image set. Subsequently, the pairwise estimated stereo depth maps are fused for each image to compensate for noise and to filter outliers. In addition, SGM makes use of peak filtering by means of depth map clustering.

In general, the optimization scheme employed in SGM allows for a dense reconstruction of disparities even of weakly textured areas. Because of the reasonably lower quality for weakly textured areas, we additionally employ a pixelwise uncertainty estimation of disparities as post-processing [33,20]. It has been shown that besides the lack of texturedness especially the orientation of the surface relative to the cameras' line-of-sight influences the disparity uncertainty. This can be traced back to fronto-parallel prior assumptions embedded in the SGM optimization framework. Therefore, we classify the disparity uncertainties to consider their quality subsequently in the 3D surface reconstruction (cf. Fig. 10.2).

Figure 10.2 From left to right: 1. Input image, 2. Disparity map estimated by SGM and 3. Uncertainty map derived from the disparity map. The disparity map represents distances from light gray (near) to dark gray (far). The uncertainty map shows uncertainty classes of the disparity map from red (high uncertainty) to blue (low uncertainty). Especially slanted surfaces and low-texture areas are classified as having a relatively higher uncertainty.

While the estimation of surface orientation (e.g. the normal vector) cannot be done in a stable way on disparity maps, we found a feature based on the local Total Variation (TV) of disparities to work well for uncertainty estimation. More precisely, we determine the TV in an iteratively growing pixel neighborhood (window) until it exceeds a given threshold. We define the resulting window size as a TV class and use its uncertainty for the estimation of the spatial error. To assign a disparity uncertainty to the TV class, we make use of stereo ground truth data [34] and learn a Gaussian uniform mixture function representing noise and outliers with an EM approach for each TV class. The finally employed disparity uncertainty range from a quarter of a pixel to a couple of pixels demonstrates the span of uncertainty of individual disparities which strongly influences the uncertainty of the 3D point.

10.3.2 3D Uncertainty Propagation and 3D Reconstruction

From dense disparity maps with pixelwise disparity uncertainty we estimate the 3D coordinates and the corresponding 3D error functions [35,20]. To this end, we use well-known analytic error propagation [36]. Beside the disparity uncertainty, the 3D error depends on focal length, baseline length, image resolution and distance of the 3D point to the camera.

For an unrestricted 3D reconstruction from 3D point clouds with varying quality, especially volumetric approaches have shown their potential for 3D reconstruction on a large scale [20]. For a volumetric representation, the 3D space is discretized into voxels in the surrounding of a 3D point. Depending on the spatial distance to the 3D point, individual voxels are assigned values that describe their likeliness to be behind or in front of a surface (cf. Fig. 10.3). We use probabilistic Gaussian cumulative distribution functions (CDFs) for the estimation of the voxel values [35,20]. For measurements from multiple cameras, the individual values are fused using a sound theoretical model employing binary Bayes theory. The final volumetric space can be transformed to 3D point clouds or triangle meshes by a Gaussian regression considering neighboring voxel values.

Figure 10.3 The accuracy of a 3D point on the line-of-sight can be represented by a Gaussian distribution (blue). The integral over the function (green) describes the probability of the actual point being behind a surface. We use this function to assign voxels on the line-of-sight a surface probability (right). Blue voxels have a low probability while red voxels have a high probability of being behind a surface.

The volumetric reconstruction is implemented runtime and memory efficient by using octree data structures. For the generation of 3D models from images captured at varying distances it is inevitable to propagate the depths into multiple octree levels. To this end, the voxel size (cf. Fig. 10.3) corresponding to the octree depth is estimated for each measurement. More precisely, individual 3D points are included in the octree on varying levels which are selected considering the scale implied by the 3D error derived from the disparity uncertainty and the camera configuration. At this point, our TV prior functions act as a local regularization term and points on lower octree depths are filtered by points with relatively lower uncertainty.

We have also demonstrated that the employed local optimization allows for the reconstruction of very large scenes by means of space division of the entire 3D reconstruction space into smaller subsets [37] (cf. Fig. 10.4). To this end, the space is incrementally split until the number of 3D points is below a given number in each subset. While a globally optimized surface would lack in consistent scalability, our local optimization paradigm avoids complex divide and conquer strategies for fusing neighboring subspaces.

Figure 10.4 Village Bonnland reconstructed from hundreds of high-resolution images. Employing terrestrial images capturing the scene at close distance our method allows for the reconstruction of important details (zoomed area). The reconstruction space was split in hundreds of subspaces and processed in parallel on a cluster system in a couple of hours. Due to the local optimization employed in the surface reconstruction neighboring surface parts are consistent even though they have been processed independently.

10.4 Scene Classification

We propose a robust and efficient analytical approach for automatic urban scene classification based on imagery and elevation (2.5D) data. Extending the approach described in [38], relative features for both color and geometry, i.e., color coherence, relative height and 3D point coplanarity, are proposed to deal with object diversity and scene complexity. The classification is conducted using a random forest (RF) [39] classifier. Feature extraction and classification can be performed in parallel on independent partitions of the scene, speeding up the processing significantly.

10.4.1 Relative Features

Replacing absolute by relative features to obtain more stable classifications is not a novel idea. E.g., relative heights of buildings and trees in relation to the ground derived based on an estimated DTM (digital terrain model) have become standard and are even considered the most significant features [40] for urban scene classification. Previous work, however, suffers from (1) the heterogeneous appearance of objects of the same class, e.g., diverse sizes and types of buildings as well as different materials/colors of streets, and (2) the similarity of features of objects from different classes, e.g., heights of buildings and trees. Shadows and variable lighting conditions during data acquisition make the situation even more difficult. The challenge is to devise more intra-class stable and, at the same time, more inter-class discriminative features concerning both color and geometry information, which is one of the goals of this work.

10.4.1.1 Color coherence

In urban scene classification changing lighting conditions usually cannot be avoided and may considerably influence the results. For instance, shadows of buildings and trees are cast on the streets, lawns and roofs. These regions are often miss-classified because of their substantially different color values. Color coherence allows for intra-class stability by dealing with objects under various lighting conditions. It measures the difference between two colors with a single value and has been proposed for image segmentation in [41]. We employ the $L^{⁎} a^{⁎} b^{⁎}$ $L^{⁎} a^{⁎} b^{⁎}$ color space, which has an independent channel ( $L^{⁎}$ $L^{⁎}$ ) for lightness/luminance, i.e., it can deal with various illumination conditions. The other two channels $a^{⁎}$ $a^{⁎}$ and $b^{⁎}$ $b^{⁎}$ represent red/green and yellow/blue, respectively. We assume that (1) the $a^{⁎}$ $a^{⁎}$ channel is sensitive to vegetation objects, i.e., trees and lawn, and (2) the $b^{⁎}$ $b^{⁎}$ channel can help to analyze objects lying in the shadow. This is because Rayleigh scattering, i.e., diffuse sky radiation (scattered solar radiation), makes the sky blue and the sun itself is more yellowish. Thus, as the bright side of a roof takes more direct sun light, it has a more yellow tone, while the dark side of a roof reflects more sky radiation and, therefore, has a more blue tone.

Color coherence is a relative feature concerning a predefined color. In contrast to [41], we do not measure the direct color distance between two arbitrary objects, but the distance from the current data point to a given object class. As the “reference” class we choose vegetation because of its relatively invariant appearance concerning color for both RGB and multispectral images. Fig. 10.5 (right) shows a map of color coherence concerning vegetation for $a^{⁎} - b^{⁎}$ $a^{⁎} - b^{⁎}$ , where the vegetation areas are dark while all other objects are lighter, i.e., have a significant difference to vegetation. Fig. 10.5 also shows that in comparison with only the $a^{⁎}$ $a^{⁎}$ channel (center right), subtracting the $b^{⁎}$ $b^{⁎}$ channel gives a much better result concerning the influence of shadows (right).

Figure 10.5 Color coherence concerning vegetation for an RGB image.

10.4.1.2 Definition of neighborhood

The spatial neighborhood relationship is employed in many approaches to integrate contextual information for more plausible results. The definition of neighborhood shared by most related work consists of the data points around the current point. For a square image grid, the range of neighbors is a simple $(2 m + 1) ⁎ (2 m + 1)$ $(2 m + 1) ⁎ (2 m + 1)$ matrix with m the order of neighbors (Fig. 10.6, left). While this is a general setup widely used, it only works well under the assumption that the meaningful neighbors are isotropically distributed or there is no direction-related information to consider. Another tricky problem is the order/size of the neighborhood: A larger window is desirable to include more contextual information, but it also implies a more time-consuming processing. The latter becomes particularly critical when graphical models are used as higher-order neighbors mean an exponentially increased computational effort.

Figure 10.6 Definitions of neighborhood: Conventional definition with second-order neighbors (left) and two radial patterns (center and right).

Inspired by [32], radial neighborhood patterns (Fig. 10.6, center and right) are proposed. The lengths of the “beams” can be set to “infinity” [32] to reach the boundary of the data and cross as many different objects as possible, while the size of the whole population of neighbors is still acceptable. The length as well as the thickness of the “beams” (Fig. 10.6, right) can be adapted to the object characteristics, e.g., height information in undulating terrain (cf. Sect. 10.4.1.3), where a finite setup is appropriate.

10.4.1.3 Relative height

Relative height is defined as the elevation difference between the current data point and the estimated local ground level. By this means, the differences between classes become more discriminative and stable. A radial neighborhood pattern (cf. above) is employed for relative height computation. For scenes with flat terrain, for instance Fig. 10.7 (left), an “infinite” length of the “beams” is used to gather as much information as possible by including all available ground data.

Figure 10.7 Employed neighborhoods for relative height for flat (left) and undulating terrain (right).

In contrast, a limited length is mandatory for scenarios with larger height differences like for undulating terrain (Fig. 10.7, right). On hillsides or in valleys, the ground level of one building can be similar or even higher than the roof of another building in the neighborhood. This implies that the absolute height from the training data is not a valid prior for buildings any more. In this case, the range of neighbors is set slightly larger than the average building length.

As shown in Fig. 10.8, the height maps for valley and hillside areas are enhanced using the relative height. What is more important, the height values of different objects, e.g., roofs and ground, are directly comparable in the relative height maps. That is, a classifier trained on one dataset can be applied to both areas.

Figure 10.8 Absolute and relative height for valley and hillside.

10.4.1.4 Coplanarity of 3D points

Coplanarity measures how well the current point and its neighbors form a plane. A common plane for the given point and its neighbors is estimated using RANSAC [3] and their coplanarity is quantified by calculating the percentage of inliers to the estimated plane. Coplanarity is employed as a feature to infer the probability of the current point being a part of a planar object. In this work, it was found to be very effective to distinguish trees from other objects, especially roofs, which might have very similar heights in most European urban areas. Fig. 10.9 presents a coplanarity map calculated from the direct neighbors of each pixel.

In summary, relative features integrate both local and contextual information. The definition of “context” is extended from a geometric neighborhood to a more semantic description of environment (ground height) and class (vegetation). In comparison to conventional features with absolute values, the proposed relative features are more discriminative between different classes, and at the same time more stable for objects of the same class.

10.4.2 Classification and Results

A standard RF classifier is employed for the classification. The calculation of the features and the classification with the trained classifier are both implemented for parallel processing splitting the dataset into independent partitions. Because the proposed features are robust for various scenarios, we empirically found that a generally trained classifier can be directly applied to all partitions from the same dataset and provide reasonable results without additional local or incremental training. The computation time can be reduced considerably because the partitions can be distributed. As long as the partitions have a reasonable size, i.e., are large enough to contain whole major objects like buildings and road segments with full width, the division will only marginally deteriorate the results.

10.4.2.1 Post-processing

Post-processing with blob filters is conducted to correct local errors based on semantic consistency constraints: e.g., the regions of roads and buildings should be homogeneous without small spots of tree or lawn inside. On the other hand, gaps are allowed for trees and lawn. For instance, ground or lawn may be visible in a gap between trees and trees may be found in the middle of a lawn.

The post-processing corrects also errors caused by data artifacts and improves the plausibility of the results. The size of the blobs to be filtered is determined based on the empirically derived object size of each class and has to be adapted to varying data resolution and quality.

10.4.2.2 Results for Bonnland

The Bonnland dataset (cf. also Sect. 10.2 and Fig. 10.10) includes buildings in the valley and on the hillside. The height of the hillside ground level can be greater than that of some building roofs in the valley. Accordingly, a classifier trained with absolute height values for ground and buildings will not work in this area. The proposed relative feature for height, by contrast, is stable in the undulating terrain without need to train the classifier locally. Fig. 10.11 presents the classification result for the whole dataset.

Figure 10.10 Bonnland data with undulating terrain: The ground height (solid line) of a hillside building can be higher than the roof top (dashed line) of a building in the valley.

Figure 10.11 Classification of Bonnland data with classes ground (gray), building (red), high vegetation (green) and low vegetation (blue).

We use a reduced version of the reconstructed 3D point cloud (cf. Sect. 10.2.2; about 8.8 million out of over 1 billion points), which is rasterized into a 2.5D point cloud with a resolution of 0.2 meters. Fig. 10.11 presents the whole Bonnland dataset and the classification result. We have defined four object classes, i.e., ground (gray), building (red), high vegetation/tree (green) and low vegetation/lawn (blue).

10.5 Scene and Building Decomposition

The parsing of complex scenes including buildings is one of the main challenges towards fully automatic city model reconstruction and the automatic reconstruction of larger urban areas. We assume that a reasonable decomposition, subdividing the whole scene as well as heterogeneous buildings into regular components is the key to tackle this.

Fig. 10.12 shows a workflow which links the approaches for scene classification [38] and primitive-based reconstruction [42,43] by

• preprocessing/decomposition—a reliable decomposition of the whole scene as well as of complex buildings into regular primitives precedes model reconstruction—and
• post-processing/assembly of primitives—the assembly of the primitives performs a true model merging in CAD (computer aided design) style—

to complete the pipeline.

Figure 10.12 Workflow of building reconstruction with decomposition and assembly.

10.5.1 Scene Decomposition

The goal of scene decomposition is to extract individual buildings from the building mask derived by scene classification (cf. Sect. 10.4). As the separation of buildings in the building mask may be adversely affected by data as well as labeling errors (cf. Fig. 10.13, red dashed contour), a mathematical morphological “opening” operation is conducted, as shown in Fig. 10.13, to remove trivial connections for a better isolation of the buildings and to eliminate small outliers. A disk-shaped structuring element is employed with a radius of 1 meter.

Figure 10.13 The building mask is decomposed into connected components. Mathematical morphological (opening) is used for a better separation of adjacent buildings.

We start to detect individual buildings via connected components (CCs). The input building mask, as presented in Fig. 10.13, is segmented into rectangular tiles accordingly. Overlap between tiles is allowed to make sure that the buildings are completely included in the tiles. Please note that global coordinates and the height (for undulating areas) are tagged to each tile for the final model assembly.

Due to the complexity of the scene and the errors of the classification, however, it cannot be guaranteed that each CC contains exactly one individual building. That is, as Fig. 10.14 shows, after decomposition, a tile may contain a single detached building, a complex building (a building consisting of multiple building components), or even multiple buildings which are closely adjacent to each other. The last case is often found in densely inhabited areas.

Figure 10.14 The results of scene decomposition may contain individual buildings, complex buildings as well as building groups consisting of adjacent buildings.

In this work we name both, complex buildings and adjacent building groups, “building complexes”. Further parsing of building complexes is described in the following.

10.5.2 Building Decomposition

Building decomposition works on the results of scene decomposition. It further divides building complexes into standard building components—primitives, which are employed in the following building model reconstruction. Although our approach presented in [43] allows one to model multiple building components by means of the “birth” and “death” jumps in a reversible jump Markov chain Monte Carlo (RJMCMC) framework, a preceding decomposition makes the statistical search more efficient and robust.

We employ a combined bottom-up and top-down scheme for building decomposition, which uses 3D geometry parsing based on a predefined primitive library. Conventional building decomposition is conducted bottom-up based on 2D footprints that are either already available [44,45] or derived from the data [46]. The performance of footprint-based decomposition is limited where 3D geometric information, e.g., different heights or roof types of buildings, has to be considered [47]. 3D geometric parsing, however, cannot be conducted on 3D building models, because the latter do not yet exist. The best decomposition is found by means of statistical model selection and optimization.

Please note that the decomposition and the following building modeling have to share a common construction principle to guarantee a consistent work pipeline. Two basic strategies for building (footprint) decomposition can be differentiated based on one key difference: if the components of a building are allowed to overlap [48,42] or not [44,45]. Different definitions are correspondingly employed for the primitives. In [43], we have demonstrated that the first concept allowing for overlap fits better to generative modeling because of its flexibility and most importantly, the potential to generate complete and plausible models.

10.5.2.1 Ridge extraction

From the 3D point cloud, we employ bottom-up methods to extract ridge lines, which can be considered as key geometric features of buildings. As shown in Fig. 10.15, we define the edges on the roof as: (1) Horizontal ridge line (red), which connects two apexes of the roof, (2) diagonal ridge line (green) connecting one apex and one eave corner, and (3) eave line (blue), which links two eave corners. The roof contour consisting of eave lines can be used as approximation of the building footprint when the roof overhang is ignored.

Figure 10.15 Ridge lines of hipped (left), gable (center) and flat (right) roofs. For flat roofs, central lines (dashed) are used to describe the height of the building.

For non-flat roofs, ridge lines can be determined as the intersection lines of the individual roof planes as also demonstrated in current work [49,50]. Planes of a building complex are detected from the 3D point cloud by a RANSAC-based approach. All intersection lines are filtered by means of the “Relation Matrix” [51] to separate ridge lines from other intersection lines as well as to determine the type of ridge lines, i.e., horizontal or diagonal ridges. “Central lines” are used for flat roofs, which do not have any ridge. In comparison with building skeletons, central lines of roofs are 3D. That is, as shown in Fig. 10.15 (right), they have a height value and in the case of adjacent flat roofs they can differentiate multiple building components by means of their heights.

Ridge lines have the following advantages for the decomposition of complicated buildings:

• Full 3D information: different heights can be used to distinguish different building components even though they have an identical width (cf. Fig. 10.15, bottom).
• High accuracy and reliability: The ridge lines are calculated by the intersection of robustly determined planes, i.e., they are actually derived from all inliers of the involved planes.
• The horizontal ridges consist of straight line segments, which can be regarded as ridge lines of individual primitives and, thus, can guide the decomposition.

10.5.2.2 Primitive-based building decomposition

A statistical primitive-based decomposition guided by the ridges is illustrated by Fig. 10.16. The detected horizontal ridge lines (Fig. 10.16, center, red) consist of straight line segments (Fig. 10.16, right, bold). The primitives selected from a predefined library [52] have a rectangular contour as well as a single straight horizontal ridge line (except for flat roof and shed roof). The end points of the segments are determined by intersection with diagonal ridges or the boundary of the building mask.

Figure 10.16 Building decomposition: Determination of a combination of primitives (right) derived from ridge lines (center, red) to represent the underlying model (left).

The horizontal ridges guide the decomposition: The most appropriate primitives (cf. Fig. 10.16, right) are statistically selected with the goals to fit (1) to the already extracted diagonal ridges and (2) to the rest of the edges, without conflicts with the known planes and the boundary of the building mask.

The building decomposition does not deal with actual building models, because they do not yet exist. The goal of the decomposition is to find an optimal combination of primitives to approximate the underlying model (Fig. 10.16, left). That is, the decomposition determines the number and types of primitives as well as the way of their combination (Fig. 10.16, right). Additionally, building decomposition also results in initial values of parameters for the primitives (cf. Sect. 10.6.1).

10.6 Building Modeling

10.6.1 Primitive Selection and Optimization

We propose a statistical building modeling based on generative primitive models [43]. The primitives are defined with parameters (cf. Fig. 10.17):

$θ \in Θ; Θ = {P, C, S},$ $θ \in Θ; Θ = {P, C, S},$

(10.1)

where the parameter space Θ consists of position parameters $P = {x, y, azimuth}$ $P = {x, y, azimuth}$ , contour parameters $C = {length, width}$ $C = {length, width}$ (rectangular footprint), and $S$ $S$ containing shape parameters: Ridge/eave height and the depth parameters of hips.

Figure 10.17 Parameters of a primitive for generative modeling.

The maximum a posteriori (MAP) estimate of Θ is employed to find the optimal model fitting the data:

${\hat{Θ}}_{M A P} = \underset{Θ}{argmax} {\frac{L (D | Θ) p (Θ)}{P (D)}} = \underset{Θ}{argmax} {L (D | Θ) p (Θ)},$ ${\hat{Θ}}_{M A P} = \underset{Θ}{argmax} {\frac{L (D | Θ) p (Θ)}{P (D)}} = \underset{Θ}{argmax} {L (D | Θ) p (Θ)},$

(10.2)

where $L (D | Θ)$ $L (D | Θ)$ is the likelihood function presenting the goodness of fit of the model to the data $D$ $D$ and $p (Θ)$ $p (Θ)$ presents the prior for Θ, which is derived from empirical knowledge and incrementally improved during the reconstruction. That is, the parameter values of the already found building components or of adjacent buildings are used to update the priors. $P (D)$ $P (D)$ is the marginal probability, which is regarded as constant in the optimization as it does not depend on Θ.

Reversible rump Markov chain Monte Carlo (RJMCMC) is used for the statistical search of the parameters, allowing for an efficient exploration of the high-dimensional (determined by the number of parameters) search space. The reversible jumps allow one to switch between different search spaces, i.e., different types of primitives. Model selection is integrated in the transition kernel of reversible jumps to guide the search of the optimal primitive type. Multiple hypothetical models are generated via statistical sampling of the primitive type as well as the corresponding parameters. The final model is the verified candidate with the best goodness of fit to the data.

As mentioned above in Sect. 10.5.2.2, certain initial values of the parameters can be derived from the primitive-based building decomposition. The following parameters can be determined directly from nothing but the horizontal ridges (cf. Sect. 10.5.2.1):

• Coordinates of the centroid (x and y) from the center of the ridge line.
• Orientation (azimuth): corresponds to orientation of the ridge line.
• Ridge height ( $z_{2}$ $z_{2}$ ) of the roof.

Additional parameters can be determined by taking the building footprints into account, which either stem from given building masks or can be derived from the diagonal ridges (cf. Sect. 10.5.2.1):

• Length and width derived from orientation and footprint.
• Depths of hips ( $h i p_{l 1}$ $h i p_{l 1}$ , $h i p_{l 2}$ $h i p_{l 2}$ , $h i p_{d 2}$ $h i p_{d 2}$ , and $h i p_{d 2}$ $h i p_{d 2}$ ) are the longitudinal and radial distances from the end points of the horizontal ridge to the boundary of the building mask.

Known initial values can significantly improve the performance of the statistical search. They are, to a certain extent, reliable and specific, as the ridge lines can be precisely determined (cf. Sect. 10.5.2.1).

Fig. 10.18 presents the reconstruction result of the running example Building 1 (top) and demonstrates the robustness of the proposed modeling approach against data flaws (bottom). 3D point clouds from image matching may contain flaws because of the quality and coverage of the images, homogeneous texture of the objects (i.e., points cannot be matched), occlusion, etc. They lead not only to false colors or incorrect positions of points, but also to gaps (missing points) in the objects. While conventional bottom-up methods may encounter insurmountable difficulties in this case, resulting in irregular and/or incomplete building components, the proposed method guarantees plausible results despite such flaws.

Figure 10.18 Reconstruction of Building 1 (top) and robust reconstruction despite flaws (bottom): Input point clouds (left), detected primitives shown as wire-frames (center), and final building models (right).

10.6.2 Primitive Assembly

If a building consists of multiple primitives, the reconstructed primitives are assembled into a single model in two consecutive steps: (1) Joint parametric adjustment and (2) geometrical model merging.

The “joint parametric adjustment” helps to remove trivial conflicts between primitives and compensates for small deviations (cf. Fig. 10.19, top left), which can occur during the reconstruction driven by a stochastic process. In the adjustment, the change of each side of a primitive is proportional to its size, i.e., footprint area. The parameters of all building components are jointly adjusted using two rules [43]:

• Rule 1: The intersection angles of the primitives are jointly regularized to $0^{\circ}$ $0^{\circ}$ or $90^{\circ}$ $90^{\circ}$ if the deviation is less than a given threshold.
• Rule 2: Heights of flat roofs or ridge- and eave-heights of other roofs are adjusted to one value if the deviation is less than a given threshold,

where the thresholds are determined according to data resolution and quality.

Figure 10.19 Primitive assembly from individual primitives (left) to a watertight CSG model (right).

The parameter adjustment, however, cannot deal with all mismatching positions of the primitives. The mismatching is the result of deviations caused by stochastic processes and the uncertainty of the data, which in principle cannot be corrected in this step. Therefore, a further geometrical adjustment is required. “Geometrical model merging” generates the final single model of a building complex. Similar to [43], we conduct a simple vertex-shifting to correct the geometrical mismatching and all the primitives are matched to each other (Fig. 10.19, bottom left). The primitives are originally generated as Boundary Representation (B-Rep) models, as shown in Fig. 10.19 (center), and simply placed together as separate models which overlap. Although in the rendered model (left) the intersecting part is hidden and does not affect the appearance, the model is ontologically not a single “subject” and geometrically not watertight. To fix this, our model merging employs Constructive Solid Geometry (CSG) modeling. The B-Rep primitives are first converted into CSG models and merged with a “union” operation into a single solid body (cf. Fig. 10.19, right). The latter is then converted back to a single and watertight B-Rep model, i.e., the final model.

10.6.3 LoD2 Models

We have performed experiments on the Bonnland dataset (cf. Sect. 10.4), representing a complete and typical central European village with a mixture of detached buildings and building complexes, a church, as well as a small castle on a hill. The 3D point cloud has been reconstructed from UAV imagery [33] and covers about 0.12 square kilometers of undulating terrain. We use a reduced and rasterized version with a resolution of 0.2 meters.

The building mask provided by scene classification (cf. Sect. 10.4) is decomposed into 62 connected components and corresponding data tiles (cf. Fig. 10.14), which are processed in parallel. The reconstructed building models are assembled in the global coordinate system. Bird's-eye views of input point cloud (top) and the reconstructed model (bottom) are presented in Fig. 10.20. Along with the watertight building models a mesh model is generated from the non-building points to model the ground.

Figure 10.20 Building reconstruction for Bonnland: Input point cloud (top) and reconstructed models (bottom).

The runtime of building modeling for 33 detached buildings and 29 building complexes, which consist of 112 primitives, is about 14 minutes on a laptop with a four cores/eight threads CPU at 2.3 GHz. Except for buildings with large flaws in the data (cf. Fig. 10.18, bottom) or occlusions, where the average deviation does not reflect the reconstruction accuracy, the reconstruction error of the majority of the buildings is less than half of the resolution, i.e., 10 centimeters.

10.6.4 Detection of Facade Elements

As the original input to the described workflow are images and because semantic analysis by Convolutional Neural Networks (ConvNets) has made significant progress in recent years, it seems promising to use this technique for the detection of facade elements. To reduce the necessary computational complexity when using a ConvNet, we restrict the image regions to the facades.

The building primitives define planar elements for roofs and facades. Once the optimum primitives have been determined, the facade planes can be derived in the form of polygons defined by vertices. The corresponding regions of a facade can then be extracted from the images and projected via a planar homography onto the same virtual fronto-parallel plane. Assuming that the facade including all elements, such as windows and doors, is almost planar, the projections from all images should have a similar position on the virtual plane. This reduces the search space for our ConvNet to a limited two-dimensional space. Fig. 10.21 shows the projections of three different views of one facade onto its corresponding virtual plane.

Figure 10.21 Projection of different images onto the same virtual fronto-parallel plane.

An adapted ConvNet [53] is employed to detect the facade elements in the images (cf. Fig. 10.22). The network is based on AlexNet [54], which was pretrained on the ImageNet dataset [55] and is extended by a set of convolutional (Conv) and deconvolutional (DeConv) layers to achieve pixelwise classification.

Figure 10.22 Architecture of the Convolutional Neural Network (ConvNet).

Rectified facades from the eTRIMS dataset [56] were used for fine-tuning the network to classify facade images into the four classes building, door, wall, and other. The results of the ConvNet for individual images are combined by summation of the output of the last fully connected (FC) layer and application of the softmax function to the sum. A rectangle is fitted to each window and door and, finally, the facade elements are projected back into the original 3D model which allows one to generate the LOD3 model (cf. next section). In Fig. 10.23 the process is visualized using one of several images.

Figure 10.23 Detection of windows and doors with a ConvNet on a planar facade projected into one image.

10.6.5 Shell Model

We propose the so-called “shell model”—a hybrid model combining elements of CSG and BRep—for the modeling of buildings in LoD3. The motivations for the “shell model” are:

• 3D Measurement data (from almost all acquisition technologies) reveal only the surface instead of the solid body of the objects.
• Data with deviations form no perfect surface but a layer with a certain thickness representing the data noise.

A shell is, therefore, considered a more reasonable and practical geometrical assumption (model) compared to a simple surface or a solid body. The concept of the shell model is presented in Fig. 10.24. It consists of parallel inner- and outer-layers and the solid body defined by them.

Figure 10.24 Comparison of shell model (right) with BRep (left) and CSG (center) models.

The shell model is specifically designed for our application, in which the images from both airborne and terrestrial cameras are fused and the reconstructed point cloud (cf. Sect. 10.3) is available for both roofs and facades. In comparison with the modeling approaches for either roof or facade data (Fig. 10.25, top), the advantages of the shell model can be summarized as follows:

1. Roof as well as facades can both be precisely modeled. In roof-based modeling, the walls are approximated by extruding the boundary of the roof (Fig. 10.25, top left). The model based on facades does not deal with the roof overhang (Fig. 10.25, top right).
2. Roof and facades are inherently assembled. No additional effort is required for the adjustment and adaption of roof and facade planes.
3. Windows and doors can be modeled quite naturally as openings.

Figure 10.25 Top—modeling for roof (extrusion to the ground) and facades (no overhang considered). On the bottom it is shown how eaves and windows as well as doors can be represented by the shell model.

Fig. 10.26 presents the reconstruction of Building 1 using the shell model. The shell model (center) is initially used for building detection from the input point cloud (left) with inner (red) and outer (blue) layers. The model of best fit is then assumed to be the layer (green) between them. The final model (right) is generated with integrated windows as well as door and a given thickness of the walls.

Figure 10.26 LoD3 modeling of Building 1: Input point cloud (left), shell model used for building detection (center) and the final model (right) with windows and door modeled as openings.

10.7 Conclusion and Future Work

We have presented an approach for 3D urban scene reconstruction and interpretation based on the fusion of airborne and terrestrial images. It is one step forward towards a complete and fully automatic pipeline for large-scale urban reconstruction. The main contributions of this work can be summarized as follows:

• Fusion of images from different platforms (terrestrial, UAV) by means of pose estimation and 3D reconstruction of the scene.
• Combination of color and 3D geometric information for scene classification.
• Primitive-based model decomposition and assembly for the parsing of complex scenes as well as buildings.
• ConvNets for the detection of windows and doors on facades.
• Automatic modeling of buildings at LoD2 and LoD3.

We are aware that many challenges remain, for instance, the reliable parsing and decomposition of building-blocks in densely inhabited urban areas with an intricate neighborhood of individual but similar buildings. Many public and commercial buildings also have special shapes that cannot be represented by the introduced rectangular primitives. Concerning future work we, thus, consider to extend the primitives with a more flexible geometry. Further building elements such as balconies, dormers and chimneys could be modeled with (variants of) the given primitives. Deep neural networks can be extended to fully utilize the available multimodal data (e.g., color and depth information in the scope of this paper) for not only the scene classification but also for the detection of facade elements. Reference [57] demonstrates a start of our exploration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10: 3D Urban Scene Reconstruction and Interpretation from Multisensor Imagery

Create new playlist

Sign In

Sign Up