Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3

Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks

Zoltan Koppanyi^⁎; Dorota Iwaszczuk^†; Bing Zha^⁎; Can Jozef Saul^‡; Charles K. Toth^⁎; Alper Yilmaz^⁎ ^⁎Department of Civil, Envir. & Geod Eng., The Ohio State University, Columbus, OH, United States
^†Photogrammetry & Remote Sensing, Technical University of Munich, Munich, Germany
^‡Robert College, Istanbul, Turkey

Abstract

Semantic segmentation has been an active field in computer vision and photogrammetry communities for over a decade. Pixel-level semantic labeling of images is generally achieved by assigning labels to pixels using machine learning techniques. Among others, the encoder–decoder convolutional neural networks have become the baseline approach for this problem recently. The majority of papers on this topic use only RGB images as input, despite the availability of other data sources, such as depth, which can improve segmentation and labeling. In this chapter, we investigate a number of encoder–decoder CNN architectures for semantic labeling, where the depth data is fused with the RGB data using three different approaches: (1) fusion with RGB image through color space transformation, (2) stacking depth images and RGB images, and (3) using Siamese network structures, such as FuseNet or VNet. The chapter also presents our approach to using surface normals in place of depth data. The advantage of using surface normal representation is to introduce viewpoint independency, meaning that the direction of a surface normal vector remains the same when the camera pose changes. This is a clear advantage over raw depth data where the depth value for a single scene point changes when the camera moves. The chapter provides a comprehensive analysis of three different fusion approaches using SegNet, FuseNet and VNet deep learning architectures. The analysis is conducted on both the Stanford 2D-3D-Semantics indoor dataset and aerial images from the ISPRS's Vaihingen dataset. Depth images of the Stanford dataset are acquired directly by flash LiDAR, and the ISPRS dataset depth images are generated by dense 3D reconstruction. We show that the surface normal representation better generalizes to different scenes. In our experiments, using surface normals with FuseNet achieved 5% improvement compared to using depth, and this resulted in 81.5% global accuracy on the Stanford dataset.

Keywords

Deep learning; CNN; Sensor fusion; Semantic labeling

3.1 Introduction

Semantic segmentation of an image is an easy task for humans; for instance, if asked, a child can easily delineate an object in an image. It is trivial for us, but not for computers as the high-level cognitive process of semantically segmenting an image cannot be transformed into exact mathematical expressions. Additionally, there is no strict definition of object categories or “boundaries”. To tackle these “soft” problems, various machine learning algorithms were developed for semantic segmentation in the past decades. Early studies used “descriptors” for representing object categories. Recent advances in neural networks allow for training deep learning models which “learn” these “descriptors”. These types of networks are called convolutional neural networks (CNN or ConvNet).

This chapter addresses the problem of semantic labeling using deep CNNs. We use the terms of “semantic labeling”, “semantic scene understanding”, “semantic image understanding” and “segmentation and labeling” interchangeably, which is considered as segmentation of an image into object categories by labeling pixels semantically. Clearly, this process involves two steps: first, objects with their boundaries in the image are localized, and then, in the second step, they are labeled. As we will see later, these two steps can be simultaneously solved with a CNN.

Both the earlier and more recent studies on semantic labeling use the three-channel red, green, blue (RGB) images. Aside from the common adoption of the RGB channels, one can expect that additional scene observations, such as specular bands or 3D information, can improve image segmentation and labeling. Note that any 3D data acquired from LiDAR or other sensor can be converted to a 2.5D representation, which is often a depth image. Depth images extend the RGB images by incorporating 3D information. This information is helpful to distinguish objects with different 3D shapes. However, finding potential ways to combine or fuse these modalities in the CNN framework remains an open question.

This chapter presents the concepts and CNN architectures that tackle this problem. In addition to an overview of the RGB and depth fusion networks and their comparison, we report two main contributions. First, we investigate various color space transformations for RGB and depth fusion. The idea behind this approach is to eliminate the redundancy of the RGB space by converting the image to either normalized-RGB space or HSI space, and then substituting one of the channels with depth. Second, we report results that indicate that using normals instead of depth can significantly improve the generalization ability of the networks in semantic segmentation.

The rest of the chapter is organized as follows. First, we give an overview of existing CNN structures that fuse RGB and depth images. We present the basic ideas of building up a network from a single image classifier to a Siamese network architecture, and then three approaches, namely, fusion through color space transformations, networks with four-channel inputs, and Siamese networks for RGB and depth fusion, are discussed. The “Methods” section provides the description of experiments conducted to investigate the three approaches, including datasets used, the data splitting and preprocessing, color space transformations, and the parametrization of the investigated network architectures. The investigation is carried out on the Stanford indoor dataset and the ISPRS aerial dataset. Both datasets provide depth images, however, these images are derived differently for each dataset; in particular, flash LiDAR sensor is used to acquire the Stanford dataset which directly provides depth images. For the ISPRS dataset we apply dense reconstruction to generate the depth images. In the “Results and Discussion” section, we present our results, comparison of the approaches, and discussion on our findings. The chapter ends with a conclusion and possible future research directions.

3.2 Overview

CNNs are used for problems where the input can be defined as array-like data, such as images, and there is spatial relationship between the array elements. Three research problems are considered to be the main application of CNNs for image classification.

First of these research problems is the labeling of an entire image based on its content; see Fig. 3.1A. This problem is referred to as the image classification, and it is well suited for interpreting the content of an entire image, and thus, can be used for image indexing and image retrieval. We will see later that networks, which solve single image classification problems, are the basis of other networks that are designed for solving more complex segmentation problems. The second research problem is to detect and localize objects of interests by providing bounding boxes around them, and then labeling the boxes [1]; see Fig. 3.1B. It is noteworthy that these two research areas have been very active due to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [1]. Finally, the third research problem, which is the focus of this chapter, is pixel-level labeling of images, where the task is to annotate all pixels of an image; see Fig. 3.1C.

Figure 3.1 Examples of research problems in image labeling and segmentation: (A) a keyword is associated with the content of an image for image classification (©Creative Commons), (B) regional convolutional neural networks often used for object detection and localization provide bounding boxes around multiple objects of interest and their labels [2–5], and (C) an example of pixel-level labeling of an indoor image [6].

Pixel-level labeling, and object detection and localization problems overlap, because annotation of all pixels eventually leads to identifying regions where the objects are located [7]. The choice of which approach should be used, however, depends on the application. Networks developed for object detection, such as the regional convolutional networks [2–5], might be more suitable for recognizing multiple objects in an image, such as chair, table, trash bin, and provide less accurate boundaries or contours of the objects. In contrast, pixel-level labeling allows for identifying larger image regions, such as ceilings and floors, and provides more accurate object boundaries [7]. In general, pixel-level labeling is better suited for remote sensing problems, such as land use classification [8–11]. Note that some remote sensing applications may require object detection, such as detecting vehicles [12], aircraft [13], or ships [14] from aerial photos.

This chapter discusses networks that solve pixel-level labeling tasks using multiple data sources, such as RGB and depth information. The most common architectures for pixel-level labeling consist of an encoder that transforms an image into a lower dimensional feature space, and a decoder that converts this feature space back to the image space [6]. The encoder part of this network architecture is often a variant of an image classification network. For this reason, in the next subsection, we briefly present some notable CNNs for image classification tasks. This discussion will be followed with an overview of encoder–decoder network structures for pixel-level labeling. Finally, we discuss the latest network architectures for fusing RGB and depth data. At the end of this section, we present the most notable datasets that might be used for developing deep models for RGB and depth fusion. Detailed discussion on various layer types, such as convolution, pooling, unpooling, ReLU layers, is not the intent of this chapter, and thus, it is not included in the remainder of the text. The reader can find detailed information on these subjects in [15, p. 321].

3.2.1 Image Classification and the VGG Network

Network architectures developed for data fusion heavily rely on the results achieved in image classification using CNNs. Image classification is a very active research field, and recent developments can be tracked from the results reported at the annual ILSVCR competition [1]. Most notable networks from this competition are LeNet-5 [16], AlexNet [17], VGG [18], GoogLeNet [19], DenseNet [20] and ResNet [21].

Most commonly used image classifier that is also adopted in many data fusion architectures is the VGG developed by the Visual Geometry Group at University of Oxford. VGG is composed of blocks of 2 to 3 convolution-ReLU (Rectified Linear Units) pairs and a max-pooling layer at the end of each block; see Fig. 3.2A. The depth or feature channel size of the blocks increases with the depth of the network. Finally, three fully connected layers and a soft-max function map the features into label probabilities at the tail of the VGG network. Several variations of the VGG network exist, which differ in the number of blocks. For example, VGG-16 network indicates that the network consists of 13 layers of convolutional-ReLU pairs as well as three fully connected layers resulting in 16 blocks; see Fig. 3.2A. The latest implementations add batch normalization and drop-out layers to the originally reported VGG network in order to improve training and the generalization of the network [22].

Figure 3.2 Deep neural network architectures: (A) VGG-16 is a single image classifier [18], (B) SegNet is used for pixel-level image segmentation and labeling using an encoder–decoder type structure [6], (C) FuseNet developed for combining RGB and depth data [26], and (D) VNet is used for remote sensing data fusion of infra-red images and digital surface models [11]. For legend, see the left bottom corner.

In VGG and other networks, the convolution is implemented as shared convolution, not as a locally connected layer [15, p. 341] to handle fewer parameters, speed up the training, and use less memory. Shared weights within one channel assumes that the same features appear at various locations of the input images. This assumption is generally valid for remote sensing and indoor images. It is worth to mention that, since the ceiling or floor regions are located at the top and bottom of the images, using locally connected layer might provide an interesting insight and research direction for segmentation of indoor images.

The channels in the convolutional layers can be interpreted as filters, highlighting patterns in an image that the network has learned from the training data. In the first convolution layer, these patterns, referred to as features, are associated with basic/primitive image content; for instance, corners, edges, etc. Deeper in the network, the features describe complex relationships of these basic image content; for more details see [15, p. 360].

It is noteworthy that VGG networks are not considered as the best network architecture for image classification problems due to their accuracy, the number of parameters learned, and the inference time. The reader can find comparisons of various CNNs in [23]. The latest deep models that outperform VGG contain significantly more convolutional layers. To train these deeper networks is harder due to the propagation of potential numerical errors stemming from very low residuals in the training step. For this reason, ResNet utilizes residual connections that allow for training a 1000-layer deep network; see [21]. While this field is rapidly evolving, current data fusion networks typically adopt the VGG network due to its simple architecture.

3.2.2 Architectures for Pixel-level Labeling

As opposed to image classification, pixel-level labeling requires annotating the pixels of the entire image. Consequently, the output is an array similar to the size of the input. This can be achieved by utilizing a single image classifier network and by discarding the classifier tail of VGG, i.e. removing the fully connected layer and soft-max, and instead, utilizing a series of unpooling operations along with additional convolutions. An unpooling operation allows for increasing the width and height of the convolutional layer and decreases the number of channels. This results in generating an output at the end of the network that has the original image size; see Fig. 3.2B. This concept is referred to as encoder–decoder network, such as SegNet [6]. SegNet adopts a VGG network as encoder, and mirrors the encoder for the decoder, except the pooling layers are replaced with unpooling layers; see Fig. 3.2B. An advantage of utilizing an image classifier is that the weights trained on image classification datasets can be used for the encoder. This can be considered a benefit as the image classification datasets are typically larger, such that the weights learned using these datasets are likely to be more accurate. Adopting these weights as initial weights in the encoder part of the network is referred to as transfer learning. Here, we mention another notable network, called U-Net, which has a similar encoder–decoder structure but the encoder and decoder features are connected forming a U-shaped network topography [24].

The main disadvantage of encoder–decoder networks is the pooling-unpooling strategy which introduces errors at segment boundaries [6]. Extracting accurate boundaries is generally important for remote sensing applications, such as delineating small patches corresponding to buildings, trees or cars. SegNet addresses this issue by tracking the indices of max-pooling, and uses these indices during unpooling to maintain boundaries. DeepLab, a recent pixel-level labeling network, tackles the boundary problem by using atrous spatial pyramid pooling and a conditional random field [25]. Additionally, the latest DeepLab version integrates ResNet into its architecture, and thus benefits from a more advanced network structure.

3.2.3 Architectures for RGB and Depth Fusion

It is reasonable to expect that using additional information about the scene, such as depth, would improve the segmentation and labeling tasks. In this chapter, our focus is on end-to-end deep learning models that fuse RGB and depth data using CNNs; hence we investigate three different approaches:

• Approach 1: Depth image can be incorporated in the network input by modifying the color space of the RGB image. In this case, one of the three image channels in the modified color space is replaced with the one-channel depth image. The main idea behind this modification is that the RGB color space carries redundant information, hence removing the redundancy does not influence scene segmentation. In our experiments, we exploit this redundancy by using “normalized” RGB space, where one channel is redundant, and therefore, can be replaced with scene depth or normal which augments the geometric information present in the image. Another color representation, namely HSI, is also investigated, where the HS channels contain the geometric scene information, and the intensity channel contains “scene” illumination. Considering that the illumination component is not descriptive of scene geometry, it can be replaced with depth or normal information. Using color transformations from RGB to the adopted color spaces allow training the existing neural networks, such as SegNet, DeepLab or U-Net [24], without any modifications. This integration has the advantage of using same number of weights as it would be used for RGB images. In addition, weights learned on large datasets for image classification can be directly used conjecturing that transfer learning for this problem holds. Here, we investigate several color space modifications, such as when the last channel of the HSI representation is replaced by depth values or surface normals.
• Approach 2: The three-channel RGB image can be stacked together with the depth image resulting in a four-channel input. In this case, the network structure requires small changes at the head of the network and the number of learnable parameters is almost the same, which is an advantage. However, it is not clear whether the depth information can be successfully incorporated using this approach due to both the different nature of and the correlation between data sources. We investigate the SegNet network with four-channel input, where the fourth channel is the one-channel depth or the surface normal image.
• Approach 3: Siamese network structures are designed to extract features from two data sources independently, and then the features of independent network branches can be combined at the end of certain convolutional blocks; see Figs. 3.2C and 3.2D. Clearly, these networks have more parameters to be estimated, but the data structures treat the data sources differently. We investigate two Siamese network structures, namely FuseNet [26] and VNet [11]. FuseNet utilizes two parallel VGG-16 networks as encoders for RGB and depth, see Fig. 3.2C, where RGB encoder–decoder part is the same as SegNet, and the depth branch is identical to a VGG with one-channel input without the classifier head. The outputs of the two networks are combined with a simple addition of two tensors from the RGB and depth branches at the end of each convolutional block. VNet is a more complex network in terms of structure; see Fig. 3.2D. While it still has two separate VGG-16 branches for RGB and depth, there is a middle branch which integrates the features generated at the end of each convolutional block of the two VGG-16 networks. This middle branch concatenate the tensors along the feature (depth) channel; doubling the length of the feature channel. The next convolutional layer reduces the length of the feature channels, and thus enables adding the three tensors, i.e. the two tensors of the VGG-16 convolutional blocks and the middle tensor in the next step. The decoder operates on the middle sub-network. Clearly, the VNet has more parameters than FuseNet. Our implementation of the network architecture presented in Fig. 3.2D is based on [11].

Furthermore, instead of depth representation, one can also derive 3D normal vector (needle) images from the depth images and use them as input. Note that surface normal vector representation of the 3D space might be more general in terms of describing objects, since it does not depend on camera pose. It is clear that the depth values change when the camera pose changes, but the normals are fixed to the local plane of the object; hence, they are view independent. This property is expected to allow for more general learning of various objects. 3D normal vectors can be represented as a three-channel image that contains the X, Y, Z components of the vector at each pixel. We also derive a one-channel input by labeling the normal vectors based on their quantized directions. This one-channel normal label image allows for simpler representation, and thus requires fewer network parameters. These two normal representations are investigated for all fusion approaches presented above.

3.2.4 Datasets and Benchmarks

Successful training of deep learning methods requires large set of labeled segments. Generating accurate labels are labor intensive, and therefore, open datasets and benchmarks are important for developing and testing new network architectures. Table 3.1 lists the most notable benchmark datasets that contain both RGB and depth data. Currently, the Stanford 2D-3D-Semantics Dataset [27] is the largest database of images captured from real scenes along with depth data as well as ground truth labels. In the remote sensing domain, the ISPRS 2D Semantic Labeling benchmark [28] is the most recent overhead imagery dataset that contains high resolution orthophotos, images with labels of six land cover categories, digital surface models (DSM) and point clouds captured in urban areas.

Table 3.1

Overview of datasets for RGB and depth fusion; datasets include annotated images; the size of the dataset is the number of annotated images.
Dataset	Size	Description
NYUv2 [29]	1.5k	Indoor scenes, captured by Kinect v1
SUN-RGBD [30]	10k	Indoor scenes, 47 indoor and 800 object categories, captured by Kinect v2 and Intel RealSense, includes other datasets, such as NYUv2.
SceneNet RGB-D [30]	>1M	Synthetic, photorealistic indoor videos from 15k trajectories
Stanford 2D-3D-Semantics Dataset [27]	70k	6000 m² indoor area; dataset includes depths, surface normals, semantic annotations, global and metadata from a 360^∘ camera, depth obtained by structured-light sensor
ISPRS 2D Semantic Labeling [28]	33+38	High resolution aerial images; urban scene of a village (Vaihingen) and city (Potsdam) captured from aircraft; three-channel images (infra-red, blue, green), DSM derived with dense reconstruction, and LiDAR point cloud

3.3 Methods

This section presents our experimental setup, used for investigating the semantic segmentation of indoor and airborne images. First, we review the Stanford 2D-3D-Semantics and ISPRS 2D Semantic Labeling datasets, and then the data split for training and testing applied in our investigation, including data preprocessing, and the derivation of 3D normals. Next, we present our normal labeling concept that converts the 3D normal vectors into labels. The section continues by introducing the color space transformations for RGB and depth fusion. Finally, the hyper-parameters used during the network training are presented.

3.3.1 Datasets and Data Splitting

The Stanford 2D-3D-Semantics Dataset (2D-3D-S) [27] contains RGB and depth images of 6 different indoor areas. The dataset were mainly captured in office rooms and hallways, and small part of it in a lobby and auditorium. The dataset is acquired by a Matterport Camera system, which combines cameras and three structured-light sensors to capture 360-degree RGB and depth images. The sensor provides reconstructed 3D textured meshes and raw RGB-D images. The dataset contains annotated pixel-level labels for each image. The objects are categorized into 13 classes: ceiling, floor, wall, column, beam, window, door, table, chair, bookcase, sofa, board, and clutter. The sofa class is underrepresented in the dataset, and thus categorized as clutter in our tests.

Due to the large size of the dataset, we use Area 1 of the Stanford dataset for training, and Area 3 for testing. This strategy allows for estimating the generalization error of the networks, because the training and testing are performed in two independent scenes. The training dataset is divided into 90% training and 10% validation sets, respectively, resulting in 9294 and 1033 images. The validation set is used for evaluating the network performance at each epoch of the training. The test data, i.e. Area 3, contains 3704 images. An overview of the training, validation and test datasets can be found in Table 3.2.

Table 3.2

Overview of the datasets used for training, validation and testing.

Dataset	Training [# of images]	Validation [# of images]	Testing [# of images]
Stanford 2D-3D-S	9294 (90% of A1)	1033 (10% of A1)	3704 (100% of A3)
ISPRS Vaihingen	8255 (65%)	1886 (15%)	2523 (20%)

The ISPRS 2D Semantic Labeling Dataset is an airborne image collection, consisting of high resolution true orthophotos and corresponding digital surface models (DSMs) derived with dense image matching. The images are annotated manually into six land use and object categories, i.e. impervious surfaces, building, low vegetation, tree, car and clutter/background. The dataset acquired in a city (Potsdam) and a village (Vaihingen). Here, the Vaihingen dataset is investigated, which contains 33 high resolution orthophoto mosaics. The images have a ground sampling distance (GSD) of 9 cm, and are false-colored using the near-infra-red, red and green spectral bands (NRG). Table 3.2 lists the data used for training, validation, and testing.

3.3.2 Preprocessing of the Stanford Dataset

This subsection presents the methods used for image resizing, depth filtering and normalization. Sample images can be seen in the first two columns of Fig. 3.3.

Figure 3.3 Different data modality representations. The first two columns are samples from the Stanford dataset, and the last two columns are samples from the ISPRS dataset. Each row presents a different modality.

Resizing. Original image size of the dataset is $1080 \times 1080$ $1080 \times 1080$ pixels. Deep learning models use significantly smaller image size due to memory and computation considerations. We should note that larger image size does not necessarily lead to more accurate result. We adopt the strategies developed by other authors [26,6], and resize the images to $224 \times 224$ $224 \times 224$ pixels. Sample RGB images can be seen in the first row of Fig. 3.3.

Depth filtering. Depth images in the dataset contain missing pixels. Since missing pixels can significantly influence the training and testing process, depth interpolation is performed by applying image inpainting strategies [31,32]. This approach, first, calculates initial guesses of the missing depth value using local statistical analysis, and then one iteratively estimates the missing depth values using discrete cosine transform. Sample depth images can be seen in the second row of Fig. 3.3.

Data normalization. RGB values are normalized according to the input layer of the neural networks. Range of depth values depends on the scene and the maximum operating range of the sensor. For indoor scenes, depth values are typically up to few meters. Therefore, we use 0 and 10 m range to normalize the data. Values larger than 10 m are truncated to this maximum value.

Calculation of normal vectors. Surface normal vectors are provided with the dataset. Examples of 3D normal images can be seen in the third row of Fig. 3.3.

3.3.3 Preprocessing of the ISPRS Dataset

This subsection presents the preprocessing steps for the ISPRS dataset. These steps include resizing and cropping of the orthophotos and the calculation of the normal vectors. Sample images can be seen in the last two columns of Fig. 3.3.

Resizing and cropping. The orthophoto mosaics from the ISPRS dataset are roughly $2000 \times 2500$ $2000 \times 2500$ pixels. In order to reduce the images to the desired $224 \times 224$ $224 \times 224$ pixel size, first, the images are cropped into $448 \times 448$ $448 \times 448$ tiles, providing an about 60% overlap. Then the cropped areas are resized to $224 \times 224$ $224 \times 224$ using nearest neighborhood interpolation. The final dataset consists of 12,664 images. Sample images can be seen in the first row of Fig. 3.3.

Depth filtering. The data are already filtered for outliers, and therefore, no data preprocessing is needed. Sample depth images can be seen in the second row of Fig. 3.3.

Calculation of normal vectors. Surface normal data is directly not available from the ISPRS dataset. Therefore, surface normals are derived from the DSM using least squares plane fitting. Given a point, the computation of a normal at that point is performed as follows. First, the eight neighbor grid points ( $z_{i}$ $z_{i}$ , $i = 1, . . ., 8$ $i = 1, . . ., 8$ ) are obtained around the point ( $z_{c}$ $z_{c}$ ), and then their differences from the point are calculated, i.e. ${\overline{z}}_{i} = z_{i} - z_{c}$ ${\overline{z}}_{i} = z_{i} - z_{c}$ . Then the points are filtered based on the vertical differences and a predefined threshold. Given ${\overline{z}}_{i}$ ${\overline{z}}_{i}$ , $i = 1, . . ., k$ $i = 1, . . ., k$ , $k ⩽ 8$ $k ⩽ 8$ , we formulate a constrained optimization problem:

$\hat{n} = \underset{| | n | | = 1}{argmin} n^{⊤} (\sum_{i = 1}^{k} {\overline{z}}_{i} {\overline{z}}_{i}^{⊤}) n,$ $\hat{n} = \underset{| | n | | = 1}{argmin} n^{⊤} (\sum_{i = 1}^{k} {\overline{z}}_{i} {\overline{z}}_{i}^{⊤}) n,$

(3.1)

where n is the surface normal, $\hat{n}$ $\hat{n}$ is its estimate. This problem can be solved with singular value decomposition; the $\hat{n}$ $\hat{n}$ solution is the eigenvector of $\sum_{i = 1}^{k} {\overline{z}}_{i} {\overline{z}}_{i}^{⊤}$ $\sum_{i = 1}^{k} {\overline{z}}_{i} {\overline{z}}_{i}^{⊤}$ corresponding to the smallest eigenvalue. Sample 3D normal images can be seen in the third row of Fig. 3.3.

3.3.4 One-channel Normal Label Representation

The one-channel surface normals are derived from the 3D normals by mapping the three-channel normal vectors into a unit sphere. This mapping can be simply done by transforming the normal unit vectors into spherical coordinates, i.e. $θ = \tan^{- 1} (\frac{n_{y}}{n_{x}})$ $θ = \tan^{- 1} (\frac{n_{y}}{n_{x}})$ and $ϕ = \cos^{- 1} (n_{z})$ $ϕ = \cos^{- 1} (n_{z})$ , since $r = | | n | | = \sqrt{n_{x}^{2} + n_{y}^{2} + n_{z}^{2}} = 1$ $r = | | n | | = \sqrt{n_{x}^{2} + n_{y}^{2} + n_{z}^{2}} = 1$ . At this point, the angle values have to be associated with integer numbers (labels) in order to represent the normal values in 8-bit image format, i.e. $(θ, ϕ) \to [0, . . ., 255]$ $(θ, ϕ) \to [0, . . ., 255]$ . Here, we opted to divide the sphere at every 18 degrees, which results in 200 values. These values are saved as gray image. This one-channel representation was calculated for both the Stanford and ISPRS datasets. Sample one-channel normal images can be seen in the fourth row of Fig. 3.3.

3.3.5 Color Spaces for RGB and Depth Fusion

We investigate two possibilities to fuse RGB and depth information through color space transformations. The goal is to keep the three-channel structure of RGB images, but substitute one of the channels with either the depth or normal labels.

Normalized RGB color space. Let r, g and b be the normalized RGB values, such that

$r = \frac{R}{R + G + B}; g = \frac{G}{R + G + B}; b = \frac{B}{R + G + B} .$ $r = \frac{R}{R + G + B}; g = \frac{G}{R + G + B}; b = \frac{B}{R + G + B} .$

(3.2)

In this color space representation, one of the channels is redundant due to the fact that it can be recovered from the other two channels using the condition $r + g + b = 1$ $r + g + b = 1$ . Therefore, for example, we can replace the blue channel with depth D or normals (one-channel representation) N, normalized to $[0, 1]$ $[0, 1]$ . This results in fused images rgD and rgN. Sample rgD and rgN images can be seen in the fifth and sixth rows of Fig. 3.3. Note that in terms of energy, blue represents about 10% of the visible spectrum in average.

HSI color space. The second color space fusion is based on hue, saturation and intensity (HSI) representation. In this representation, intensity I is replaced with depth D or normals N, which results in HSD and HSN representation, respectively, where

$H = {\begin{matrix} \cos^{- 1} \frac{0.5 [(r - g) + (r - b)]}{\sqrt{{(r - g)}^{2} + (r - g) (g - b)}}, \\ 2 π - \cos^{- 1} \frac{0.5 [(r - g) + (r - b)]}{\sqrt{{(r - g)}^{2} + (r - g) (g - b)}}, \end{matrix}$ $H = {\begin{matrix} \cos^{- 1} \frac{0.5 [(r - g) + (r - b)]}{\sqrt{{(r - g)}^{2} + (r - g) (g - b)}}, \\ 2 π - \cos^{- 1} \frac{0.5 [(r - g) + (r - b)]}{\sqrt{{(r - g)}^{2} + (r - g) (g - b)}}, \end{matrix}$

(3.3)

$S = 1 - 3 \min (r, g, b) .$ $S = 1 - 3 \min (r, g, b) .$

(3.4)

Sample HSI and HSN images can be seen in the seventh and eighth rows of Fig. 3.3.

Finally, we investigate a representation which is a transformation of the HSD or HSN representation back to the RGB space. Therefore, we calculate

$(R_{d}, G_{d}, B_{d}) = {\begin{matrix} (b, c, a), h = H, for H < \frac{2}{3} π, \\ (a, b, c), h = H - \frac{2}{3} π, for \frac{2}{3} π ⩽ H < \frac{4}{3} π, \\ (c, a, b), h = H - \frac{4}{3} π, for \frac{4}{3} π ⩽ H < 2 π, \end{matrix}$ $(R_{d}, G_{d}, B_{d}) = {\begin{matrix} (b, c, a), h = H, for H < \frac{2}{3} π, \\ (a, b, c), h = H - \frac{2}{3} π, for \frac{2}{3} π ⩽ H < \frac{4}{3} π, \\ (c, a, b), h = H - \frac{4}{3} π, for \frac{4}{3} π ⩽ H < 2 π, \end{matrix}$

(3.5)

where

$a = D (1 - S),$ $a = D (1 - S),$

(3.6)

$b = D (1 + \frac{S \cos (h)}{\cos (\frac{π}{3} - h)}),$ $b = D (1 + \frac{S \cos (h)}{\cos (\frac{π}{3} - h)}),$

(3.7)

$c = 3 D - (a + b),$ $c = 3 D - (a + b),$

(3.8)

for color space modification using depth D or $(R_{n}, G_{n}, B_{n})$ $(R_{n}, G_{n}, B_{n})$ calculated using (3.5), where

$a = N (1 - S),$ $a = N (1 - S),$

(3.9)

$b = N (1 + \frac{S \cos (h)}{\cos (\frac{π}{3} - h)}),$ $b = N (1 + \frac{S \cos (h)}{\cos (\frac{π}{3} - h)}),$

(3.10)

$c = 3 N - (a + b),$ $c = 3 N - (a + b),$

(3.11)

for color space modification using normals N. Sample HSI and HSN images can be seen in the last two rows row of Fig. 3.3.

3.3.6 Hyper-parameters and Training

Due to the similarity in the network structures of investigated networks, the same hyper-parameters are used in the experiments for all three networks. We use pretrained weights for the VGG-16 part of the networks, which are trained on the large-scale ImageNet dataset [1]. When the network has an additional one-channel branch, such as the depth branch of FuseNet, or it has an additional input channel, i.e. four-channel network, the three-channel pretrained weights of the input convolution layer of VGG-16 are averaged and used for initializing the extra channel. A stochastic gradient descent (SGD) solver and cross-entropy loss are used for training. The entire training dataset is passed through the network at each epoch. Within an epoch, the mini batch size is set to 4 and the batch was randomly shuffled at each iteration. The momentum and weight decay was set to 0.9 and 0.0005, respectively. The learning rate was 0.005 and decreased by 10% at every 30 epochs. The best model is chosen according to the performance on the validation set measured after each epoch. We run 300 epochs for all tests, but the best loss is typically achieved between 100–150 epochs.

3.4 Results and Discussion

The predication images calculated from the test set are evaluated based on the elements of the confusion matrix. The ${TP}_{i}$ ${TP}_{i}$ , ${FP}_{i}$ ${FP}_{i}$ and ${FN}_{i}$ ${FN}_{i}$ , respectively, denote the total number of true positive, false positive and false negative pixels of the ith label, where $i = 1, . . ., K$ $i = 1, . . ., K$ , K is the number of label classes. We use three standard metrics to measure the overall performance of the networks. Global accuracy represents the percentage of the correctly classified all pixels and is defined as

$GlobAcc = \frac{1}{N} \sum_{i = 1}^{K} {TP}_{i}$ $GlobAcc = \frac{1}{N} \sum_{i = 1}^{K} {TP}_{i}$

(3.12)

and N denotes the total number of annotated pixels. The mean accuracy measures the average accuracy over all classes:

$MeanAcc = \frac{1}{K} \sum_{i = 1}^{K} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i}} .$ $MeanAcc = \frac{1}{K} \sum_{i = 1}^{K} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i}} .$

(3.13)

Finally, intersection over union (IoU) is defined as the average value of the intersection of the prediction and ground truth over the union of them:

$IoU = \frac{1}{K} \sum_{i = 1}^{K} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i} + {FN}_{i}} .$ $IoU = \frac{1}{K} \sum_{i = 1}^{K} \frac{{TP}_{i}}{{TP}_{i} + {FP}_{i} + {FN}_{i}} .$

(3.14)

3.4.1 Results and Discussion on the Stanford Dataset

Results for the various network and input structures are presented in Table 3.3. Fig. 3.4 shows the accuracies of each class for different networks. The baseline solution, the SegNet with RGB inputs, achieves 76.4% global accuracy in our tests. If only depth images are used in SegNet, then the accuracy drops by 5%. Fusing RGB and depth through color space transformations results in noticeable decrease ( $\sim 1 %$ $\sim 1 %$ –3%) in the network performance.

Table 3.3

Results on Stanford indoor dataset; Training: 90% of Area 1, 9294 images; Validation: 10% of Area 1, 1033 images; Testing: 100% of Area 3, 3704 images. *Network abbreviations: the name of the networks consists of two parts: the first is the base network, i.e. SegNet, FuseNet and VNet, see Fig. 3.2, following the input type. Note that, in some cases, the networks are slightly changed to be able to accept the different input size. For instance, RGBD and RGBN represent four-channel inputs. RGB+D and RGB+N indicates that the network has separate branches for depth or normal, respectively. X_nY_nZ_n is three-channel input of 3D surface normals.

Network*	GlobalAcc	MeanAcc	Mean IoU
Segnet RGB	76.4%	56.0%	70.8%
Segnet Depth Only	71.0%	50.8%	67.5%
Segnet HSD	73.4%	53.6%	74.2%
Segnet HSN	77.4%	57.1%	71.2%
Segnet R_dG_dB $_{d (H S I)}$ $_{d (H S I)}$	73.5%	49.7%	64.1%
Segnet R_nG_nB $_{n (H S I)}$ $_{n (H S I)}$	74.6%	51.4%	67.2%
Segnet rgD	76.1%	55.6%	71.3%
Segnet rgN	76.3%	53.4%	69.2%
Segnet RGBD	75.4%	53.6%	68.5%
Segnet RGBN	76.7%	57.4%	74.4%
Segnet RGBDN	68.7%	45.2%	58.9%
FuseNet RGB+D	76.0%	54.9%	73.7%
FuseNet RGB+N	81.5%	62.4%	76.1%
FuseNet RGB+X_nY_nZ_n	81.4%	62.4%	78.1%
VNet RGB+D	78.3%	54.2%	68.5%
VNet RGB+N	80.0%	56.1%	70.4%

Figure 3.4 Per-class accuracies for the Stanford dataset.

The four-channel variants of the SegNet, i.e. SegNet RGBD and SegNet RGBN, produce similar results to the SegNet RGB. In these cases, the information from the depth data clearly makes no contribution to the segmentation process; the network uses only the pretrained weights of SegNet RGB, and the weights associated with the depth are neglected.

The FuseNet architecture with depth input, i.e. FuseNet RGB+D, produces slightly worse results than the baseline solution in our tests; it particularly achieves 76.0% global accuracy, which is 0.4% less than the accuracy produced by SegNet, which can be considered insignificant. The mean IoU in this case is better by $\sim 3 %$ $\sim 3 %$ than the baseline. Note that this 76% accuracy is similar to the result reported in [26] on the SUN RGB-D benchmark dataset. In our tests, we found that FuseNet with normals, i.e. FuseNet RGB+N, provides the best results in terms of global and mean accuracy. The $\sim 5 %$ $\sim 5 %$ margin between either the SegNet RGB or FuseNet RGB+D and the RGB+N is significant. The 3D normal representation provides the same global accuracy as the FuseNet RGB+N, while it outperforms by 2% margin in the mean IoU.

Discussion. Results suggest that depth does not improve the overall accuracy for the adopted dataset. Using normals instead of depth, however, improves almost all types of RGB and depth fusion approaches with a significant margin. Since the test dataset is a different area (Area 3) than the training (Area 1), these results indicate that the normals improve the generalization ability of the network. In other words, normals are better general representation of indoor scenes than depth.

In order to support this hypothesis, we also evaluated the networks using data from the Area 1 only. No samples were taken from Area 3 neither for the training nor for the testing. These results are presented in Table 3.4. In this training scenario, the images from Area 1 are divided into 10% (1033 images) training and 90% (9294 images) validation sets. We used this extreme split in order to rely on the pretrained SegNet weights; consequently, the achieved accuracies presented in Table 3.4 are significantly smaller than in the first test scenario. Table 3.4 shows no or a slight performance gap between depth and normal data for all approaches. FuseNet shows the best results by effectively combining RGB and depth; FuseNet outperforms SegNet with 6% margin as opposed to the results presented in Table 3.3; note that similar improvements were reported by previous studies [26]. It is important to note that FuseNet is able to combine RGB and depth with very low number of training samples, which prompts the question whether the network actually learns the depth features, or just the addition operator applied between the network branches is an efficient, but deterministic way to share depth information with the VGG-16 network.

Table 3.4

Results for Area 1 only; Training: 10% of Area 1, 1033 images; Testing and validation: 90% of Area 1, 9294 images.

Channels	GlobalAcc	MeanAcc	Mean IoU
SegNet RGB	65.0%	61.8%	45.2%
SegNet rgD	58.5%	53.1%	36.9%
SegNet rgN	58.7%	53.1%	36.5%
SegNet RGBD	68.6%	58.8%	44.9%
SegNet RGBN	68.8%	59.1%	45.4%
FuseNet RGB+D	71.9%	73.0%	48.8%
FuseNet RGB+N	69.1%	58.0%	43.6%

The per-class accuracies presented in Fig. 3.4 show that each network has a different accuracy per each class. FuseNet RGB+D provides the best overall accuracy metrics, but other networks, such as SegNet HSD, performs better in majority of classes, such as column, beam, chair. This indicates that an ensemble classifier using these networks might outperform any individual network.

3.4.2 Results and Discussion on the ISPRS Dataset

Results for the ISPRS Vaihingen dataset are listed in Table 3.5, the baseline solution is the three-channel SegNet NRG network, which achieves 85.1% global accuracy. Note that SegNet NRG is the same network as SegNet RGB which was discussed in the previous section; we use NRG term to emphasize that the ISPRS images are false color infra-red, red, green images. As expected, the accuracy drops down by about 10% using only depth data. All network solutions produce similar results close to the baseline solution. SegNet NRGD, NRG+N and VNet NRG+N slightly outperform the SegNet NRG in all three metrics; though, the difference is not significant. The best results are provided by the VNet NRG+N achieving 85.6% global, 50.8% mean accuracy and 60.4% mean IoU. Per-class accuracies presented in Fig. 3.5. Here, we see that all networks have a very similar performance in all classes. A slight difference can be seen in the clutter class, where SegNet NRGD outperforms the other networks.

Table 3.5

Results for ISPRS Vaihingen dataset; Training: 65%, 8255 images; Validation: 15%, 1886 images; Testing: 20%, 2523 images. *Network abbreviations: the name of the networks in the first column consists of two parts: the first is the base network, i.e. SegNet, FuseNet and VNet, see Fig. 3.2, following the input type. NRG denotes the infra-red, red and green bands, which is the image format of the ISPRS dataset. Note that, in some cases, the first layer of the networks is slightly changed to be able to accept the different input size. For instance, NRGD and NRGN represent four-channel inputs. NRGB+D and NRGB+N indicates that the network has separate branches for depth or normal, respectively. X_nY_nZ_n is three-channel input of 3D surface normals.

Networks*	GlobalAcc	MeanAcc	Mean IoU
Segnet NRG	85.1%	49.7%	59.3%
Segnet Depth Only	75.9%	38.9%	49.9%
Segnet HSD	84.7%	46.6%	56.4%
Segnet HSN	82.8%	45.2%	54.8%
Segnet NRGD	74.2%	39.8%	50.9%
Segnet NRGN	85.2%	48.5%	58.4%
Segnet NRGDN	84.8%	46.7%	57.1%
FuseNet NRG+D	84.4%	47.3%	57.4%
FuseNet NRG+N	85.5%	48.1%	57.9%
FuseNet NRG+X_nY_nZ_n	85.1%	47.2%	57.3%
VNet NRG+D	85.6%	48.2%	58.4%
VNet NRG+N	85.6%	50.8%	60.4%

Figure 3.5 Per-class accuracies for the ISPRS dataset.

Discussion. Overall, almost all accuracies presented in Table 3.5 are within the error margin and are around 85%. Using normals have insignificant improvement due to the fact that almost all surface normals point in the same direction (upward).

Table 3.5 shows very slight improvement of VNet over FuseNet, while FuseNet NRG+D does not achieve the accuracy of the baseline SegNet architecture. It is noteworthy that the reported global accuracies on the ISPRS Vaihingen dataset using either SegNet or FuseNet are around 91% [11]. The main reason of this relatively large performance gap between Table 3.5 and [11] is potentially due both to the difference in evaluation technique and the random selection of images in the training process.

Although SegNet NRG has similar results to VNet or FuseNet, there are examples, where data fusion using depth outperforms SegNet; see an example in Fig. 3.6. The red rectangle shows a section, where SegNet NRG does not detect a building due to the color of the building roof is very similar to the road; see Figs. 3.6D and 3.6C for ground truth. Unlike SegNet, FuseNet NRG+D partially segment that building, see Fig. 3.6E, due to the depth difference between the roof and road surface in the depth image presented in Fig. 3.6B. Also, note that the depth image is not sharp around the building contours, and the prediction image of FuseNet NRG+D follows this irregularity; see Fig. 3.6B and 3.6E. This indicates that the quality of the depth image is important to obtain reliable segment contours.

Figure 3.6 (A) NRG image, (B) depth, (C) ground truth, (D) SegNet NRG, (E) FuseNet NRG+D, (F) FuseNet NRG+N; Label colors: white – impervious surface; red – building; light green – low vegetation, dark green – tree, blue – car, black – background.

Using normals instead of depth as input also does not improve the global accuracy significantly. Most object categories are horizontal, such as grass or a road, and therefore, normals point upwards containing no distinctive information of the object class itself. In fact, after analyzing the confusion matrices, we found that FuseNet NRG+N slightly decreased (0.7%) the false positives of the building class and improved the true positives for the impervious surface class with the same ratio. This indicates that the network better differentiate between buildings and roads. The reason for this that some roofs in the dataset are not flat, such as gable or hipped roof. In these cases, normals have a distinguishable direction that guides the network to correctly identify some buildings. Two examples can be seen in the first two rows of Fig. 3.7. The figures demonstrate that the FuseNet NRG+N has sharper object edges than a single SegNet NRG classifier. Finally, the last row in Fig. 3.7 presents a case when the NRG based network does not classify correctly an object based on the ground truth; see the object boxed by the red rectangle. The ground truth and FuseNet NRG+N mark that object as car, and SegNet as building. This example demonstrates that normal or depth data has semantical impact on interpreting a patch.

Figure 3.7 (A) NRG image, (B) 3D normals, (C) ground truth labels, (D) SegNet NRG, (E) FuseNet NRG+N; Label colors: white – impervious surface; red – building; light green – low vegetation, dark green – tree, blue – car, black – background.

3.5 Conclusion

The chapter reviewed the current end-to-end learning approaches to combining RGB and depth for image segmentation and labeling. We conducted tests with SegNet, FuseNet and VNet architectures using various inputs including RGB, depth, normal labels, HSD, HSN and 3D surface normals. Networks were trained on indoor and remote sensing images. We reported similar results as other authors in terms of global accuracy; around 78% on the Stanford and 85% on ISPRS dataset. For indoor scenes, when the training and testing datasets were captured from different areas, we noticed 1–5% improvement with using normal labels instead of depth in almost all fusion approaches. This indicates that using surface normals is a more general representation of the scene that can be exploited in neural network trained for semantic segmentation of indoor images. Best global accuracy achieved by FuseNet RGB+N was 81.5% global accuracy on the Stanford indoor dataset. Similar performance gaps between networks using depth or normal is not achieved for aerial images due to differences between indoor and aerial image scenes.

In summary, the accuracy provided by a simpler SegNet RGB network is close to both FuseNet and VNet indicating that there is still room for improvements, especially for remote sensing data. Future models might consider more recent networks, such as residual connections used in ResNet, or conditional random field implemented in DeepLab. It is also expected that future networks become deeper. Directly incorporating some depth knowledge into the network; for instance, directed dropouts of weights in the convolution layer, might be an interesting research direction. For high resolution images, image segmentation at various resolution (image pyramid) might increase the global consistency of the inference.

Augmenting Siamese networks by adding of tensors, such as in FuseNet or VNet, clearly improves results and is a simple way to fuse RGB and depth data. These networks outperform the original SegNet network in indoor datasets. In the future, more powerful neural network architectures are expected to be developed and thus can be adopted for these two modalities.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Multimodal Semantic Segmentation: Fusion of RGB and Depth Data in Convolutional Neural Networks

Create new playlist

Sign In

Sign Up