Zoltan Koppanyi⁎; Dorota Iwaszczuk†; Bing Zha⁎; Can Jozef Saul‡; Charles K. Toth⁎; Alper Yilmaz⁎ ⁎Department of Civil, Envir. & Geod Eng., The Ohio State University, Columbus, OH, United States
†Photogrammetry & Remote Sensing, Technical University of Munich, Munich, Germany
‡Robert College, Istanbul, Turkey
Semantic segmentation has been an active field in computer vision and photogrammetry communities for over a decade. Pixel-level semantic labeling of images is generally achieved by assigning labels to pixels using machine learning techniques. Among others, the encoder–decoder convolutional neural networks have become the baseline approach for this problem recently. The majority of papers on this topic use only RGB images as input, despite the availability of other data sources, such as depth, which can improve segmentation and labeling. In this chapter, we investigate a number of encoder–decoder CNN architectures for semantic labeling, where the depth data is fused with the RGB data using three different approaches: (1) fusion with RGB image through color space transformation, (2) stacking depth images and RGB images, and (3) using Siamese network structures, such as FuseNet or VNet. The chapter also presents our approach to using surface normals in place of depth data. The advantage of using surface normal representation is to introduce viewpoint independency, meaning that the direction of a surface normal vector remains the same when the camera pose changes. This is a clear advantage over raw depth data where the depth value for a single scene point changes when the camera moves. The chapter provides a comprehensive analysis of three different fusion approaches using SegNet, FuseNet and VNet deep learning architectures. The analysis is conducted on both the Stanford 2D-3D-Semantics indoor dataset and aerial images from the ISPRS's Vaihingen dataset. Depth images of the Stanford dataset are acquired directly by flash LiDAR, and the ISPRS dataset depth images are generated by dense 3D reconstruction. We show that the surface normal representation better generalizes to different scenes. In our experiments, using surface normals with FuseNet achieved 5% improvement compared to using depth, and this resulted in 81.5% global accuracy on the Stanford dataset.
Deep learning; CNN; Sensor fusion; Semantic labeling
Semantic segmentation of an image is an easy task for humans; for instance, if asked, a child can easily delineate an object in an image. It is trivial for us, but not for computers as the high-level cognitive process of semantically segmenting an image cannot be transformed into exact mathematical expressions. Additionally, there is no strict definition of object categories or “boundaries”. To tackle these “soft” problems, various machine learning algorithms were developed for semantic segmentation in the past decades. Early studies used “descriptors” for representing object categories. Recent advances in neural networks allow for training deep learning models which “learn” these “descriptors”. These types of networks are called convolutional neural networks (CNN or ConvNet).
This chapter addresses the problem of semantic labeling using deep CNNs. We use the terms of “semantic labeling”, “semantic scene understanding”, “semantic image understanding” and “segmentation and labeling” interchangeably, which is considered as segmentation of an image into object categories by labeling pixels semantically. Clearly, this process involves two steps: first, objects with their boundaries in the image are localized, and then, in the second step, they are labeled. As we will see later, these two steps can be simultaneously solved with a CNN.
Both the earlier and more recent studies on semantic labeling use the three-channel red, green, blue (RGB) images. Aside from the common adoption of the RGB channels, one can expect that additional scene observations, such as specular bands or 3D information, can improve image segmentation and labeling. Note that any 3D data acquired from LiDAR or other sensor can be converted to a 2.5D representation, which is often a depth image. Depth images extend the RGB images by incorporating 3D information. This information is helpful to distinguish objects with different 3D shapes. However, finding potential ways to combine or fuse these modalities in the CNN framework remains an open question.
This chapter presents the concepts and CNN architectures that tackle this problem. In addition to an overview of the RGB and depth fusion networks and their comparison, we report two main contributions. First, we investigate various color space transformations for RGB and depth fusion. The idea behind this approach is to eliminate the redundancy of the RGB space by converting the image to either normalized-RGB space or HSI space, and then substituting one of the channels with depth. Second, we report results that indicate that using normals instead of depth can significantly improve the generalization ability of the networks in semantic segmentation.
The rest of the chapter is organized as follows. First, we give an overview of existing CNN structures that fuse RGB and depth images. We present the basic ideas of building up a network from a single image classifier to a Siamese network architecture, and then three approaches, namely, fusion through color space transformations, networks with four-channel inputs, and Siamese networks for RGB and depth fusion, are discussed. The “Methods” section provides the description of experiments conducted to investigate the three approaches, including datasets used, the data splitting and preprocessing, color space transformations, and the parametrization of the investigated network architectures. The investigation is carried out on the Stanford indoor dataset and the ISPRS aerial dataset. Both datasets provide depth images, however, these images are derived differently for each dataset; in particular, flash LiDAR sensor is used to acquire the Stanford dataset which directly provides depth images. For the ISPRS dataset we apply dense reconstruction to generate the depth images. In the “Results and Discussion” section, we present our results, comparison of the approaches, and discussion on our findings. The chapter ends with a conclusion and possible future research directions.
CNNs are used for problems where the input can be defined as array-like data, such as images, and there is spatial relationship between the array elements. Three research problems are considered to be the main application of CNNs for image classification.
First of these research problems is the labeling of an entire image based on its content; see Fig. 3.1A. This problem is referred to as the image classification, and it is well suited for interpreting the content of an entire image, and thus, can be used for image indexing and image retrieval. We will see later that networks, which solve single image classification problems, are the basis of other networks that are designed for solving more complex segmentation problems. The second research problem is to detect and localize objects of interests by providing bounding boxes around them, and then labeling the boxes [1]; see Fig. 3.1B. It is noteworthy that these two research areas have been very active due to the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [1]. Finally, the third research problem, which is the focus of this chapter, is pixel-level labeling of images, where the task is to annotate all pixels of an image; see Fig. 3.1C.
Pixel-level labeling, and object detection and localization problems overlap, because annotation of all pixels eventually leads to identifying regions where the objects are located [7]. The choice of which approach should be used, however, depends on the application. Networks developed for object detection, such as the regional convolutional networks [2–5], might be more suitable for recognizing multiple objects in an image, such as chair, table, trash bin, and provide less accurate boundaries or contours of the objects. In contrast, pixel-level labeling allows for identifying larger image regions, such as ceilings and floors, and provides more accurate object boundaries [7]. In general, pixel-level labeling is better suited for remote sensing problems, such as land use classification [8–11]. Note that some remote sensing applications may require object detection, such as detecting vehicles [12], aircraft [13], or ships [14] from aerial photos.
This chapter discusses networks that solve pixel-level labeling tasks using multiple data sources, such as RGB and depth information. The most common architectures for pixel-level labeling consist of an encoder that transforms an image into a lower dimensional feature space, and a decoder that converts this feature space back to the image space [6]. The encoder part of this network architecture is often a variant of an image classification network. For this reason, in the next subsection, we briefly present some notable CNNs for image classification tasks. This discussion will be followed with an overview of encoder–decoder network structures for pixel-level labeling. Finally, we discuss the latest network architectures for fusing RGB and depth data. At the end of this section, we present the most notable datasets that might be used for developing deep models for RGB and depth fusion. Detailed discussion on various layer types, such as convolution, pooling, unpooling, ReLU layers, is not the intent of this chapter, and thus, it is not included in the remainder of the text. The reader can find detailed information on these subjects in [15, p. 321].
Network architectures developed for data fusion heavily rely on the results achieved in image classification using CNNs. Image classification is a very active research field, and recent developments can be tracked from the results reported at the annual ILSVCR competition [1]. Most notable networks from this competition are LeNet-5 [16], AlexNet [17], VGG [18], GoogLeNet [19], DenseNet [20] and ResNet [21].
Most commonly used image classifier that is also adopted in many data fusion architectures is the VGG developed by the Visual Geometry Group at University of Oxford. VGG is composed of blocks of 2 to 3 convolution-ReLU (Rectified Linear Units) pairs and a max-pooling layer at the end of each block; see Fig. 3.2A. The depth or feature channel size of the blocks increases with the depth of the network. Finally, three fully connected layers and a soft-max function map the features into label probabilities at the tail of the VGG network. Several variations of the VGG network exist, which differ in the number of blocks. For example, VGG-16 network indicates that the network consists of 13 layers of convolutional-ReLU pairs as well as three fully connected layers resulting in 16 blocks; see Fig. 3.2A. The latest implementations add batch normalization and drop-out layers to the originally reported VGG network in order to improve training and the generalization of the network [22].
In VGG and other networks, the convolution is implemented as shared convolution, not as a locally connected layer [15, p. 341] to handle fewer parameters, speed up the training, and use less memory. Shared weights within one channel assumes that the same features appear at various locations of the input images. This assumption is generally valid for remote sensing and indoor images. It is worth to mention that, since the ceiling or floor regions are located at the top and bottom of the images, using locally connected layer might provide an interesting insight and research direction for segmentation of indoor images.
The channels in the convolutional layers can be interpreted as filters, highlighting patterns in an image that the network has learned from the training data. In the first convolution layer, these patterns, referred to as features, are associated with basic/primitive image content; for instance, corners, edges, etc. Deeper in the network, the features describe complex relationships of these basic image content; for more details see [15, p. 360].
It is noteworthy that VGG networks are not considered as the best network architecture for image classification problems due to their accuracy, the number of parameters learned, and the inference time. The reader can find comparisons of various CNNs in [23]. The latest deep models that outperform VGG contain significantly more convolutional layers. To train these deeper networks is harder due to the propagation of potential numerical errors stemming from very low residuals in the training step. For this reason, ResNet utilizes residual connections that allow for training a 1000-layer deep network; see [21]. While this field is rapidly evolving, current data fusion networks typically adopt the VGG network due to its simple architecture.
As opposed to image classification, pixel-level labeling requires annotating the pixels of the entire image. Consequently, the output is an array similar to the size of the input. This can be achieved by utilizing a single image classifier network and by discarding the classifier tail of VGG, i.e. removing the fully connected layer and soft-max, and instead, utilizing a series of unpooling operations along with additional convolutions. An unpooling operation allows for increasing the width and height of the convolutional layer and decreases the number of channels. This results in generating an output at the end of the network that has the original image size; see Fig. 3.2B. This concept is referred to as encoder–decoder network, such as SegNet [6]. SegNet adopts a VGG network as encoder, and mirrors the encoder for the decoder, except the pooling layers are replaced with unpooling layers; see Fig. 3.2B. An advantage of utilizing an image classifier is that the weights trained on image classification datasets can be used for the encoder. This can be considered a benefit as the image classification datasets are typically larger, such that the weights learned using these datasets are likely to be more accurate. Adopting these weights as initial weights in the encoder part of the network is referred to as transfer learning. Here, we mention another notable network, called U-Net, which has a similar encoder–decoder structure but the encoder and decoder features are connected forming a U-shaped network topography [24].
The main disadvantage of encoder–decoder networks is the pooling-unpooling strategy which introduces errors at segment boundaries [6]. Extracting accurate boundaries is generally important for remote sensing applications, such as delineating small patches corresponding to buildings, trees or cars. SegNet addresses this issue by tracking the indices of max-pooling, and uses these indices during unpooling to maintain boundaries. DeepLab, a recent pixel-level labeling network, tackles the boundary problem by using atrous spatial pyramid pooling and a conditional random field [25]. Additionally, the latest DeepLab version integrates ResNet into its architecture, and thus benefits from a more advanced network structure.
It is reasonable to expect that using additional information about the scene, such as depth, would improve the segmentation and labeling tasks. In this chapter, our focus is on end-to-end deep learning models that fuse RGB and depth data using CNNs; hence we investigate three different approaches:
Furthermore, instead of depth representation, one can also derive 3D normal vector (needle) images from the depth images and use them as input. Note that surface normal vector representation of the 3D space might be more general in terms of describing objects, since it does not depend on camera pose. It is clear that the depth values change when the camera pose changes, but the normals are fixed to the local plane of the object; hence, they are view independent. This property is expected to allow for more general learning of various objects. 3D normal vectors can be represented as a three-channel image that contains the X, Y, Z components of the vector at each pixel. We also derive a one-channel input by labeling the normal vectors based on their quantized directions. This one-channel normal label image allows for simpler representation, and thus requires fewer network parameters. These two normal representations are investigated for all fusion approaches presented above.
Successful training of deep learning methods requires large set of labeled segments. Generating accurate labels are labor intensive, and therefore, open datasets and benchmarks are important for developing and testing new network architectures. Table 3.1 lists the most notable benchmark datasets that contain both RGB and depth data. Currently, the Stanford 2D-3D-Semantics Dataset [27] is the largest database of images captured from real scenes along with depth data as well as ground truth labels. In the remote sensing domain, the ISPRS 2D Semantic Labeling benchmark [28] is the most recent overhead imagery dataset that contains high resolution orthophotos, images with labels of six land cover categories, digital surface models (DSM) and point clouds captured in urban areas.
Table 3.1
Dataset | Size | Description |
---|---|---|
NYUv2 [29] | 1.5k | Indoor scenes, captured by Kinect v1 |
SUN-RGBD [30] | 10k | Indoor scenes, 47 indoor and 800 object categories, captured by Kinect v2 and Intel RealSense, includes other datasets, such as NYUv2. |
SceneNet RGB-D [30] | >1M | Synthetic, photorealistic indoor videos from 15k trajectories |
Stanford 2D-3D-Semantics Dataset [27] | 70k | 6000 m2 indoor area; dataset includes depths, surface normals, semantic annotations, global and metadata from a 360∘ camera, depth obtained by structured-light sensor |
ISPRS 2D Semantic Labeling [28] | 33+38 | High resolution aerial images; urban scene of a village (Vaihingen) and city (Potsdam) captured from aircraft; three-channel images (infra-red, blue, green), DSM derived with dense reconstruction, and LiDAR point cloud |
This section presents our experimental setup, used for investigating the semantic segmentation of indoor and airborne images. First, we review the Stanford 2D-3D-Semantics and ISPRS 2D Semantic Labeling datasets, and then the data split for training and testing applied in our investigation, including data preprocessing, and the derivation of 3D normals. Next, we present our normal labeling concept that converts the 3D normal vectors into labels. The section continues by introducing the color space transformations for RGB and depth fusion. Finally, the hyper-parameters used during the network training are presented.
The Stanford 2D-3D-Semantics Dataset (2D-3D-S) [27] contains RGB and depth images of 6 different indoor areas. The dataset were mainly captured in office rooms and hallways, and small part of it in a lobby and auditorium. The dataset is acquired by a Matterport Camera system, which combines cameras and three structured-light sensors to capture 360-degree RGB and depth images. The sensor provides reconstructed 3D textured meshes and raw RGB-D images. The dataset contains annotated pixel-level labels for each image. The objects are categorized into 13 classes: ceiling, floor, wall, column, beam, window, door, table, chair, bookcase, sofa, board, and clutter. The sofa class is underrepresented in the dataset, and thus categorized as clutter in our tests.
Due to the large size of the dataset, we use Area 1 of the Stanford dataset for training, and Area 3 for testing. This strategy allows for estimating the generalization error of the networks, because the training and testing are performed in two independent scenes. The training dataset is divided into 90% training and 10% validation sets, respectively, resulting in 9294 and 1033 images. The validation set is used for evaluating the network performance at each epoch of the training. The test data, i.e. Area 3, contains 3704 images. An overview of the training, validation and test datasets can be found in Table 3.2.
Table 3.2
Overview of the datasets used for training, validation and testing.
Dataset | Training [# of images] | Validation [# of images] | Testing [# of images] |
---|---|---|---|
Stanford 2D-3D-S | 9294 (90% of A1) | 1033 (10% of A1) | 3704 (100% of A3) |
ISPRS Vaihingen | 8255 (65%) | 1886 (15%) | 2523 (20%) |
The ISPRS 2D Semantic Labeling Dataset is an airborne image collection, consisting of high resolution true orthophotos and corresponding digital surface models (DSMs) derived with dense image matching. The images are annotated manually into six land use and object categories, i.e. impervious surfaces, building, low vegetation, tree, car and clutter/background. The dataset acquired in a city (Potsdam) and a village (Vaihingen). Here, the Vaihingen dataset is investigated, which contains 33 high resolution orthophoto mosaics. The images have a ground sampling distance (GSD) of 9 cm, and are false-colored using the near-infra-red, red and green spectral bands (NRG). Table 3.2 lists the data used for training, validation, and testing.
This subsection presents the methods used for image resizing, depth filtering and normalization. Sample images can be seen in the first two columns of Fig. 3.3.
Resizing. Original image size of the dataset is pixels. Deep learning models use significantly smaller image size due to memory and computation considerations. We should note that larger image size does not necessarily lead to more accurate result. We adopt the strategies developed by other authors [26,6], and resize the images to pixels. Sample RGB images can be seen in the first row of Fig. 3.3.
Depth filtering. Depth images in the dataset contain missing pixels. Since missing pixels can significantly influence the training and testing process, depth interpolation is performed by applying image inpainting strategies [31,32]. This approach, first, calculates initial guesses of the missing depth value using local statistical analysis, and then one iteratively estimates the missing depth values using discrete cosine transform. Sample depth images can be seen in the second row of Fig. 3.3.
Data normalization. RGB values are normalized according to the input layer of the neural networks. Range of depth values depends on the scene and the maximum operating range of the sensor. For indoor scenes, depth values are typically up to few meters. Therefore, we use 0 and 10 m range to normalize the data. Values larger than 10 m are truncated to this maximum value.
Calculation of normal vectors. Surface normal vectors are provided with the dataset. Examples of 3D normal images can be seen in the third row of Fig. 3.3.
This subsection presents the preprocessing steps for the ISPRS dataset. These steps include resizing and cropping of the orthophotos and the calculation of the normal vectors. Sample images can be seen in the last two columns of Fig. 3.3.
Resizing and cropping. The orthophoto mosaics from the ISPRS dataset are roughly pixels. In order to reduce the images to the desired pixel size, first, the images are cropped into tiles, providing an about 60% overlap. Then the cropped areas are resized to using nearest neighborhood interpolation. The final dataset consists of 12,664 images. Sample images can be seen in the first row of Fig. 3.3.
Depth filtering. The data are already filtered for outliers, and therefore, no data preprocessing is needed. Sample depth images can be seen in the second row of Fig. 3.3.
Calculation of normal vectors. Surface normal data is directly not available from the ISPRS dataset. Therefore, surface normals are derived from the DSM using least squares plane fitting. Given a point, the computation of a normal at that point is performed as follows. First, the eight neighbor grid points (, ) are obtained around the point (), and then their differences from the point are calculated, i.e. . Then the points are filtered based on the vertical differences and a predefined threshold. Given , , , we formulate a constrained optimization problem:
where n is the surface normal, is its estimate. This problem can be solved with singular value decomposition; the solution is the eigenvector of corresponding to the smallest eigenvalue. Sample 3D normal images can be seen in the third row of Fig. 3.3.
The one-channel surface normals are derived from the 3D normals by mapping the three-channel normal vectors into a unit sphere. This mapping can be simply done by transforming the normal unit vectors into spherical coordinates, i.e. and , since . At this point, the angle values have to be associated with integer numbers (labels) in order to represent the normal values in 8-bit image format, i.e. . Here, we opted to divide the sphere at every 18 degrees, which results in 200 values. These values are saved as gray image. This one-channel representation was calculated for both the Stanford and ISPRS datasets. Sample one-channel normal images can be seen in the fourth row of Fig. 3.3.
We investigate two possibilities to fuse RGB and depth information through color space transformations. The goal is to keep the three-channel structure of RGB images, but substitute one of the channels with either the depth or normal labels.
Normalized RGB color space. Let r, g and b be the normalized RGB values, such that
In this color space representation, one of the channels is redundant due to the fact that it can be recovered from the other two channels using the condition . Therefore, for example, we can replace the blue channel with depth D or normals (one-channel representation) N, normalized to . This results in fused images rgD and rgN. Sample rgD and rgN images can be seen in the fifth and sixth rows of Fig. 3.3. Note that in terms of energy, blue represents about 10% of the visible spectrum in average.
HSI color space. The second color space fusion is based on hue, saturation and intensity (HSI) representation. In this representation, intensity I is replaced with depth D or normals N, which results in HSD and HSN representation, respectively, where
Sample HSI and HSN images can be seen in the seventh and eighth rows of Fig. 3.3.
Finally, we investigate a representation which is a transformation of the HSD or HSN representation back to the RGB space. Therefore, we calculate
where
for color space modification using depth D or calculated using (3.5), where
for color space modification using normals N. Sample HSI and HSN images can be seen in the last two rows row of Fig. 3.3.
Due to the similarity in the network structures of investigated networks, the same hyper-parameters are used in the experiments for all three networks. We use pretrained weights for the VGG-16 part of the networks, which are trained on the large-scale ImageNet dataset [1]. When the network has an additional one-channel branch, such as the depth branch of FuseNet, or it has an additional input channel, i.e. four-channel network, the three-channel pretrained weights of the input convolution layer of VGG-16 are averaged and used for initializing the extra channel. A stochastic gradient descent (SGD) solver and cross-entropy loss are used for training. The entire training dataset is passed through the network at each epoch. Within an epoch, the mini batch size is set to 4 and the batch was randomly shuffled at each iteration. The momentum and weight decay was set to 0.9 and 0.0005, respectively. The learning rate was 0.005 and decreased by 10% at every 30 epochs. The best model is chosen according to the performance on the validation set measured after each epoch. We run 300 epochs for all tests, but the best loss is typically achieved between 100–150 epochs.
The predication images calculated from the test set are evaluated based on the elements of the confusion matrix. The , and , respectively, denote the total number of true positive, false positive and false negative pixels of the ith label, where , K is the number of label classes. We use three standard metrics to measure the overall performance of the networks. Global accuracy represents the percentage of the correctly classified all pixels and is defined as
and N denotes the total number of annotated pixels. The mean accuracy measures the average accuracy over all classes:
Finally, intersection over union (IoU) is defined as the average value of the intersection of the prediction and ground truth over the union of them:
Results for the various network and input structures are presented in Table 3.3. Fig. 3.4 shows the accuracies of each class for different networks. The baseline solution, the SegNet with RGB inputs, achieves 76.4% global accuracy in our tests. If only depth images are used in SegNet, then the accuracy drops by 5%. Fusing RGB and depth through color space transformations results in noticeable decrease (–3%) in the network performance. Table 3.3 Results on Stanford indoor dataset; Training: 90% of Area 1, 9294 images; Validation: 10% of Area 1, 1033 images; Testing: 100% of Area 3, 3704 images. *Network abbreviations: the name of the networks consists of two parts: the first is the base network, i.e. SegNet, FuseNet and VNet, see Fig. 3.2, following the input type. Note that, in some cases, the networks are slightly changed to be able to accept the different input size. For instance, RGBD and RGBN represent four-channel inputs. RGB+D and RGB+N indicates that the network has separate branches for depth or normal, respectively. XnYnZn is three-channel input of 3D surface normals. The four-channel variants of the SegNet, i.e. SegNet RGBD and SegNet RGBN, produce similar results to the SegNet RGB. In these cases, the information from the depth data clearly makes no contribution to the segmentation process; the network uses only the pretrained weights of SegNet RGB, and the weights associated with the depth are neglected. The FuseNet architecture with depth input, i.e. FuseNet RGB+D, produces slightly worse results than the baseline solution in our tests; it particularly achieves 76.0% global accuracy, which is 0.4% less than the accuracy produced by SegNet, which can be considered insignificant. The mean IoU in this case is better by than the baseline. Note that this 76% accuracy is similar to the result reported in [26] on the SUN RGB-D benchmark dataset. In our tests, we found that FuseNet with normals, i.e. FuseNet RGB+N, provides the best results in terms of global and mean accuracy. The margin between either the SegNet RGB or FuseNet RGB+D and the RGB+N is significant. The 3D normal representation provides the same global accuracy as the FuseNet RGB+N, while it outperforms by 2% margin in the mean IoU. Discussion. Results suggest that depth does not improve the overall accuracy for the adopted dataset. Using normals instead of depth, however, improves almost all types of RGB and depth fusion approaches with a significant margin. Since the test dataset is a different area (Area 3) than the training (Area 1), these results indicate that the normals improve the generalization ability of the network. In other words, normals are better general representation of indoor scenes than depth. In order to support this hypothesis, we also evaluated the networks using data from the Area 1 only. No samples were taken from Area 3 neither for the training nor for the testing. These results are presented in Table 3.4. In this training scenario, the images from Area 1 are divided into 10% (1033 images) training and 90% (9294 images) validation sets. We used this extreme split in order to rely on the pretrained SegNet weights; consequently, the achieved accuracies presented in Table 3.4 are significantly smaller than in the first test scenario. Table 3.4 shows no or a slight performance gap between depth and normal data for all approaches. FuseNet shows the best results by effectively combining RGB and depth; FuseNet outperforms SegNet with 6% margin as opposed to the results presented in Table 3.3; note that similar improvements were reported by previous studies [26]. It is important to note that FuseNet is able to combine RGB and depth with very low number of training samples, which prompts the question whether the network actually learns the depth features, or just the addition operator applied between the network branches is an efficient, but deterministic way to share depth information with the VGG-16 network. Table 3.4 Results for Area 1 only; Training: 10% of Area 1, 1033 images; Testing and validation: 90% of Area 1, 9294 images. The per-class accuracies presented in Fig. 3.4 show that each network has a different accuracy per each class. FuseNet RGB+D provides the best overall accuracy metrics, but other networks, such as SegNet HSD, performs better in majority of classes, such as column, beam, chair. This indicates that an ensemble classifier using these networks might outperform any individual network. Results for the ISPRS Vaihingen dataset are listed in Table 3.5, the baseline solution is the three-channel SegNet NRG network, which achieves 85.1% global accuracy. Note that SegNet NRG is the same network as SegNet RGB which was discussed in the previous section; we use NRG term to emphasize that the ISPRS images are false color infra-red, red, green images. As expected, the accuracy drops down by about 10% using only depth data. All network solutions produce similar results close to the baseline solution. SegNet NRGD, NRG+N and VNet NRG+N slightly outperform the SegNet NRG in all three metrics; though, the difference is not significant. The best results are provided by the VNet NRG+N achieving 85.6% global, 50.8% mean accuracy and 60.4% mean IoU. Per-class accuracies presented in Fig. 3.5. Here, we see that all networks have a very similar performance in all classes. A slight difference can be seen in the clutter class, where SegNet NRGD outperforms the other networks. Table 3.5 Results for ISPRS Vaihingen dataset; Training: 65%, 8255 images; Validation: 15%, 1886 images; Testing: 20%, 2523 images. *Network abbreviations: the name of the networks in the first column consists of two parts: the first is the base network, i.e. SegNet, FuseNet and VNet, see Fig. 3.2, following the input type. NRG denotes the infra-red, red and green bands, which is the image format of the ISPRS dataset. Note that, in some cases, the first layer of the networks is slightly changed to be able to accept the different input size. For instance, NRGD and NRGN represent four-channel inputs. NRGB+D and NRGB+N indicates that the network has separate branches for depth or normal, respectively. XnYnZn is three-channel input of 3D surface normals. Discussion. Overall, almost all accuracies presented in Table 3.5 are within the error margin and are around 85%. Using normals have insignificant improvement due to the fact that almost all surface normals point in the same direction (upward). Table 3.5 shows very slight improvement of VNet over FuseNet, while FuseNet NRG+D does not achieve the accuracy of the baseline SegNet architecture. It is noteworthy that the reported global accuracies on the ISPRS Vaihingen dataset using either SegNet or FuseNet are around 91% [11]. The main reason of this relatively large performance gap between Table 3.5 and [11] is potentially due both to the difference in evaluation technique and the random selection of images in the training process. Although SegNet NRG has similar results to VNet or FuseNet, there are examples, where data fusion using depth outperforms SegNet; see an example in Fig. 3.6. The red rectangle shows a section, where SegNet NRG does not detect a building due to the color of the building roof is very similar to the road; see Figs. 3.6D and 3.6C for ground truth. Unlike SegNet, FuseNet NRG+D partially segment that building, see Fig. 3.6E, due to the depth difference between the roof and road surface in the depth image presented in Fig. 3.6B. Also, note that the depth image is not sharp around the building contours, and the prediction image of FuseNet NRG+D follows this irregularity; see Fig. 3.6B and 3.6E. This indicates that the quality of the depth image is important to obtain reliable segment contours. Using normals instead of depth as input also does not improve the global accuracy significantly. Most object categories are horizontal, such as grass or a road, and therefore, normals point upwards containing no distinctive information of the object class itself. In fact, after analyzing the confusion matrices, we found that FuseNet NRG+N slightly decreased (0.7%) the false positives of the building class and improved the true positives for the impervious surface class with the same ratio. This indicates that the network better differentiate between buildings and roads. The reason for this that some roofs in the dataset are not flat, such as gable or hipped roof. In these cases, normals have a distinguishable direction that guides the network to correctly identify some buildings. Two examples can be seen in the first two rows of Fig. 3.7. The figures demonstrate that the FuseNet NRG+N has sharper object edges than a single SegNet NRG classifier. Finally, the last row in Fig. 3.7 presents a case when the NRG based network does not classify correctly an object based on the ground truth; see the object boxed by the red rectangle. The ground truth and FuseNet NRG+N mark that object as car, and SegNet as building. This example demonstrates that normal or depth data has semantical impact on interpreting a patch.3.4.1 Results and Discussion on the Stanford Dataset
Network* GlobalAcc MeanAcc Mean IoU Segnet RGB 76.4% 56.0% 70.8% Segnet Depth Only 71.0% 50.8% 67.5% Segnet HSD 73.4% 53.6% 74.2% Segnet HSN 77.4% 57.1% 71.2% Segnet RdGdB 73.5% 49.7% 64.1% Segnet RnGnB 74.6% 51.4% 67.2% Segnet rgD 76.1% 55.6% 71.3% Segnet rgN 76.3% 53.4% 69.2% Segnet RGBD 75.4% 53.6% 68.5% Segnet RGBN 76.7% 57.4% 74.4% Segnet RGBDN 68.7% 45.2% 58.9% FuseNet RGB+D 76.0% 54.9% 73.7% FuseNet RGB+N 81.5% 62.4% 76.1% FuseNet RGB+XnYnZn 81.4% 62.4% 78.1% VNet RGB+D 78.3% 54.2% 68.5% VNet RGB+N 80.0% 56.1% 70.4% Channels GlobalAcc MeanAcc Mean IoU SegNet RGB 65.0% 61.8% 45.2% SegNet rgD 58.5% 53.1% 36.9% SegNet rgN 58.7% 53.1% 36.5% SegNet RGBD 68.6% 58.8% 44.9% SegNet RGBN 68.8% 59.1% 45.4% FuseNet RGB+D 71.9% 73.0% 48.8% FuseNet RGB+N 69.1% 58.0% 43.6% 3.4.2 Results and Discussion on the ISPRS Dataset
Networks* GlobalAcc MeanAcc Mean IoU Segnet NRG 85.1% 49.7% 59.3% Segnet Depth Only 75.9% 38.9% 49.9% Segnet HSD 84.7% 46.6% 56.4% Segnet HSN 82.8% 45.2% 54.8% Segnet NRGD 74.2% 39.8% 50.9% Segnet NRGN 85.2% 48.5% 58.4% Segnet NRGDN 84.8% 46.7% 57.1% FuseNet NRG+D 84.4% 47.3% 57.4% FuseNet NRG+N 85.5% 48.1% 57.9% FuseNet NRG+XnYnZn 85.1% 47.2% 57.3% VNet NRG+D 85.6% 48.2% 58.4% VNet NRG+N 85.6% 50.8% 60.4%
The chapter reviewed the current end-to-end learning approaches to combining RGB and depth for image segmentation and labeling. We conducted tests with SegNet, FuseNet and VNet architectures using various inputs including RGB, depth, normal labels, HSD, HSN and 3D surface normals. Networks were trained on indoor and remote sensing images. We reported similar results as other authors in terms of global accuracy; around 78% on the Stanford and 85% on ISPRS dataset. For indoor scenes, when the training and testing datasets were captured from different areas, we noticed 1–5% improvement with using normal labels instead of depth in almost all fusion approaches. This indicates that using surface normals is a more general representation of the scene that can be exploited in neural network trained for semantic segmentation of indoor images. Best global accuracy achieved by FuseNet RGB+N was 81.5% global accuracy on the Stanford indoor dataset. Similar performance gaps between networks using depth or normal is not achieved for aerial images due to differences between indoor and aerial image scenes.
In summary, the accuracy provided by a simpler SegNet RGB network is close to both FuseNet and VNet indicating that there is still room for improvements, especially for remote sensing data. Future models might consider more recent networks, such as residual connections used in ResNet, or conditional random field implemented in DeepLab. It is also expected that future networks become deeper. Directly incorporating some depth knowledge into the network; for instance, directed dropouts of weights in the convolution layer, might be an interesting research direction. For high resolution images, image segmentation at various resolution (image pyramid) might increase the global consistency of the inference.
Augmenting Siamese networks by adding of tensors, such as in FuseNet or VNet, clearly improves results and is a simple way to fuse RGB and depth data. These networks outperform the original SegNet network in indoor datasets. In the future, more powerful neural network architectures are expected to be developed and thus can be adopted for these two modalities.