Arnaud Le Bris⁎; Nesrine Chehata⁎,†; Walid Ouerghemmi⁎,¶; Cyril Wendl⁎,‡; Tristan Postadjian⁎; Anne Puissant§; Clément Mallet⁎ ⁎Univ. Paris-Est, LASTIG STRUDEL, IGN, ENSG, Saint-Mande, France
†EA G&E Bordeaux INP, Université Bordeaux Montaigne, Pessac, France
‡Student at Ecole Polytechnique Fédérale de Lausanne (EPFL), Lausanne, Switzerland
§CNRS UMR 7362 LIVE-Université de Strasbourg, Strasbourg, France
¶Aix-Marseille Université, CNRS ESPACE UMR 7300, Aix-en-Provence, France
Very high spatial resolution (VHR) multispectral imagery enables a fine delineation of objects and a possible use of texture information. Other sensors provide a lower spatial resolution but an enhanced spectral or temporal information, permitting one to consider richer land cover semantics. So as to benefit from the complementary characteristics of these multimodal sources, a decision late fusion scheme is proposed. This makes it possible to benefit from the full capacities of each sensor, while dealing with both semantic and spatial uncertainties. The different remote-sensing modalities are first classified independently. Separate class membership maps are calculated and then merged at the pixel level, using decision fusion rules. A final label map is obtained from a global regularization scheme in order to deal with spatial uncertainties while conserving the contrasts from the initial images. It relies on a probabilistic graphical model involving a fit-to-data term related to merged class membership measures and an image-based contrast-sensitive regularization term. Conflict between sources can also be integrated into this scheme.
Two experimental cases are presented. In the first case one considers the fusion of VHR multispectral imagery with lower spatial resolution hyperspectral imagery for fine-grained land cover classification problem in dense urban areas. In the second case one uses SPOT 6/7 satellite imagery and Sentinel-2 time series to extract urban area footprints through a two-step process: classifications are first merged in order to detect building objects, from which a urban area prior probability is derived and eventually merged to Sentinel-2 classification output for urban footprint detection.
Late fusion; Decision fusion; Multimodal remote sensing; Classification; Land cover; Very high spatial resolution; Hyperspectral; Time series; Urban area; Urban footprint
The last years have witnessed the emergence of a large variety of new sensors with various characteristics. The possibility to collect different kinds of observations over the same area has considerably increased: remote sensing can now be considered generically multimodal [1]. Those sensors can use different modalities (radar, Lidar or optical). They can be air-borne or satellite-borne. Even for the same modality, they can exhibit very distinct characteristics: for instance, optical sensors show a large range of spectral configurations (number of spectral bands, position and width of the spectral bands), spatial resolutions, coverage and, for spaceborne sensors, revisit times (i.e., minimum delay between two possible consecutive acquisitions over the same area, thus conditioning the possibility to capture genuine time series). As a consequence, combining remote-sensing data with different characteristics is a standard remote-sensing problem that has been extensively investigated in the literature [2]. The overall aim consists in fusing multisensor information as a means of combining the respective advantages of each sensor. Complementary observations can thus be exploited for land cover mapping purposes, which is a core remote-sensing application and the necessary input for a large number of public policies and environmental models. Combining existing sensors can mitigate limitations of any one particular sensor for various land cover issues [3,4].
This chapter specifically focuses on the fusion of one data type, exhibiting a very high spatial resolution, with another one, exhibiting a lower spatial resolution but enhanced complementary characteristics. Indeed, very high spatial resolution (VHR) multispectral imagery enables an accurate spatial delineation of objects and a possible use of texture information for enhanced class discrimination [5]. On the other hand, sensors with lower spatial resolutions offer enhanced spectral or temporal information, making it possible to consider richer land cover semantics. To illustrate this problem, two use cases will be considered, accompanied by two methodological contributions:
In both cases, the land cover fusion scheme targets benefiting from the complementary characteristics of these multimodal sources.
Existing data fusion approaches will be analyzed in Sect. 11.1.1. From this review, existing methods will be discussed, and a fusion strategy elaborated (Sect. 11.1.2). This proposed framework will then be presented in detail (Sect. 11.2), before being applied to the two above-mentioned use cases (Sects. 11.3 and 11.4, respectively).
Fusion of heterogeneous data sources have been widely investigated in the remote-sensing literature (e.g., [6–9]). Fusion can be carried out at three different levels [10]:
Fusion can be achieved at the observation level, i.e., through the direct joint analysis of the pixel values (with or without calibration procedures). For that purpose, pan-sharpening is a well-known technique that integrates the geometric details of a high resolution panchromatic image and the color information of a low spatial resolution multispectral (or hyperspectral) image to produce a high spatial resolution multispectral (or hyperspectral) image. Pan-sharpening methods usually use the panchromatic image to replace the high frequency part in the low resolution image [11]. Other fusion algorithms have been proposed to merge multispectral (or hyperspectral) and panchromatic (but also multispectral and hyperspectral) images to combine complementary characteristics in terms of spatial and spectral resolutions [12–14]). A review of such methods can be found in [14]. Eventually, super-resolution is another approach relying on early fusion of several sensors [15]. Data sources can be merged at the feature level. Features (spectral indices, texture-based, etc.) are computed for each source separately or for both of them and fed into the same classifier through a unique feature set [16]. Examples of remote-sensing pipelines involving fusion at the attribute level can be found in [17–22]. For instance, [18] proposed a conditional random field (CRF) model for building detection using InSAR and orthoimage features. Reference [20] merged Lidar and optical aerial image features for forest stand extraction: the proposed approach involves several steps (segmentation, classification, and regularization) and fusion is performed at each of them. Improvement is noticed for each step. More recently, deep convolutional neural networks were used to perform data fusion at the attribute level [23]. Reference [22] applied for instance deep forests to Lidar and hyperspectral imagery features. A detailed review can be found in [24]. Several datasets and challenges have been released in the last decades, under the aegis of the IEEE GRSS society [25–27]. Late decision fusion happens after the classification process: the outputs of multiple independent classifiers are combined in order to provide a more reliable decision. Such classification results can be either label maps or class membership probability maps. Various late decision fusion methods have been proposed. Most of them can be divided into different categories: consensus rules (majority voting), probabilistic approaches (Bayesian fusion), credibilist or evidential ones, and possibilist ones. The probabilistic, evidential, and credibilist decision fusion approaches are generic and can be applied to different fusion problems. They only require class “membership” measures (probabilities or belief masses depending on the approach) for each source and for each class, or at least a confidence measure for each source. Possibilist methods use fuzzy logic-based fusion rules [28–30]. They require one to define weights [31] in order to better deal with the uncertainty of the different sources. Such generic approaches have been applied to remote-sensing data [32]. Evidential approaches are a generalization of probabilistic ones. They include the well-known Dempster–Shafer fusion rule [33]. In remote sensing, this rule has often been used to merge classifications or alarm detection results. For instance, [34–36] applied the Dempster–Shafer rule to combine several different supervised building detectors based on different remote-sensing modalities (optical, Lidar, radar). This rule was also used by [37] for the fusion of different road obstacle detectors in the context of intelligent vehicle development. Reference [38] used the Dempster–Shafer rule for the fusion of urban footprints detected at different dates out of satellite archival images. References [39,40] applied this rule to detecting changes in an unsupervised classification context. Another evidential fusion rule is the Yager rule. It was applied by [41] to combine different road obstacle detectors for vehicle navigation. Other evidential rules have been proposed more recently: the Dezert–Smarandache rule may be mentioned [42]. Other new rules have also been proposed by [43,44]: they extend Dempster–Shafer rules in order to achieve a better management of the conflict between sources. However efficient in some cases, Dempster–Shafer remains a theoretical complex framework that does not easily apply when dealing with heterogeneous and multiple data. Another important issue consists in defining the input of the different fusion rules. Different situations can be taken into account depending on whether a global confidence measure is affected for each source, or, conversely, whether class posterior probabilities for a given source are directly available for each pixel (directly provided by the initial classifier). In the former situation, global confidence measures affected to each source are calculated from the confusion matrices (indeed, a confusion matrix provides the probability of an object labeled class A by one source to belong in fact to class B). Validation data are thus necessary to calculate these weights. Furthermore, for evidential methods and especially Dempster–Shafer methods, uncertainty classes (i.e. unions of original classes) must necessarily be defined. Hence, belief masses associated to these classes must be computed. If a global confidence is associated to a source, some solutions have been proposed. For instance, we have Appriou's method [45], or more recently the method presented in [46,47]. If a class membership measure is affected to each pixel for each source and for each class, it is also possible to derive belief masses for these uncertainty (union) classes. Finally, [38] propose an alternative way to integrate these two kinds of information. Last but not least, the last category of late fusion approaches consists in supervised learning. They automatically learn from training examples the best way to merge input sources. Per-source posterior class probabilities are concatenated and considered as a feature vector. This feature vector is then provided as an input to a classifier which is trained to perform the best possible fusion. Thus, they can be considered as being at the interplay between late and intermediate fusion (classifiers previously applied to each source can then be considered as a kind of feature generators, then referred to as auto-context classification [48]). Such approaches have been used with different classifiers: random forest [49,50], Adaboost [50], support vector machines [51,52]. These supervised learning-based methods enable good results but require a sufficient amount of training data to model the classes and avoid over-fitting (especially for deep learning). In addition, it must here be said that most fusion methods mentioned in this section generate measures to assess the conflict between two sources. Late approaches have often been applied to remote-sensing fusion problems [53–56,32,51,57–61]. Fusion methods operating at the decision level can be applied in two situations, considering if they try to merge multiple classifiers applied to either the same data source, or to multiple sources. For instance, [53] combined neuronal and statistical maximum likelihood classifiers using several consensus theory rules (i.e., majority voting, complete agreement) to classify multispectral and hyperspectral images. References [57,58] merged posterior probabilities from maximum likelihood classifications of optical images with prior information about classes derived out of, respectively, digital terrain models or digital surface models, as well as information from existing land cover databases. Reference [55] investigated the fusion of multitemporal thematic mapper images, using decision fusion-based methods (i.e., joint likelihood and weighted majority fusion). A characterization of the spatial organization of SAR image elements is investigated by [54] merging the responses of multiple low-level detectors applied to the same image within a Dempster–Shafer scheme. Reference [32] investigated the use of fuzzy decision rules to combine the classification results of a conjugate gradient neural network and a fuzzy classifier over an IKONOS image. Reference [61] combined convolutional neural networks and random forest classifiers using a multiplication Bayesian scheme.11.1.1.1 Early fusion – fusion at the observation level
11.1.1.2 Intermediate fusion – fusion at the attribute/feature level
11.1.1.3 Late fusion – fusion at the decision level
Fusion can be performed at different levels. Fusion at the observation level, e.g., pan or multisharpening, is limited to specific situations where it has a real physical meaning (e.g., hyperspectral and multispectral images acquired simultaneously). It is not generic enough.
Fusion at the observation level or at the decision level is more generic and also eligible to the present fusion problem. Two main issues remain: (i) for both, the spatial scale of analysis and, subsequently, the interpolation process; (ii) for feature-based approaches, the ability to correctly handle the various data sources in the decision process. In the case of imbalanced feature sets, supervised techniques such as random forests or support vector machines, even with feature selection strategies, may prefer the data source generating the larger number of attributes. Thus, the process will not fully benefit from the advantages of all datasets.
As a consequence, late fusion strategy is adopted in this chapter. Indeed, contrary to intermediate level fusion methods, it makes it possible to initially process each input data source independently through specific optimal methods. Moreover, it even enables one to use already existing results from available operational land cover classification services, as long as they provide class membership confidence measures. Besides, especially in this last situation, it can be used without any ground truth (training) information, contrary to intermediate level fusion methods.
Most existing decision fusion methods do not explicitly take into account the fact that input data sources have different spatial resolutions, and thus do not explicitly deal with both semantic and spatial uncertainties. Spatial uncertainty handling here consists in removing classification noise, and enforcing that the classification result follows as closely as possible the natural borders in the original images. Such a task can be cast in the form of a smoothing problem. Local smoothing methods exist: majority voting, Gaussian and bilateral filtering [62], as well as probabilistic relaxation [63] are possible. The majority vote can be used in particular when a segmentation of the area is available: the major class is assigned to the segment. The vote can also be weighted by class probabilities of the different pixels. The probabilistic relaxation is another local smoothing method that aims at homogenizing probabilities of a pixel according to its neighbors. It is an iterative algorithm in which the class probability at each pixel is updated at each iteration in order to have it closer to the probabilities of its neighbors.
However, these local smoothing methods are generally outperformed by global regularization strategies [64,20]. Global regularization methods consider the whole image by connecting each pixel to its neighbors. They traditionally adopt Markov random fields (MRFs): the labels at different locations are not considered to be independent and the global solution can be retrieved with the simple knowledge of the close neighborhood for each pixel. The optimal configuration of labels is retrieved when finding the maximum a posteriori over the entire field [65,64]. The problem is therefore considered as a minimization procedure of a global energy over the whole image. Despite a simple neighborhood encoding (pairwise relations are often preferred), the optimization procedure propagates over large distances. Global regularization is often considered as a post-processing step within a classification process. It has been associated to late fusion in recent works, as for instance in [66,67].
As a consequence, so as to benefit from the complementarity of a very high spatial resolution sensor with another one, exhibiting lower spatial resolution but enhanced complementary characteristics, the proposed fusion framework involves (i) fusion at decision level, (ii) associated to global regularization. It mostly relies on existing state-of-the-art methods, but combines them in order to cope both with semantic and spatial uncertainties. Besides, it is flexible enough to integrate several fusion rules and be applied to various use cases.
A late fusion framework is proposed in order to benefit both from low spatial resolution data (but enhanced spectrally or temporally) and very high spatial resolution multispectral monodate data. It aims at dealing both with semantic and spatial uncertainties. It consists in three main steps, presented in Fig. 11.1.
The first tested fusion approach is based on fuzzy rules [30]. Fuzzy rule theory states that a fuzzy set A in a reference set of classes
where the membership probability of A in
In order to account for the fact that fuzzy sets with a strong fuzziness possibly hold unreliable information, each fuzzy set i is weighted according to a pointwise confidence measure
n is the number of sources and
In further experiments, all fuzzy rules have been tested as input using the probabilities weighted by the pointwise measure.
The following fusion rules based on fuzzy logic were considered:
A straightforward approach is to sum or multiply the input class membership probabilities as a Bayesian sum (∼ majority vote) or product [70]:
Those rules will be referred to as Bayesian sum and Bayesian product, respectively.
The aim of these rules is to take into account the confidence, measured by the classification margin, of each source. The classification margin is defined for each pixel x and each source s as the difference between the two highest class probabilities:
with
Fusion can then be carried out preferring for each pixel the most confident source, i.e. the one with the highest margin. This fusion rule (referred to as margin-Max) selects, for each pixel, the source for which the margin between the two highest probabilities is the highest, with sources
where
The classifier confidence information provided by the margin can also be used to weight the class probabilities of each source in the Bayesian sum and product (respectively, margin Bayesian sum weighted, margin Bayesian product weighted):
According to the Dempster–Shafer (DS) formalism, an information from a source s for a class c can be given as a mass function
Masses associated to each simple class are directly the class membership probabilities:
For mixed classes,
This leads to a mass
The fusion rule is based on the following conflict measure between two sources A and B:
The fusion is performed by
In addition to the previous standard fusion rules, learning-based supervised methods were also tested [51,52,23]. Such methods consist in learning how to best merge both sources (based on a ground truth). A classifier is trained to label feature vectors corresponding to the concatenation of class membership measures from both sources. Thus, such a strategy can be considered at the interplay between late and intermediate fusion (classifiers applied independently to each source can then be considered as a kind of feature generators). It is similar to auto-context approaches. A drawback stems from the fact that they require a significant amount of reference data.
In next experiments, two classifiers were considered for supervised fusion: random forests (RFs) [71] and support vector machines (SVMs) [72] with a linear or a radial basis function (rbf) kernel.
After fusion rules have been applied at the pixel level, a spatial regularization of the obtained classification map is performed. This regularization here aims at dealing with spatial uncertainties between both sources. Indeed, the fusion result still contains noisy patches, especially in transition areas between neighboring classes. Besides, considering original image information, such a regularization also enables one to preserve real-world contours more accurately.
A global regularization strategy is adopted [66]. The problem is expressed using an energetic graphical model and solved as a min-cut problem. Indeed, such a formulation has been used successfully for many purposes related to image processing in the last years [73].
The problem is formulated in terms of an energy E that has to be minimized over the whole image I in order to retrieve a labeling C of the entire image which corresponds to a minimum of E. As commonly adopted in the literature, E consists of two terms, one related to the data fidelity, and one to prior spatial knowledge, setting constraints on class transitions between the different pairs of neighboring pixels N. Several options were considered for the different energy terms. We have
where
The final label map corresponds to the configuration C which minimizes E over I.
The data term is a fit-to-data attachment term. It relies on the probability distribution
Earlier experiments [20] verified that Option 2 tends to smooth the classification map more than Option 2. Thus, the data term will be selected among these options depending on the targeted application and on the input data. Option 1 will be selected to keep small regions as long as they are relevant according to class probabilities, while Option 2 will be used to obtain smoother maps with wider flat areas [20].
The regularization term
The term
Once the energy E has been defined, it has to be minimized in order to get a labeling configuration C (i.e., a classification map) of the entire image, which corresponds to a minimum of the energy E. This model can be expressed as a graphical model and solved as a min-cut problem [73,75].
The graph-cut algorithm employed here is the quadratic pseudo-Boolean optimization (QPBO)1 [75,76]. QPBO is a classical graph-cut method that builds a probabilistic graph where each pixel is a node. The minimization is computed by finding the minimal cut. Contrary to several standard graph-cut methods for which the pairwise term
QPBO performs binary classification. Extension to the multiclass problem is performed using an α-expansion routine [73]. Each label α is visited in turn and a binary labeling is solved between that label and all others, thus flipping the labels of some pixels to α. These expansion steps are iterated until convergence and at the end the algorithm returns a labeling C of the entire image which corresponds to a minimum of the energy E.
The regularization term E (Eq. (11.21)) is controlled by up to four parameters, depending on the retained formulation for
A simple Potts model can be obtained using the following parameterizations:
A greedy way for parameter optimization was presented in [66]. The value
Such a strategy can be performed by quantitative cross-validation when sufficient reference data is available. Otherwise, it can be empirically performed by qualitative (visual) evaluation of the results. The set of parameters yielding the nicest and smoothest possible result while following the real object contours can thus be identified. This solution is relevant when regularization also targets improving the visual quality and the interpretability of classification results in operational contexts.
In practice, a set of parameters defined for a classification problem and a decision fusion rule is stable enough to be used in other, similar situations.
This first use case concerns the joint use of hyperspectral and very high resolution (VHR) multispectral imagery for fine urban land cover classification. Indeed, several applications require fine-grained knowledge about urban land cover and especially urban material maps [77,78]. As no geodatabases contain such information, remote-sensing techniques are urgently required.
Mapping urban environments requires VHR optical images. Indeed, such a spatial resolution is necessary to individualize and precisely delineate urban objects and to consider sharper geometrical details (e.g., [79,80]). However, VHR sensors have generally a poor spectral configuration (usually four bands, blue–green–red–near infrared), limiting their ability to discriminate fine classes [81–84], compared to superspectral or hyperspectral (HS) sensors. Unfortunately, the latter generally exhibit a lower spatial resolution. To overcome the weaknesses of both sensors, HS and VHR multispectral (MS) images can be jointly integrated to benefit from their complementary characteristics and subsequently efficiently separate the classes of interest. Thus, the fusion of such sensors should enhance the classification performance at the highest spatial resolution.
It here may be recalled that early fusion (at observation level) i.e., image sharpening [14], could be applied within this context. However, late fusion is more generic and still valid even for images not acquired simultaneously and processed by specific land cover labeling approaches.
As mentioned earlier, the method is based on three main steps:
Experiments were performed over three datasets captured over the cities of Pavia (Italy), and Toulouse (France; see Fig. 11.2). For all datasets, a SVM classifier was trained using 50 samples per class extracted from the images.
Concerning Pavia city (Italy), two datasets called “Pavia University” and “Pavia Center” were used. They are free datasets widely used by the hyperspectral community and available on line.2 Initially captured by a ROSIS hyperspectral sensor, these datasets have, respectively, 103 and 102 spectral bands from 430 to 860 nm. Pavia University is a 335 × 605 pixels image, Pavia Center is a 715 × 1096 pixels image, and both have a GSD of 1.3 m. Both scenes are composed of nine land cover classes (Fig. 11.2): Asphalt, Meadows, Gravel, Trees, Painted Metal Sheets, Bare Soil, Bitumen, Self-Blocking Bricks, Shadows for Pavia University and Water, Trees, Meadows, Self-Blocking Bricks, Bare soil, Asphalt, Bitumen roofing, Tiles roofing, Shadows for Pavia Center. MS images were generated for a Pleiades satellite spectral configuration (limited to three bands, red–green–blue), with a GSD of 1.3 m, while HS images were resampled at a lower spatial resolution of 7.8 m and at the full original spectral range (i.e., 103 and 102 bands), so that their pan-sharpening ratio would be the same as for the Toulouse dataset.
The third dataset is called “Toulouse Center” (France). It was captured over the city Toulouse in 2012 by Hyspex sensors [87]. It has 405 spectral bands ranging from 400 to 2500 nm, and an initial GSD of 1.6 m. Its associated land cover is composed of 15 classes (Fig. 11.2): Slate roofing, Asphalt, Cement, Water, Pavements, Bare soil, Gravel roofing, Metal roofing 1, Metal roofing 2, Tiles roofing, Grass, Trees, Railway tracks, Rubber roofing, Shadows. MS and HS images were created for the fusion purpose; a MS image using Pleiades satellite spectral configuration (four bands, red (R)–green (G)–blue (B)–near infrared (NIR)), with a GSD of 1.6 m, and a HS image which is a resampled version of the original image at a spatial resolution of 8 m [88].
The MS image is characterized by a high spatial resolution and few bands, while the HS one has a low spatial resolution and a hundred(s) of bands. As expected, the SVM classifier applied over these images led to:
The corresponding classification accuracies are listed in Table 11.2: better results are retrieved using the HS image.
10 different decision fusion rules were first tested and compared over the three datasets. The quantitative results provided in Table 11.1 lead us to consider the compromise, Bayesian product, margin-Max and Dempster–Shafer rules to be the most efficient rules. The comparison must also take into consideration the visual inspection of the results, as ground truth data remains very limited on these datasets. For Pavia University, four of the best accuracies were reached for Min, compromise, Bayesian product, and Dempster–Shafer rules. In practice, the Min/Compromise rules give the most satisfactory rendering, especially regarding the Self-Blocking Bricks class which is a conflicting class (see Fig. 11.3, magenta color class). The two other rules seem to overestimate this class and to a greater extent consider the HS classification map in the fusion process. This explains their higher accuracy (Table 11.1). The Min rule acts in a cautious way when taking the best of the lowest memberships, while the compromise rule acts depending on the degree of conflict between sources. The Bayesian product rule is a good and simple trade-off if the initial classification maps are not highly conflicting. Otherwise, the result will be degraded by wrong information.
Table 11.1
Classification accuracies (in %) after fusion procedure, 10 fusion rules at decision level are compared. (OA = Overall Accuracy; F-score = mean F-score.)
blank cell | Pavia University | Pavia Center | Toulouse Center | ||||||
---|---|---|---|---|---|---|---|---|---|
OA | Kappa | F-score | OA | Kappa | F-score | OA | Kappa | F-score | |
Max | 92.8 | 90.7 | 90.6 | 98.5 | 97.8 | 96.0 | 75.6 | 62.4 | 69.8 |
Min | 96.1 | 94.9 | 95.1 | 98.6 | 98.0 | 96.3 | 72.2 | 58.7 | 65.8 |
Compromise | 96.1 | 95.0 | 95.0 | 98.8 | 98.3 | 96.7 | 73.6 | 60.2 | 68.0 |
Prior1 | 94.7 | 93.1 | 93.4 | 98.2 | 97.5 | 95.3 | 71.3 | 57.7 | 65.5 |
Prior2 | 92.8 | 90.7 | 90.6 | 98.5 | 97.8 | 96.0 | 75.6 | 62.4 | 69.8 |
AD | 95.0 | 93.5 | 93.5 | 99.0 | 98.7 | 97.7 | 75.8 | 58.1 | 28.3 |
Sum Bayes | 95.0 | 93.5 | 93.2 | 98.7 | 98.1 | 96.5 | 75.7 | 62.7 | 70.5 |
Prod Bayes | 96.6 | 95.5 | 95.6 | 99.0 | 98.6 | 97.2 | 74.5 | 61.4 | 69.8 |
Margin-Max | 94.0 | 92.2 | 92.0 | 98.8 | 98.3 | 96.6 | 75.6 | 62.5 | 69.6 |
Dempster–Shafer V1 | 96.4 | 95.4 | 95.3 | 98.9 | 98.5 | 97.1 | 74.6 | 61.5 | 69.8 |
Concerning Pavia Center, all the rules seem accurate (Fig. 11.4, e.g., with the Dempster–Shafer rule), with an overall accuracy higher than 98% (Table 11.1). When visually inspecting the results, all rules gave similar good results excepting Prior 1, showing a result guided by the HS classification map rather than the MS one.
The Toulouse dataset is the largest one, with up to 15 classes. This explains the lower accuracies reached for this dataset. The best results were given by the Max, Prior 2, Bayesian sum and Dempster–Shafer rules. In practice, the Max, prior and sum rules seem to overestimate certain classes; especially tile roofing and vegetation. The best qualitative results are given by the Min, compromise and Dempster–Shafer rules. Despite a satisfactory accuracy, the AD rule exhibits many misclassifications regarding tile roofing (i.e., underestimation), metal roofing 1 (i.e., overestimation), and an erroneous detection of the gravel roofing. This is mainly due to the global accuracy measure which is included in the rule and calculated thanks to a limited ground truth data.
However, due to the very limited amount of reference data, the quantitative accuracies do not necessarily transcribe the real potential of the fusion rules. The best ones from a quantitative and practical qualitative point of view are the compromise, the Bayesian product and the Dempster–Shafer rules.
In this study, VHR-MS images as well as HS ones at lower resolution were generated from VHR HS original images. Thus, working on such synthetic datasets leads to quite optimistic results, but this is sufficient to assess the different fusion rules. Besides, the fusion method is flexible enough for instance to integrate a specific process to deal with shadows in a diachronic acquisition context.
Global regularization was applied to enhance the classification results and eliminate the artifacts. Table 11.2 presents the optimization results for the best fusion rules per dataset. Indeed, the optimization procedure permits one to enhance further the classification. Quantitatively, it slightly enhances the decision fusion classification (by 1–2%) but offers a better visual rendering with an elimination of the artifacts, a better decimation of the classes borders, and a regularization of the scattered pixels (Figs. 11.3, 11.4 and 11.5). These optimized maps seem better in modeling the real scene. The optimization effect is more visible over Pavia University and Toulouse Center. Concerning Pavia Center, the decision fusion gives already good results and thus, the optimized maps are only slightly improved (Table 11.2). Results obtained over the Pavia datasets are comparable to other studies (e.g., [17]). For Pavia University; the painted metal sheets are better recovered and no mismatches with the surrounding road are noticeable. The proposed method permits one to extract some bitumen buildings that were difficult to differentiate from roads (i.e., upper right and lower right, Fig. 11.5), even if the gravel buildings could still be better refined. For Pavia center, the global rendering is enhanced with a minimization of the classification artifacts.
Table 11.2
Classification accuracy of images HS and MS separately, after decision fusion, and after global regularization. For each dataset, results are provided for the fusion rule achieving the best final results after global regularization.
Image HS classification | Image MS classification | Decision fusion | After regularization | |
---|---|---|---|---|
Pavia University (Min rule) | ||||
OA (%) | 94.7 | 68.8 | 96.1 | 97.0 |
Kappa (%) | 93.1 | 61.6 | 94.9 | 96.1 |
F-score (%) | 93.4 | 72.8 | 95.1 | 96.3 |
Pavia Center (Dempster–Shafer V1 rule) | ||||
OA (%) | 98.2 | 92.0 | 98.9 | 99.3 |
Kappa (%) | 97.5 | 89.0 | 98.5 | 99.0 |
F-score (%) | 95.3 | 83.5 | 97.1 | 98.0 |
Toulouse Center (Compromise rule) | ||||
OA (%) | 71.2 | 69.2 | 73.5 | 74.6 |
Kappa (%) | 57.6 | 53.8 | 60.2 | 61.5 |
F-score (%) | 65.4 | 55.9 | 68.0 | 70.9 |
Several decision fusion methods were tested and compared. Among the fuzzy rules, the Min and Compromise rules are the most efficient. The Max rule often leads to misclassifications due to the fact it pays more confidence to the highest membership. The prioritized rules favor a source rather than the other. Indeed, the reliability is not ensured, as noticed for Prior 1, which gives confidence to the less reliable source. The AD rule accuracy is too dependent on the ground truth reliability: it gives encouraging results for Pavia datasets, but the accuracy was not sufficient for Toulouse dataset. The Bayesian sum and product rules can be interesting in the case of low conflict between sources, since they give acceptable results over Pavia Center and Toulouse. Concerning the proposed margin-based rule, it performs well over Pavia center, and correctly over Toulouse. However, it is not sufficient over Pavia University. Finally, the Dempster–Shafer rule has homogeneous performance over the three datasets, leading always to interesting results.
Even if the decision fusion enables one to increase the classification accuracy compared to the initial maps, the results remain affected by classification artifacts and unclear borders. The final maps are either guided by one of the initial maps or by both: the final result is, therefore, a better version of the initial maps. The optimization procedure gives encouraging results, with clear borders among the different classes, and artifacts elimination.
The method also has the possibility to integrate other decision rules in a fully tunable way. The optimization model is simple and flexible and could be modified according to the used dataset and the spatial resolution of the data sources. In further work one will investigate the explicit use of conflict measures from the fusion step within the regularization framework. At the moment, the optimization parameter selection is rather manual; some automation could be included, and other contrast measures could be tested to improve the accuracy.
This second use case focuses on the detection of urban areas out of SPOT 6 and Sentinel-2 satellite imagery. Mapping urban areas is important to monitor urban sprawl and soil imperviousness, and to predict their further evolution [89,90]. Remote sensing is highly relevant for such a regular and continuous monitoring over time. Supervised classification approaches using satellite imagery have been extensively studied in order to automate the process of land cover (LC) classification [91–93,38], but often rely only on one sensor.
Urban and peri-urban areas are complex and heterogeneous landscapes containing impervious areas, trees, grass, bare ground, and water [94,95]. “Artificialized areas” can be defined as irreversibly impervious areas, including buildings and roads, but also small enclosed pervious structures such as gardens, backyards, and green public spaces [96]. There is no unique clear definition of the urban area or footprint. It generally corresponds to a simplification of the artificialized area: road networks outside of built areas are then excluded.
This study aims at detecting such area automatically out of multisource remote-sensing data, trying to follow the real-world city boundary contour as closely as possible. Isolated built-up areas should also be retrieved.
The remote-sensing paradigm has drastically changed in the very last years with the advent of new sensors exhibiting enhanced spectral, spatial, swath or revisit period characteristics, making it possible to acquire datasets at country scale in a limited time. SPOT 6/7 and Sentinel-2 are examples of these new sensors. They will be used in this use case. Indeed, on one hand, they are freely available over the whole French territory thanks to the Théia initiative3 and GEOSUD Equipex.4 On the other hand, they exhibit complementary characteristics:
Last years have witnessed the advent of deep learning methods, and especially Convolutional Neural Networks (CNNs) [97–99]. Such approaches have shown their superiority compared to standard classification processes. Indeed, thanks to their end-to-end process, CNNs directly learn optimized (spectral and textural) features (convolution filters) for each classification problem as well as the best way to use them (i.e., the classification model). Besides, implicit features directly take into account the context and thus perform a multiscale analysis of the image. As a counterpart, they require a huge amount of training data.
New studies as regards urban footprint detection have been initiated by the advent of Sentinel data. Sentinel-2 optical images exhibit excellent spectral and temporal characteristics. They are perfectly tailored for land cover production. In [91], Sentinel-2 time series are directly classified by a Random Forest for the yearly extraction of 20-class land cover maps. The method presented in [38] can also be applied to such time series, classifying each date independently before merging the results by a Dempster–Shafer process. It must here be said that, as Sentinel-2 exhibits 10 m GSD for some bands, several studies have tried to use both their radiometric and texture information to detect urban areas [92,100]. However, here, it was decided to focus on Sentinel-2 specificities (enhanced spectral characteristics and time series) and not to exploit the texture information poorer than the one from SPOT 6/7.
To summarize, deep learning approaches are optimal to analyze SPOT 6/7 images. Their spatial resolution makes it possible to try to detect urban elements (e.g. buildings) and to use them to derive urban areas. For Sentinel-2 data, it is more interesting to focus on their specificities (enhanced spectral characteristics and time series). Thus, the fusion of such sources would combine their advantages to reduce spatial and semantic uncertainties. The late fusion scheme proposed in Sect. 11.2 is adapted, considering again that original data have been classified earlier and independently by specific methods. Besides, it can enable the integration of existing land cover maps produced such as Théia's ones based on [91].
The proposed workflow (Fig. 11.6) consists of three steps.
Both fusions (for the 5-class and the binary classifications) are performed following the scheme presented in Sect. 11.2.
Both sources are classified individually. The Sentinel-2 time series is labeled using a random forest (RF) classifier trained from 50,000 samples per class. RF is used to have a framework similar to the one from [91], of which the LC maps are intended to be available at a national scale. The SPOT 6/7 image is classified using a deep Convolutional Neural Network (CNN) [98], because of its high ability to efficiently exploit context and texture information from VHR image. The training process of the CNN used 10,000 samples per class (from which 10% were kept for cross-validation). Both classifiers produce membership probabilities for the five classes.
Both fusions (i.e., for the 5-class nomenclature and the urban/non-urban one) are performed according to Sect. 11.2, involving a per-pixel decision fusion followed by a spatial regularization.
The “urban/non-urban” fusion requires one to derive binary class probabilities from results of the previous steps. Buildings from the 5-class fusion result are considered as seeds of urban areas and used to define a prior to be in an urban area (see Fig. 11.7): a linearly decreasing function assigning a probability is applied surrounding all buildings. This probability to be in an urban area starts from 1 and reaches a value of 0 after a distance of 100 m to a building.
Class posterior probabilities from the Sentinel-2 image RF 5-class classification are converted to a binary classification with an urban area class (u), and a non-urban area class (¬u):
Then, the prior probability map to be in an urban area according to previous building detection is merged together with binary class probabilities from the Sentinel-2 RF classifier.
Sentinel-2 offers both a spectral configuration improved over the usual multispectral sensors and has an ability to acquire time series (5 day revisit). In further experiments, only the 10 spectral bands having a 10 or 20 m GSD are used. All are upsampled to 10 m GSD. Six dates (namely, August 15th 2016, January 25th 2017, March 16th 2017, April 12th 2017 and May 25th 2017) are kept. They were retained both because of their low cloud cover and in order to have different seasons/appearances of land cover classes.
SPOT 6/7 includes four spectral bands (red–green–blue–near infrared) pan-sharpened to 1.5 m. A single date (April 16th) is used.
A ground truth of five classes is generated (Fig. 11.8) from available national reference geo databases (training and evaluation: the number of pixels used as training samples is very small compared to the total amount of samples). Buildings, roads and water areas are extracted from IGN's BD Topo®5) topographic database, forests from IGN's BD Forêt®6 database and crops from the Référentiel Parcellaire Graphique7 of the French Ministry of Agriculture.
Experiments are performed over a test area spanning 648 km2 in Finistère, North Western France. This study area contains both urban, peri-urban, rural, and natural landscapes.
For the sake of visibility, the results are shown over a restricted area of 0.64 km2 (Figs. 11.10, 11.9 and 11.11). There, the original classifications exhibit several errors: the impact of data fusion can be clearly demonstrated. Quantitative evaluation over image is for five tiles of 3000 × 3000 m size, totaling an area of 45 km2, distributed over the entire 648 km2 study zone. Each classification is compared to the class labels of the ground truth (Fig. 11.8). Evaluation measures for individual classifications, fusion and regularization, all using five classes, are shown in Table 11.3.
Table 11.3
Accuracy scores (in %) for the first step. OA = overall accuracy, AA = average accuracy, Fm = Mean F-score, Fb = F-score for buildings. Fusion rules are described in Sect. 11.2.1.
Method | Kappa | OA | AA | Fm | Fb |
---|---|---|---|---|---|
Original classifications | |||||
Sentinel-2 | 72.0 | 83.7 | 81.5 | 64.6 | 52.2 |
SPOT 6 | 73.3 | 85.2 | 70.8 | 63.4 | 62.5 |
Per pixel fusion (before regularization) | |||||
Fuzzy Min | 79.1 | 88.4 | 84.7 | 76.7 | 73.8 |
Fuzzy Max | 77.3 | 87.4 | 84.2 | 73.2 | 70.5 |
Compromise | 78.8 | 88.2 | 84.5 | 76.2 | 72.5 |
Prior 1 | 73.3 | 85.2 | 70.8 | 63.4 | 62.5 |
Prior 2 | 73.3 | 85.2 | 70.8 | 63.4 | 62.5 |
Bayesian Sum | 78.3 | 87.9 | 84.8 | 74.8 | 71.9 |
Bayesian Product | 79.1 | 88.5 | 85.0 | 76.7 | 73.8 |
Dempster–Shafer V1 | 79.1 | 88.4 | 85.1 | 76.5 | 73.7 |
Dempster–Shafer V2 | 79.0 | 88.4 | 85.1 | 76.4 | 73.7 |
Margin Maximum | 77.7 | 87.6 | 84.5 | 73.4 | 70.2 |
Margin Bayesian Product | 78.4 | 88.0 | 84.7 | 75.1 | 72.0 |
Margin Bayesian Sum | 78.0 | 87.7 | 84.6 | 74.0 | 71.0 |
RF | 81.8 | 90.0 | 90.1 | 81.2 | 81.6 |
SVM linear | 80.5 | 89.3 | 88.6 | 77.9 | 80.4 |
SVM rbf | 81.0 | 89.6 | 89.1 | 79.1 | 83.4 |
Fusion and regularization | |||||
Fuzzy Min | 75.1 | 85.8 | 82.3 | 73.9 | 73.8 |
SVM rbf | 81.4 | 89.8 | 89.3 | 79.8 | 83.9 |
Initial classifications. Original classifications confirm the initial observation that the SPOT 6/7 CNN result tends to preserve small objects. However, some confusions between (bare soil) crops and built-up areas occur. The Sentinel-2 RF classification overall behaves better, but it mixes buildings and roads due to its coarse spatial resolution. The results from the individual classifications on the SPOT 6/7 and Sentinel-2 images are shown on Fig. 11.9. This area was selected to illustrate problems of both classifications and improvements obtained with fusion and regularization (better results are observed in all other areas). Confusion between water and built areas can be noticed. This phenomenon is caused by the fact that the water area database used to generate training samples included ponds at the bottom of careers appearing white and very similar to built-up areas.
Per pixel fusion. This first fusion is performed at the SPOT 6/7 resolution (1.5 m) as it aims at achieving the most fine detection of building objects.
Among the classic fusion rules proposed by [69], the Min fuzzy rule produces the best results, following the objects' borders most precisely while producing the least class confusions (Fig. 11.10 and Table 11.3). Considering Fig. 11.10, all rules managed to eliminate the wrongly classified building patch (top-left), preferring the Sentinel-2 classification over the SPOT 6/7 one. The Min fusion rule follows the field contours a bit less smoothly than the other ones. The industrial area (at the center of the displayed area) is still confused with water in both results (such a confusion can be explained by the presence of very white water area training samples).
The RF/SVM supervised fusions initially tend to produce patches of buildings rather than separate buildings due to missing training data (and thus constraints) between buildings and roads (Fig. 11.8). Indeed the ground truth contains gaps between buildings and so the classifiers tend to aggregate individual buildings as there are no training data available around building to prevent the classifier from doing it. Adding an additional sixth buffer class “around buildings” helps to refine the contours and to obtain a higher level of detail than just using the Min or Bayes rules. It preserves more details of individual buildings, but can cause some confusion between buildings and this “buffer” class in the center of very wide buildings in industrial areas (see Fig. 11.10). It can also erase small patches of buildings. However, this remains quite an exception and is not a real problem for the final goal of detection of urban areas. The same observations are made on all areas, even those on which the supervised fusion model is not trained. Thus, the result of the supervised fusion using a rbf SVM classifier with buffer class is used in subsequent steps.
Regularization. The parameters are as follows:
Raw urban area maps directly derived using the binary class probabilities are shown in Fig. 11.12. Again, they are first merged using a per pixel fusion rule. The supervised learning based fusion approaches cannot be used here since no training data for urban/non-urban areas was available. Several rules yield visually similar results; the Min fuzzy fusion rule is eventually chosen.
Global regularization is then performed with the same
As no true reference data for urban areas is available, strict evaluation is not possible. However, the detected artificialized area can be compared to binary ground truth maps derived from the following other related databases:
Strict quantitative evaluation is not possible, but such data can provide some hints: accuracies are provided in Table 11.4. Besides, a visual comparison with the different sources is shown in Fig. 11.13.
Table 11.4
Accuracy measures: F-score for the buildings class (F-scoreu), Kappa, overall accuracy (OA) and intersection over union for the buildings class (IoUu).
Classification | Ground Truth | F-Scoreu [%] | Kappa [%] | OA [%] | IoUu [%] |
---|---|---|---|---|---|
Binary dilated buildings | BD Topo® | 86.7 | 83.2 | 94.5 | 76.5 |
OSO | 56.8 | 50.1 | 86.8 | 39.7 | |
OSM | 58.3 | 51.9 | 87.3 | 41.2 | |
Binary Sentinel-2 | BD Topo® | 65.2 | 56.4 | 85.9 | 48.3 |
OSO | 63.9 | 58.3 | 89.3 | 46.9 | |
OSM | 52.6 | 45.5 | 86.0 | 35.7 | |
Fusion Min | BD Topo® | 79.8 | 75.2 | 92.4 | 66.3 |
OSO | 66.9 | 62.2 | 91.2 | 50.3 | |
OSM | 62.9 | 57.6 | 90.2 | 45.9 | |
Regularization | BD Topo® | 79.7 | 75.1 | 92.5 | 66.2 |
OSO | 67.4 | 62.8 | 91.5 | 50.9 | |
OSM | 64.4 | 59.4 | 90.7 | 47.4 |
The following aspects can be underlined:
The dilated BD Topo® ground truth generally yields the highest agreement with classifications in terms of accuracy measures. The fusion and regularization steps improve the accuracy measures over the individual input classifications with the exception of the binary regularization input, which can be explained by the fact that both have been produced by dilatation.
A framework was proposed to detect urban areas. Sentinel-2 and SPOT 6/7 data were classified individually in five topographic classes. Decision-level fusion and regularization then enable one to obtain a result preserving high geometric details while reducing misclassifications. Results presented in this study were presented on one dataset, but the processing chain was applied to another region (Gironde department in South Western France) exhibiting a different (climatic and topographic) landscape. It led to similar conclusions, showing high generalization potential despite varying behaviors of the initial classifiers.
Traditional fusion methods enable artifact removal but can keep confusion between buildings and roads. In contrast, supervised fusion methods enable an enhanced detection of buildings at the price of a new artificial “building buffer” class. Such an introduction comes at the cost of introducing a certain additional amount of semantic uncertainty. However, new class confusions mostly occur between the vegetation and forest classes, which is not a problem here. Buildings could thus be extracted with a higher amount of semantic details.
Second, the urban area can be approximated by fusion and regularization merging the urban / non-urban class membership probabilities of the Sentinel-2 classification and a urban prior measure derived from previously detected buildings. A simple function was used to derive a probability map of urbanized area. Improvements could be made having a more advanced urban membership prior measure, decreasing faster for uncertain buildings.
Furthermore, the promising results of supervised fusion would justify the use of CNNs for fusion. Although such a strategy looks also promising for the identification of artificialized areas, it would have to face missing ground truth data and heterogeneous, user- and application-dependent definitions of such areas.
A fusion framework was proposed to merge very high spatial resolution, monodate, multispectral images with time series of images exhibiting an enhanced spectral configuration but with a lower spatial resolution. It mostly relies on existing state-of-the-art methods, combined in order to cope both with semantic and with spatial uncertainties. Besides, it is flexible enough to integrate other fusion rules. It is a late fusion strategy, permitting one to initially process each input data source independently through specific methods, and even to use already calculated results from existing operational land cover classification services. Besides, in this case, it can be used without any ground truth information, contrary to intermediate level fusion methods.
The proposed framework was applied to two different use cases. For each of them, classification results were improved. Several fusion rules were tested. Good results were reached for several of them, but the best results were obtained by the “Minimum” fuzzy rule, by Dempster–Shafer rules, and, when sufficient training data is available, by supervised learning-based methods.
At present, the proposed fusion framework has been applied to only two sources, but it could easily be extended to more sources. Besides, for the moment, it has been tested only for optical images exhibiting different characteristics, but it is generic enough to be applied to input data of different modalities. It has also been used only to merge class membership probability maps from classification results, but it would be interesting to apply it to other kinds of results. Especially for the lower spatial resolution source, it would be relevant to use class abundances obtained from an unmixing process (applied to the hyperspectral case or time series of data).