7.3 Object Detection and Recognition in Satellite Imagery

Remote sensing is a technique that observes the earth and its surface from platforms far from the earth such as satellites or space shuttles. Man-made satellites are one kind of platform which takes digital images of the earth's surface through cameras, video cameras, multispectral scanners, hyperspectral sensors, pupil diameter radar and so on. Multispectral scanners can simultaneously record the reflecting spectral signals at the same scene and form several images with different spectral bands (400–1000 nm) from visible light to invisible light (near infrared), and the technique has been extended to hyperspectral imagery with higher spectral resolution for a scene. Since optical spectral imagery includes more geological information in spectral and spatial domains, and it can be used to study earth materials and detect hidden objects. However, optical spectral imagery is often disturbed by cloud cover or by changes of sunlight, and thus it cannot work in cloudy or rainy conditions or on a dark night. The electromagnetic wave signals with lower frequency range can pass through the cloud layer and this is independent of changes of visible light intensity (it works in almost all weather conditions, day or night), so a satellite with synthetic aperture radar (SAR for short), by transmitting and receiving electromagnetic wave signals, has been developed. SAR can directly generate two-dimensional imagery with orthogonal dimensions (range and azimuth dimensions), but it is often disturbed by the speckle noise due to the interference of scattering signals. The two kinds of satellite imagery (hyperspectral and SAR images) are mutually complementary to monitor a wide extent of the earth and its surface across many spectral bands. With the technical development of satellite remote sensing, a great deal of data per second is sampled while the man-made satellite moves around the earth's orbit. The huge amount of imagery information makes classical image analysis techniques impossible. Exploring automatic object search methods from raw imagery is becoming increasingly important. Visual attention can rapidly select the interesting locations in the images, which will be a helpful tool for searching through the information and looking for hidden targets in the satellite imagery.

However, the appearance of satellite imagery is totally different from natural images. Each pixel in a remote sensing image represents an area of the earth's surface (e.g., a pixel covers 30 m × 30 m area of the earth's surface for a multispectral imagery with mid resolution). It is known that oceans cover more than 70% of the earth's area. For the ocean regions the target (a commercial vessel or a warship) may be detected easily, but on land, the complicated road network, farmlands, buildings of cities, rivers, mountains and vegetation and so on, result in a complex background for satellite imagery, and it is difficult to find the hidden target. Figure 7.3(a) and (b) show two images from Google earth that are a combination of aerial photographs and satellite imagery collected by Landsat-7 (30 m × 30 m resolution) for the land regions. Thus, the analysing method for satellite imagery should be different from the natural images or man-made object images with artificial objects. This section mainly introduces the applications of visual attention for object search and recognition in satellite imagery.

Figure 7.3 Satellite image from Google earth

img

First, we present ship target detection in ocean regions for multispectral imagery and for SRA imagery by using visual attention, and then, we introduce airport detection in the land region, with the help of saliency maps, prior knowledge and decision trees. Finally, a object recognition strategy in satellite imagery is presented, which employs the saliency map and other features and combines computer vision methods to identify arbitrary object patches in satellite imagery.

7.3.1 Ship Detection based on Visual Attention

Ship detection in satellite imagery (multispectral imagery and SAR imagery) is becoming more and more important because of its potential application in searching for missing fishing vessels, region estimation for oil pollution, vessel traffic services, military target detection and so on. However, satellite imagery is often corrupted by noise and light changes, which cause mistakes of object detection in most conventional methods.

Conventional ship target detection in multispectral or SAR imagery mainly includes the constant false alarm rate (CFAR) algorithm and its improvements [52–56], the maximum likelihood ratio test [57], the support vector machine (SVM) [58], the cross correlation [59] based method and the polarization cross-entropy approach [60]. Among them, the CFAR algorithm is the most commonly used method in both multispectral and SAR imagery, and utilizes the statistical information of target and background to find a detection threshold that makes the probability of false alarm invariant as light intensity changes or strong noise appears. However, the estimation of statistical information has higher computational complexity. With the attention computational model, especially, the frequency domain models, it can be very fast to locate the salient regions on a relatively simple background, and this is suitable for ship detection in ocean regions. The following two subsections introduce ship target detection based on frequency domain bottom-up attention models for multispectral imagery and SAR imagery, respectively.

7.3.1.1 Ships Detection for Multispectral Imagery

Multispectral remote sensing imagery includes several spectral images across different spectral bands (6–8 bands), that is there are 6–8 images for a given scene. Different ships display different intensities in different spectral bands. A ship may pop out in some band images but disappears in other band images, and for other ships it may be the reverse; so only considering the image in one single spectral band cannot detect all ships due to information missing. As we saw in Chapter 4, a quaternion can represent four values (intensity, colour opponent and motion) at one image pixel, and the phase spectrum of quaternion Fourier transform (PQFT) be used in computing the saliency map of the image. In multispectral remote sensing imagery, there are several values in different bands at each pixel, so with the aid of PQFT, the several values at one pixel can be represented as a biquaternion [10, 61]. Using the phase spectrum of biquaternion Fourier transform (PBFT) we can also obtain the saliency map of the multispectral remote sensing imagery. According to the saliency map, an adaptive detection threshold is set to detect the locations of ships in ocean regions. The steps in ship detection [10] are as follows.

1. Ocean region detection
Remote sensing imagery covers a wide area including the land and ocean regions on the earth. In general, the size of remote sensing images is very large (more than 20 000 pixels in both length and width), and ship targets are often located in ocean regions, so the search of ocean regions is prerequisite. Fortunately, the ocean absorbs optical spectral energy in all bands, such that the intensity in ocean regions is lower than that in land regions. A simple threshold in the histogram of multispectral imagery can be found to segment the ocean regions. After segmentation, multispectral imagery with small spatial size covering just the ocean region is built.
2. Saliency map computation based on PBFT
The multispectral remote sensing imagery data from Landsat 7 ETM+ consists of eight bands. Each pixel of the imagery can be represented as a biquaternion (a hypercomplex number), and denoted as

(7.29) equation

where q0 ~ q3 are complex numbers whose real and imagery parts respectively represent the values of the same pixel in two different spectral remote images, and img that have img. Equation 7.29 can also be written as a two-quaternion combination:

(7.30) equation

where img denotes the axis of imaginaries of the biquaternion, which satisfies img, and img and img are the real part and imaginary parts of the biquaternion, and both of them are quaternions, to be denoted as

(7.31) equation

As with the PQFT model mentioned in Section 4.4, the multispectral imagery can be represented as a biquaternion image in which the quantity of each pixel is described by a biquaternion (Equation 7.30). Using Fourier transforms of the quaternion and the biquaternion [62, 63] for the biquaternion image, and then keeping the phase spectrum and setting the amplitude spectrum to one, the SM of the multispectral imagery is obtained after convoluting the square of the modulus of the inverse Fourier transform with a Gaussian low-pass filter.
In [10], only six spectral bands with a spatial resolution of 30 m × 30 m per pixel are used in the multispectral imagery (bands 1–5 and band 7) and other two bands, band 6 with spatial resolution of 60 m × 60 m per pixel and the panchromatic band with a spatial resolution of 15 m × 15 m per pixel, are not considered. Thus, let img and the imagery of the six spectral bands is represented as a pure biquaternion image. If there are more than eight spectral bands in multispectral imagery (the eight values in biquaternion are not enough to present the imagery), the extended representation of the hypercomplex, called Clifford algebra and the Clifford Fourier transform [64] is used to calculate the saliency map.
3. Ship detection by adaptive thresholding
The saliency map obtained above is normalized to [0, 1], and an adaptive threshold related to the mean and standard deviation of the saliency map is defined as

(7.32) equation

where img and img are the mean and standard deviation of the normalized saliency map, respectively, and β is an adjusted parameter. If there are no ships in the saliency map, the noise and light change are amplified after normalization. Thus in general, the deviation of saliency maps with ship signatures is smaller, while the deviation of saliency maps without any ship signature is comparatively large. Given a suitable parameter β (in [10], β = 6), the pixel value of a ship detection map in each pixel is computed by

(7.33) equation

where, SM(.) is the normalized saliency map and (x, y) is the coordinate of the pixel.
The experimental results in [10] show the efficiency of ship target detection for the PBFT. Figures 7.4(a)(f) show the ocean region images segmented by six spectral bands of Landsat 7 ETM+ data of the East China Sea collected in April 2008, in which we can see that the intensity of different ships is different in the same band due to different reflectance of each ship's material quality. In addition, the intensities of the background are also different in the different bands. Some ship targets are brighter than the background and some ship targets are darker than the background. There are eleven ship targets in the multispectral imagery of Figure 7.4, but there are less than eleven salient targets in each band. It is difficult to set a threshold to detect all ship targets in any single band.
The conventional CFAR algorithm and some improved algorithms may fail to detect some ship targets after computations. The arrowheads in Figure 7.4(b) and (e) point to the missing targets by CFAR algorithm [53] and by maximum likelihood method [57], respectively. The PBFT method does not need to estimate the statistical distribution and will save on detecting time for comparing with other object detection based on visual attention (in [10] the time cost is one fifth of [53, 57] and a third of [58]). Furthermore, its saliency computing combines all-band information together, so that it can detect all eleven ship targets in Figure 7.4 with the minimum false alarm rate [10] comparing with CFAR, maximum likelihood and the support vector machine.

Figure 7.4 Real multispectral remote sensing data in ocean region. (a)–(f) are the six spectral bands of real Landsat7 ETM+ data. The arrowhead in Figure 7.4(b) is the missing ship from the CFAR algorithm and the arrowhead in (e) is the missing ship from maximum likelihood (the results are excerpted from [10]). Reprinted from Neurocomputing, 76, no. 1, Zhenghu Ding, Ying Yu, Bin Wang, Liming Zhang, ‘An approach for visual attention based on biquaternion and its application for ship detection in multispectral imagery,’ 9–17, 2012, with permission from Elsevier

img

7.3.1.2 Visual Attention Based Ship Detection in SAR Imagery

Ship signatures in SAR images are determined by many scattering mechanisms including direct reflection from areas at right angles to the radar beam, corner reflections and multiple reflections from the ships and sea surface. In SAR imagery, there are speckle noises resulting from backscattered signals and the heterogeneities caused by backscattering alteration or different wind conditions and so on. Figure 7.5(a) shows an SAR HH image (500 × 256 pixels) with two ship targets in the background with speckle noise and heterogeneities, scanned from the ALOS satellite. With conventional ship detection algorithms based on statistical distribution of the signal it is very difficult to find the ships since the signatures of the ships have a similar intensity level to their surroundings.

Figure 7.5 (a) original ALOS SAR; (b) and (c) the detecting results by CFAR and by attention based method [9]. Reprinted from Neurocomputing, 74, no. 11, Ying Yu, Bin Wang, Liming Zhang, ‘Hebbian-based neural networks for bottom-up visual attention and its applications to ship detection in SAR images,’ 2008–2017, 2011, with permission from Elsevier

img

In [9], a frequency visual attention computational model (PCT) introduced in Section 4.5 is used to create a saliency map of these SAR set images, and then makes use of the adaptive threshold method (Equations 7.32 and 7.33) to complete ship target detection. Since visual attention can catch the most important information, regardless of speckle noise or the heterogeneous background, the ships in SRA are searched and detected successfully in a database of SAR HH images. In the example shown in Figure 7.5(c), compared with Figure 7.5(b) (CFAR algorithm), the visual attention-based method has fewer false alarms than the CFAR algorithm.

7.3.2 Airport Detection in a Land Region

The panchromatic remote sensing imagery covers all optical spectral bands so that it has the highest spatial resolution, which commonly provides a means of detecting geospatial objects such as bridges, roads and airports in land regions with complex background. For specific object (airport) detection, the top-down knowledge and object classification need to be considered. In general, airports are often built near a city or suburb, so the area around an airport is often very complex and has a multiplicity of objects such as intricate roads, all kinds of buildings, farmland, bridges, mountains, sea and so on. Thus, selecting an airport in a complex background is a challenging issue. Note that each scene has only one panchromatic remote sensing image.

Conventional works on airport detection consider the features of an airport such as the runways and texture difference between the airport and its surroundings. Runway detection uses classical edge detection methods to find the edges on the satellite image, and then utilizes Hough transform [36, 65] to detect straight lines. A region with straight lines may be the location of an airport. Finally, the feature extraction and classification used in computed vision are used to find the target. This kind of method [66–68] is simple and fast but it is difficult to distinguish an airport from the interferential objects, if these objects also possess straight line parts. Texture segmentation methods [69, 70] aim to find the regions with high texture difference via other features such as statistical moments or SIFT features. This kind of method has higher precision, but brings more computational complexity due to the operation working pixel by pixel. Since airports are salient in remote sensing images due to their differences from the surroundings, in principle, the visual attention mechanism should easily replace image segmentation in the latter to get candidate regions for airports. However, the complex background in land regions of satellite imagery makes most of the visual attention models fail in finding the regions of interest.

An idea proposed in [11] combines an improved visual attention model with two kinds of conventional methods. An analysis of visual attention computational models shows that most bottom-up models fail to pop out airport areas, and only the attentional focus of the GBVS model (mentioned in Section 3.4) somewhat contains the airport area. Since the airport includes the runways with straight line, the idea in [11] is to use the result of the Hough transform of the input image instead of the orientation channel in the GBVS model. An example comparing the improved saliency map and others is shown in Figure 7.6. We can see that the improved GBVS model can pop out the airport target shown in Figure 7.6 and is quicker than the original GBVS model, although it is slower than the PFT and BS models (Sections 4.3 and 3.1). The order of time cost is PFT, BS, improved GBVS and GBVS. Thereby, the improved GBVS model is employed in [11] by a tradeoff of the above computational models.

Figure 7.6 An example of saliency maps resulting from different models: the object is the airport in Dalian, China: (a) Original image; (b) NVT; (c) PFT; (d) GBVS; (e) improved GBVS

img

Before using an attention model, preprocessing is necessary to select useful regions in the introduced method [11]. First, the broad-area satellite image (20 000 × 20 000 pixels) is partitioned into many mid-size image chips with 400 × 400 pixels. The conventional method based on edge detection is applied to these partitioned image chips. In most cases, the extent of the runway length in an airport is known, because the resolution of satellite image is given for an existing satellite. Hough transform for these extracted edges can check the straight lines on the partitioned image chips. If there are no straight lines or there are straight lines but outside a certain range on the image chip, they are abandoned. Since the operation of edge extraction and Hough transform is fast, many image chips without airports can be omitted, which saves computational time. Secondly, in the improved GBVS model, two kinds of prior knowledge are used: (1) the result of Hough transform as a feature channel of the image chip under consideration replaces the orientation channel in the original GBVS model as mentioned above, and GBVS has two feature channels (intensity and the result of the Hough transform); (2) the material of the surface of airport runways is generally concrete or bitumen, and in most remote sensing images these materials are brighter due to their high reflectivity. Thus, prior knowledge is added by multiplying the filtered original image chip with the saliency map (pixel to pixel).

The saliency map only gives candidate regions of the airport, and procedures in computer vision such as features extraction, training and classification then follow. In [11], SIFT key-points with their respective feature descriptors are used. The difference with [70] is that only the key-points at salient region are calculated. The classifier in [11] employs a hierarchical discriminant regression (HDR) tree [71] (a decision tree) with fast search speed, as mentioned in Section 5.4.

In summary, the steps of airport detection based on visual attention modelling in [11] are as follows.

1. Image partitioning
A panchromatic remote sensing image of large size is portioned into many image chips with a mid size (400 × 400 pixels).
2. Training
With a training set, SIFT features are computed on each training sample. If a key-point falls on an airport area, it is labelled ‘1’; otherwise, it is labelled ‘0’. All the labelled samples are input into the HDR tree for training, and then a decision tree is built for testing.
3. Preprocessing
Edge detection and the Hough transform are used to judge whether the partitioned image chip contains an airport. The unimportant image chips are discarded. The results of the Hough transform for the image chip with candidate airport are retained for the next step.
4. Computation of saliency map
The selected image chip and its Hough transform are integrated into the improved GBVS model, and then the saliency map is obtained.
5. Computing the area of the airport
Arbitrary area growing methods mentioned in Section 7.2.1 can then be used to get the candidate airport areas in the order of decreasing saliency on the saliency map. These growing areas are drawn by an external rectangle. Thus, for each image there may be several rectangular candidate airport areas.
6. Calculating the feature ratio
For each candidate airport area the SIFT key-points with feature descriptors are extracted and classified into ‘1’ and ‘0’ by the HDR tree. The feature ratio of an airport area is defined by the percentage of label ‘1’ in all the SIFT key-points of the area.
7. Airport recognition
Two criteria are considered in the airport recognition. One is the feature ratio of the airport area. When the feature ratio is higher than a threshold, it will be recognized as an airport. The other is the salience order of the area. An area with higher saliency order and at least one SIFT feature labelled ‘1’ in the region will be classed as an airport. Some results of airport detection in complex background by the method proposed in [11] are shown in Figure 7.7.

Figure 7.7 Some recognition results from attention-based airport detection

img

In [11], some experimental results (the ROC curve and detection speed) show that this attention-based airport detection method has higher recognition ratios, lower false alarm rate and higher search speed, compared to conventional methods [67, 70]. This is an example of visual attention's application in complex satellite images.

7.3.3 Saliency and Gist Feature for Target Detection

In airport detection introduced above, the information on the airport is incorporated into the original GBVS model. Thus, the method aims at a specific object (airport). For different kinds of objects this model does not work well.

Recently an application of visual attention suited to the detection of different objects in high-resolution broad-area satellite images was proposed in [8]. The difference from the airport detection method is that more general features are extracted from the feature channels in a pure bottom-up computational model. The features can detect different objects after training in a corresponding database.

First, the broad-area satellite image is cut into small image chips as above. The goal of target detection is to detect whether the chip includes a required object or not. The feature extraction of each image chip is based on saliency maps of low-level feature channels and gist analysis. A feature vector with 238 dimensions for arbitrary object detection is extracted, which consists of statistic values for both saliency and gist of these low-level features. The detector or classifier employs a support vector machine (SVM). For different required objects, their respective training databases with or without required object are built. These feature vectors of the chips in the training database for a required object are input to the SVM to learn the discriminant function (optimal hyperplane). In the testing stage, the tested image chips can be identified by the trained SVM, which is similar to other object detection or recognition methods in computer vision. Since the features of the proposed method are based on the saliency of low-level feature channels, feature extraction of the image chip is emphasized below.

7.3.3.1 Saliency Feature Calculation

As mentioned in Chapter 3, the low-level features in an image chip are obtained by decomposing the image to multiple feature channels with multiple scales. There are ten feature channels such as intensity, orientation (where the four orientations, 0̊, 45̊, 90̊, and 135̊, are combined in one channel), local variance, entropy, spatial correlation, surprise, T-junction, L-junction, X-junction and endpoint. The feature channels for local variance, entropy and spatial correlation are analysed within 16 × 16 image patches, and other feature channels are analysed by using an image pyramid with nine scale images and centre–surround difference between scale images. The competition within each feature and across-scale is computed in these channels, and this can suppress strong noise and make sparse peaks stand out in the channel saliency maps.

The intensity channel is like that in the BS model introduced in Chapter 3: the input image chip is progressively low-pass filtered and subsampled to create nine scale maps (the image pyramid) that is computed as six centre–surround difference maps (the centre scales are selected as img, and the surround scales are defined as s = q + δ where img). The six difference maps are normalized respectively. The orientation feature channel is generated by Gabor filters with four orientations in the image pyramid, and then the four orientations are integrated into an orientation map for each scale and normalized in the six centre–surround difference maps. The computation of the two feature channels is the same as that in the BS model.

The feature channels of local variance, entropy and spatial correlation are computed over the 16 × 16 pixel image patch of the image chip, and the resulting image is 16 times smaller than the original image in both width and length, which directly represents the feature's saliency map. The local variance feature is obtained by computing the variance of each image patch, which is represented as

(7.34) equation

Where Im is the mean intensity of the image patch under consideration, and Npz denotes the number of pixels in the patch and I (i, j) is the intensity value at location (i, j) of the patch. The symbol ‘sqrt (.)’ is the square root function. The entropy value of one image patch is related to the probability density of intensity in the 16 × 16 image patch, which is described as

(7.35) equation

where img is the probability of possible intensity in the neighbourhood Nei. The computation of spatial correlation is based on the patch under consideration and other patches at a given radius from the local patch. The correlation of two random variables a and b can be rewritten as

(7.36) equation

where E(.) is expectation, σa and σb are the standard variances of variables a and b and cov(.) is covariance. The spatial correlation feature of the image patch is computed by Equation 7.36, while the value of every location of image patch (16 × 16) and the values of its surrounding patches replace the variables a and b. It is worth noting that a location with lower spatial correlation means higher salience in the position.

Bayesian surprise is defined as the difference between prior beliefs and posterior beliefs after a new observed signal input, which is often used to measure the saliency map of a dynamic image, as mentioned in Section 3.8. For the static image chip the prior beliefs are firstly computed from a large neighbourhood (the 16 × 16 image patches near the patch under consideration), and the current image patch is considered as the new observed signal for computing the adjusted posterior beliefs, thereby the surprise feature can be attained.

The junction feature channels are from orientation features in four orientation feature maps and in different scales. The L-junction responds at locations where two edges meet perpendicularly and end at the intersection point, and the T-junction responds the two orthogonal edges but one edge ends at the intersection point. In the X-junction there are no ends at the intersection point. The endpoint represents an extended edge ends. So each edge in the orientation feature is checked in the eight neighbours. The four junction feature channels are calculated at different scales.

For the nine feature channels mentioned above, except the surprise feature, the competition are implemented within the feature map and across the scales, and finally are resized to the same size as the local variance feature channel (16 times smaller than the original image in width and length). Now we have ten saliency maps for ten feature channels. For each saliency map, four statistical features are extracted: mean, variance, number of local maxima and average distance between locations of local maxima. The dimension of saliency feature vector is 40 (10 × 4 = 40).

7.3.3.2 Gist Feature Computation

The gist of a scene is often captured when a viewer just glances at the scene. The cursory estimation is commonly related to statistical features such as the mean of some features or variance. The studies of scene perception in [72, 73] enlighten scientists and engineers to design new features for scene recognition [74]. The gist feature in [8] is designed to capture and summarize the overall statistics of the entire image chip in low-level feature channels.

In the intensity channel, the statistical values of five raw scale maps in pyramid (scale range 0–4) and six centre–surround difference maps are considered as gist features; that is, the mean values of the 11 maps are calculated. Other mean values of maps for four orientations (0̊, 45̊, 90̊, 135̊), four L-junctions (0̊, 45̊, 90̊, 135̊), four T-junctions (0̊, 45̊, 90̊, 135̊), four endpoints (0̊, 45̊, 90̊, 135̊) and one X-junction – a total 17 channels, and each channel with 11 maps – are also regarded as gist features. Therefore, after computation, the dimension of the gist feature vector is 198 (11 + (17 × 11)). A simple combination of saliency features and gist features generates a final feature vector with 40 + 198 = 238 dimensions for object detection. As mentioned above, the feature vector of 238 dimensions is input to a classifier (SVM) to detect whether a required object exists or not, assuming the parameters of SVM were available from the training stage.

Several very large databases, with more than 10 000 image chips each, which are cut out from high-resolution broad-area satellite images, are built to test the algorithm in [8] for different object detection tasks. The ROC curve mentioned in Chapter 6, as compared with TPR and FPR when threshold changes, was adopted in their experiments. The detecting results shown in [8] for several different objects (boats, airports and buildings) demonstrate that the proposed method by [8] outperforms other object detection methods such as HMAX [50, 75], SIFT [39] and the hidden scale salient object detecting algorithm [76]. The use of statistical features (extracted from the saliency maps) as the input of a classifier is a meaningful application for computer vision.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset