Remote sensing is a technique that observes the earth and its surface from platforms far from the earth such as satellites or space shuttles. Man-made satellites are one kind of platform which takes digital images of the earth's surface through cameras, video cameras, multispectral scanners, hyperspectral sensors, pupil diameter radar and so on. Multispectral scanners can simultaneously record the reflecting spectral signals at the same scene and form several images with different spectral bands (400–1000 nm) from visible light to invisible light (near infrared), and the technique has been extended to hyperspectral imagery with higher spectral resolution for a scene. Since optical spectral imagery includes more geological information in spectral and spatial domains, and it can be used to study earth materials and detect hidden objects. However, optical spectral imagery is often disturbed by cloud cover or by changes of sunlight, and thus it cannot work in cloudy or rainy conditions or on a dark night. The electromagnetic wave signals with lower frequency range can pass through the cloud layer and this is independent of changes of visible light intensity (it works in almost all weather conditions, day or night), so a satellite with synthetic aperture radar (SAR for short), by transmitting and receiving electromagnetic wave signals, has been developed. SAR can directly generate two-dimensional imagery with orthogonal dimensions (range and azimuth dimensions), but it is often disturbed by the speckle noise due to the interference of scattering signals. The two kinds of satellite imagery (hyperspectral and SAR images) are mutually complementary to monitor a wide extent of the earth and its surface across many spectral bands. With the technical development of satellite remote sensing, a great deal of data per second is sampled while the man-made satellite moves around the earth's orbit. The huge amount of imagery information makes classical image analysis techniques impossible. Exploring automatic object search methods from raw imagery is becoming increasingly important. Visual attention can rapidly select the interesting locations in the images, which will be a helpful tool for searching through the information and looking for hidden targets in the satellite imagery.
However, the appearance of satellite imagery is totally different from natural images. Each pixel in a remote sensing image represents an area of the earth's surface (e.g., a pixel covers 30 m × 30 m area of the earth's surface for a multispectral imagery with mid resolution). It is known that oceans cover more than 70% of the earth's area. For the ocean regions the target (a commercial vessel or a warship) may be detected easily, but on land, the complicated road network, farmlands, buildings of cities, rivers, mountains and vegetation and so on, result in a complex background for satellite imagery, and it is difficult to find the hidden target. Figure 7.3(a) and (b) show two images from Google earth that are a combination of aerial photographs and satellite imagery collected by Landsat-7 (30 m × 30 m resolution) for the land regions. Thus, the analysing method for satellite imagery should be different from the natural images or man-made object images with artificial objects. This section mainly introduces the applications of visual attention for object search and recognition in satellite imagery.
First, we present ship target detection in ocean regions for multispectral imagery and for SRA imagery by using visual attention, and then, we introduce airport detection in the land region, with the help of saliency maps, prior knowledge and decision trees. Finally, a object recognition strategy in satellite imagery is presented, which employs the saliency map and other features and combines computer vision methods to identify arbitrary object patches in satellite imagery.
Ship detection in satellite imagery (multispectral imagery and SAR imagery) is becoming more and more important because of its potential application in searching for missing fishing vessels, region estimation for oil pollution, vessel traffic services, military target detection and so on. However, satellite imagery is often corrupted by noise and light changes, which cause mistakes of object detection in most conventional methods.
Conventional ship target detection in multispectral or SAR imagery mainly includes the constant false alarm rate (CFAR) algorithm and its improvements [52–56], the maximum likelihood ratio test [57], the support vector machine (SVM) [58], the cross correlation [59] based method and the polarization cross-entropy approach [60]. Among them, the CFAR algorithm is the most commonly used method in both multispectral and SAR imagery, and utilizes the statistical information of target and background to find a detection threshold that makes the probability of false alarm invariant as light intensity changes or strong noise appears. However, the estimation of statistical information has higher computational complexity. With the attention computational model, especially, the frequency domain models, it can be very fast to locate the salient regions on a relatively simple background, and this is suitable for ship detection in ocean regions. The following two subsections introduce ship target detection based on frequency domain bottom-up attention models for multispectral imagery and SAR imagery, respectively.
Multispectral remote sensing imagery includes several spectral images across different spectral bands (6–8 bands), that is there are 6–8 images for a given scene. Different ships display different intensities in different spectral bands. A ship may pop out in some band images but disappears in other band images, and for other ships it may be the reverse; so only considering the image in one single spectral band cannot detect all ships due to information missing. As we saw in Chapter 4, a quaternion can represent four values (intensity, colour opponent and motion) at one image pixel, and the phase spectrum of quaternion Fourier transform (PQFT) be used in computing the saliency map of the image. In multispectral remote sensing imagery, there are several values in different bands at each pixel, so with the aid of PQFT, the several values at one pixel can be represented as a biquaternion [10, 61]. Using the phase spectrum of biquaternion Fourier transform (PBFT) we can also obtain the saliency map of the multispectral remote sensing imagery. According to the saliency map, an adaptive detection threshold is set to detect the locations of ships in ocean regions. The steps in ship detection [10] are as follows.
(7.31)
Ship signatures in SAR images are determined by many scattering mechanisms including direct reflection from areas at right angles to the radar beam, corner reflections and multiple reflections from the ships and sea surface. In SAR imagery, there are speckle noises resulting from backscattered signals and the heterogeneities caused by backscattering alteration or different wind conditions and so on. Figure 7.5(a) shows an SAR HH image (500 × 256 pixels) with two ship targets in the background with speckle noise and heterogeneities, scanned from the ALOS satellite. With conventional ship detection algorithms based on statistical distribution of the signal it is very difficult to find the ships since the signatures of the ships have a similar intensity level to their surroundings.
In [9], a frequency visual attention computational model (PCT) introduced in Section 4.5 is used to create a saliency map of these SAR set images, and then makes use of the adaptive threshold method (Equations 7.32 and 7.33) to complete ship target detection. Since visual attention can catch the most important information, regardless of speckle noise or the heterogeneous background, the ships in SRA are searched and detected successfully in a database of SAR HH images. In the example shown in Figure 7.5(c), compared with Figure 7.5(b) (CFAR algorithm), the visual attention-based method has fewer false alarms than the CFAR algorithm.
The panchromatic remote sensing imagery covers all optical spectral bands so that it has the highest spatial resolution, which commonly provides a means of detecting geospatial objects such as bridges, roads and airports in land regions with complex background. For specific object (airport) detection, the top-down knowledge and object classification need to be considered. In general, airports are often built near a city or suburb, so the area around an airport is often very complex and has a multiplicity of objects such as intricate roads, all kinds of buildings, farmland, bridges, mountains, sea and so on. Thus, selecting an airport in a complex background is a challenging issue. Note that each scene has only one panchromatic remote sensing image.
Conventional works on airport detection consider the features of an airport such as the runways and texture difference between the airport and its surroundings. Runway detection uses classical edge detection methods to find the edges on the satellite image, and then utilizes Hough transform [36, 65] to detect straight lines. A region with straight lines may be the location of an airport. Finally, the feature extraction and classification used in computed vision are used to find the target. This kind of method [66–68] is simple and fast but it is difficult to distinguish an airport from the interferential objects, if these objects also possess straight line parts. Texture segmentation methods [69, 70] aim to find the regions with high texture difference via other features such as statistical moments or SIFT features. This kind of method has higher precision, but brings more computational complexity due to the operation working pixel by pixel. Since airports are salient in remote sensing images due to their differences from the surroundings, in principle, the visual attention mechanism should easily replace image segmentation in the latter to get candidate regions for airports. However, the complex background in land regions of satellite imagery makes most of the visual attention models fail in finding the regions of interest.
An idea proposed in [11] combines an improved visual attention model with two kinds of conventional methods. An analysis of visual attention computational models shows that most bottom-up models fail to pop out airport areas, and only the attentional focus of the GBVS model (mentioned in Section 3.4) somewhat contains the airport area. Since the airport includes the runways with straight line, the idea in [11] is to use the result of the Hough transform of the input image instead of the orientation channel in the GBVS model. An example comparing the improved saliency map and others is shown in Figure 7.6. We can see that the improved GBVS model can pop out the airport target shown in Figure 7.6 and is quicker than the original GBVS model, although it is slower than the PFT and BS models (Sections 4.3 and 3.1). The order of time cost is PFT, BS, improved GBVS and GBVS. Thereby, the improved GBVS model is employed in [11] by a tradeoff of the above computational models.
Before using an attention model, preprocessing is necessary to select useful regions in the introduced method [11]. First, the broad-area satellite image (20 000 × 20 000 pixels) is partitioned into many mid-size image chips with 400 × 400 pixels. The conventional method based on edge detection is applied to these partitioned image chips. In most cases, the extent of the runway length in an airport is known, because the resolution of satellite image is given for an existing satellite. Hough transform for these extracted edges can check the straight lines on the partitioned image chips. If there are no straight lines or there are straight lines but outside a certain range on the image chip, they are abandoned. Since the operation of edge extraction and Hough transform is fast, many image chips without airports can be omitted, which saves computational time. Secondly, in the improved GBVS model, two kinds of prior knowledge are used: (1) the result of Hough transform as a feature channel of the image chip under consideration replaces the orientation channel in the original GBVS model as mentioned above, and GBVS has two feature channels (intensity and the result of the Hough transform); (2) the material of the surface of airport runways is generally concrete or bitumen, and in most remote sensing images these materials are brighter due to their high reflectivity. Thus, prior knowledge is added by multiplying the filtered original image chip with the saliency map (pixel to pixel).
The saliency map only gives candidate regions of the airport, and procedures in computer vision such as features extraction, training and classification then follow. In [11], SIFT key-points with their respective feature descriptors are used. The difference with [70] is that only the key-points at salient region are calculated. The classifier in [11] employs a hierarchical discriminant regression (HDR) tree [71] (a decision tree) with fast search speed, as mentioned in Section 5.4.
In summary, the steps of airport detection based on visual attention modelling in [11] are as follows.
In [11], some experimental results (the ROC curve and detection speed) show that this attention-based airport detection method has higher recognition ratios, lower false alarm rate and higher search speed, compared to conventional methods [67, 70]. This is an example of visual attention's application in complex satellite images.
In airport detection introduced above, the information on the airport is incorporated into the original GBVS model. Thus, the method aims at a specific object (airport). For different kinds of objects this model does not work well.
Recently an application of visual attention suited to the detection of different objects in high-resolution broad-area satellite images was proposed in [8]. The difference from the airport detection method is that more general features are extracted from the feature channels in a pure bottom-up computational model. The features can detect different objects after training in a corresponding database.
First, the broad-area satellite image is cut into small image chips as above. The goal of target detection is to detect whether the chip includes a required object or not. The feature extraction of each image chip is based on saliency maps of low-level feature channels and gist analysis. A feature vector with 238 dimensions for arbitrary object detection is extracted, which consists of statistic values for both saliency and gist of these low-level features. The detector or classifier employs a support vector machine (SVM). For different required objects, their respective training databases with or without required object are built. These feature vectors of the chips in the training database for a required object are input to the SVM to learn the discriminant function (optimal hyperplane). In the testing stage, the tested image chips can be identified by the trained SVM, which is similar to other object detection or recognition methods in computer vision. Since the features of the proposed method are based on the saliency of low-level feature channels, feature extraction of the image chip is emphasized below.
As mentioned in Chapter 3, the low-level features in an image chip are obtained by decomposing the image to multiple feature channels with multiple scales. There are ten feature channels such as intensity, orientation (where the four orientations, 0̊, 45̊, 90̊, and 135̊, are combined in one channel), local variance, entropy, spatial correlation, surprise, T-junction, L-junction, X-junction and endpoint. The feature channels for local variance, entropy and spatial correlation are analysed within 16 × 16 image patches, and other feature channels are analysed by using an image pyramid with nine scale images and centre–surround difference between scale images. The competition within each feature and across-scale is computed in these channels, and this can suppress strong noise and make sparse peaks stand out in the channel saliency maps.
The intensity channel is like that in the BS model introduced in Chapter 3: the input image chip is progressively low-pass filtered and subsampled to create nine scale maps (the image pyramid) that is computed as six centre–surround difference maps (the centre scales are selected as , and the surround scales are defined as s = q + δ where ). The six difference maps are normalized respectively. The orientation feature channel is generated by Gabor filters with four orientations in the image pyramid, and then the four orientations are integrated into an orientation map for each scale and normalized in the six centre–surround difference maps. The computation of the two feature channels is the same as that in the BS model.
The feature channels of local variance, entropy and spatial correlation are computed over the 16 × 16 pixel image patch of the image chip, and the resulting image is 16 times smaller than the original image in both width and length, which directly represents the feature's saliency map. The local variance feature is obtained by computing the variance of each image patch, which is represented as
(7.34)
Where Im is the mean intensity of the image patch under consideration, and Npz denotes the number of pixels in the patch and I (i, j) is the intensity value at location (i, j) of the patch. The symbol ‘sqrt (.)’ is the square root function. The entropy value of one image patch is related to the probability density of intensity in the 16 × 16 image patch, which is described as
(7.35)
where is the probability of possible intensity in the neighbourhood Nei. The computation of spatial correlation is based on the patch under consideration and other patches at a given radius from the local patch. The correlation of two random variables a and b can be rewritten as
where E(.) is expectation, σa and σb are the standard variances of variables a and b and cov(.) is covariance. The spatial correlation feature of the image patch is computed by Equation 7.36, while the value of every location of image patch (16 × 16) and the values of its surrounding patches replace the variables a and b. It is worth noting that a location with lower spatial correlation means higher salience in the position.
Bayesian surprise is defined as the difference between prior beliefs and posterior beliefs after a new observed signal input, which is often used to measure the saliency map of a dynamic image, as mentioned in Section 3.8. For the static image chip the prior beliefs are firstly computed from a large neighbourhood (the 16 × 16 image patches near the patch under consideration), and the current image patch is considered as the new observed signal for computing the adjusted posterior beliefs, thereby the surprise feature can be attained.
The junction feature channels are from orientation features in four orientation feature maps and in different scales. The L-junction responds at locations where two edges meet perpendicularly and end at the intersection point, and the T-junction responds the two orthogonal edges but one edge ends at the intersection point. In the X-junction there are no ends at the intersection point. The endpoint represents an extended edge ends. So each edge in the orientation feature is checked in the eight neighbours. The four junction feature channels are calculated at different scales.
For the nine feature channels mentioned above, except the surprise feature, the competition are implemented within the feature map and across the scales, and finally are resized to the same size as the local variance feature channel (16 times smaller than the original image in width and length). Now we have ten saliency maps for ten feature channels. For each saliency map, four statistical features are extracted: mean, variance, number of local maxima and average distance between locations of local maxima. The dimension of saliency feature vector is 40 (10 × 4 = 40).
The gist of a scene is often captured when a viewer just glances at the scene. The cursory estimation is commonly related to statistical features such as the mean of some features or variance. The studies of scene perception in [72, 73] enlighten scientists and engineers to design new features for scene recognition [74]. The gist feature in [8] is designed to capture and summarize the overall statistics of the entire image chip in low-level feature channels.
In the intensity channel, the statistical values of five raw scale maps in pyramid (scale range 0–4) and six centre–surround difference maps are considered as gist features; that is, the mean values of the 11 maps are calculated. Other mean values of maps for four orientations (0̊, 45̊, 90̊, 135̊), four L-junctions (0̊, 45̊, 90̊, 135̊), four T-junctions (0̊, 45̊, 90̊, 135̊), four endpoints (0̊, 45̊, 90̊, 135̊) and one X-junction – a total 17 channels, and each channel with 11 maps – are also regarded as gist features. Therefore, after computation, the dimension of the gist feature vector is 198 (11 + (17 × 11)). A simple combination of saliency features and gist features generates a final feature vector with 40 + 198 = 238 dimensions for object detection. As mentioned above, the feature vector of 238 dimensions is input to a classifier (SVM) to detect whether a required object exists or not, assuming the parameters of SVM were available from the training stage.
Several very large databases, with more than 10 000 image chips each, which are cut out from high-resolution broad-area satellite images, are built to test the algorithm in [8] for different object detection tasks. The ROC curve mentioned in Chapter 6, as compared with TPR and FPR when threshold changes, was adopted in their experiments. The detecting results shown in [8] for several different objects (boats, airports and buildings) demonstrate that the proposed method by [8] outperforms other object detection methods such as HMAX [50, 75], SIFT [39] and the hidden scale salient object detecting algorithm [76]. The use of statistical features (extracted from the saliency maps) as the input of a classifier is a meaningful application for computer vision.