7.1 Object Detection and Recognition in Computer Vision

Object detection is one kind of image segmentation based on geometrical and statistical features of objects, and object recognition refers to identification of a specific object from other objects including similar objects. Recognition is a further work that takes place after detection; in other words, detection is the important first stage of recognition. For example, face detection in a scene is to find human facial regions that have symmetrical geometrical features and specific skin colour (or other features), while face recognition is to distinguish a specific person from many other people.

7.1.1 Basic Concepts

In general, a raw input image needs to be preprocessed, such as denoising and segmentation. Denoising is mainly accomplished by different low-pass filtering in a spatial or transform domain. The conventional segmentation methods are of three types: region based, edge based and motion based. The region based segmentation includes threshold segmenting by setting one or several intensity values as the thresholds in grey or colour histograms to partition the image, or segmentation with the aid of entropy, region growing from some seed pixels, clustering, graph segmentation, region marking, region splitting and merging and so on. Edge based methods include edge extraction, active contour (snake) models and so on. The motion based methods mainly use the difference between two adjacent frames and optical flow. In the preprocessing stage, prior knowledge is not considered. The preprocessing methods are not specially presented here. Some of them (e.g., segmentation) which are related to visual attention [13–20] are presented in Section 7.2.

Conventional object detection and recognition are to learn visual categories from a training dataset, and then to identify new instances of those categories. So they need supervised learning with a training dataset. Two stages are necessary for object detection and recognition: one is feature extraction for each window in an image (each key-point or the whole image) and the other is classification that compares the new instance with labelled samples in the training dataset and decides which class it belongs to. In this sense, object recognition has more than two categories to be identified. If there are only two categories (object/not-object or foreground/background) that need to be recognized, it becomes object detection.

7.1.2 Feature Extraction

Features extracted from an object or image region can succinctly represent the object and distinguish its category from other categories, and this is a crucial issue in object detection and recognition. In general, these extracted features include both global and local features. The global features are obtained by an image transform such as principal components analysis (PCA), independent components analysis (ICA), Fourier transform, wavelet transform, intensity or colour histogram and so on. Among these global features, PCA is an optimal representation from high dimension to low dimension while keeping the main information in the image or the image region. The local features include edges, intensity and colour obtained by filtering as mentioned in the BS model of Chapter 3, geometric features (line, corner, angle, gradient and symmetry) attained by using early Hough transform [36], corner detecting operator [37] and statistical features on a local region or pixel-wise (one- and two-order moment, higher-order moment and entropy).

A challenge for global and local feature extraction is to keep the invariance: shift invariance, rotation invariance, affine invariance, luminance invariance and size invariance. When a new instance (object) is different from the same object in the training set in position, orientation, viewing angle and size it is difficult to be detected and recognized, because the global or local features, as mentioned above, are not robust to a wide variety, though some features have invariance in certain aspects. For instance, moment features have rotation invariance, but they cannot keep affine invariance; coefficients of polar Fourier transform have rotational invariance but no other invariance.

Recently, the local invariant feature detectors and descriptors have been developed in [38–40], which can operate under different viewing conditions, even with partial occlusion. The local features extraction system, called the scale invariant feature transform (SIFT) was proposed in [39], and then an improved method called speeded up robust feature (SURF) is proposed by [40] in order to speed up the computation time of SIFT. The underlying ideas of SIFT and SURF are similar. In some applications of attention-based object recognition, the local key-points extracted by SIFT for a new object are often used to match the key-points of the labelled object in the training database. For the applications in Sections 7.2.4, 7.3 and 7.5, a simple introduction to SIFT is to be given as follows.

The features extraction by SIFT include four steps: (1) scale-space extremes detection; (2) key-point localization; (3) orientation assignment; (4) structure of key-point descriptor.

1. Scale-space extremes detection
An original image I composed of array I(x, y) is convolved by Gaussian kernel functions with different standard deviations to build the blurred image groups at different scales of an image. For one scale, the blurred images can be represented as

(7.1) equation

where G(σ, x, y) is the Gaussian function with standard deviation σ, and LG(·) is the blurred image. The larger the standard deviation is, the more blurred LG(·) is. When the standard deviation parameter σ increases from σ, , k2σ . . . (where k > 1, in [39], img), a group of blurred images from the original image to a very blurred image are generated, which forms the first level of a pyramid called the first octave. It is noticed that in the first octave, all the blurred images have the same size as the original image (no down-sampling). The second octave is based on a down-sampled image from the first octave, that is, taking the blurred image convoluted by a Gaussian function with standard deviation kn−1σ (if the most blurred image is convoluted with standard deviation knσ) in the first octave and down-sampling it to generate a small-scale image (a half in both length and width of the original image) as new initial image. The same computation as the first octave for the new initial image is operated to create a group of blurred images in the second level of the pyramid (second octave) by using Equation 7.1. The computation can go on to the last octave of the pyramid. Afterwards, the difference-of-Gaussian (DoG) images are generated from the difference of adjacent blurred images, which simulates the DoG function in the primary cortex mentioned in Chapter 2 (Equation 2.4). A difference image between scales σ and is denoted as

(7.2) equation

It is worth noting that the number of the DoG image is one less than that of the blurred images in each octave. The local extremes of img are to compare each sample point with its 26 surrounding points – eight neighbours in the current image and nine neighbours in the scales above and below, respectively: 8 + 9 + 9 = 26. If the value of a point under consideration is the maximum or minimum among its surrounding, the point is regarded as a candidate key-point (extreme point).
In the first step, the detection of extremes is accomplished in multiresolution and multiscale images, and the processing of the DoG function makes these extreme points insensitive to illumination change, and therefore the extreme detection has scale invariance and robustness to illumination variation.
2. Key-point localization
For each extreme point (candidate key-point), the interpolation of nearby data is used to determine its position, and some extreme points with low contrast along an edge are removed to obtain true key-points. This stage gets stable key-points and resists noise. As a result, the number of key-points is reduced.
3. Orientation assignment of each key-point
To determine the key-point orientation, a histogram of orientation gradient is computed within a region around the key-point location (the size of the region is 8 × 8 or 16 × 16 pixels). Assume that the scale of the key-point is to select the blurred smooth image LG that is the closest scale of the key-point; the gradient magnitude img and orientation img of all pixels on the region are calculated as

(7.3) equation

The gradient orientations of all pixels in its surrounding region form an orientation histogram; the contribution of each neighbouring pixel is weighted by the gradient magnitude with a Gaussian circular window centred at the key-point. The final orientation histogram has 36 bins across 0̊ ~ 360̊. The highest peak in orientation histogram is the dominant direction of local gradient and the second highest peak within 80% of the highest peak is the secondary direction. If there are more directions within 80% of the highest peak, they will be all kept. The direction of the local gradient of each key-point can be rotated for computing the local feature descriptor, toward the rotation invariance. Thus, each key-point in an image has three specific parameters: location, scale (key-point located) and orientation.
4. Key-point feature descriptor
Once a key-point orientation has been selected, the coordinates of the descriptor and the gradient orientations of these pixels in the region around the key-point are rotated to the dominant key-point direction. The feature descriptor is computed for a set of orientation histograms on 4 × 4 pixel neighbourhoods. If the size of the whole region is 16 × 16 pixels, there are four orientation histograms at each part for up-left, up-right, down-left and down-right around the key-point. Just as before, the contribution of the pixels to the 4 × 4 area is weighted by the gradient magnitude and a Gaussian circular window. Each descriptor of the key-point contains an array of 16 orientation histograms around the key-point (each orientation histogram is related to 4 × 4 pixels) and each orientation histogram covers 8 bins (8 grey levels), and the key-point feature descriptor is a vector with 128 elements (16 × 8 = 128). The vector is normalized to enhance the robustness of the illumination changes.
After the four steps above, there are a large number of key-points related to their feature descriptor (128 dimensions) across all scales and locations in the image under consideration. These features are called SIFT features and have strong robustness to the variations of scale, location, rotation and illumination. Many experiments of object recognition have showed that SIFT features can not only be used to recognize objects with position shift and orientation change in different viewing angle but also successfully recognize occluded objects.
Other invariant features can be obtained by the SURF approach [40]. The idea of SURF features is similar to SIFT ones. With SURF, the detector of interest points is based on Hessian matrix and the descriptor is described as a distribution of Haar wavelet responses within the neighbourhood of the interest point. Moreover, the descriptor of an interest point is a 64 -dimension vector. Thus the SURF approach is more computational efficient than that of SIFT.

7.1.3 Object Detection and Classification

Suppose the features of an object (an image patch) or a key-point of image are extracted by using the approach mentioned above, and each of them can be represented as a feature vector f, img, referred to as a sample, where fi is the value of the ith feature, i = 1, 2, . . . d and d is the dimension of the feature vector. For object detection, we need to find a location or an image patch in a scene where a desired object is located. Object recognition needs to recognize the object's label in the scene, if there are multiple objects to be classified.

Both detection and recognition belong to the classification problem. Object detection separates the scene into two classes: the object region and the no-object region, as with object segmentation, and object recognition classifies the object into its category as a multiclass problem. A classification program operates in a set of typical objects called the training set in which each sample is labelled in advance. Suppose that in the training set the feature vector and label of each sample are known, represented as img, where the superscript img is the label of the jth sample img, and c is the number of classes in the training set. Assume class l includes Nl samples in the training set, and then the total sample number is N = N1 + N2 + . . . Nc. Evidently, c = 2 for object detection and c > 2, for object recognition.

Suppose that each sample with d features is one point in a d-dimensional feature space, and the classification aims at finding a discriminant function by means of learning the samples in the training set, such that all samples in the training set can be correctly classified. The discriminant function may be linear, piecewise or non-linear. Figure 7.1 shows the three cases in two-dimensional (f1 and f2) feature space.

Figure 7.1 Discriminant function for two classes and multiple classes: (a) linear discriminant function for A and B classes; (b) non-linear discriminant function for A and B classes; (c) piecewise discriminant function for four classes (A, B, C and D)

img

Figure 7.1(a) shows the samples in two classes that are linearly separable, and the discriminant function is a straight line. In Figure 7.1(b), the two categories are non-linearly separable, and the rounded discriminant function can separate the two classes. The multiclasses discriminant function is shown in Figure 7.1(c). In the real world, some samples may be overlapped, and statistical estimation criteria are necessary, such as minimum error rate estimation and minimum risk estimation based on Bayesian decision. Although there are many classification methods in engineering applications, here we only introduce the commonly used methods in the next application examples.

Suppose that all samples in the training set are correctly labelled after the training phase. Given a testing sample f with unknown label, the classification algorithms to label the testing sample are presented as follows.

1. Nearest neighbour method (NN)
This very simple matching method involves comparing the tested sample f with all samples in the training set by Euclidean distance, and then the label of the nearest sample in the training set is defined as the label of the tested sample f. Mathematically, define the discriminant function, gl, for class l as

(7.4) equation

If img, then img, where img is the kth category.
This method can be expanded to the k-nearest neighbour (k-NN) method. This is to find the tested sample's k nearest samples in the training set and to decide the label of the tested sample by using voting of the k nearest samples.
2. Decision tree method
This method is to search a feasible match of the tested sample in the training set through a decision tree that consists of a root node, internal nodes and leaf nodes. Each node in the decision tree represents a grouping of samples in the training set. The searching method was introduced in Sections 5.3 and 5.4. The decision tree has faster speed than the NN and k-NN methods, because it does not need to match all the samples of the training set.
3. Neural network method
The feed-forward artificial neural network with multilayer structure has been successfully used in classification, because there is no need to estimate samples' statistical properties and only needs to learn the weights between neurons with the help of the back propagation (BP) algorithm (a gradient algorithm) in the training stage. In the feed-forward neural network, each neuron in mid and output layers satisfies a non-linear relationship; that is, the relation between output and input of one neuron is a monotonically increasing sigmoid function or a radio basis function. In general, the neural network has three layers: mid layer, input layer and output layer. The number of neurons in the input layer is equal to the dimension of the feature vector (samples), and the neuron number in the output layer in general is the number of classifications for the multiclass issue.
For a two-class issue, it can save one output neuron and only needs one output neuron to represent yes/no. For the linear separable case, the neural network only needs two layers (input and output with one neuron). The weights from input neurons to output neuron are obtained in the training stage. When the tested sample is input to the neural network with one output neuron that has been learned, if the output neuron fires (+1), the input sample is an object or an object region; contrarily, if the output neuron does not fire (−1), the input sample is not an object or an object region.
For the non-linear separable or more complex case, since the transfer functions of neurons in the mid and output layers are non-linear, the neutral network method is able to tackle with any non-linear or complex discrimination if the number of neurons in the mid layer is sufficient. However, a multilayer network with non-linear neurons often has a local minimum problem and the learning period lasts very long. In addition, noise in the tested samples often results in erroneous classification.
4. Support vector machine (SVM) [41]
A linear discriminant function for a linear separation problem is more convenient for learning weights of neural networks and avoids local minima. One method is to map the samples in low-dimensional feature space to a high-dimensional space in which the distribution of samples is changed as a linearly separable form. This transform is completed by a kernel function, such as the Gaussian kernel or a polynomial function.

(7.5) equation

where K is a kernel function. The optimal hyperplane in the high-dimensional space is designed to separate a two-class issue. For convenient explanation, consider the case in Figure 7.2, with two-class separable samples shown as solid and hollow circles in a two-dimensional space; the hyperplane is a line that can separate the two classes.

Figure 7.2 Optimal hyperplane in a high-dimensional space

img

The discriminant function for a two-class issue is

(7.6) equation

where img is the weight vector for a neural network with a single neuron whose input is img and the output is label yl of img, img, and b is the threshold of the neuron. H is an optimal discriminant hyperplane that satisfies

(7.7) equation

Two parallel lines H1 and H2 pass the two-class samples (two hollow circles with a star and one solid circle with a white point in Figure 7.2) that have the nearest distance to the optimal hyperplane, respectively. Both H1 and H2 are parallel to the optimal hyperplane H. The distance between H1 and H2 is the margin of the two classes. The optimal separate hyperplane should separate the two classes correctly and make the margin maximum. The samples passed by H1 and H2 are referred to as support vectors for the two-class issue. The following procedures are how to find the optimal hyperplane H.

Let us normalize the discriminant function img and let the two-class samples satisfy img. If the support vectors for the two classes are denoted by img and img that are substituted into Equation 7.6, considering the support vectors satisfy img and img, then we can obtain the margin between hyperplanes H1 and H2 equals img, according to

equation

equation

Evidently, maximizing the margin img means to minimize the weight vector img. Thus, the classification issue becomes an optimal problem, that is minimizing img under the constraint condition img. A cost function with a Lagrange function is built to find the optimal hyperplane H, which is defined as

(7.8) equation

where img is the Lagrange multiplier for the ith sample, img is number of samples in the training set, and img is the label of sample img which is known in the training set (img = 1 while img and img = −1 while img). Minimizing the function of Equation 7.8 by setting to zero for the partial derivative of Equation 7.8 with respect to vector img and variable b respectively, we can get

equation

Substituting the above equation into Equation 7.8, we put the question into its dual question, that is to solve the maximum value of

(7.9) equation

for the constraint condition img.

Maximizing Equation 7.9 by setting the partial derivative of each Lagrange multiplier to zero, we can obtain all the optimal Lagrange multipliers, img for i = 1, 2, . . . , N, and the optimal weight vector is

(7.10) equation

It is worth noting that most optimal Lagrange multipliers are zeros, except support vectors, and the optimal variable b* can be obtained by setting the second term of Equation 7.8 to zero for given img and support vector. The optimal parameters (img and b*) can be used to label a new sample img according to the following discriminant function

(7.11) equation

In fact, the dimension of img may be very high (infinite) and the inter-product terms img and img in Equation 7.9 cannot be really implemented. If we use a symmetrical function as the kernel, the inter-product of img and img can be represented by the samples in original feature space; then we rewrite Equation 7.9 and 7.11 as

(7.12) equation

(7.13) equation

Thus, for the given N training samples img the steps of the SVM are: (1) select kernel function K; (2) calculate Equation 7.12 by maximizing img to obtain the optimal Lagrange multipliers, img; (3) compute the optimal weight vector w* by using Equation 7.10; (4) find the support vectors from the samples with img in the training set and compute the variable b* while setting the second term of Equation 7.8 to zero. For the testing stage, the new sample can be classified by Equation 7.13.

With the SVM method, it is very simple to classify new samples; the approach is very robust to noise, and has been used in many applications for object detection and recognition. Notice that (1) although original SVM operates in the case where two classes can be separated, it can still suit the inseparable case by adding a relaxed term in the constraint condition in Equation 7.8; (2) for a multiclass issue, it can decompose the multiclass issue to a number of two-class issues (each class vs. the remaining classes), and therefore the SVM can also solve the multiclass issue.

All the classification algorithms mentioned above need to know the cluster of all samples with their labels in a training set (NN, k-NN, decision tree) or to learn the discriminant function (neural network, SVM) in advance, and this needs prior knowledge (experience). In addition, full search in a scene is probably required. For example, to find the object patch (or object feature vector) in a scene, all the possible patches (or feature vectors), through shifting the patch window in the scene, are compared with the object patch in the training set, or directly inputted to the learned neural network. This full search strategy is time consuming and does not meet the requirements of real-world applications. A pure visual attention method or visual attention combining with the engineering methods introduced above will greatly improve the effect and efficiency.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset