Chapter 7
Application Challenges of Underwater Vision

Nuno Gracias1, Rafael Garcia1, Ricard Campos1, Natalia Hurtos1, Ricard Prados1, ASM Shihavuddin2, Tudor Nicosevici1, Armagan Elibol3, Laszlo Neumann1 and Javier Escartin4

1Computer Vision and Robotics Institute, University of Girona, Girona, Spain

2École Normale Supérieure, Paris, France

3Department of Mathematical Engineering, Yildiz Technical University, Istanbul, Turkey

4Institute of Physics of Paris Globe, The National Centre for Scientific Research, Paris, France

7.1 Introduction

Underwater vehicles, either remotely operated or autonomous, have enabled a growing range of applications over the last two decades. Imaging data acquired by underwater vehicles have seen multiple applications in the context of archeology (Eustice et al. 2006a), geology (Escartin et al. 2009; Zhu et al. 2005), or biology (Pizarro and Singh 2003) and have become essential in tasks such as shipwreck inspection (Drap et al. 2008), ecological studies (Jerosch et al. 2007; Lirman et al. 2007), environmental damage assessment (Gleason et al. 2007a; Lirman et al. 2010), or detection of temporal changes (Delaunoy et al. 2008), among others. Despite their acquisition constraints, which often require the underwater vehicle to navigate at a close distance to the seafloor or the structure of interest, imaging sensors have the advantage of purveying higher resolution and lower noise when compared with the traditional sensors for seafloor surveying such as multibeam echosounders or side-scan sonars. Such higher resolution naturally promotes easier interpretation of finer-scale benthic features.

One of the most useful tools to carry out the above-mentioned scientific studies is the generation of optical maps. These maps provide the scientists with short-range, high-resolution visual representations of the ocean floor, enabling a detailed analysis of the structures of interest. Additionally, these offline-generated maps can be used by the vehicle to precisely locate themselves in the environment with respect to previously visited areas.

Rather than reviewing particular application cases, this chapter focuses on describing computer vision techniques that are common to most real-world applications. These techniques address the use of the collected data after mission, without the constraints of real-time operation, and are offline bynature.

We start by addressing the creation of 2D mosaics, which is currently the most widely used method for organizing underwater imagery in mapping and inspection applications. Key challenges here are the efficient co-registration of potentially very large image sets and the adequate blending of the images into seamless visual maps that preserve the important visual content for the applications at hand. The inclusion of 2.5D information is addressed next, in the context of multimodal mapping, using acoustically derived bathymetry and optical images. For fine-scale estimation of 3D models of benthic structures, structure from motion (SfM) and multiview stereo techniques are discussed. A key challenge for underwater applications is the generation of meaningful object surfaces from point clouds that are potentially very noisy and highly corrupted. The interpretation of optical maps is addressed next by reviewing some of the recently proposed techniques for image segmentation. These techniques are motivated by the large volume of image data that modern underwater vehicles can provide, which is beyond what is feasible for human experts to analyze manually. Finally, the topic of mapping with modern high-frequency imaging sonars is addressed, motivated by the fact that they allow for the use of computer vision techniques that have been developed for optical images.

7.2 Offline Computer Vision Techniques for Underwater Mapping and Inspection

7.2.1 2D Mosaicing

Building a 2D mosaic is a task that involves two main steps. From a geometrical point of view, the acquired images should be aligned and warped accordingly into a single common reference frame. From a photometric point of view, the rendering of the mosaic should be performed through blending techniques, which allow dealing with differences in appearance of the acquired stills and reduce the visibility of the registration inaccuracies between them (see Figure 7.1).

Large-scale deep-ocean surveys may be composed of hundreds to hundreds of thousands of images, which are affected by several underwater phenomena, such as scattering and light attenuation. Furthermore, the acquired image sets may present small or nonexistent overlaps between consecutive frames. Navigation data coming from acoustic positioning sensors (ultra short base line (USBL), long base line (LBL)), velocity sensor Doppler velocity log (DVL), inclinometers, or gyroscopes become in that case essential to estimate the vehicle trajectory.

nfgz001

Figure 7.1 Underwater mosaicing pipeline scheme. The Topology Estimation, Image Registration, and Global Alignment steps can be performed iteratively until no new overlapping images are detected

7.2.1.1 Topology Estimation

Unfortunately, if positioning data coming from USBL, LBL, or DVL are not available, using time-consecutive image registration, assumed to have an overlapping area, becomes the only method to estimate the trajectory of the robot. This dead-reckoning estimate suffers from a rapid accumulation of registration errors, which translates into drifts from the real trajectory followed by the vehicle. However, it does provide valuable information for non-time-consecutive overlapping between the involved images. The matching between non-time-consecutive images is fundamental to accurately recover the path followed by the vehicle. This task is performed using global alignment methods (Capel 2004; Elibol et al. 2008 2011a; Ferrer et al. 2007; Gracias et al. 2004; Sawhney et al. 1998; Szeliski and Shum 1997). The refined trajectory can be used to predict additional overlapping between nonsequential images, which can be consequently attempted to match. The iterative process involving the registration of the new image pairs and the subsequent optimization is known as topology estimation (Elibol et al. 2010 2013). Even when navigation data are available, performing a topology estimation step is required to guarantee recovering accurate estimates of the vehicle path, and its value becomes even higher when dealing with large-scale surveys involving hundreds of thousands of images (Figure 7.2).

nfgz002

Figure 7.2 Topology estimation scheme. (a) Final trajectory obtained by the scheme proposed in Elibol et al. (2010). The first image frame is chosen as a global frame, and all images are then translated in order to have positive values in the axes. The c07-math-001 and c07-math-002 axes are in pixels, and the scale is approximately 150 pixels per meter. The plot is expressed in pixels instead of meters since the uncertainty of the sensor used to determine the scale (an acoustic altimeter) is not known. The red lines join the time-consecutive images while the black ones connect non-time-consecutive overlapping image pairs. The total number of overlapping pairs is 5412. (b) Uncertainty in the final trajectory. Uncertainty of the image centers is computed from the covariance matrix of the trajectory (Ferrer et al. 2007). The uncertainty ellipses are drawn with a 95% confidence level. (c) Mosaic built from the estimated trajectory

When dealing with data sets of thousands of images, all-to-all image pair matching strategies are unfeasible to carry out the topology estimation. Consequently, the use of more efficient techniques is required. Elibol et al. (2010) proposed an extended Kalman filter (EKF) framework, aimed at minimizing the total number of matching attempts while obtaining the most accurate trajectory. This approach predicts new possible image pairs considering the uncertainty of the recovered path. Another solution to the topology estimation problem based in a bundle adjustment (BA) framework was proposed by Elibol et al. (2011b). The approach combines a fast image similarity criterion with a minimum spanning tree (MST) solution to obtain an estimate of the trajectory topology. Then, image matching between pairs with high probable overlap is used to improve the accuracy of the estimate.

7.2.1.2 Image Registration

The image registration problem (Brown et al. 2007) consists of finding an appropriate planar transformation, which allows aligning in two or more 2D images taken from different viewpoints. The aim is to overlay all of them into a single and common reference frame (see Figure 7.3).

nfgz003

Figure 7.3 Geometric registration of two different views (a and b) of the same underwater scene by means of a planar transformation, rendering the first image on top (c) and the second image on top (d)

There are two main groups of image registration methods: direct methods and feature-based methods. The first group, also known as featureless methods, relies on the maximization of the photometric consistency over the overlapping image regions and is known to be appropriate to describe small translations and rotations (Horn and Schunck 1981; Shum and Szeliski 1998; Szeliski 1994). Nevertheless, in the context of underwater imaging using downward-looking cameras attached to an autonomous underwater vehicle (AUV) or remotely operated vehicle (ROV), it is common to acquire stills using stroboscopic lighting. This is due to power consumption restrictions affecting vehicle autonomy and leads to a low-frequency image acquisition. Consequently, the images do not have enough overlap to be registered using direct methods. For that reason, feature-based methods are most widely used to register not only underwater but also terrestrial and aerial imagery in the literature. Feature-based methods use a sparse set of salient points (Bay et al. 2006; Beaudet 1978; Harris and Stephens 1988; Lindeberg 1998; Lowe 1999) and correspondences between image pairs to estimate the transformation between them.

Image registration using feature-based methods involves two main stages. First, in the feature detection step, some interest or salient points should be located in one or both images of an image pair. Next, in the feature matching step, these salient points should be associated according to a given descriptor. This procedure is also known as the resolution of the correspondence problem. Depending on the strategy used to detect and match the features, two main strategies can be distinguished.

A first feature-based registration strategy relies on detecting salient points in one image using a feature detector algorithm, such as Harris (Harris and Stephens 1988), Laplacian (Beaudet 1978), or Hessian (Lindeberg 1998), and recognizing the same features in the other. In this case, the identification is performed using cross-correlation or a sum of squared differences (SSD) measure, involving the pixel values of a given area surrounding the interest point. A second method consists of detecting interest points on both images using some invariant image detectors/descriptors, such as SIFT (Lowe 1999), its faster variant SURF (Bay et al. 2006), or others, and solving the correspondence problem comparing their descriptor vectors. The descriptors have demonstrated to be invariant to a wide range of geometrical and photometric transformations between the image pairs (Schmid et al. 1998). This robustness becomes highly relevant in the context of underwater imaging, where viewpoint changes and significant alterations in the illumination conditions are frequent. Furthermore, the turbidity of the medium has been proven to have an impact in the performance of the feature detectors (Garcia and Gracias 2011) (Figure 7.4).

nfgz004

Figure 7.4 Main steps involved in the pairwise registration process. The feature extraction step can be performed in both images of the pair, or only in one. In this last case, the features are identified in the second image after an optional image warping based on a transformation estimation

7.2.1.3 Motion Estimation

Once a set of correspondences has been found on an image pair, they can be used to compute a planar transformation describing the motion of the camera between both. This transformation is stored in a homography matrix c07-math-003 (Hartley and Zisserman 2003; Ma et al. 2003), which can describe a motion with up to eight degrees of freedom (DOF).

The homography matrix c07-math-004 encodes information about the camera motion and the scene structure, a fact that facilitates establishing correspondence between both images. c07-math-005 can be computed, in general, from a small number of corresponding image pairs.

The homography accuracy (Negahdaripour et al. 2005) is strongly tied to the quality of the correspondences used for its calculation. The homography estimation algorithms assume that the only source of error is the measurement of the locations of the points, but this assumption is not always true inasmuch as mismatched points may also be present. There are several factors that can influence the goodness of the correspondences detected. Images can suffer from several artifacts, such as nonuniform illumination, sunflickering (in shallow waters), shadows (specially in the presence of artificial lighting), and digital noise, among others, which can produce a failure of the matching. Furthermore, moving objects (including shadows) may induce correspondences which, despite being correct, do not obey the dominant motion between the two images. These correspondences are known as outliers. Consequently, it is necessary to use an algorithm able to discern right and wrong correspondences. There are two main strategies to reject outliers widely used in the bibliography (Huang et al. 2007): RANSAC (Fischler and Bolles 1981) and LMedS (Rousseeuw 1984). LMedS efficiency is very low in the presence of Gaussian noise (Li and Hu 2010; Rousseeuw and Leroy 1987). For this reason, RANSAC is the most widely used method in the literature in the underwater imaging context.

7.2.1.4 Global Alignment

Pairwise registration of images acquired by an underwater vehicle equipped with a downward-looking camera cannot be used as an accurate trajectory estimation strategy. Image noise, illumination issues, and the violation of the planarity assumption may unavoidably lead to an accumulative drift. Therefore, detecting correspondences between nonconsecutive frames becomes an important step in order to close a loop and use this information to correct the estimated trajectory.

The homography matrix c07-math-006 represents the transformation of the c07-math-007 image withrespect to the global frame (assuming the first image frame as a global frame) and is known as absolute homography. This c07-math-008 matrix is obtained as a result of the concatenation of the relative homographies c07-math-009 between the c07-math-010 and c07-math-011 images of a given time-consecutive sequence. As mentioned earlier, relative homographies have limited accuracy and computing absolute homographies by cascading them results in cumulative error. This drift will cause, in the case of long sequences, the presence of misalignments between neighboring images belonging to different transects (see Figure 7.5).

nfgz005

Figure 7.5 Example of error accumulation from registration of sequential images. The same benthic structures appear in different locations of the mosaic due to error accumulation (trajectory drift)

The main benefit of global alignment techniques is the use of the closing-loop information to correct the pairwise trajectory estimation by reducing the accumulated drift.

There are several methods in the literature intended to solve the global alignment problem (Szeliski 2006). Global alignment methods usually require the minimization of an error term based on the location of the image correspondences. These methods can be classified according to the domain where this error is defined, leading to two main groups: image frame methods (Capel 2001; Ferrer et al. 2007; Marzotto et al. 2004; Szeliski and Shum 1997) and mosaic frame methods (Can et al. 2002; Davis 1998; Gracias et al. 2004; Kang et al. 2000; Pizarro and Singh, 2003; Sawhney et al. 1998).

7.2.1.5 Image Blending

Once the geometrical registration of all the mosaic images has been carried out, during the global alignment step, the photomosaic can be rendered. In order to produce an informative representation of the seafloor that can be used by the scientists to perform their benthic studies, the use of blending techniques to obtain a seamless and visually pleasant mosaic is required (see Figure 7.6).

nfgz006

Figure 7.6 Photomosaic built from six images of two megapixels. The mosaic shows noticeable seams in (a), where the images have only been geometrically transformed and sequentially rendered on the final mosaic canvas, the last image on top of the previous one. After applying a blending algorithm, the artifacts (image edges) disappear from the resulting mosaic (b).

Images courtesy of Dan Fornari (Woods-Hole Oceanographic Institution)

On the one hand, the geometrical warping of the images forming the mosaic may lead to inconsistencies on their boundaries due to registration errors, moving objects, or the presence of 3D structures of the scene violating the planarity assumption in which the 2D registration relies. On the other hand, differences in the image appearance due to changes in the illumination conditions, oscillations in the distance to the seafloor, or the turbidity of the underwater medium cause the image boundaries to be easily noticeable (Capel 2004). Consequently, the consistency of the global appearance of the mosaic can be highly compromised. The main goal of image blending techniques is to obtain a homogeneous appearance of the generated mosaic, not only from the esthetic but also from the informative point of view, enhancing the visual data when needed, to obtain a continuous and consistent representation of the seafloor.

There are two main groups of blending algorithms depending on their main working principle (Levin et al. 2004): transition smoothing methods and optimal seam finding methods. The transition smoothing methods (also known as feathering (Uyttendaele et al. 2001) or alpha blending methods (Porter and Duff 1984)) are based on reducing the visibility of the joining areas between images combining the overlapping information. The optimal seam finding methods rely on searching the optimal path to cut the images along their common overlapping area in which the photometric differences between both are minimal (Davis 1998; Efros and Freeman 2001). The combination of the benefits of both groups of techniques lead to a third group (e.g., Agarwala et al. (2004) and Milgram (1975)), which can be called hybrid methods. These methods reduce the visibility of the seams, smoothing the common overlapping information around a photometrically optimal seam.

Image blending methods can be classified according to their capabilities and weaknesses, which may make them appropriate for certain applications but unfeasible for others. Considering the main principle of the techniques, the combination of a transition smoothing around an optimally found boundary (i.e., the use of hybrid methods) seems to be the most adopted approach in the recent bibliography. The tolerance of the techniques to moving objects is strongly tight to the principle, being all the optimal seam finding based methods able to deal with this issue up to a certain degree. This is due to the fact that optimal seam finding methods place the cut path on areas where photometric differences are small. As a consequence, overlapping areas with moving objects are cut. Concerning the domain in which the blending is performed, both luminance and gradient domains are widely used, the second having gained a special importance in recent years (Mills and Dudek 2009; Szeliski et al. 2008; Xiong and Pulli 2009). One of the benefits of the gradient domain is the easy reduction of exposure differences between neighboring images, which does not require additional pre-processing, due to the fact that image gradients are not sensitive to differences in exposure. Ghosting and double contouring effects arise when fusing information between images affected by registration errors or astrong violation of the planarity assumption. The ghosting effect occurs when wrong-registered low-frequency information is fused, while double contouring affects the high-frequency information. Given that all transition smoothing methods are affected by this artifact, the restriction of the fusion to a limited width area is required to reduce its visibility. Concerning the color treatment, it is similar in most of the techniques in the literature. Concretely, the blending is always performed in a channel-wise manner independent of the number of channels of the source images (i.e., color or gray-scale images). There are few methods in the literature conceived to work in real time (Zhao 2006), which therefore requires sequential processing to be performed. However, methods working in a global manner are better conditioned to deal with problems such as the exposure compensation so as to ensure global appearance consistency. The sequential processing tends to accumulate drift on the image corrections, which may strongly depend on the first processed and stitched image. Few blending methods are claimed to work with high dynamic range images. Nevertheless, gradient-based blending methods are able to intrinsically deal with them due to the nature of the domain. Methods able to process high dynamic range images require the application of tone mapping (Neumann et al. 1998) algorithms in order to appropriately display the results. The high dynamic range should be stretched in order to allow its visualization into low dynamic range devices, such as screen monitors or printers.

Large-scale underwater surveys often lead to extensive image data sets composed from thousands to hundreds of thousands of stills. Given the significant computational resources that the processing of this amount of data may require, the use of specific optimal techniques such as that proposed by Prados et al. (2014) is needed. The first stages of the pipeline involve the input sequence pre-processing, required to reduce artifacts such as the inhomogeneous lighting of the images, mainly due to the use of limited-power artificial light sources and the phenomenon of light attenuation and scattering. In this step, a depth-based non-uniform-illumination correction is applied, which dynamically computes an adequate illumination compensation function depending on the distance of the vehicle to the seafloor. Next, a context-dependent gradient-based image enhancement is used, allowing to equalize the appearance of neighboring images when those have been acquired at different depths or with different exposure times. The pipeline follows with the selection of each image contribution to the final mosaic based on several criteria, such as image quality (i.e., image sharpness and level of noise) and acquisition distance. This step allows getting rid of redundant and low quality image information, which may affect in a counterproductive way the surrounding contributing images. Next, the optimal seam placement between all the images is found, minimizing the photometric differences around the cut path and discarding moving objects. A gradient blending in a narrow region around the optimally computed seams is applied, in order to minimize the visibility of the joining regions as well as to refine the appearance equalization along all the involved images. The usage of a narrow region of fusion reduces the appearance of artifacts, such asghosting or double-contouring, due to registration inaccuracies. Finally, a strategy allowing to process gigamosaics composed of tens of thousands of images in conventional hardware is applied. The technique divides the whole mosaic in tiles, processing them individually, and seamlessly blending all of them again using another method that requires low computational resources.

7.2.2 2.5D Mapping

Apart from the optical images acquired at a close distance to the seafloor, AUVs and ROVs can acquire bathymetric information when traveling further from the scene. Acoustic information has a lower resolution than optical imagery, but provides a rough approximation of the seabed relief, which may become highly informative during the scene interpretation. Both optical and acoustic data can be combined, projecting the detailed 2D high-resolution photomosaics into a low-resolution triangle mesh generated from the bathymetry, leading to what can be named as 2.5D mosaics (see Figure 7.7). Formerly, these mosaics cannot be considered 3D inasmuch as acoustic data can only provide elevation information.

nfgz007

Figure 7.7 2.5D map of a Mid-Atlantic Ridge area of approximately c07-math-012 resulting from the combination of a bathymetry and a blended photomosaic of the generated high-resolution images. The obtained scene representation provides scientists with a global view of the interest area as well as with detailed optical information acquired at a close distance to the seafloor. Data courtesy of Javier Escartin (CNRS/IPGP, France)

There is a great deal of literature on the different uses of this scanning configuration for sonar mapping, and automatic methods to build these maps out of raw sonar readings have been a hot research topic in recent years (Barkby et al. 2012; Roman and Singh 2007). The application of these techniques has proven to provide key benefits for other research areas such as archeology (Bingham et al. 2010) or geology (Yoerger et al. 2000). Given the depth readings, retrieving a surface representation is straightforward. By defining a common plane, all the measures can be projected in 2D. Then, an irregular triangulation such as Delaunay or a gridding technique can be applied to the projections on the plane and the third coordinate is used to lift the surface in a 2.5D representation. These mappings can be enhanced by using photomosaics for texture, thus producing a multimodal representation of the area (Campos et al. 2013b; Johnson-Roberson et al. 2009). As a result, the triangulated depth map provides the scientists with a general view of the studied area, while the textured photomosaics allow to reach a high level of detail on the areas of special interest.

Optical and acoustic data, namely maps, should be geometrically registered to align their position and the scale. Given that both types of information are difficult to correlate when interpreted as imagery, the registration mainly relies in the vehicle positioning data. Additionally, manually selected fiducial points can be used to refine the alignment of the two maps. Provided that the resolution between both is significantly different (by some orders of magnitude), small inaccuracies in the registration might be negligible.

On the other hand, although not that extensive, cameras are also used for height-map reconstruction. Regarding full 3D reconstruction, the methods in the literature are more concerned with recovering the point set, obviating the surface reconstruction part. Thus, the focus is usually put on the registration of multiple camera positions, using either SfM systems (Bryson et al. 2012; Nicosevici et al. n.d.) or SLAM approaches (Johnson-Roberson et al. 2010) as well as monocular or stereo cameras (Mahon et al. 2011). The application of these methods in the underwater domain has to face the added complexity of dealing with noisy images, in the sense of the previously commented aberrations introduced by the rapid attenuation of light on the water medium, the nonuniform lighting on the image, and forward/backward scattering. This translates in the methods requiring further pre-processing of the images to alleviate these issues, and having to deal with larger errors in the reconstructions than their less noisier on-land counterparts. Furthermore, as previously stated, the common configuration of downward-looking cameras makes methods tend to represent the shape underlying these points as a height-map in a common reference plane, which may be easily extracted using principal component analysis (PCA) (Nicosevici et al. n.d.; Singh et al. 2007).

7.2.3 3D Mapping

In this section, we deal with the problem of 3D mapping. Recall that in both the 2D and 2.5D cases, we assume that there exists a base plane onto which the images can be projected, that is, once the motion is extracted, building the map mainly consists of warping and deforming the images onto this common plane. However, note that in the 3D case the problem becomes more complex, as this base representation is missing.

This previously mentioned assumption on the area of interest being close to planar has motivated the use of scanning sensors located on the bottom of underwater vehicles, with their optical axis orthogonal to the seafloor. In fact, this configuration is also helpful for the problem of large-area mapping: by using downward-looking cameras/sensors, we attain overview capabilities able to provide an overall notion of the shape of the scene.

As a consequence, there exist very few proposals on underwater mapping, dealing with the 3D reconstruction problem in its more general terms. If we drop both the assumption of the scene being 2D/2.5D and the downward-looking configuration of the sensor, we allow the observation of arbitrary 3D shapes. In both acoustic and optical cases, the trajectory of the vehicle is required in order to be able to reconstruct the shape of the object. On the one hand, for the acoustic case, the individual 2D swath provided by a multibeam sensor can be composed with the motion of the vehicle to obtain a 3D point cloud. On the other hand, for the optical case, once the trajectory of the camera is available, the 3D positions of the points can be triangulated in 3D using the line-of-sight rays emerging from each camera–2D feature pair.

Thus, the mapping problem is closely related to that of trajectory estimation. In the optical domain, those two problems are often formulated as SfM methods, when dealing with pure optical data (Bryson et al. 2012; Nicosevici et al. n.d.), and more generic SLAM approaches, when there is additional information available other than optical (Johnson-Roberson et al. 2010). In both cases, the optical constraint guiding the optimization of the vehicle trajectory and the scene structure is the reprojection error. For a given 3D point, this error is defined as the difference between its corresponding 2D feature in the image, and the back-projection of the 3D point using both the camera pose and its internal geometry. Note that only relevant feature points in the images will be used during the process, which leads to the reconstructed scene being described as a sparse point cloud. In order to get a more faithful representation of the object, dense point set reconstruction methods are commonly used with the estimated trajectory (Furukawa and Ponce 2010; Yang and Pollefeys 2003).

After a smooth recovery of the trajectory of the vehicle, and regardless of the scanning technology used being acoustic or optical, the scene is always retrieved as a point cloud. Note that this also applies to other 3D recovery techniques in land and/or aerial applications (e.g., lidar scans). Regardless of the technology used, the scanning of an object is always in the form of discrete measures (i.e., points) that are supposed to be taken at the surface of the object. Moreover, the resulting point cloud is unstructured, and no assumption on the shape of the object can bemade.

It is clear that having the scene represented by a point cloud complicates further interpretation and/or processing of these data. First, it is obvious that these point sets are difficult to visualize. Since the points are infinitesimally small, from any given viewpoint a user cannot tell which part of the object should be visible and which should be occluded, and just expert users are able to interpret the data by moving around the point set with the help of a viewing software. Additionally, working with points alone makes further computations complex to apply to the point set directly. Basically, since the connectivity between those points is unknown, we are not able to compute simple measures on the object (e.g., areas, intersections). Even if some approximations can be extracted directly from the point set, the computations are far simpler when the continuous surface of the object is known.

We can then conclude that a surface representation of the object, derived from the discrete measures in the point cloud, is needed to describe a 3D object. This process is known in the literature as the surface reconstruction problem from a set of unorganized points (in the following referred to as the surface reconstruction problem). A triangle mesh is commonly used as the representation of the surface we aim to obtain, given the ability of modern hardware graphics architectures to efficiently display triangle primitives. In this way, we have a piecewise linear approximation of the surface of the scene that can be used to aid visualization and further calculus. Note that the surface reconstruction problem is inherently ill-posed, since given a set of points one may define several surfaces agreeing with the samples. Additionally, we have the added problem of the points in this set not being perfectly exact measures on the object. In real-world data sets, and regardless of the methodology used to retrieve the point set, these are affected by two main problems: noise and outliers. Noise refers to the repeatability (i.e., the variance) of the measuring sensor, whereas the outliers refer to those badly measured points provided by errors during the point retrieval process. Thus, we might need to reconstruct the surface while attenuating the noise in the input and disregarding outliers.

In the following sections, we overview the state of the art in surface reconstruction, both generic approaches and how the underwater community is starting to gain interest in the topic. Then, we show some additional applications and further processing that can be applied to the resulting triangle meshes.

7.2.3.1 Surface Reconstruction

Due to the above-mentioned common downward-looking configuration of the sensor, there are few approaches tackling the problem of 3D mapping in the underwater community. One of the few examples in this direction can be found in Johnson-Roberson et al. (2010), where they use the surface reconstruction method of Curless and Levoy (1996), originally devised to unrestricted 3Dreconstruction. Nevertheless, the added value of applying this method in front of a 2.5D approximation is not clear in this case, since the camera still observes the scene in a downward-looking configuration. Another proposal, this time using a forward-looking camera and working on a more complex structure, is that presented in Garcia et al. (2011), where an underwater hydrothermal vent is reconstructed using dense 3D point cloud retrieval techniques and the Poisson surface reconstruction method (Kazhdan et al. 2006).

For downward-looking configurations, the straightforwardness of changing from depth readings to 2.5D representations makes the reconstruction of the scene as a triangulated terrain model to be just a side result. However, we are now concerned with a more general scenario, where the sensor can be mounted in a less restrictive configuration, that is, located anywhere on the robot and with any orientation. With this new arrangement, objects can be observed from arbitrary viewpoints, so that the retrieved measures are no longer suitable for projection onto a plane (and hence, a 2.5D map cannot be built). Viewing the object from further positions allows a better understanding of the global shape of the object since its features can be observed from angles that are more suitable to their exploration. An example in this direction is depicted in Figure 7.8, where we can find a survey of an underwater hydrothermal vent. It is clear in Figure 7.8b that it is not possible to recover the many details and concavities of this chimney using just a 2.5D representation. The more general configuration of the camera, in this case mounted at the front of the robot oriented approximately at c07-math-013 angle with respect to the gravity vector, allowed a more detailed observation of the area, attaining also higher resolution. Consequently, it is obvious that for these new exploration approaches, the problem of surface reconstruction is of utmost importance to complete the 3D mapping pipeline.

nfgz008

Figure 7.8 (a) Trajectory used for mapping an underwater chimney at a depth of about 1700 m in the Mid-Atlantic ridge (pose frames in red/green/blue corresponding to the c07-math-014 axis). We can see the camera pointing always toward the object in a forward-looking configuration. The shape of the object shown was recovered using our approach presented in Campos et al. (2015). Note the difference in the level of detail when compared with a 2.5D representation of the same area obtained using a multibeam sensor in (b). The trajectory followed in (b) was downward-looking, hovering over the object, but for the sake of comparison we show the same trajectory as in (a). Finally, (c) shows the original point cloud, retrieved through optical-based techniques, that was used to generate the surface in (a). Note the large levels of both noise and outliers that this data set contains.

Data courtesy of Javier Escartin (CNRS/IPGP, France)

Another issue that burdens the development of surface reconstruction techniques is the defect-ridden nature of point sets retrieved on real-world operations. The large levels of noise and the huge number of outliers present in the data clearly complicate the processing. Thus, the surface reconstruction method to use mainly depends on the quality of the data we are dealing with and the ability of the method to recover faithful surfaces from corrupted input.

Given the ubiquity of point set representations and the generality of the problem itself, surface reconstruction has attracted the attention of many research communities, such as those of computer vision, computer graphics, or computational geometry. In all of them, the different approaches can be mainly classified into two types: those based on point set interpolation, and those approximating the surface with respect to the points. Usually, these methods are applied to range scan data sets because of the spread use of these sensors nowadays. Nevertheless, they tend to be generic enough to be applicable to any point-based data regardless of their origin, consequently including our present case of optical and acoustic point sets retrieved from an underwater scenario.

Another relevant issue is the requirement of some methods to have some additional information associated with the point set. Many methods in the state of the art require available per-point normals to help in disambiguating this ill-posed problem. Moreover, some methods assume even higher level information, such as the pose of the scanning device at the time of capturing the scene, to further restrict the search for a surface that corresponds to that of the real scanned object. The availability of these properties will also guide our selection of the proper algorithm.

In the following sections, we overview both interpolation and approximation-based approaches in the state of the art.

Interpolation-Based Approaches

Mainly studied by the computational geometry community, methods based on interpolating the points commonly rely on space partition techniques such as the Delaunay triangulation, or its equivalent, the Voronoi diagram. Algorithms tackling the problem with a volumetric view try to partition the cells of these structures into those belonging to the inside of the object and those from the outside of the object. Then, the surface is basically located at the interphase between these two volumes. Widely known methods working with this idea are the Cocone (Amenta et al. 2000) and the Power Crust (Amenta et al. 2001). These algorithms usually rely on some theoretical proofs derived from specific sampling conditions assumed on the point sets. However, fairly often these conditions are not fulfilled in the real-world data, and some heuristics need to be derived to render the methods useful on these cases. We can also distinguish hungry approaches, where the surface is incrementally growing, one triangle at atime, by taking local decisions on which points/triangles to insert next at each iteration (Bernardini et al. 1999; Cohen-Steiner and Da, 2004).

Given the fact that, for the methods trying to interpolate the points, part of the input points will also become vertices of the output surface, the noise is translated to the resulting mesh. Consequently, these kinds of methods cannot deal with noisy inputs. A way to overcome this limitation is to apply a prior noise smoothing step on the point set, which provides considerably better results, as shown in Digne et al. (2011). Nevertheless, some methods have faced the problem of disregarding outliers during reconstruction. The spectral partitioning of Kolluri et al. (2004) or the optical-based graph-cut approach of Labatut et al. (2007) are the clear examples in this direction.

Approximation-Based Approaches

We can find a large variety of procedures based on approximation, which makes them more difficult to classify. Nevertheless, the common feature of most of the methods is to define the surface in implicit form and evaluate it using a discretization of the working volume containing the object. After approximating the surface in this embodiment, the surface triangle mesh can be extracted by means of a surface mesher step (Boissonnat and Oudot 2005; Lorensen and Cline 1987). Thus, we can mainly classify these methods depending on the implicit definition they provide.

One of the most popular approaches is to derive a signed distance function (SDF) from the input points. This SDF can be defined as the distance from a given point to a set of local primitives, such as tangent planes (derived from known normals at input points) (Hoppe et al. 1992; Paulsen et al. 2010), or moving least squares (MLS) approximations (Alexa et al. 2004; Guennebaud and Gross 2007; Kolluri 2008). Alternatively, some methods work with the radial basis function (RBF) interpolation of the SDF. The RBF method, usually used to extrapolate data from samples, is used to derive the SDF from point sets with the associated normals (Carr et al. 2001; Ohtake et al. 2003 2004).

On the other hand, there are some approximation approaches where the implicit function sought is not a distance function but an indicator function. This simpler definition denotes for a given point whether it is part of the interior of the object or the outside (i.e., it is a pure binary labeling) (Kazhdan et al. 2006; Kazhdan and Hoppe 2013; Manson et al. 2008). Usually, this inside/outside information is derived from the information provided by known per-point normals.

Note that, up to this point, all the reviewed methods require that the surface normal vector is known for each input point. It is clear that the requirement of per-point normals may be a burden in the cases where the scanning methodology does not provide this information. Moreover, estimating the normals at input points to reconstruct a surface is a chicken and egg problem: to compute the normals, we have to infer somehow the surface around each point. Nevertheless, even when working with raw point sets without normals, we can derive some distance function. The only drawback is that the resulting function is bound to be unsigned. When working with unsigned distance functions (UDFs), the main problem resides in the fact that we cannot extract the surface from the zero-level set as is done for SDF or indicator function-based approaches. Thus, methods try to recover the sign of the function using some heuristics (Giraudot et al. 2013; Hornung and Kobbelt 2006; Mullen et al. 2010; Zhao et al. 2001).

In all the above-mentioned cases, both noise attenuation and outlier rejection are problematic, and most of them require per-point normals. Still, there exist other methods proposing unconventional procedures but not requiring any additional information, such as in Campos et al. (2013a 2015), where they merge a set of local shapes derived from the points into a global surface without resorting to an intermediate implicit formulation. Instead, they modify the surface meshing method in Boissonnat and Oudot (2005) to be able to deal with these local surfaces directly. Thus, in these cases, meshing and surface reconstruction problems are intrinsically related, meaning that the quality of the resulting surface is also a user parameter. Moreover, both methods work with the idea of dealing with noise and outliers in the data, by means of using robust statistics techniques within the different steps of the method, and allow the reconstruction of bounded surfaces, usually appearing when surveying a delimited area of the seafloor. These last two methods are applied specifically to underwater optical mapping data sets, and a sample of the behavior of Campos et al. (2015) can be seen in Figure 7.8a.

nfgz009

Figure 7.9 A sample of surface processing techniques that can be applied to the reconstructed surface. (a) Original; (b) remeshed; (c) simplified

7.2.3.2 Further Applications

The ability of modern graphics hardware (i.e., graphics processing units (GPUs)) to display in real-time triangle primitives eases the visualization of the reconstructed scenes. Moreover, the widespread use of triangle meshes has motivated the rapid development of various mesh processing techniques in recent years that may be helpful in further processing (see Botsch et al. (2010) for a broader overview of these methods). Some of these techniques include the following:

  • (Re)Meshing: Changes the quality/shape of triangles (see Figure 7.9b). Some computations, such as the finite element method (FEM), require the shape of the triangles to be close to regular, or adaptive to the complexity or curvatureof the object. Meshing (or remeshing) methods allow tuning the quality of the triangulation according to some user-defined parametrization on the shape of the triangles.
  • Simplification: Related to remeshing, we can change the complexity of the mesh (see Figure 7.9c) to attain real-time visualization, or to also simplify further calculus to compute the data.
  • Smoothing: Smooths the appearance of the resulting surface. Note that this technique may be specially useful in the case of using an interpolation-based surface reconstruction technique, since in this case a noisy point cloud will result in a rough and spiky approximation of the surface. Smoothing methods try to attenuate small high-frequency components in the mesh, resulting in a more visually pleasant surface.

In contrast to the pure mesh-based approaches, when using optical-based reconstruction we can also benefit from texture mapping. Texture mapping is a post-processing step where the texture of the original images used to reconstruct the scene can also be used to colorize and basically give a texture to each of the triangles of the resulting mesh. Since we have reconstructed the surface from a set of views, both the cameras and the surface are in the same reference frame. Thus, as shown in Figure 7.10, the texture mapping is quite straightforward: we can back-project each triangle to one of its compatible views and use the enclosed texture. The problem then reduces to that of blending, discussed in Section 7.2.1, but in 3D. The different available variants of these methods are mainly concerned with the selection of the best view to extract the texture from a given triangle, and also with alleviating the differences in illumination that may appear when composing the textures obtained from different views. Two representative methods in the literature can be found in Lempitsky and Ivanov (2007) and Gal et al. (2010).

nfgz010

Figure 7.10 Texture mapping process, where the texture filling a triangle in the 3D model is extracted from the original images. Data courtesy of Javier Escartin (CNRS/IPGP, France)

7.2.4 Machine Learning for Seafloor Classification

Underwater image classification is still a relatively new research area compared with the existing large body of work on terrestrial image classification. Apart from the inherent challenges of underwater imagery, there are specific challenges related to image classification in this environment. Significant intraclass and intersite variability in the morphology of seabed organisms or structures of interest, complex spatial borders between classes, variations in viewpoint, distance, and image quality, limits to spatial and spectral resolution, partial occlusion of objects due to the three-dimensional structure of the seabed, gradual changes in the structures of the classes, lighting artifacts due to wave focusing, and variable optical properties of the water column can be considered some of the major challenges in this field.

Seafloor imagery collected by AUVs and ROVs is usually manually classified by marine biologists and geologists. Due to the involved hard manual labor, several methods have been developed toward the automatic segmentation and classification of benthic structures and elements. Although the results of these methods have proved to be less accurate than manual classification, this automated classification is nonetheless very valuable as a starting point for further analysis.

A general supervised approach for object classification in the computer vision community contains several standard steps such as image collection, pre-processing, invariant feature extraction (texture, color, shape), feature modification (kernel mapping, dimension reduction, normalization), classifier training, and, finally, accuracy testing. There are many different computer vision techniques available for each of these steps in the framework to be used in the case of seafloor object classification.

In one of the initial efforts in automated seabed classification using optical imagery, Pican et al. (1998) used gray-level co-occurrence matrix (GLCM). Haralick et al. (1973) and Kohonen maps (Heskes 1999) are used as texture descriptors. In Shiela et al. (2008) and Soriano et al. (2001), the authors use a feed-forward back-propagation neural network to classify underwater images. They use local binary patterns (LBP) (Ojala et al. 1996) as texture descriptors and normalized chromaticity coordinates (NCC) and mean hue saturation value (HSV) as color descriptors. The works by Johnson-Roberson et al. (2006a, b) employ both acoustic and optical imagery for benthic classification. Acoustic and visual features are classified separately using a support vector machine (SVM) by Cortes and Vapnik (1995), with assigned weights which are determined empirically. A similar approach is proposed in Mehta et al. (2007), where SVM are used to classify each pixel in the images.

Alternatively, one of the most common strategies for general object classification for image characterization is the use of Bag of Words (Csurka et al. 2004), such as in the work of Pizarro et al. (2008). This method yields a good level of accuracy and can be considered as one of the main references in the state of the art.

Color and texture features have been used by Gleason et al. (2007b) in a two-step algorithm to classify three broad cover types. This system requires expensive acquisition hardware, capable of acquiring narrow spectral band images. The work by Marcos et al. (2005) uses LBP and NCC histograms as feature descriptors and a linear discriminant analysis (LDA) (Mika et al. 1999) as the classifier. The work of Stokes and Deane (2009) uses normalized color space and discrete cosine transforms (DCT) to classify benthic images. The final classification is done using their proposed probability density weighted mean distance (PDWMD) classifier from the tail of the distribution. This method is time efficient with good accuracy but requires accurate color correction, which may be difficult to achieve on underwater images without controlled lighting.

The work by Diaz and Torres (2006) uses the local homogeneity coefficient (LHC) by Francisco et al. (2003) for segmentation and pixel-by-pixel distance of texture features such as energy, entropy, and homogeneity to classify. This method can only deal with classes that are highly discriminative by nature and, therefore, has limited underwater applications. Beijbom et al. (2012) proposed a novel framework for seabed classification, which consists of feature vector generation using a maximum response (MR) filter bank (Varma and Zisserman 2005), and an SVM classifier with a RBF kernel. In this method, multiple patch sizes were used, providing a significant improvement relative to classification accuracy. Bender et al. (2012) in their recent work used a novel approach of probabilistic targets least square classifier to cluster the similar types of areas on the seabed. This method shows promising results and is likely to evolve in future research.

For the cases where the survey images contain enough overlap to allow the extraction of depth information, then 2.5D- or even 3D-based features can provide significant additional information. The work by Friedman et al. (2012) presented a new method for calculating the rugosity, slope, and aspect features of the Delaunay triangulated surface mesh of the seabed terrain by projecting areas onto a plane using PCA. They used these features to define the characteristics of the seabed terrain for scientific communities. Geographic information system (GIS) tools such as St John BIOMapper use statistics such as curvature, plan curvature, profile curvature, mean depth, variance of depth, surface rugosity, steepness and direction of slope to characterize the complexity of the seafloor. Some of these features can be considered as potential 3D or 2.5D features for underwater object description.

Shihavuddin et al. (2013) presented an adaptive scheme forseafloor classification, which uses a novel image classification framework that is applicable to both single images and composite mosaic data sets, as illustrated in Figure 7.11. This method can be configured to the characteristics of individual data sets such as the size, number of classes, resolution of the samples, color information availability, and class types. In another work Shihavuddin et al. (2014), 2D and 2.5D features were fused to obtain a better classification accuracy, focusing on munition detection. They used several 2.5D features such as symmetry, rugosity, curvature, and so on, on the highest resolution map of the terrain together with the 2D features.

nfgz011

Figure 7.11 Seafloor classification example on a mosaic image of a reef patch in the Red Sea, near Eilat, covering approximately 3 c07-math-015 6 m. (a) Original mosaic. (b) Classification image using five classes: Brain Coral (green), Favid Coral (purple), Branching Coral (yellow), Sea Urchin (pink), and Sand (gray).

Data courtesy of Assaf Zvuloni and Yossi Loya (Tel Aviv University)

Schoening (2015) proposed a combined feature method for automated detection and classification of benthic megafauna and finally quantity estimation of benthic mineral resources for deep sea mining. This method was designed for special types of seafloor object detection.

Regardless of the particularities of the employed methods, automated classification of the seafloor will only achieve accurate results when the imagery is acquired under adequately good conditions. In this aspect, autonomous vehicles play an important, if not crucial, role as optimum classification results are obtained using visual data acquired at high resolution under clear visibility conditions, uniform illumination, consistent view angles and altitude, and sufficient image overlap.

7.3 Acoustic Mapping Techniques

Several researchers have drawn attention to the use of FLS either as a substitute or as a complementary device for optical cameras in mapping purposes. The sensor parallelism becomes straightforward: FLS can be exploited to mosaic the seafloor through the registration of FLS images, following the same concept of 2D photomosaicing. Even though the range of FLS is greater than that of optical cameras, their field of view is also limited. Thus, it is often not possible to image a given area within a single frame or at least to do so without sacrificing a great deal of resolution by pushing the device's range to the limit. In such circumstances, mosaicing of FLS images allows to obtain an extended overview of an area of interest regardless of the visibility conditions and without compromising the resolution.

Similar to the optical mosaicing, the workflow to create an acoustic image mosaic follows three main steps:

  1. Image registration: First, frame-to-frame transformations are computed using an image registration method. The particularities of FLS imagery pose a significant challenge to the registration techniques typically used in photomosaicing. In this sense, area-based approaches (Hurtós et al. 2014b) that use all the image content become more suitable than feature-based approaches (Kim et al. 2005 2006; Negahdaripour et al. 2005) that have a more unstable behavior on low signal-to-noise ratio (SNR) images. In addition, by avoiding the extraction of explicit features, the registration technique remains independent of the type and number of features present in the environment, and it can be robustly applied to a wide variety of environments ranging from more featureless natural terrains to man-made scenarios.
  2. Global alignment: Consecutive images are aligned by transformingthem to a common reference frame through compounding of the different transformations. Errors that accumulate along the trajectory can be corrected by means of global optimization techniques that make use of the transformations between nonconsecutive images. In Hurtós et al. (2014b), the problem of obtaining a globally consistent acoustic mosaic is set down as a pose-based graph optimization. A least squares minimization is formulated to estimate the maximum-likelihood configuration for the sonar images based on the pairwise constraints between consecutive and nonconsecutive frame registrations. In order to integrate the sonar constraints into the optimization framework, a method is established to quantify the uncertainty of the registration results. Apart from the sonar motion constraints, the same framework can integrate constraints coming from dead-reckoning navigation sensors or absolute positioning sensors. In addition, a strategy needs to be established to identify putative loop closures according to the spatial arrangement of the image's positions, so that registration is attempted only between those pairs of frames that overlap (Hurtós et al. 2014a). Once the graph is constructed, different back-ends developed to efficiently optimize pose graphs (e.g., g2o (Kummerle et al. 2011), iSAM (Kaess et al. 2012)) can be used to obtain the final set of absolute positions in which to render the individual images.
  3. Mosaic rendering: Finally, in order to achieve an informative and smooth mosaic, the individual sonar frames are fused. Unlike the blending in optical mosaics, this implies dealing with a high number of overlapping images as well as with sonar-specific artifacts arising from its image formation geometry. Different strategies can be enabled according to the photometric irregularities present in the data both at frame level (i.e., inhomogeneous insonification patterns due to different sensitivity of the sonar transducer elements, nonuniform illumination across frames, blind areas due to improper imaging configuration) and at mosaic level (i.e., seams along different tracklines due to a different number of overlapping images and different resolution) (Hurtós et al. 2013a).

A mosaicing system as such is of great interest in many mapping tasks that are carried out in turbid waters and murky environments. A clear example is the mapping of ship hulls, which are routinely inspected for security reasons using divers, being a hazardous and time consuming task. Given that these inspections are carried out inside harbor, where water visibility is often limited, they are a good example of target application for the described mapping methodology. Figure 7.12 shows an example of a mosaic obtained from King Triton vessel in Boston Harbor using the HAUV (Vaganay et al. 2005) equipped with a DIDSON FLS (Sou 2015) (data courtesy of Bluefin Robotics (Blu 2015b)). The mosaic, consisting of 518 frames, presents a consistent overall appearance and allows the identification of the various features on the hull.

nfgz012

Figure 7.12 Ship hull inspection mosaic. Data gathered with HAUV using DIDSON FLS.

Data courtesy of Bluefin Robotics

Another significant application is the mapping of harbors, bays, and estuaries, which typically suffer from poor visibility conditions. Figure 7.13 shows an example of a mosaic generated in a marina environment using an Autonomous Surface Craft from the Center of Maritime Research and Experimentation (CMRE). In this case, a BlueView P900-130 FLS was used (Blu 2015a), providing individual images with a wide field of view, long range, and low resolution. The mosaic exhibits consistency, thanks to the detection of several loop closures across the different tracks, even though consecutive ones have reciprocal headings and the image appearance is highly distinct.

nfgz013

Figure 7.13 Harbor inspection mosaic. Data gathered from an Autonomous Surface Craft with BlueView P900-130 FLS.

Data courtesy of Center for Maritime Research and Experimentation

A final example of acoustic mapping with FLS can be seen in Figure 7.14. A mosaic composed of four different tracks and more than 1500 FLS frames shows the remains of the Cap de Vol Iberian shipwreck. The mosaic was created in real time, thanks to a restrictive criterion on the frame candidate selection to attempt registrations. A mosaic of the same area built from optical data is shown for comparison purposes.

nfgz014

Figure 7.14 Cap de Vol shipwreck mosaic: (a) acoustic mosaic and (b) optical mosaic

7.4 Concluding Remarks

In this chapter, we discuss the most relevant underwater mapping and classification techniques. The use of the presented mapping techniques allows large-scale, yet detailed, visual representations of the environment surveyed by the underwater vehicle. Choosing which mapping approach to employ has to be done depending on various factors: (i) the mapping application, (ii) the type of survey and the acquisition sensors, and (iii) the characteristics of the environment. Specifically, when only visual information (i.e., color, texture, shape) is required for a specific study and the surveyed area is relatively flat, 2D mapping techniques are the most suitable. In contrast, when structural information is important or when the surveyed area is characterized by significant relief variations, 2.5D and 3D mapping techniques are the most adequate.

Additionally, for applications/studies that require semantic information, we present machine learning techniques that enable automatic classification of visual data. These techniques use supervised learning approaches, allowing the transfer of knowledge from an expert to the system. The system then uses this knowledge to classify new visual data and provide the user with meaningful semantic interpretations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset