Chapter 11

Improving Tracking Accuracy Using Illumination Neutralization and High Dynamic Range Imaging

N. Ladas*; Y. Chrysanthou*; C. Loscos    * University of Cyprus, Nicosia, Cyprus
University of Reims Champagne-Ardennes, Reims, France

Abstract

Object recognition and tracking is a common task in video processing with multiple applications including surveillance, security, industrial inspection, medicine, and more. Recognition and tracking accuracy can drop significantly when the scene dynamic range exceeds that of common camera sensors, which may lead to decreased tracking accuracy. The problem is further exacerbated when the scene illumination changes abruptly, such as when a light is turned off, or when clouds obscure the sun. This chapter investigates high dynamic range (HDR) imaging as a possible solution to these problems. HDR imaging extends the accuracy of the captured image, which naturally improves the performance of existing algorithms, while further gains are possible when algorithms are designed to specifically take advantage of HDR information. This chapter discusses an algorithm that improves the accuracy of object tracking algorithms by employing a custom HDR capture methodology and an image neutralization algorithm designed to take advantage of HDR information.

Keywords

High dynamic range imaging; Object recognition; Tracking; Illumination capture; Illumination neutralization

1 Introduction

Object recognition and tracking is a class of video processing algorithms that identify targets as they appear in a video (recognition) and then follow their location for as long as they remain in view (tracking).

When the sensor’s dynamic range cannot accommodate the full dynamic range of the scene, parts of the image appear completely dark (undersaturated) or washed out (oversaturated) and important information may be lost. In the case of object detection, information in the saturated regions could be useful in the identification process, as it might contain features that describe the object that we aim to identify. In tracking, saturated regions can cause the tracking algorithm to fail when the target enters these regions.

Robust object recognition and tracking is a challenging task even in well exposed, constant-illumination scenes. Rapid changes in illumination conditions, such as lights turning on/off or clouds obscuring the sun can cause significant changes in the appearance of the captured scene that, in turn, cause significant reduction in tracking accuracy. In addition, illumination effects, such as shadows and specular highlights, can perturb algorithm efficiency.

High dynamic range (HDR) imaging can alleviate these problems by extending the dynamic range that is captured. Simply employing HDR inputs in unmodified conventional techniques can be beneficial, as less information is lost due to saturation, while further performance gains are possible when techniques are designed to take advantage of HDR information.

This chapter presents an in-depth discussion of the technique by Ladas et al. [1, pp. 51–54] that improves the accuracy of object tracking algorithms by employing a custom HDR capture methodology, and an image neutralization algorithm designed to take advantage of HDR information. While the work presented is used as a guiding example, we expand the relevant related work and give insights to the implementation specifics and directions for future work.

2 Overview

The work presented in this chapter aims to improve the accuracy of object tracking algorithms by employing HDR imaging and a method for scene illumination normalization. The algorithm takes as input an HDR video stream where the illumination can have potentially significant variation in intensity. Additionally, a set of HDR fisheye images is also used that capture the scene illumination (incoming light). By analyzing the scene illumination the method is able to reverse the illumination effects that appear on the input images and produce an illumination-neutral image stream. Tracking objects using the illumination-neutral images is shown to be much more robust.

As the HDR video stream of the tracked area is captured, a separate low dynamic range (LDR) camera, mounted with a fisheye lens, takes snapshots of the scene’s incoming illumination and assembles them in the form of an HDR environment map. This is done at regular time intervals (typically every few seconds). Given an estimate of the scene geometry visible through each pixel, the incoming illumination from the current HDR environment map is used to compute in real-time the scene lighting for each point. Using this, lighting disparities, such as shading and shadows are removed, producing an illumination-neutral video stream that can be used as input by any tracking algorithm.

3 Acquisition of the Scene Illumination

3.1 Fisheye Images

To acquire the scene illumination, a standard SLR camera is used, mounted with a fisheye lens pointing upward. At this configuration, one shot captures a hemispherical view of the scene (typically up to 360 × 90 degrees) that includes most of the area contributing to the incoming illumination in typical scenes (see Fig. 1). The standard assumption is made that the main sources of light in the environment are relatively far from the observed part of the scene. If this assumption does not hold, fisheye images must be captured close to the point of interest in the scene, and sampled in multiple locations to form a light field [2, pp. 31–42].

f11-01-9780128094778
Fig. 1 Image captured with a camera equipped with a fisheye lens. The camera points upward to capture the illumination arriving from the light sources of the scene.

Directly photographing light sources are usually not possible, as the regions of the image depicting the light sources will be saturated. Standard LDR to HDR assembly [3, pp. 369–378] was used to properly capture the fisheye images. For very bright light sources, such as the sun, it may still not be possible to capture the image directly with conventional cameras. In such cases, specialized capture techniques must be employed.

One possible approach is to use a light probe to capture the incoming illumination at a specific location. A light probe is a highly reflective metallic sphere that reflects the illumination arriving from nearly every direction. Because we are photographing the probe instead of the light sources directly, probe reflectance must be determined prior to using a calibration procedure [4, pp. 189–198]. Because the probe reflects less light than it receives, this indirection is often enough to avoid saturation of the image sensor. In very strongly lit scenes, such as when taking a light probe of the sky in a sunny day, the light reflected by the probe will still saturate the camera sensor. An approach for dealing with this problem was described by Debevec and Lundgren [5, pp. 1–11] which involved photographing a separate diffuse (mat) gray ball. The authors calibrated the reflectance of the gray ball against a known illuminant and used it to capture the illumination of the sun. A conventional light probe was then used to capture the illumination from the rest of the sky.

Another approach is to use neutral density (ND) filters to block a portion of the incoming illumination. ND filters are darkened plastic membranes that attach to the camera lens, typically quantified by the amount of illumination they block (expressed in f-stops). For accurate results, the ND filter must be calibrated, as it may not block the exact amount of illumination advertised or it may behave evenly across its surface. The work by Stumpfel [6, pp. 145–150] describes an approach that performs this calibration.

For highly accurate scene illumination capture, trichromatic (e.g., red-green-blue) light sampling may not be sufficient and the whole visible spectrum must be captured [7, pp. 1–12]. This, however, requires specific, high-cost equipment and cumbersome capture procedures. For all cases considered in this chapter, standard LDR to HDR acquisition was sufficient.

3.2 Environment Map Representation

The captured fisheye images must be converted to a known representation. A commonly used representation is equirectangular (latitude-longitude), as shown in Fig. 2.

f11-02-9780128094778
Fig. 2 A fisheye image converted to the latitude-longitude representation. The fisheye image only captures the appearance of the top hemisphere and as a result, the bottom half of the resulting latitude-longitude image is black.

Equirectangular projection maps the surface of a sphere onto a plane. The horizontal direction represents latitude and the vertical direction longitude. This representation is widely used as it allows for easy, intuitive access of the data. This is because pixels locations (x, y) directly translate into spherical coordinates (θ, ϕ). To successfully project the captured fisheye image to the altitude-longitude projection, the fisheye lens must be calibrated with a technique such as Scaramuzza et al. [8, pp. 5695–5701]. The fisheye lens captures up to half a hemisphere and as a result, when projected to an equirectangular image, half of the image is black. Equirectangular images representing scene illumination are often referred to as environment maps (a term that we will use for the remainder of the chapter).

To capture changing light conditions, such as lights turning on and off, the scene illumination must be captured regularly. In the current experiment, a standard SLR camera, controlled remotely by a laptop running freely available software (http://gphoto.sourceforge.net/), was used to capture the fisheye images at regular time intervals. Due to the bracketing, the transfer from camera to computer and the assembly of the images, it was not possible to produce more than one fisheye HDR image every few seconds. For environments where the illumination conditions change more frequently, it would be desirable to use an HDR video camera.

4 Precomputed Radiance Transfer

Typically, to correctly calculate the illumination arriving at each scene location, we must sample from there the environment map, using a suitable distribution and taking into account obstructions and interreflections. This is a time-consuming task that is not well suited to the real-time nature of object detection/tracking. This is a common problem in real-time graphics and it is commonly solved by making simplifying assumptions that trade accuracy for efficiency, such as ignoring interreflections and precomputing visibility for static scene components.

A good compromise between accuracy and efficiency is the precomputed radiance transfer (PRT) technique [9, pp. 527–536, 10, pp. 131–160]. In PRT, reflectance and visibility at key scene locations (typically geometry vertices) are calculated using Monte Carlo integration and projected into spherical harmonic (SH) coefficients. The incoming illumination is assumed to be coming from far enough, thus having the same relative origin and direction when seen from any point in the scene. The illumination is also encoded as one set of SH coefficients. Encoding the visibility is a slow process and must be done offline (once per scene) while encoding the incoming illumination is done in real-time. In our case we assumed mostly diffuse objects, therefore the reflectance is a cosine term that can be encoded into SH during preprocessing. At runtime, the illumination arriving at scene point x can be calculated as the sum of the products of the SH coefficients of incoming illumination L and the transfer function (visibility + cosine term) T:

I(x)=i=0NLiTi(x)

si1_e

where n is the number of SH coefficients used. Increasing the number of coefficients leads to better results at the cost of using more memory. Typically, 16 SH coefficients provide good lighting results in low-frequency illumination scenarios. The light reflected toward the camera is then calculated as the product of incoming illumination and scene reflectance:

Rx=pxIx

si2_e

Standard PRT implementations calculate the visibility for each geometry vertex. This approach has two limitations. If the scene is overly complex, the processing time and memory requirements for the visibility computation increase dramatically. On the other hand, if the scene consists of crude geometry with few vertices, visibility will be undersampled and the lighting quality will suffer. In many tracking scenarios, the camera is static (e.g., CCTV cameras). We exploit this fact, and compute the scene visibility for each screen pixel instead of each scene vertex. We perform an initial ray-cast step that traces rays from the camera toward the scene. At each intersection, we trace random rays from the intersection point toward the rest of the scene in order to estimate the visibility. This approach works well in complex scenes, because the number of visibility computations is always constant (equal to the number of pixels). In simple scenes, this approach naturally multisamples the scene and produces better lighting results than the per-vertex computation. In the cases where the camera is not static, the system can revert to per-vertex computation.

A limitation of PRT is that it requires the scene’s geometric information to be known, which may not be the case for real world scenes. When the scene geometry is not available, we work around this limitation by calculating the incoming illumination on a hypothetical flat surface. Because many tracking scenarios consist of tracking objects on flat surfaces (e.g., tracking pedestrians in a plaza or the cars on a road), this assumption works quite well. When, however, the goal is to maximize the accuracy of the algorithm, the scene geometry must be provided. Digitizing a real scene’s static geometric information is a well-studied field and many approaches are available [11, pp. 834–849].

5 Neutralizing Illumination

5.1 Sampling Environment Maps

Sampling and encoding the visibility is a slow process and is done once per scene and offline. Sampling the illumination given by the environment map can be performed in real time, as long as the number of samples used is small. Efficient sampling can be achieved using relatively few samples by employing importance sampling techniques [12, p. 33]. The key concept on which importance sampling techniques are based is to allocate more samples toward regions in the image that have more brightness, as these regions will likely contribute more to the scene illumination.

5.2 Neutralizing

Sampling the illumination in real time is beneficial as it allows for more than one environment map to be used during the tracking operation. This is important in cases where the scene illumination changes abruptly during tracking. Because our system captures both the video stream and a sequence of environment maps concurrently, it is able to better react to illumination changes. Acquiring and sampling the environment maps can take a few seconds, and so multiple video frames are associated with the same environment map. Each new environment map is sampled and encoded into SHs and the scene illumination is recalculated. After the illumination is calculated, we subtract it from the input video. Because scene materials are assumed to be diffuse, we can estimate the reflectance ρ of the scene’s objects by dividing each pixel in the video frame by the illumination calculated for the scene location associated with that pixel:

px=pixeli/Ix

si3_e

After the reflectance is calculated, we obtain the illumination-neutral video by lighting the scene objects with ambient lighting. In order to easily associate video pixels with scene locations, we align the virtual camera used in the PRT system with the real camera.

6 Improving Tracking Accuracy

To evaluate how our neutralization scheme affects tracking accuracy, we compared the tracking results of tracking-learning-detection (TLD) [13, pp. 49–56], a high-end real time tracking algorithm, using standard video versus illumination neutralized video as input. TLD is resilient against changes in global illumination and can adapt, given enough training input, to strong changes in local illumination. In TLD the user is tasked with specifying the initial position of the target object and no object detection is performed. Alternatively, the initial position of the target could be determined using an object detection method such as [14, pp. 1627–1645].

The test scenario consisted of tracking a small object on a flat surface. As the object moved, we altered the illumination conditions by switching lights on/off and obscuring the light coming from the windows. The goal was to capture different illumination conditions (well-lit, under-lit, directional lighting) and also to have rapid changes in illumination. At each illumination change, we captured a new environment map. Fig. 3 shows in columns 1 and 2, the results when using an unprocessed video, and in columns 3 and 4, the results when using the illumination neutral video produced by our algorithm.

f11-03-9780128094778
Fig. 3 The first column shows the original input stream while the second column shows the tracker’s output for this stream. The third column shows the illumination neutral stream produced by the algorithm. The fourth column shows the tracking result for the illumination neutral input. Tracking accuracy is substantially improved for illumination neutral inputs.

Successful tracking is shown in each image as a circle over the target. When sharp illumination changes occur, tracking fails for the unprocessed video (frames 4, 6–10). The illumination-neutral video maintains a constant scene appearance by removing illumination effects and as result tracking continues unhindered.

Because the TLD tracker operates on LDR input, we had to convert the HDR input frames into LDR images. To do this, we selected a fixed exposure setting from the LDR stack used to create the HDR frames. Another approach would be to tone map the HDR and use the result as input for the tracker. We tested our data using the tone-mapping procedure described in Reinhard and Devlin [15, pp. 13–24] and gave similar results to using a fixed exposure. Evaluating the effect of tone mapping on tracking and other computer vision methods is an open direction for further work.

7 Conclusions

This chapter presented a method for enhancing the accuracy of a high-end object tracker. The method used HDR inputs instead of conventional video, and additionally captured the scene illumination in the form of environment maps. Using this information, the appearance of the inputs was normalized and the effects of sudden light changes, that dramatically darkened or brightened the scene, were removed. This translated to concrete improvements in tracking performance.

The algorithm presented here is relatively simple, but still manages to improve tracking performance significantly. We believe that the results presented here are indicative of the possible performance gains of using HDR-enabled object trackers, although more thorough testing is required to solidify this claim. At the same time, there is very little literature involving HDR imaging and tracking, object recognition, or similar disciplines, which leads us to believe that the area is good target for future work.

References

[1] Ladas N., Chrysanthou Y., Loscos C. Improving tracking accuracy using illumination neutralization and high dynamic range imaging. In: HDRi2013—First International Conference and SME Workshop on HDR Imaging. 2013:2–6 7.

[2] Levoy M., Hanrahan P. Light field rendering. In: Inproceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques—SIGGRAPH ’96. New York, NY: ACM Press; 1996:31–42. http://portal.acm.org/citation.cfm?doid=237170.237199 (accessed December 12, 2013).

[3] Debevec P.E., Malik J. Recovering high dynamic range radiance maps from photographs. In: Inproceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’97. New York, NY: ACM Press/Addison-Wesley Publishing Co. 1997:369–378.

[4] Debevec P. Rendering synthetic objects into real scenes: bridging traditional and image-based graphics with global illumination and high dynamic range photography. In: Inproceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques. SIGGRAPH ’98. New York, NY: ACM; 1998:189–198.

[5] Debevec P., Lundgren T. Estimating surface reflectance properties of a complex scene under captured natural illumination. ACM Trans. Graph. 2004;1–11.

[6] Stumpfel J. Direct HDR capture of the sun and sky. In: Inproceedings of the 3rd International Conference on Computer Graphics, Virtual Reality, Visualisation and Interaction in Africa—AFRIGRAPH ’04. ACM Press; 2004:145. http://portal.acm.org/citation.cfm?doid=1029949.1029977.

[7] Kider J.T. A framework for the experimental comparison of solar and skydome illumination. ACM Trans. Graph. 2014;33(6):1–12.

[8] Scaramuzza D., Martinelli A., Siegwart R. A toolbox for easily calibrating omnidirectional cameras. In: 2006 IEEE/RSJ International Conference on Intelligent Robots and Systems. 2006:5695–5701.

[9] Sloan P.-P., Kautz J., Snyder J. Precomputed radiance transfer for real-time rendering in dynamic, low-frequency lighting environments. ACM Trans. Graph. 2002;21(3):527–536.

[10] Slomp M.P.B., Oliveira M.M., Patricio D.I. A gentle introduction to precomputed radiance transfer. Revista De Informática Teórica E Aplicada. 2006;13(2):131–160.

[11] Engel J., Stückler J., Cremers D. LSD-SLAM: large-scale direct monocular SLAM. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 8690 LNCS (Part 2). 2014.834–849.

[12] Debevec P. A median cut algorithm for light probe sampling. In: ACM SIGGRAPH 2008 Classes. New York, NY: ACM; 2008:7154. http://dl.acm.org/citation.cfm?id=1401176 accessed November 19, 2014.

[13] Kalal Z., Matas J., Mikolajczyk K. P-N learning: bootstrapping binary classifiers by structural constraints. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 2010:49–56.

[14] Felzenszwalb P.F. Object detection with discriminative trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 2010;32(9):1627–1645.

[15] Reinhard E., Devlin K. Dynamic range reduction inspired by photoreceptor physiology. IEEE Trans. Visual. Comput. Graph. 2005;11(1):13–24.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset