5 Perceptually Based Image Processing Algorithm Design

The human visual system plays a pivotal role in the creation of images in a digital camera. Perceptually based image processing helps ensure that an image generated by a digital camera “looks right” from a human visual system perspective. In achieving this goal, perceptual considerations are present in all aspects of a digital image processing chain.

In this chapter, the components of a digital image processing chain are described, with particular emphasis on how perceptual considerations influence algorithm design at each stage in the processing chain. There are a variety of ways in which the human visual system impacts processing decisions. Even before the capture button is pressed, perceptual considerations affect the decisions made by the automatic camera control algorithms that adjust the exposure, focus, and white balance settings of the camera. The design of the color filter array on the image sensor, as well as the associated demosaicking algorithm, are influenced by the human visual system. Computational efficiency is also important in a digital image processing chain, and algorithms are designed to execute quickly without sacrificing perceptual quality. Even image storage is influenced by perceptual considerations, as image compression algorithms exploit the characteristics of the human visual system to discard perceptually insignificant data and reduce the size of the image file.

The remainder of this chapter is organized as follows. In Section 5.2, a canonical digital camera image processing chain is presented. This section introduces the processing steps that are described in greater detail throughout the remainder of the chapter. Namely, Section 5.3 discusses automatic camera control, Section 5.4 focuses on demosaicking, Section 5.5 discusses noise reduction, and Section 5.6 presents color rendering. Edge enhancement is detailed in Section 5.7 and compression is discussed in Section 5.8. Concluding remarks are provided in Section 5.9.

5.2 Digital Camera Image Processing Chain

A digital camera image processing chain is constructed to simulate the most important functions of the human visual system (HVS). What follows is an overview of a representative chain [1], with detailed discussions of the components occurring in subsequent sections.

Figure 5.1 is a diagram of an example chain. The input image of this chain is a “raw” color filter array (CFA) image produced by the camera sensor. From an HVS perspective, much may have already occurred prior to the CFA image being produced. If the camera is in an automatic setting, autofocus, autoexposure, and automatic white balancing algorithms (collectively referred to as “A*”) will have sensed the scene edge content, brightness level, and illuminant color and have adjusted the corresponding camera capture parameters before the shutter button has been pressed. This can be thought of as simulating the automatic adjustment of HVS parameters, such as the eye’s pupil diameter and lens thickness, as well as the gains within the retina.

Images

FIGURE 5.1
Digital camera image processing chain.

Once the CFA image has been read from the camera sensor, the first of possibly many noise reduction operations is performed. While the name of this operation carries an obvious meaning and intent, noise reduction is a far more subtle and sophisticated task than one might presume. From the HVS perspective, not all kinds of noise are equally important, and married with the additional HVS perspective that not all kinds of scene detail are equally important, the noise reduction task now becomes one of modifying the aggressiveness of the cleaning process to respect the integrity of the underlying scene detail that is visually most important. At the same time, the effectiveness of the noise reduction operation must be ensured.

Following noise reduction, white-point correction or white balancing, is performed. This simulates the chromatic adaptation of the HVS. Regardless of the color of the scene illuminant, be it a bluish daylight or a yellowish incandescent lightbulb glow, colors must appear as expected. This means that a piece of white paper needs to appear white regardless of the color of the illuminating light. While the HVS performs this task automatically, it is the role of the white balancing operation to mimic this capability.

CFA interpolation, or demosaicking, is perhaps the image processing operation most unique to digital cameras. To reduce cost and size, most cameras employ a single sensor covered with a CFA. The CFA restricts each pixel in the sensor to a single color from a small set of colors, typically red, green, and blue. Since each pixel records only one color channel value, and three color channel values are needed to describe a color, it is the job of the demosaicking operation to restore the missing two color channel values at each pixel. This is accomplished by taking advantage of the inherent color channel correlation between neighboring pixels and the luminance-chrominance signal processing of the HVS.

The color values produced by the demosaicking operation are generally in a device-specific, nonstandard color space. The role of color correction is to transform the nonstandard color data into a standard color space appropriate for the final use or uses of the image. These uses could include display on a computer monitor, hardcopy printing, or storage and archiving. When the image is rendered to a device, the limitations of the destination equipment usually mean not all colors can be produced faithfully. Depending on the sophistication of the color correction process, colors that are out of gamut of the destination device can end up producing rather significant artifacts. Additionally, as with most signal amplifying operations, precautions must be taken against undue noise amplification. Therefore, the color correction operation usually compromises among color accuracy, color gamut limitations, and noise amplification. The color sensitivity of the HVS, both objectively and subjectively, becomes the primary tool for guiding such color correction compromises.

Tone scaling is the means in which the image processing chain simulates the brightness adaptation mechanism of the HVS. The HVS is capable of sensing scene brightnesses over a range of ten orders of magnitude! However, it cannot sense that entire range simultaneously. Instead, it uses brightness adaptation to properly sense the portion of the brightness range that is currently being observed, be it a brightly lit beach or a candle-lit birthday cake. In each case, the contrast of the scene is adjusted to appear similar. Tone scaling generally addresses the situation of rendering a scene captured under outdoor lighting so that it looks correct when shown on a display that is dimmer by a number of magnitudes. This is achieved by applying an appropriate nonlinear transformation (a tone scale) to the image data. The design of the tone scale naturally follows from the characteristics of the HVS.

There are a number of elements of the digital camera imaging chain, both hardware and software, that act as lowpass (blurring) filters. The genuine high-frequency spatial components of the image can be significantly attenuated, if not outright eliminated, by factors such as lens aberrations and aggressive noise reduction. Edge enhancement (sharpening) attempts to restore at least some of the fidelity of this high-frequency spatial image content. Since this sharpening operation is, by its nature, a signal amplifying step, it is important that some effort is made to amplify the real image detail and not the image noise. Again, the HVS can be referenced so as to note which spatial frequencies in an image are important to sharpen, and which can be left unaltered in the name of avoiding noise amplification and image processing artifacts.

From the perspective of the HVS, there is a lot of redundancy in the image produced by a digital camera and its image processing chain. This redundancy takes the form of spatial frequency detail that is visually unimportant, if not outright undetectable. Additionally, what is important spatial frequency detail in the luminance portion of the image is not necessarily as important in the chrominance portions of the image. Lossless compression of a digital camera image looks for mathematical redundancies in the image data and removes them in order to produce a smaller image file for reduced storage requirements. Since lossless compression is exactly reversable, no information is lost, although this also means the amount of compression achieved is generally modest. Lossy compression uses knowledge of the HVS to discard data that are perceptually unimportant in order to achieve a much higher compression ratio. While the result is no longer mathematically reversible, an image with little to no visual degradation can be reconstructed from a much smaller image file.

Taken together, the image processing chain of a digital camera simulates the major operations of the HVS. The exact composition of the chain is not fixed, and Figure 5.1 must be taken as just a starting place. Some operations such as noise reduction can easily occur at different locations and more than once within the chain. Additional operations can be added, such as motion deblurring. Still, the image processing chain and its components discussed here provide a foundation from which to discuss subsequent variations.

Images

FIGURE 5.2
High-level flow for processing in a digital camera.

5.3 Automatic Camera Control

The HVS adapts to an extremely wide range of circumstances, placing similar demands on a camera for general consumer use. This section discusses automatic camera control for a general purpose compact digital camera. The primary algorithms are autoexposure, autofocus, and automatic white balancing, sometimes collectively referred to as “A*.” Because the goal is reproduction of a scene for human viewing, perceptual considerations are key in developing the algorithms. This has been the case for decades in photography. For example, an early approach to standardization of film “speed” (measurement of film sensitivity) focused precisely on the photographic exposure required to make an excellent print [2]. On the other hand, the development of the zone system was based on exposing, processing, and printing a negative to achieve specific aims for the reproduced image [3].

While the goal is framed around the perceived quality of the reproduction, achievement of the goal begins with capturing (at least) the scene information a human sees. Maximizing the information captured is important, but a related goal is to minimize perceptually significant degradations in reproduced images. There are limits to the scene information that can be captured with conventional camera technology, and trade-offs must be made. For example, most exposure control algorithms maximize the exposure (light captured) to improve the signal-to-noise ratio in the reproduced image. However, still cameras also limit the exposure time to avoid motion blur, because the quality degradation caused by significant motion blur usually outweighs the quality improvement from increased exposure. These trade-offs vary with the scene and with viewing conditions, such as the size of the reproduced image, the image processing, and the viewing distance. For example, because perception of a stream of images is different from perception of a still image, video cameras usually control exposure time primarily to fit the frame rate of the capture and limit motion artifacts such as judder, rather than simply minimize motion blur. These trade-offs and others will be discussed in more detail later.

To discuss these decisions in context, a brief tutorial on the operation of compact digital cameras is presented. Figure 5.2 shows a simplified diagram of processing in a digital still camera. The central column of the figure illustrates the preview flow—the capture and presentation of images to display on a panel or in an electronic view finder. The right-hand column of the figure illustrates the video flow—the capture, display, and storage of a sequence of images in a video. The video and preview flows are largely the same; the primary difference is whether images are compressed and stored, or simply displayed. The block at the left of the figure represents all of the operations for capture and processing of a still image once a still capture is initiated.

In this diagram, camera operation begins in the preview oval at the top of the diagram. Normal operation continues through capture, processing, and display of a preview image, a test of the shutter trigger, and without shutter activation a return to the top of the preview column for another iteration. The diamond decision blocks in the figure show responses to shutter operation. If the camera is set in still capture mode, the preview flow is active until the shutter is actuated. When the shutter is triggered, the still capture process is executed, then processing returns to the preview flow. If the camera is set in video capture mode, triggering the shutter moves control to the video flow, which includes capture, processing, display, compression, and storage. Once in the video flow, the camera stays in video operation until the shutter is triggered again, finishing storage of the video sequence and returning the camera to preview mode.

From a camera control perspective, the preview and video flows are essentially identical. In both cases, there is a stream of images being captured, processed, and displayed. Both flows have the same constraints: operation at video rates, such as 30 frames per second, with images of low to moderate resolution (for example, 640 × 480 on up to 1920 × 1080). Still capture mode entails capture of a high-resolution image, however, it does not need to fit the same constraints. The exposure time for a still image may be different from the exposure time for a video frame, and flash or other active illumination may be used. However, exposure and focus decisions about capture of the still image are based primarily on analysis of the preview stream. The following discussion will describe the preview stream and the reader should understand the comments generally apply as well to a video capture.

Figure 5.3 shows a slightly more detailed diagram of processing in a typical compact digital camera while a scene is being composed for capture. The first step, in the upper left corner of Figure 5.3, is capture of a raw preview image, stored in a buffer denoted by the oval at the top of the figure. This image is fed to three analysis processes (exposure, focus, white balance) as well as to an image processing chain to render the preview image for display. The three analysis processes may interact with each other, but are shown this way because the parameters for controlling exposure, focus, and white balance are largely independent and will be discussed in turn. This largely independent processing for exposure and focus is also practiced in digital single-lens reflex (DSLR) cameras, which often have separate, dedicated subsystems for focus and exposure control. The perceptual issues with DSLRs are similar to those with compact digital cameras, although there is greater use of dedicated hardware in DSLRs. The current discussion will use the compact digital camera as an example system, to highlight algorithm development rather than hardware design.

Images

FIGURE 5.3
Nominal flow for preview analysis and display.

After a preview image is analyzed, the capture parameters updated, and the image is displayed, the process is repeated. The processing chain to display the preview image normally uses parameters already determined from previous preview images, rather than hold up the display process waiting for analysis of the current image to be completed.

Other processing steps, such as face detection, face recognition, and automatic scene detection, not shown in the figure, are often used in digital cameras. These processes can provide important information for camera control, but are not tied to one specific camera control decision. This discussion will focus on the simpler algorithms tied more directly to control of a camera.

5.3.1 Automatic Exposure

As shown in Figure 5.3, exposure control is essentially a feedback control loop, repeatedly analyzing a captured image and updating the exposure parameters. The degree to which the preview image is underexposed or overexposed is considered the exposure error, the feedback signal for this control loop. The discussion here separately addresses the analysis of exposure error in a preview image and the use of different parameters to control exposure.

As with any control loop, stability is important, but for this loop, the criteria for closed loop behavior are based on human perception, since the goal is to produce a stream of rendered images that are all exposed well. Rapid adjustment to a new exposure when the scene changes significantly must be balanced against the desire for stable exposure in the face of minor scene changes.

When capturing video, this exposure control loop is all that is needed, since the video sequence is the captured product. In a digital still camera, the analysis performed on the preview stream can be a precursor to further exposure analysis for a still capture. Because the processing path for the still capture is different from the video path, the optimal exposure can be different, so exposure is often re-analyzed just before a still capture. In addition, still capture normally has different limitations on exposure parameters and active illumination than video capture. This will be discussed further under exposure control.

5.3.1.1 Exposure Analysis

In simplified form, the exposure analysis module considers whether the latest preview image is well exposed, overexposed, or underexposed. This can be quantified as a relative exposure error, and exposure parameters (discussed later) are adjusted to minimize the exposure error. To begin, the normal definition of proper exposure is simply that exposure which provides a suitable image when viewed in print or on display. This means the proper exposure depends on the image processing chain, especially the color and tone rendering steps.

Some exposure analysis, such as detection of faces, other specific content, or a pyramid decomposition, may be too computationally demanding to run at preview frame rates. In this case, the preview exposure analysis is often simplified to fit the available resources. Sometimes, the complex processes are run less often, with the results fed into the exposure control loop when available.

For decades, with relatively static rendering systems (film and paper tone scales), automatic exposure analysis focused on aligning the exposure of a midtone gray with the proper location on a fixed tone curve. Development of digital adaptive tone scale technology has made the exposure problem more complex as the rendering process becomes more complex, but most camera autoexposure algorithms still operate as extensions of the traditional process.

Human perception is not linear with exposure; a midtone gray is not 50% reflectance. This is illustrated in Figure 5.4, showing a plot of the conversion from linear relative exposure to CIE L*, a perceptually even measure of lightness [4]. As shown in Figure 5.4, a midtone gray near 50 L* corresponds to a relative exposure of 18%, highlighted by the small circle on the curve at L* 50. Most automatic exposure algorithms are based on some extension of the observation that images of many different scenes will average out to this 18% gray, often known as the gray world model. Many scenes taken together do average near 18% gray, but individual scenes vary widely from the average.

Images

FIGURE 5.4
CIE L* vs. relative linear exposure.

Because the gray world model is so weak when applied to an individual scene, greater sophistication is needed. The approaches involve various schemes for weighting regions of the image differently, trying to improve the correlation between an estimated gray and a perceptual gray. In effect, the goal is to roughly discount what is unimportant and emphasize what is most important. Fully automatic metering systems use a combination of

Area weighting uses the spatial location of pixels within the image to determine significance. Spot metering and center-weighted metering are the simplest examples of area weighting.
Tone weighting uses the tone value to determine significance. The simpler methods weight the central portion of the tone scale more heavily, although usually with less emphasis on shadows than lighter tones.
Modulation weighting uses the local modulation (texture) in the scene to control weighting, to avoid the average being overly affected by large, flat areas in the scene.

These weighting methods are combined in practice, and even early metering systems, such as one presented in Reference [5], describe combinations of area weighting and modulation weighting, especially considering scene types with substantial differences between center areas and surround areas. Because the different weighting methods are motivated by different perceptual scenarios, they are discussed separately below.

Area weighting is motivated by the realization that the subject of a scene is more likely to be in the central portion of the image, the upper portion of the scene is more likely to contain sky, and so on. Other models of the probability of subject locations within images can also be used. An early example of area weighting can be found in Reference [5], with more recent implementations presented in References [6] and [7]. The use of multiple metering zones continues to develop, especially combined with modulation weighting and recognition of classified scene types.

Tonal weighting is motivated by the fact that different portions of the tone scale in a rendered image have different importance. Exposure analysis from a preview image is complicated by the fact that many scenes have a greater observed dynamic range than can be captured in a single exposure, and also greater dynamic range than can be displayed in the reproduced image. When scene content has an exposure beyond the captured or rendered range of an image, it is clipped, such as a white wedding dress being rendered without highlight detail. In practice, the problem of clipping tends to be one-sided, since only highlights are clipped at capture, while shadows recede into the dark noise. On the other hand, the large dynamic range of many scenes means a simple approach, such as calculating exposure to capture the brightest portion of the scene without clipping, is likely to fail. Images containing candles, the sun, and other light sources would be exposed “for the highlights,” resulting in a much lower signal-to-noise ratio than desired for the rest of the scene.

As with all averaging processes, a small variation in the distribution of extreme scene exposure values affects the mean much more than similar variations closer to the mean. For example, exposure of an evening street scene would vary with the number of street lights included in the field of view. The effect is partially addressed by transforming data to a more visually even metric, such as CIE L*, before computing an average. Considering the shape of Figure 5.4, the compression of highlight information (decreasing slope with increased L*) mitigates the impact of highlight data on an average. The limited exposure range for many consumer cameras also tends to limit the problem, since these highlight values are often clipped in capture of a preview image. A clear example emphasizing the use of tone weighting is presented in Reference [8].

Modulation weighting is driven by the realization that people pay more attention to scene regions with modulation than to flat areas. Averaging image areas near edges also tends to make the results more robust, since scene tones on both sides of an edge are included. The challenge is to avoid confusing modulation caused by noise with modulation caused by the scene. This can be done by using a bandpass analysis that is sensitive to relatively low frequencies. More sophisticated use of modulation is described in Reference [9].

Modeling of the human visual response to scenes with high dynamic range has expanded with the development in high dynamic range capture and rendering technology. One example is a multi-scale pattern, luminance, and color processing model for the HVS [10]. A somewhat different approach is the development of a tone reproduction operator to generate suitable reproduction images from a high dynamic range scene [11]. Because of their complexity, these analyses are mostly used in development of adaptive renderings, rather than in precise exposure control of the initial captured image(s).

Many refinements of automatic exposure analysis use classification approaches, such as identifying scenes with faces, blue skies, or snow, and handling them differently. These approaches tend to operate heuristically, but they are driven more by perception than optimization of a single objective metric. Development of theory-based objective metrics that correlate with subjective decisions is an open area of research.

Figure 5.5 shows an example of the difference between naive gray world exposure and perceptually driven exposure. Two different portrait scenes are shown, one with a dark background and one with a light background. Figures 5.5a and 5.5b were captured with perceptually based exposure, resulting in fairly consistent exposure for the main subject. Figures 5.5c and 5.5d illustrate naive exposure control. Both of these images average to a mid-tone gray, but the overall captures are not as good as with perceptually based exposure.

Images

FIGURE 5.5 (See color insert.)
An illustration of exposure: (a) portrait with a dark background and perceptual exposure, (b) portrait with a light background and perceptual exposure, (c) portrait with a dark background and naive mean value exposure, and (d) portrait with a light background and naive mean value exposure.

The result of exposure analysis can be quantified as an estimate of how far the preview image is from optimal exposure, providing a natural error signal for the preview exposure control loop.

5.3.1.2 Exposure Control

The parameters used for capture of the preview image and the estimated exposure error can be combined to compute the scene brightness. Once the scene brightness is estimated, the next challenge is to determine how to best utilize the camera exposure controls to capture the given scene. Physical camera exposure is controlled through aperture (f/number), gain, exposure time, and flash, or other active illumination. The parameters that control exposure have simple physical models, but they have multiple, often conflicting, perceptual effects in the final image, depending on the scene captured. For example, increasing gain allows rendering of images captured with less exposure, at the expense of increasing the visibility of noise in the image. Increasing the exposure time used to capture an image increases the amount of light captured, but also increases the potential for motion blur in the captured image. Active illumination, such as flash or light-emitting diodes, adds light to the scene being captured, increasing the signal and improving the signal-to-noise ratio, and also filling in some shadows cast by harsh ambient illumination. The exposure added falls off with the square of the distance to the scene, so active illumination has a very different impact on foreground and background scene content.

Exposure calculations for cameras are commonly based on the Additive Photographic Exposure (APEX) system [12]. In computing exposure for digital cameras, it is common to refer to changing the camera’s ISO setting, but this is more accurately changing an exposure index (EI) setting. The notions of EI and ISO value have become confused in common usage associated with digital photography, so a brief tutorial is presented here.

ISO speed ratings were developed to quantify photographic film speed or sensitivity. The ISO value quantifies the exposure required at the film surface to obtain standard image quality. A film ISO rating can also be entered into a light meter and used to control exposure metering to obtain the desired exposure on the film. When the ISO value is used in metering, the precise descriptor for this value is exposure index (EI). While ISO represents the exposure a particular film needs for standard exposure, EI is a way to specify how much exposure is actually used for a single capture of a scene.

The standard for digital camera sensitivity [13] defines at least two pertinent ISO values. One is the saturation limited ISO, typically between 40 and 100 for consumer cameras, quantifying the lowest EI value at which the camera can operate. The other is the noise limited ISO, usually ranging up to 1600 or so for compact cameras, quantifying the maximum EI value at which the camera can operate and still achieve the signal-to-noise ratio stated in the sensitivity standard [13]. These two ISO values quantify the range of exposures within which the camera can produce normally acceptable images, although it is common for a camera to allow operation over a different range of EI values. The user interface for most digital cameras uses the term ISO, although as explained previously, EI is a more precise term.

Adjustment of EI in a digital camera has a dual role in exposure control. In the first role, selection of a specific EI for capture of a scene determines how much light is accumulated on the sensor for that capture, although several other parameters, such as exposure time and f/number, are used to control the exposure. In the other role, image processing parameters, such as gain, noise reduction, and sharpening, are usually adjusted in a camera as a function of EI. This allows a camera to provide its best image rendering for a given EI.

ALGORITHM 5.1

A simplified automatic exposure control algorithm.

Set f/number to the lowest available value and set EI to the camera’s saturation-limited ISO (lowest EI value).
Estimate scene brightness from the preview stream.
Calculate exposure time t from scene brightness, f/number, and EI.
If t > t_max, set t = t_max and obtain EI from scene brightness, f/number, and t.
If t < t_min, set t = t_min and obtain f/number from scene brightness, EI, and t.

Returning to exposure control, the optimum trade-offs between various capture parameters (for instance, f/number, exposure time, EI, flash) vary depending on the scene conditions, the design limitations of the camera, and the viewing conditions. Comprehensive understanding of the scene is an open research question, so most algorithms for choosing a set of exposure control parameters have been based on very simple heuristic algorithms, such as one depicted in Algorithm 5.1. In this example, t_max is a threshold for maximum exposure time, normally calculated based on the focal length of the camera to limit the risk of motion blur. The term t_min denotes a minimum exposure time, based on the shutter limitations of the camera. This very simple model works fairly well for many scenes in the context of a compact digital camera.

The relatively large depth of field for a compact digital camera means there is usually not a significant advantage from stopping down the lens. With the small pixels often used in compact digital cameras, decreasing the lens aperture (running at higher f/number) usually degrades resolution and sharpness, as well as decreasing the light available for capture with the sensor. The optical reasons for this are discussed in more detail in the Appendix.

These considerations lead to an exposure control algorithm that keeps the f/number at the lowest value possible. This leaves exposure time, EI, and active illumination available for exposure control. Of these parameters, EI is usually simply adjusted to provide an effective change in sensitivity. In the end, there are practically two variables to change: active illumination and exposure time. Before the development of practical motion estimation technology, the process of balancing the risk of motion blur associated with longer exposure time against signal-to-noise ratio was done heuristically based on the focal length of the camera. Some recent work explicitly models the quality impact of motion blur versus the quality impact of operation with higher exposure index [14], providing a framework for a more precise trade-off. This framework has been extended to include use of electronic flash along with ambient illumination [15].

As mentioned previously, video and still capture have different exposure constraints. While exposure time for a still capture can easily range from 1/1000 second to 1/15 second or longer, exposure time for video is usually limited. The exposure time for a video frame can approach the frame time, but it cannot be longer than the frame time (for instance, 1/30 second for video at 30 frames per second). There are also two perceptual artifacts that further constrain exposure time in video. When motion is captured in a video, longer exposure times increase the motion blur captured in each frame, which can lead to a soft, blurry video of low quality. Conversely, exposure times much shorter than the frame time can lead to frames with very little motion blur, but large motion displacement between frames. This presents the visual system with an artificial stop-action kind of motion. In practice, the exposure time for video capture is usually selected to minimize or balance both of these motion artifacts.

5.3.2 Automatic Focus

As with exposure, the goal for automatic focus is to adjust the camera to capture the scene at hand, with the most important portion of the scene in best focus. The depth of field depends on the focal length of the lens, the aperture (f/number), and the focus setting. Some cameras, particularly inexpensive ones with fixed wide angle lenses, are fixed focus, obviating automatic focus completely. Focus control is more critical at longer focal lengths, because depth of field is decreased. As a result, cameras with zoom lenses have focus requirements that vary with focal length. Finally, focus for a high-resolution capture is more critical than for a low-resolution capture. This is important for cameras that provide both still and video functions, since stills are usually much higher resolution than videos.

5.3.2.1 Focus Analysis

The HVS has greater sensitivity to luminance detail than chrominance detail, and the green channel has a greater impact on luminance than the red or blue channels. Focus analysis is simplified by choosing a single color band for which to optimize focus, normally the green channel.

Focus can be analyzed in a number of ways, but the most common approach in compact digital cameras is contrast maximization. A series of images are captured and analyzed with the lens at different focus settings. Local contrast values at the different focus settings are compared, and the focus setting that provides maximum local neighborhood contrast, or best sharpness, is chosen as being in best focus. A key step in the contrast maximization process is filtering to determine local contrast. The usual approach is to process a preview image with one or more bandpass filters, designed to maximize the ability to determine changes in sharpness. Focus changes affect higher more than lower spatial frequencies, but scene modulation is also reduced at higher frequencies, especially considering the modulation transfer function of the camera lens, reducing the signal available for focus detection. In addition, human perception of sharpness tends to be weighted toward lower frequencies, as discussed later in Section 5.7.

Human perception also influences focus analysis in the selection of a focus region within the image. When a scene has significant depth, it is crucial that the focus analysis emphasizes the scene content most likely to be the subject. Preferably, focus analysis is also performed on image data in a metric with a visually even tone scale, so that edges at different tone values produce similar contrast values. Calculating local contrast in a non-visual metric, such as relative linear exposure, will make brighter areas of the scene appear to have much greater contrast than midtones.

Selection of focus areas also considers analysis results from multiple focus settings. Regions that show a sharper peak in focus values are usually weighted more heavily than regions that show less of a focus peak, which usually have less edge content. Regions with peak contrast at a closer focus setting are also more likely to be the main subject.

Viewing a video raises temporal concerns not present with a still capture. In particular, stability of the focus setting is important. As described earlier, contrast maximization requires images collected at different focus settings, some of which must be measurably out of focus in order to ensure the best focus has been found. The challenge is to measure local sharpness changes to confirm focus, while minimizing the visibility of focus changes in the stored video stream. Analysis of frame or scene differences can be used to help the camera determine when the scene has shifted enough to warrant refocusing.

Most other interesting features of focus analysis algorithms are driven by optical and engineering concerns rather than perceptual ones. Reference [16] describes an autofocus system in a digital camera, including discussion of system hardware, focus evaluation, and logic for focus region weighting. Reference [17] describes the use of noise reduction and the use of filter banks for improved autofocus analysis. The temporal problems of maintaining focus in video capture are discussed in Reference [18].

5.3.2.2 Focus Control

Focus control is essentially the movement of one or more elements in the camera or lens to bring the image into best focus. The perceptual concerns here are minor; the major concerns are mechanical and optical. The only impact human perception has on focus control is in tolerance analysis.

5.3.3 Automatic White Balancing

Automatic white balancing has been an active area of research since the development of color photography. It is the technology that allows images to be captured under a wide variety of illuminants and then reproduced in a pleasing way. Because the HVS has great flexibility in adaptation to different illuminants, cameras must provide roughly similar behavior. As with automatic exposure and focus, white balance correction must have the desired speed of response to scene and illuminant changes, while offering stability in the face of insignificant changes.

5.3.3.1 White Balance Analysis

Much of the early work on white balance analysis focused on automatic creation of prints from color negatives. The gray world hypothesis (the average of many scenes will integrate to 18% gray) was first mentioned in Reference [19]. Models of human adaptation to different illuminants are considered in color science, although the models tend to be more complex than those commonly used in cameras. The problem usually addressed in a camera is more precisely illuminant estimation, rather than complete modeling of human adaptation to it.

White balance analysis during preview or video capture must be computationally lightweight, usually using dedicated hardware and very simple processing based on extensions of the original gray world model. The goal of the analysis is the estimation of a gain triplet, with an element for each of three color channels, which scales the native sensor response so neutral scene content will have equal mean values in all channels. Having approximately equal channel responses for neutral scene content improves robustness in the later color correction operation, allowing colors to move around the neutral axis. Tone correction can then render the tone scale for proper reproduction. Some camera sensors have more than three channels [20], but the automatic white balance analysis is quite similar.

Most white balance analysis is done in a luminance-chrominance color space in order to better separate exposure effects from illuminant balance effects, somewhat akin to the human perception of color [21]. Because this analysis is done before full color correction, it is not in a calibrated color space, but even an approximate separation of luminance from chrominance helps decorrelate the channels for analysis. A common color space used is the ratio of red to blue and green to magenta (combined red and blue). If the linear red, green, and blue values are converted to a log space, this is easily accomplished with a matrix transform similar to one proposed in Reference [22], shown below:

$[\begin{matrix} Y \\ C_{1} \\ C_{2} \end{matrix}] = \frac{1}{4} [\begin{matrix} 1 & 2 & 1 \\ - 1 & 2 & - 1 \\ - 2 & 0 & 2 \end{matrix}] [\begin{matrix} R \\ G \\ B \end{matrix}] (5.1)$

Algorithms for white balance analysis usually employ averaging schemes with variable weighting of different content in the scene. The variable weighting schemes generally emphasize regions based on five perceptual phenomena:

local modulation (texture),
distance from the neutral axis,
location in the tone scale, particularly highlights,
recognizable scene content, such as faces or memory colors, and
human sensitivity to different white balance errors

The motivation for emphasizing areas with modulation or texture is partly statistical. It avoids having the balance driven one way or another by large areas of a single color, such as red buildings, blue skies, or green foliage. Further, as with exposure analysis, there is greater likelihood that the main subject is in a region with modulation.

Colors closer to the neutral axis are weighted more heavily than saturated colors, because people are more sensitive to white balance errors in neutrals. It also helps stabilize the balance estimate a bit, as the extreme colors are weighted less. This concept complicates the algorithm, since the notion of neutral is not well defined before performing illuminant estimation. This can be approached by applying a calibrated daylight balance, then analyzing colors in the resulting image [23]. Another option is to estimate the probability of each of several illuminants being the actual scene illuminant [24], [25].

Highlight colors can be given special consideration, based on the theory that highlights are specular reflections that are the color of the illuminant [26], [27], [28]. This theory is not applicable for scenes that have no truly specular highlights. However, human sensitivity to balance errors tends to diminish in shadows, justifying a tonally based weighting even in the absence of specular highlights.

Some white balance approaches also consider the color of particular scene content. The most common of these is using face detection to guide balance adjustments, providing a reasonable balance for the face(s). This has other challenges, because faces themselves vary in color. Memory colors, such as blue skies and green foliage, may also help, but the complexity of reliably detecting specific scene content, especially before proper balance and color correction, makes this challenging.

Finally, because human sensitivity to balance errors is not symmetric, white balance algorithms are adjusted to avoid the worst quality losses. For example, balance errors in the green-magenta direction are more perceptually objectionable than balance errors in the red-blue direction. Preference can also enter into the tuning. For example, many people prefer a slightly warm (yellowish) reproduction of faces to a slightly cool (bluish) reproduction.

When a camera is used in a still capture mode, the white balance analysis for the full resolution still image can be more complex than the analysis run at video rates, since the regular preview stream is interrupted. If the camera uses a flash or other active illumination, the white balance analysis will yield very different results from analysis of preview images, captured without active illumination. Further, since flash is often a somewhat different color balance than the ambient illumination, still captures with flash are usually mixed illuminant captures, making the estimation problem more complex, although use of distance or range information can help.

5.3.3.2 White Balance Control

Control of white balance is almost always the application of a global adjustment to each color channel. If the balance is applied to linear data, the white balance correction is a triplet of gain factors. Because the white balance operation is usually approached independently of the exposure operation, the triplet of channel gains has two degrees of freedom. Multiplying all channels by a constant gain is in effect an exposure adjustment; therefore, the gain triplet is usually normalized so the minimum gain in the triplet is one, leaving only two values greater than one.

Some algorithms deal with flash as a special case of mixed illuminant capture by using a spatially varying balance correction. This is perceptually helpful, since a human observer rarely ever “sees” the scene with the flash illumination, so it is especially desirable to minimize the appearance of mixed illuminant white balance problems in the reproduced image.

5.4 Color Filter Array Interpolation (Demosaicking)

The process of designing CFAs and the related task of CFA interpolation (or demo-saicking) can derive significant inspiration from the HVS. Due to differences in how the HVS perceives edges based on their orientation and colors based on their luminance-chrominance composition, significant economies can be made in the CFA process without sacrificing high image quality in the final results. The details of HVS-driven array and algorithm designs are discussed below.

5.4.1 Color Filter Array

As stated in Section 5.2, CFA interpolation attempts to reconstruct the missing color values at each pixel that were not directly captured [29]. By far the most ubiquitous CFA is the Bayer pattern [30]. Referring to Fig. 5.6a, this pattern was initially introduced as a luminance-chrominance sampling pattern, rather than a red-green-blue system. It used the HVS as the motivation: “In providing for a dominance of luminance sampling, recognition is taken of the human visual system’s relatively greater ability to discern luminance detail (than chrominance detail)” [30]. This reasoning has set the tone for most subsequent efforts in CFA design. When designing a new CFA pattern, the first step is to establish a strong luminance sampling backbone. This frequently takes the form of a checkerboard of luminance pixels as in Figure 5.6a. This is attractive from an HVS perspective due to the oblique effect [31], [32]. The oblique effect is the observed reduction of the visibility of diagonal edges compared to horizontal and vertical edges. The relationship between the oblique effect and CFA sampling with a checkerboard array can be developed from the frequency response of this sampling pattern as follows [33]:

Images

FIGURE 5.6
Bayer CFA pattern: (a) luminance-chrominance version, (b) red-green-blue version.

Images

FIGURE 5.7
Checkerboard sampling Nyquist diagram.

$S (ξ, η)= \frac{1}{2} comb (\frac{ξ - η}{2}, \frac{ξ + η}{2}), (5.2)$

with the comb function defined as

$comb (\frac{x}{b}, \frac{y}{d}) = | b d | \sum_{m = - \infty}^{\infty} \sum_{n = - \infty}^{\infty} δ (x - m b, y - n d), (5.3)$

where δ (x, y) is the two-dimensional delta function [34]. The corresponding Nyquist diagram of Equation 5.2 is shown in Figure 5.7. The dots mark the centers of the fundamental frequency component (within the dashed diamond) and some of the sideband frequency components of the frequency response. The dashed lines indicate the Nyquist sampling frequency of the fundamental component, that is, the maximum detectable spatial frequency of the sampling array. It can be seen that the Nyquist frequency reaches its peak along the horizontal and vertical axes and is lowest in the diagonal directions. Therefore, in accordance with the oblique effect, a checkerboard array has its maximum frequency response when sampling horizontal and vertical edges, and its minimum frequency response when sampling diagonal edges.

Once the positions of the luminance pixels are fixed in the CFA pattern, the remaining locations are usually evenly divided among the chrominance channels. In the case of a 2 x 2 pattern such as Figure 5.6a, the locations of the chrominance pixels are essentially forced. However, with larger CFA patterns, additional degrees of freedom become available and multiple chrominance patterns are then possible for a given luminance pattern.

Images

FIGURE 5.8
Bayer green interpolation neighborhood.

5.4.2 Luminance Interpolation

It is natural to approach the task of interpolating the CFA luminance values by using neighboring luminance values within the CFA pattern, such as in Figure 5.6a. In Figure 5.6b, this would be equated to interpolating missed green values from neighboring green values. This can be posed as a linear or nonlinear interpolation problem. Figure 5.8 depicts a typical green pixel interpolation neighborhood. Linearly, one would simply compute ${G^{'}}_{0} = (G_{1} + G_{2} + G_{3} + G_{4}) / 4$ . Nonlinearly, the approach would be to perform directional edge detection and interpolate in the direction of least edge activity [35]. For example, if |G₂ − |G₃| ≤ |G₁ − G₄|, then ${G^{'}}_{0} = (G_{2} + G_{3}) / 2$ . Otherwise, ${G^{'}}_{0} = (G_{1} + G_{4}) / 2$ . Any number of more sophisticated decision-making processes can be described, but the essence would be the same.

All of the preceding misses a point, however, from the HVS perspective. The red and blue pixels do not produce true chrominance values. Instead, they produce color values that can be thought of as consisting of a luminance component and a chrominance component. For example, in Figure 5.9, a full-color image (Figure 5.9a) is shown with its corresponding green (Figure 5.9b) and red (Figure 5.9c) color channels. Both color channels have strong luminance components. Therefore, it is possible for the red and blue pixel values to contribute to the interpolation of green pixel values. A simple, yet very useful, model is given below:

$\begin{matrix} R = Y + C_{R}, & G = Y, & B = Y + C_{B} . \end{matrix} (5.4)$

This model reflects the color channel correlation present in the HVS [1]. Over small distances in the image, C_R and C_B can be considered to be approximately constant, as shown in Figure 5.9d. Under this model, spatial variation is largely confined to the luminance channel with chrominance information changing slowly across the image. Using Equation 5.4 and Figure 5.10, an improved luminance interpolation calculation based on HVS considerations can be developed [36]. Assuming a symmetric interpolation kernel, the horizontally interpolated green value G′₅ can be computed as follows:

${G^{'}}_{5} = a_{1} (R_{3} - C_{R}) + a_{2} G_{4} + a_{3} (R_{5} - C_{R}) + a_{2} G_{6} + a_{1} (R_{7} - C_{R}) . (5.5)$

In this equation it is assumed that the chrominance term C_R is constant over the depicted pixel neighborhood. Three constraints are required to determine the coefficients a₁ through a₃. Since C_R is generally unknown, the first constraint is to eliminate it from the equation.

Images

FIGURE 5.9
Full-color image and its components: (a) original image, (b) green channel, (c) red channel, and (d) CR channel.

Images

FIGURE 5.10
Bayer green-red interpolation neighborhood.

This is done by computing its coefficient and setting it to zero:

${G^{'}}_{5} = a_{1} R_{3} + a_{2} G_{4} + a_{3} R_{5} + a_{2} G_{6} + a_{1} R_{7} - (2 a_{1} + a_{3}) C_{R} \Rightarrow a_{3} = - 2 a_{1}, (5.6)$

resulting in

${G^{'}}_{5} = a_{2} (G_{4} + G_{6}) + a_{1} (R_{3} - 2 R_{5} + R_{7}) . (5.7)$

Images

FIGURE 5.11
Triangular edge profile with collocated sinusoid.

The next constraint to add is that a neutral flat field input produces a neutral flat field output. In other words, if G₄ = G₆ = R₃ = R₅ = R₇ = k, then G′₅ = k. This results in a₂ = 1/2. Finally, there is no one ideal constraint for determining a₁, which is the last coefficient. One reasonable constraint is the exact reconstruction of a neutral edge with a triangular profile, as shown in Figure 5.11. As indicated, this edge profile could also be a sampled version of 1 + cos (2πx/4). Setting G₄ = G₆ = 1, R₃ = R₇ = 0, and R₅ = G₅ = 2 produces a₁ = −1/4. The final expression is given below [37]:

${G^{'}}_{5} = \frac{G_{4} + G_{6}}{2} - \frac{R_{3} - 2 R_{5} + R_{7}}{4} . (5.8)$

The set of constraints used could also be interpreted as imposing exact reconstruction at spatial frequencies of 0 and 1/4 cycles/sample. While the constraint at 0 cycles/sample is probably best left intact, the second constraint could easily be changed to a different spatial frequency with a differing corresponding value for a₁. A vertical interpolation expression would be generated in the same way. For a linear interpolation algorithm, both horizontal and vertical expressions can be averaged to produce a two-dimensional interpolation kernel as follows:

${G^{'}}_{5} = \frac{G_{2} + G_{4} + G_{6} + G_{8}}{4} - \frac{R_{1} + R_{3} - 4 R_{5} + R_{7} + R_{9}}{8} . (5.9)$

The color channel correlation of the HVS can also be used to improve the edge detectors used in a nonlinear, adaptive interpolation algorithm [38]. Using reasoning similar to that used above, the horizontal edge detector can be augmented from |G₄ − G₆| to |G₄ − G₆| + |R₃ − 2R₅ + R₇| and the vertical edge detector from |G₂ − G₈| to |G₂ − G₈| + |R₁ − 2R₅ + R₉|. Again, the direction of least edge activity would be interpolated with the corresponding horizontal or vertical interpolation expression.

5.4.3 Chrominance Interpolation

Since the HVS is most sensitive to changes in chrominance at the lower spatial frequencies, linear interpolation of chrominance values should be sufficient under most circumstances. As stated in Section 5.4.2, it is important to interpolate chrominances and not simply colors [39]. This can be accomplished by returning to Equation 5.4 and rewriting the model in terms of chrominances as follows:

$\begin{matrix} C_{R} = R - Y, & Y = G, & C_{B} = B - Y . \end{matrix} (5.10)$

Images

FIGURE 5.12
Chrominance interpolation neighborhood: (a) colors, (b) chrominances.

Assuming luminance interpolation has been previously performed, each pixel will have an associated luminance (green) value. Therefore, at each red pixel location a chrominance value C_R = R − G can be computed. Similarly, at each blue pixel location C_B = B − G can be computed. The resulting pixel neighborhood is given in Figure 5.12b. Missing C_R chrominance values are computed using averages of nearest neighbors, for example, C′_R2 = (C_R1 + C_R3) /2, C′_R4 = (C_R1 + C_R7)/2, and C′_R5 = (C_R1 + C_R3 + C_R7 + C_R9)/4. Missing values for C_B are computed in a similar manner. After chrominance interpolation, the chrominance values can be converted back to colors using Equation 5.4. It is possible to combine the color to chrominance and back to color transformations with the chrominance interpolation expressions to produce equivalent chrominance interpolation expressions using just color values. These resulting expressions are given below:

${R^{'}}_{2} = \frac{R_{1} + R_{3}}{2} - \frac{G_{1} - 2 G_{2} + G_{3}}{2}, (5.11)$

${R^{'}}_{4} = \frac{R_{1} + R_{7}}{2} - \frac{G_{1} - 2 G_{4} + G_{7}}{2}, (5.12)$

${R^{'}}_{5} = \frac{R_{1} + R_{3} + R_{7} + R_{9}}{4} - \frac{G_{1} + G_{3} - 4 G_{5} + G_{7} + G_{9}}{4}, (5.13)$

These expressions are written in a way to illustrate their differences from simply averaging color values of the same color, for example, averaging only neighboring red values to produce interpolated red values.

5.5 Noise Reduction

Noise reduction lessens random signal variation in an image while preserving scene content. The random variation, commonly called noise, diminishes the quality of a reproduced image viewed by a human observer [40]. Since noise reduction usually introduces undesirable side effects, the challenge in designing the noise reduction algorithm is to ensure that the perceptual quality improvement exceeds the quality degradation from side effects.

The primary side effects from noise reduction are perceived as loss of sharpness or loss of texture. Adaptive noise reduction normally preserves high-contrast edges but smooths lower-contrast edges, leading to texture loss. The quality degradation caused by loss of texture is a topic of current research, especially for development of ways to objectively measure image quality from digital cameras [41].

Other side effects can be caused by abrupt transitions in smoothing behavior. Usually, small modulations (small in spatial extent, contrast, or both) are smoothed, while larger modulations are preserved. When a single textured region (such as a textured wall, grass, or foliage) contains both small and large modulations, noise reduction will change the texture greatly if the transition from smoothing to preservation is too abrupt, thus causing switching artifacts. Another potential side effect is contouring, when a smooth gradation is broken into multiple flat regions with sharp transitions, that is, an image is heavily quantized. This can occur with noise reduction based on averaging values near a neighborhood mean, such as a sigma [42] or bilateral [43] filter, especially with iterative approaches.

In addition to the perceptual complexity of side effects, noise reduction must operate over a large range of conditions in a compact digital camera. When a camera operates at its lowest EI, the signal-to-noise ratio in the raw capture is often four to eight times the signal-to-noise ratio when operating at the maximum EI. Further, noise in a raw image and the visibility of noise in the rendered image vary with signal level. The wide range of signal-to-noise ratios and noise visibility requires noise reduction to adapt over a broad set of conditions to provide a pleasing reproduced image. For example, if a camera operates at a very low EI and the raw image has a good signal-to-noise ratio to begin with, the noise may degrade image quality only slightly. In this case, noise reduction should be tuned to minimize its side effects and ensure retention of maximum scene content.

Overall image quality has been modeled by considering multiple distinct degradations to image quality as perceptually independent visual attributes, such as sharpness, noise, contouring, and others [40]. Because noise reduction produces multiple effects that are not necessarily perceptually independent, this modeling approach will vary in effectiveness, depending on the precise nature of the noise reduction effects. However, the multivariate quality model of [40] has been applied with some success to modeling the overall quality of images from a camera operating over a range of EI values [14].

5.5.1 Contrast Sensitivity Function and Noise Reduction

Since human contrast sensitivity diminishes at higher spatial frequencies, noise reduction primarily reduces higher-frequency modulation and preserves lower-frequency modulation as much as possible. However, the relationship between a contrast sensitivity function (CSF) and optimal noise reduction is indirect for two reasons. First, the human CSF is defined in terms of a contrast sensitivity threshold as a function of angular frequency at the eye. The relationship between spatial frequency in a camera image (cycles/pixel) and angular frequency at the eye depends on the full image processing chain, including resizing, output or display, and viewing conditions. Second, contrast sensitivity functions are usually based on measurement of minimum detectable contrast in a uniform field. The modulation that is due to noise in a compact digital camera is often above the threshold of visibility, and the quality degradation that arises from different levels of supra-threshold noise is more complex than expressed in a simple CSF.

5.5.2 Color Opponent Image Processing

The human visual system sensitivity to luminance modulation is different from its sensitivity to chrominance modulation. The luminance CSF has a bandpass character with peak sensitivity between one and ten cycles/degree. The chrominance contrast sensitivity is also reduced as frequency increases, but the reduction begins at a lower frequency, roughly one third to one half the spatial frequency of the peak of the luminance CSF. Because of this, noise reduction is most effective when operating in a luminance-chrominance color space, treating the luminance and two chrominance channels differently. While a colorimetric uniform color space such as CIELAB is attractive for separating luminance from chrominance, the transformation is somewhat complex. Because noise reduction makes relatively small changes to mean signal values, an approximate transformation is usually adequate. A common approach is a matrix transform similar to one proposed in Reference [22], shown earlier in Equation 5.1. Another very common space is the YC_BC_R space used for lumachroma JPEG compression. The primary advantage of using the JPEG YC_BC_R space for noise reduction is that the data are usually converted to that space for compression. Converting to that color space earlier in the chain and applying noise reduction there saves one or two color conversions in the processing chain.

After converting to a luma-chroma color space, it is normal to apply noise reduction to each of the resulting channels with different techniques. Noise reduction for the luminance channel seeks to attenuate high-frequency noise, rather than eliminate it completely, and seeks to preserve as much sharpness as possible within the cleaning requirements. Noise reduction for the chrominance channels is generally much more aggressive, and the associated degradation of sharpness and spatial resolution is quite acceptable. In practice, since chroma is often subsampled for compression, noise reduction is often applied to the subsampled chroma channels. This saves processing and makes control of lower spatial frequencies easier because the chroma channels are sampled at lower frequency.

To illustrate the effect of noise reduction in a digital camera, Figure 5.13 shows example hypothetical noise power spectra (NPS) for rendered images of flat patches with noise reduction applied, along with an example CSF. Noise in a raw digital camera image generally has a relatively flat (white) frequency response, so a non-flat shape for the final NPS is the result of noise reduction (and other operations) in the processing chain. The first curve in Figure 5.13 is a luminance CSF using the model from Reference [44]. For this figure, the CSF has been computed for a viewing distance of 400 mm and a pixel pitch of 0.254 mm, corresponding to viewing an image at full resolution on a typical LCD display, and normalized so the CSF response fills the range of the plot to overlay the NPS curves. The peak CSF response is at 0.11 cycles/pixel for this situation. Increasing the viewing distance would move the peak toward lower frequencies, as would using a higher resolution display.

The application of noise reduction in a luma-chroma color space causes the resulting image’s NPS for the luma channel to be very different from those for the chroma channels. Figure 5.13 shows four sample NPS curves: a sample luma NPS at a low EI (Luma 100), a luma NPS at a high EI (Luma 1600), and corresponding chroma curves (Chroma 100 and Chroma 1600). The C_B and C_R channels normally have similar NPS characteristics, so this example shows a single chroma NPS curve. All of the NPS plots show a lowpass characteristic, although of variable shape. At low EI, the luma NPS has relatively little reduction and a fairly wide passband. Since small amounts of noise have little impact on overall quality, there is relatively little need to reduce the luma noise lower than shown for EI 100. The chroma NPS shows no significant response past 0.25 cycles/pixel, consistent with subsampled chroma channels. Because of the perceptual degradation from low-frequency chroma noise, the preference is to make the chroma NPS similar to or lower than the luma NPS.

Images

FIGURE 5.13
Example noise power spectrum for a still image.

At high EI, the luma NPS is much more aggressively filtered. The noise in the raw image at EI 1600 is usually roughly four times the noise at EI 100, yet after noise reduction, it is less than three times the noise at EI 100. Further, the relatively flat NPS from white noise in the raw image has been modified much more than for EI 100. While attenuation is still fairly linear with frequency, there is more attenuation of higher frequencies and a tendency to be concave upward. The chroma NPS is much more aggressively filtered than for EI 100, deliberately sacrificing chrominance detail in order to reduce chrominance noise. At the higher EI, the quality degradation that is due to noise is substantial, so trading off sharpness and texture to limit the noise provides better overall quality. Modeling of the relationship between NPS and human perception is a topic of current research [45].

NPS analysis for a frame of a video shows some similar characteristics; noise reduction is more aggressive at higher frequencies and at a higher EI. In compact digital cameras, there is typically a greater emphasis on limiting the noise in the video than on preserving sharpness, partially driven by the greater compression applied to videos. As noise is difficult to compress and tends to trigger compression artifacts, there is a greater emphasis on avoiding these noise-induced artifacts than on preserving scene detail.

Another important feature illustrated in Figure 5.13 is the use of masking. Allowing luma noise power to be greater than chroma noise power tends to mask the chroma noise with the luminance modulation, making the chroma noise less visible. In addition, the development of noise reduction algorithms also depends on use of masking from scene content. Noise reduction algorithms tend to smooth more in areas with little scene modulation and preserve more modulation in regions with greater local contrast. The challenge is to discriminate as well as possible between noise and local contrast due to scene content.

Images

FIGURE 5.14 (See color insert.)
Example images with: (a) noise, (b) light RGB smoothing, (c) aggressive RGB smoothing, and (d) simple color opponent smoothing.

An example illustrating some of the power of color opponent processing for noise reduction is shown in Figure 5.14. Namely, Figure 5.14a is an example noisy image, such as might come from a capture at a high EI, processed through demosaicking and white balance. The raw captured image was very noisy with a fairly flat NPS, and the early processing steps have reduced some noise, especially at higher frequencies, while leaving a fair bit of lower-frequency noise. Figure 5.14b illustrates a light smoothing (Gaussian blur) applied in RGB color space. There is modest sharpness loss, but also modest reduction in noise, especially in the lower frequencies. Low-frequency variations in color remain in areas of low scene texture, such as the blue wall, the face, and green background. Figure 5.14c illustrates a more aggressive smoothing (Gaussian blur) applied in RGB. The noise is significantly reduced, but the loss in sharpness degrades the quality substantially. Figure 5.14d illustrates light smoothing applied to a luma channel and more aggressive smoothing applied to both chroma channels. The luma channel smoothing matches that for Figure 5.14b and the chroma channel smoothing matches that for Figure 5.14c. The overall quality is significantly better than the other three figures, because the reduction in chroma noise is achieved without a similar reduction in sharpness.

5.5.2.1 Noise Probability Distributions

The probability distribution for noise has a great impact on noise reduction. The dominant noise in a digital camera raw image is usually shot noise with a Poisson distribution, plus an additive read noise with a Gaussian distribution. The read noise is signal independent, but the shot noise is dependent on signal level. Further, a small number of pixels in the sensor are abnormal, casually referred to as defective pixels. Defective pixels act like impulse noise, but have highly variable responses. Rather than two or three clear peaks (normal, dark defects, bright defects), a histogram of a low-light, flat-field image capture is more like a mixture of Gaussians, with a main peak and an extremely wide tail.

One treatment for defective pixels and other contributors that spread out the tails of the noise distribution is to apply an impulse filter before other operations, preventing spread of impulse noise into other pixels during smoothing. The challenge in performing impulse filtering is to avoid filtering scene detail unnecessarily. For example, small light sources, and their reflections in eyes, are easily misidentified as impulses. Despite their similarity to impulse noise, they have exceptional perceptual importance.

5.6 Color Rendering

Nearly all practical imaging systems have a small number of band-limited light detectors and emitters. The shape of the detectors’ spectral efficiencies and the emitters’ radiance characteristics determine the intrinsic color reproduction of the imaging system, and this reproduction is almost never accurate or visually pleasing without any signal processing. Additionally, there may be factors present when viewing a reproduction that alter its appearance and that may not have been present when capturing the corresponding original scene. The signal processing that modifies an image to account for the differences between the HVS and the imaging system and to account for the viewing environment is called color rendering. Knowledge of HVS color perception and a mathematical specification of color stimuli help to define the signal processing needed for rendering. These two topics are described in the following section, prior to a discussion of the specific steps to render colors of an image captured with a camera and displayed on a monitor.

5.6.1 Color Perception

In the seventeenth century, it was discovered that a color is a percept induced by light composed of one or more single-wavelength components [46]. In the early nineteenth century, the hypothesis was that there are three retinal light sensors in the human eye [47], even though the sensing mechanism was unknown. Later in the nineteenth century, it was determined that wavelengths that are usually associated with red, green, and blue compose a minimal set of primary wavelengths that can be mixed to induce the colors associated with the rest of the wavelength spectrum [48], [49]. It is now known that there are two types of light-sensitive cells in the human eye, called rods and cones because of their shapes. Rods and cones derive their light sensitivity from the bleaching of chromoproteins by light, and the rods have higher sensitivity than the cones [50], [51], [52]. Signals from the rods significantly contribute to human vision only under low illumination levels and do not contribute to color vision. The signals that contribute to color vision are produced by three types of cones, which are called rho, gamma, and beta cones. Each type of cone has a different chromoprotein, giving the rho, gamma, and beta cones spectral passbands with maximum sensitivities near 580 nm, 550 nm, and 445 nm, respectively. The three passbands correspond to light wavelength bands that are usually associated with the colors red, green, and blue, and therefore the light-sensitive cone cells are the biological cause of the early findings presented in References [48] and [49].

The signals created by the rods and cones during photo-stimulation are electrical nerve impulse trains, and the four impulse trains (one from the rods and three from the cones) are transformed into one achromatic signal and two chromatic signals. In the late nineteenth century, it was proposed that the chromatic signals represent a color-opponent space [21]. In the modern view, the achromatic signal is also included in the color-opponent space, making it three dimensional, and the most basic model specifies that each axis of the space represents a measure of two opponent, mutually exclusive sensations: redness and greenness, yellowness and blueness, and lightness and darkness. The basic model explains why humans do not perceive any color as being, for example, simultaneously greenish and reddish but may perceive a color as being, say, simultaneously greenish and bluish. The visual signals are transmitted through the optic nerve to the brain, and several effects originate during the perception of color. The effects are psychological and psychophysical in nature and three such effects that are relevant to color rendering are general brightness adaptation, lateral brightness adaptation, and chromatic adaptation.

General brightness adaptation of the visual system is an adjustment that occurs in response to the overall luminance of the observed scene. The overall sensitivity of the visual system increases or decreases as the overall luminance of the observed scene decreases or increases, respectively. An emergent effect of this type of adaptation is that a scene is perceived to have lower contrast and to be less colorful when viewed under a lower overall luminance than when viewed under a higher overall luminance. Another effect that arises is related to changes in instantaneous dynamic range. Even though the total dynamic range of the visual system is extremely high (on the order of 10¹⁰), at any given general brightness adaptation level the instantaneous dynamic range is much smaller (and varies, on the order of 10 for a very dark level and up to 10³ under full daylight) [53].

Lateral brightness adaptation is a change in the sensitivity of a retinal area as influenced by the signals from neighboring retinal areas. This type of adaptation is very complex and yields several visual effects, including Mach bands [54] and the Bezold-Brücke effect [55]. Another lateral brightness adaptation effect is simultaneous contrast, whereby the lightness, hue, and colorfulness of a region [56] depend on the color of its surround. Simultaneous contrast is the origin of the dark-surround effect, which occurs when an image is perceived as having lower contrast if viewed with a dark surround than if viewed with a light surround [57].

Chromatic adaptation is a reaction of the visual response to the overall chromaticity of the observed scene and the chromaticity of a stimulus that is perceived as neutral is called the chromatic adaptation point. This type of adaptation allows the visual system to recognize an object under a range of illumination conditions. For example, snow is perceived as white under daylight and incandescent lamps even though the latter source lacks power in the shorter wavelengths and therefore yields a radically different visual spectral stimulus than daylight.

For a more detailed description of the physiology, properties, and modeling of human vision, see References [32], [56], and [58].

5.6.2 Colorimetry

Human vision is an astonishingly complex mechanism of perception and a complete theoretical description has not yet been developed. However, the science of colorimetry can be used to predict whether two color stimuli will perceptually match [59], [60], [61]. Colorimetry is based on color-matching experiments, where an observer views test and reference stimuli and the reference stimulus is adjusted until the observer perceptually matches the two stimuli. The reference stimulus is composed of three independent light sources, which are called the color primaries. The observers adjust the intensity of each color primary to achieve the perceptual match. When a perceptual match to the test stimulus is achieved, the primary color intensities are recorded and are called the tristimulus values for the test color. If a color-matching experiment is carried out for all test colors composed of a single wavelength, the set of tristimulus values forms a set of three curves that are called color-matching functions (CMFs). Each set of color primaries will have an associated CMF set, and any CMF set may be converted with a linear transform into another CMF set having different color primaries [62].

Besides depending on the color primaries, CMFs can also vary because of differences in the eye responses among individuals, so the Commission Internationale de l’Eclairage (CIE) defined standard observers for unambiguous tristimulus specification [63]. The CIE XYZ tristimulus values for any arbitrary stimulus can be calculated as

$X = k \int_{λ_{1}}^{λ_{2}} S (λ) \bar{x} (λ) d λ, (5.14)$

$Y = k \int_{λ_{1}}^{λ_{2}} S (λ) \bar{y} (λ) d λ, (5.15)$

$Z = k \int_{λ_{1}}^{λ_{2}} S (λ) \bar{z} (λ) d λ, (5.16)$

where λ is the wavelength variable, the interval [λ₁, λ₂] defines the visible wavelength range, S(λ) is the spectral power distribution of the stimulus in units of watts per nanometer, and $\bar{x}$ , $\bar{y}$ , and $\bar{z}$ are the CIE standard observer CMFs. The normalizing constant k is sometimes chosen such that Y is equal to 1 when S(λ) is equal to the spectral power distribution of the illuminant, yielding luminance factor values. When this type of normalization is used, the tristimulus values specify relative colorimetry. The normalizing constant k can also be chosen such that Y yields luminance values in units of candelas per square meter (cd/m²), in which case the tristimulus values specify absolute colorimetry [58].

The CIE Y value, or luminance, correlates well with the perception of brightness. The magnitudes of the CIE X and Z values do not correlate well with any perceptual attribute, but the relative magnitudes of the CIE XYZ tristimulus values are correlated with hue and colorfulness [56]. For these reasons, the CIE defined relative tristimulus values, called xyz chromaticities, as

$\begin{matrix} x = \frac{X}{X + Y + Z}, & y = \frac{Y}{X + Y + Z}, & z = \frac{Z}{X + Y + Z} . \end{matrix} (5.17)$

5.6.3 Color Rendering Using Relative Colorimetric Matching

One common imaging task is to display on a monitor an image that was captured with a digital camera. It is usually required to display a realistic, visually pleasing reproduction of the original scene. A reasonable strategy, but one that is not preferred as will be discussed later, is to apply a color rendering that achieves a relative colorimetric match between the displayed reproduction and the original scene. Relative rather than absolute colorimetry is used because the dynamic range and the gamut of many original scenes are greater than the monitor’s dynamic range and gamut, and in those cases an absolute match cannot be achieved.

A monitor display is an additive color reproduction device that typically has three color primaries. Three control code values are input to the device to produce a color stimulus composed of the mixture of the three color primary intensities. The dynamic range of a cathode-ray tube (CRT) monitor in a darkened viewing environment is typically on the order of 10³ and that of a liquid-crystal display (LCD) is on the order of 10², although when viewed with environmental flare an LCD monitor may actually have a larger dynamic range than a CRT monitor [64]. CRTs were the most popular monitor devices until recently but now LCDs dominate the monitor market. The neutral tone scale characteristic of many CRTs can be approximated by a simple power-law relationship:

$Y_{m} = k_{m} {(\frac{V}{V_{\max}})}^{γ} + Y_{m 0}, (5.18)$

where Y_m is the CIE luminance factor produced by the monitor, V is the input control code value, V_max is the maximum input code value, Y_m0 is the CIE luminance factor when V = 0 and k_m is a normalizing constant chosen such that Y_m = 1 when V = V_max. The relation between control code value and CIE luminance factor for LCD monitors is more complicated than for CRTs, but it can be made to follow the same power-law relationship as shown above with some signal processing between the input and the output [64].

A digital color camera is equipped with a light sensor and, most often, with an array of three types of light receptors (RGB or CMY) whose responses are linear with respect to irradiance. The dynamic range of a sensor depends on the receptors’ electron-gathering capacity [65], but it is common for a sensor to have a dynamic range of at least 10³, which is sufficient to cover the instantaneous dynamic range of the human eye. Sensor spectral responsivities typically form approximations to a set of color-matching functions. If the approximation is good, a sensor’s responses can be multiplied by a 3 × 3 matrix to obtain good approximations to CIE XYZ tristimulus values.

Suppose that an image is captured with a digital color camera that has spectral responsivities that are good approximations to color-matching functions and that the camera responses have been transformed to CIE XYZ tristimulus values. For neutral stimuli (that is, all stimuli that have the chromaticity of the chromatic adaptation point), the chromaticity ratios x/y and z/y are constant and therefore the tristimulus ratios X/Y and Z/Y are also constant. Thus, it is possible to gain the camera channels such that neutral stimuli produce the same response in all three channels as described in Section 5.3.3. Such channel balancing is assumed, as well as that the display monitor is calibrated such that the chromaticity of its output is the same for any triplet of equal input code values. Further suppose that the goal is to display the image such that a colorimetric, rather than a perceptual, match is achieved between the original scene and the monitor reproduction. For simplicity, it is assumed that the camera signals do not saturate, that all of the camera responses are within the monitor’s gamut, that the observer is chromatically adapted to CIE illuminant D65 when viewing the original scene and the monitor reproduction, and that the monitor white point has the chromaticity of CIE illuminant D65.

The camera’s output CIE luminance factor can be transformed with the inverse of the monitor neutral tone scale characteristic:

$V = V_{\max} {(\frac{Y_{C} - Y_{m 0}}{k_{m}})}^{\frac{1}{γ}}, (5.19)$

where Y_c is the CIE luminance factor measured by the camera. This type of transformation, that accounts for the display nonlinearity, is usually called gamma correction. The code value V is used as input to the monitor’s three channels so that the output luminance factor of the monitor is Y_m = Y_c (and, by monitor calibration, the output chromaticity is the same as that of CIE illuminant D65). In short, the relative colorimetric matching for the neutral tone scale is achieved with the transformation of camera responses to CIE XYZ approximations, camera channel balancing, monitor calibration, and gamma correction. Assuming a colorimetrically matched neutral tone scale, the relative colorimetric matching of nonneutral stimuli is described next.

A set of CMFs for the monitor may be derived by performing a color-matching experiment using the monitor to produce the reference stimuli. Once the display CMFs are obtained, relative colorimetric matching of non-neutral stimuli would be achieved if the camera responsivities are proportional to the display CMFs because the camera would directly measure how much of each monitor primary is present in a given color stimulus of an original scene. However, the proportionality is impossible for physically realizable cameras and monitors, but the matching can be achieved with additional signal processing [66]. The balanced camera responses may be transformed with a 3 × 3 matrix to tristimulus values associated with the monitor color primaries [62]. This transformation is usually called color correction. Once color correction is accomplished, the new tristimulus values are gamma-corrected and applied to the monitor to achieve relative colorimetric matching between all of the stimuli in the original scene and the monitor reproduction.

Figure 5.15a shows how an image as captured with a camera with no signal processing approximately appears when displayed on a monitor. The camera signals are linear with sensor irradiance, no gamma correction is applied, and there is a discrepancy between the camera responsivities and the monitor CMFs. Under these circumstances most colors are reproduced too dark with incorrect chromaticities. When that camera image is gamma-corrected before being displayed, the colors will have approximately the correct luminance factors but the chromaticities will remain incorrect as illustrated in Figure 5.15b. If color correction is performed before gamma correction, then the chromaticities will also be correct. Figure 5.15c shows an illustration of an image that has been processed with color and gamma corrections and thus simulates a perfect relative colorimetric match to the original scene. Yet, this image still looks like it could be improved – its luminance contrast and colorfulness seem low. These two attributes are perceived to be low because the observer’s general brightness adaptation level corresponds to the display’s overall luminance level, which is much lower than the level of the original scene. Colorimetric matching is evidently not sufficient to achieve perceptual matching. While the system has accounted for chromatic adaptation (by balancing the camera channels for neutral stimuli and setting the monitor white point to have the chromaticity of CIE illuminant D65), no other adaptation effects have been considered in the system setup or the signal processing. Moreover, up to this point the viewing environment has been assumed to be flareless. If the colorimetrically matched image of Figure 5.15c is displayed in a viewing environment with flare, then the darker colors will look less dark as illustrated in Figure 5.15d. Viewing flare can have many forms, but here it is assumed that it is light with the same chromaticity as the monitor white point and that it has a uniformly distributed intensity over the monitor face.

Images

FIGURE 5.15 (See color insert.)
A series of images illustrating how an image approximately appears when displayed on a monitor at different steps in the color rendering, where (a–c) simulate images viewed within a flareless viewing environment and images (d–f) simulate images viewed with flare. (a) The original linear camera image, (b) after gamma correction but not color corrected, (c) after color correction and gamma correction, (d) after color correction and gamma correction (with flare), (e) with boosted RGB scales, and (f) after flare correction.

5.6.4 Color Rendering with Adaptation and Viewing Flare Considerations

A way to simultaneously account for the low luminance contrast and low colorfulness that is perceived as a result of an observer’s low general brightness adaptation level is to boost all three image channels. The boost can be achieved by gaining the color-corrected tristimulus values just before sending the signal to the display. The same gain is applied to all three channels to maintain the chromaticities of all colors. Figure 5.15e shows how the image in Figure 5.15d would approximately appear if the three channels are gained. The overall luminance contrast and the colorfulness have been increased, but the darker colors appear to not be dark enough because of the viewing flare. Since it is assumed that the viewing flare is uniform across the monitor, to correct for the flare it is sufficient to subtract the boosted tristimulus values of the flare from the boosted tristimulus values of the scene. The channels are gained again such that the white point remains the same before and after the flare tristimulus subtractions. Figure 5.15f illustrates the results of applying the flare correction to the image from Figure 5.15e, and it can be seen that while lighter colors appear about the same in both figures, the darker colors appear even darker after the correction. It is this last image that appears to be a perceptual match to the original scene.

5.7 Edge Enhancement

During image capture, edge detail is lost through optical and other effects that reduce the transfer of modulation from the scene to the reproduced image. Edge enhancement, also known as sharpening, is used to boost edge contrast, making the image appear sharper and partially compensating for the losses in the modulation transfer function (MTF). There are several main perceptual effects from edge enhancement: increased sharpness, increased noise, and spatial artifacts such as over-sharpening or ringing, visible as halos at highcontrast edges. Just as noise reduction is intended to reduce noise while preserving sharpness, edge enhancement is intended to increase sharpness while avoiding noise amplification and other artifacts. When applying edge enhancement, it is important that the quality improvement from increased sharpness is greater than the quality loss from increased noise or other artifacts.

5.7.1 Human Visual Sharpness Response

The human visual sharpness response depends more strongly on low and intermediate spatial frequencies in a viewed image. An early publication investigated sharpness in photographic images and concluded that both acutance and resolving power contributed to a prediction of sharpness [67]. Since then, further work relating system MTF and sharpness-related image quality included the development of an optical merit function based on camera or system MTF and the contrast sensitivity of the eye [68]. A more recent publication describing models of visual response for analysis of image processing algorithms included a conveniently parameterized model for a two-dimensional CSF [44].

Perception of sharpness in a reproduced image depends on viewing distance, the resolution of the reproduced image, and the MTF of the entire reproduction chain from camera lens through print or display. Figure 5.16 shows an example camera MTF (including lens, sensor, and processing effects) along with a CSF from Reference [44].

The frequency axis for the CSF is scaled for viewing the image at 100 pixels per inch on a display at a viewing distance of 20 inches, a common viewing distance for a computer display. The peak in the human CSF aligns fairly well with the peak of the camera MTF, which is desirable for a well-tuned image chain. The perceived improvement from edge enhancement is maximized if the edge enhancement has the greatest effect on frequencies with the greatest response in the human CSF.

During capture, viewing conditions for the reproduced image are generally unknown and variable, since users view digital images in a variety of ways on a variety of devices. Because the viewing conditions are variable, digital camera edge enhancement is usually optimized for a selected, somewhat conservative, viewing condition. One common condition is viewing of the image on a high-resolution display at one camera pixel per display pixel and a viewing distance of twice the image height, cropping the image to fit the display. Another viewing condition, with images resized to fill a large display and viewing at a distance of 34 inches, was used in development of a quality ruler for subjective image evaluation [69], [70], while other options include viewing prints of various sizes.

Images

FIGURE 5.16
Example spatial frequency response with CSF overlay.

5.7.2 Linear Edge Enhancement

Routine edge enhancement is based on a convolution operation to obtain an edge map, which is scaled and added to the original image, as follows:

$A' = A + k E, (5.20)$

where A′ is the enhanced image, A is the image from the previous processing stage, k is a scalar gain, and E is the edge map. There are several possible ways to create E, one of which is an unsharp mask:

$E = A - A*b, (5.21)$

where b is a lowpass convolution kernel and * refers to a convolution. A second method is direct convolution:

$E = A*h, (5.22)$

where h is a highpass or bandpass convolution kernel. The design of the convolution kernel, either b or h, and the choice of k, are the main tuning parameters in linear edge enhancement. The kernel controls which spatial frequencies to enhance, while k controls the magnitude of the enhancement. Often, the kernel is designed to produce a bandpass edge map, providing limited gain or even zero gain at the highest spatial frequencies.

Consistency of edge response in the enhanced image is an important consideration in kernel design. If the camera has an anisotropic optical transfer function (OTF), such as caused by an anti-aliasing filter that blurs only in one direction, the kernel may be designed to partially compensate for the anisotropy. If the camera OTF is isotropic, a rotationally symmetric kernel provides equal enhancement for edges of all orientations, preserving that consistency.

Images

FIGURE 5.17
Soft thresholding and edge limiting edge enhancement nonlinearity.

Equations 5.21 and 5.22 do not specify the color metric for A, such as linear camera RGB, rendered image RGB, or another color space. If the loss in system MTF is primarily in the camera optical system, then performing enhancement in a metric linear with relative exposure will allow for the most precise correction. However, MTF losses occur in multiple steps in the image processing chain, several of them not linear with the original exposure. Experience has shown that edge enhancement in a nonlinear space (for example, after tone rendering and gamma correction) provides a more perceptually even enhancement than

5.7.3 Nonlinear Edge Enhancement

Even with a well-designed bandpass kernel, the formulation in Section 5.7.2 amplifies noise and can produce halo artifacts at high-contrast edges. While limiting the edge enhancement is possible, better perceptual results can be achieved by applying a nonlinearity to the edge map before scaling and adding it to the original image.

Figure 5.17 shows an example of nonlinearity to be applied to values in the edge map E with a simple lookup table operation, E’ = f (E). Small edge map values are most likely to be the result of noise, while larger edge map values are likely to come from scene edges. The soft thresholding function shown in Figure 5.17 reduces noise amplification by reducing the magnitude of all edge values by a constant, and is widely used for noise reduction, such as in Reference [71]. Soft thresholding eliminates edge enhancement for small modulations, while continuing to enhance larger modulations. The nonlinearity shown in Figure 5.17 also reduces halo artifacts by limiting the largest edge values, since high-contrast edges are the most likely to exhibit halo artifacts after edge enhancement.

Figure 5.18 illustrates some of the effects discussed here. Namely, Figure 5.18a shows an image without edge enhancement applied. There is measurable noise in the image, but it does not detract substantially from the overall quality of the image under the intended viewing conditions. Figure 5.18b shows the same image with a simple linear convolution edge enhancement, as described in Section 5.7.2. The image is much sharper, with higher edge contrast, but also much higher noise and halo artifacts, particularly obvious in the text in the cookbook. Figure 5.18c shows the same image with an edge-limiting soft thresholding step used in edge enhancement. The sharpness of the image is only slightly less than the sharpness of Figure 5.18b, but the noise and halo artifacts are substantially reduced.

Images

FIGURE 5.18 (See color insert.)
Example images with: (a) no edge enhancement, (b) linear edge enhancement, and (c) soft thresholding and edge-limited enhancement.

Application of edge enhancement in an RGB color space tends to amplify colored edges caused by chromatic aberration or other artifacts earlier in the capture chain, as well as colored noise. Because of this, edge enhancement is often applied to the luma channel of the image after converting to a luma-chroma color space. The specific luma-chroma color space is often the JPEG YC_BC_R space, chosen for the same reasons as for noise reduction.

5.8 Compression

A digital image that has been captured and processed as described in the previous sections of this chapter contains both mathematically redundant as well as perceptually insignificant data. An image compression step can be included in the overall image processing chain to produce a more efficient representation of the image, in which the redundant and perceptually insignificant data have been removed. This step of image compression reduces the amount of physical storage space required to represent the image. This also increases the number of images that can be stored in a fixed-size storage space as well as to allow faster transmission of individual images from a digital camera to other devices.

Image compression can be divided into two categories: lossy and lossless. Lossless image compression is reversible, meaning that the exact original image data can be recovered from the compressed image data. This characteristic limits the amount of compression that is possible. Lossless image compression focuses on removing redundancy in an image representation. Images commonly have both spatial redundancy and spectral redundancy.

In addition to removing redundancy, lossy image compression uses knowledge of the HVS to discard data that are perceptually insignificant. This typically allows a much greater compression ratio to be achieved than with lossless image compression, while it still allows an image with little or no visual degradation to be reconstructed from the compressed image data. When no visual degradation occurs, the compression is referred to as visually lossless.

During (or even before) compression, images are often converted from an RGB color space to a luminance-chrominance color space, such as YC_BC_R [72]. This conversion serves multiple purposes. First, it reduces the spectral redundancy among the image channels, allowing them to be independently encoded more efficiently [73]. Second, it transforms the image data into a space in which the characteristics of the HVS can be more easily exploited. The HVS is generally more sensitive to luminance information than chrominance information, in particular at higher spatial frequencies [74]. The chrominance channels can be compressed more aggressively than can the luminance channel without visual degradation. One way this is achieved is by spatially subsampling the chrominance channels prior to encoding.

Many lossy compression schemes transform the pixel data into the frequency domain. Digital cameras predominantly use the JPEG lossy image compression standard [72]. JPEG employs a two-dimensional discrete cosine transform (DCT) to convert pixel values into spatial frequency coefficients. For natural imagery containing mostly low-frequency information, this transformation compacts most of the signal energy into a small fraction of the transform coefficients, which is advantageous for compression. Representing images in the frequency domain is also beneficial for optimizing compression with respect to the HVS.

It is known that the HVS has varying sensitivity to different spatial frequencies [74]. In particular, the HVS has decreasing sensitivity at higher spatial frequencies. The HVS also has varying sensitivity according to the orientation of the spatial frequency information, being less sensitive at a diagonal orientation than at horizontal or vertical orientations.

In JPEG image compression, the DCT is performed on 8 × 8 blocks of image data. This produces a collection of 64 frequency coefficients, varying in horizontal and vertical spatial frequency. Each of these coefficients is quantized. This is the lossy step within JPEG compression, in which information is discarded. Quantization involves dividing each frequency coefficient by an integer quantization value, and rounding the result to the nearest integer. Reducing the accuracy by which the frequency coefficients are represented allows them to be more efficiently compressed. Increased quantization generally results in increased compression. On the other hand, the greater the quantization value applied to a given frequency coefficient, the more uncertainty and hence greater expected error there is when recovering the frequency coefficient from its quantized representation during decoding.

TABLE 5.1
Default JPEG luminance quantization table.

Images

The characteristics of the HVS can be applied to this quantization process, exploiting the varying sensitivity of the HVS to different spatial frequencies and orientations, as well as to luminance and chrominance information. The contrast sensitivity function is used to indicate the base detection threshold at which a spatial frequency just becomes visible under certain viewing conditions [74]. These detection thresholds can be used to design a quantization table (also referred to as a quantization matrix) that allows maximum quantization of frequency coefficients while retaining a visually lossless representation of the information. Table 5.1 lists the default luminance quantization table associated with the JPEG image compression standard, designed to be at the threshold of visibility under specific viewing conditions [75]. The quantization values generally increase as spatial frequency increases (from left to right, and from top to bottom), corresponding to the decreased sensitivity of the HVS at high spatial frequencies. Also, there are relatively larger quantization values for spatial frequencies at 45° orientations (on the diagonal of the matrix), indicative of the decreased sensitivity of the HVS to diagonal frequency content.

In the JPEG compression standard, the quantization process is content independent. For a given color channel, a single quantization table is used throughout the entire image. Because of this, JPEG-based compression is unable to directly exploit the HVS characteristic of contrast masking. Contrast masking refers to the change in visibility of a target signal due to the presence of a background signal [74]. As the contrast of the masking background signal increases, it can be more difficult to detect the target signal. Contrast masking can be observed, for example, when random noise is added to an image. The noise is more visible in smooth regions with low contrast than in regions of high contrast and texture.

Contrast masking applies to compression. Studies have suggested three perceptually significant activity regions in an image: smooth, edge, and texture [76]. Because of the masking properties of the HVS, high-contrast textured image regions can tolerate more error without visual degradation than smooth regions of low contrast or isolated sharp edges in an image. In JPEG, high-contrast textured image regions can tolerate greater quantization of frequency coefficients than other regions. However, since the quantization tables used in JPEG are constant throughout the image, a compromise must be made. A conservative level of quantization can be applied throughout the image to avoid visual degradation in the most sensitive areas. This comes at a cost of poor compression efficiency, particularly in regions of texture where extra bits are wasted encoding perceptually insignificant information.

One method to exploit the masking properties of the HVS to increase compression efficiency while remaining compliant with the JPEG image compression standard is discussed in Reference [77]. In this approach, local texture measurements are used to compute visibility thresholds for the spatial frequency coefficients in each block. Coefficients that are smaller than their corresponding visibility threshold can be prequantized to zero without introducing any visual degradation. This prequantization allows some coefficients to be quantized to zero that otherwise would not be when the standard quantization matrix is applied. Zero values after quantization are more efficiently encoded than nonzero values, resulting in improved compression efficiency.

An alternative method to exploit the masking properties of the HVS within the framework of JPEG is discussed in Reference [78]. In this approach, the quantization matrix value applied to a particular frequency coefficient can be normalized based on previously encoded quantized frequency coefficients. The previously encoded information is used to estimate the amount of contrast masking present, and correspondingly the normalization factors that can be applied to subsequent quantization matrix values. In particular, when significant contrast masking is present, normalization values greater than one are used to increase the quantization applied to an individual frequency coefficient. The increased quantization results in increased compression efficiency, without introducing any visual degradation. Although this can be done using a single quantization matrix (for a given color channel) throughout the image as required by the JPEG standard, it does require a smart decoder to properly interpret the compressed image data. The smart decoder must duplicate the calculation of the normalization terms used during quantization in order to allow correct recovery of the frequency coefficients from their quantized representations.

5.9 Conclusion

The human visual system influences the design of image processing algorithms in a digital camera. This chapter illustrates the perceptual considerations involved in the many different steps of a digital camera image processing chain. Most tangibly, perceptual considerations affect processing algorithms to help ensure that the final image “looks right.” Noise reduction, color rendering, and edge enhancement are all performed with the human visual system in mind. Other aspects of the processing chain are affected by the human visual system as well. The design of the image sensor color filter array and the corresponding demosaicking algorithm exploit human visual system characteristics. Even computational and compression efficiency are influenced by perceptual considerations. Throughout the processing chain, perceptual considerations affect the design of image processing algorithms in a digital camera.

Appendix

Compact digital cameras usually have a pixel pitch between 1 and 2 μ m, so the sharpness of the captured image is often limited by the lens even when the scene is at best focus. The MTF for a diffraction-limited lens with a circular aperture for a perfectly focused image can be expressed as follows [79]:

$MTF (v) = {\begin{array}{l} \frac{2}{π} [\arccos (\frac{v}{v_{C}}) - \frac{v}{v_{C}} \sqrt{1 - {(\frac{v}{v_{C}})}^{2}}] & for v < v_{C}, \\ 0 & otherwise, \end{array} (5.23)$

where ν is spatial frequency and ν_C = 1/(λN) denotes the cutoff frequency, which depends on wavelength λ and the f/number N. This function is plotted for green light (550 nm) and a wide range of f/numbers in Figure 5.19.

Images

FIGURE 5.19
Diffraction-limited lens MTF.

Each MTF curve in the figure is labeled with the f/number used to generate it. Inexpensive lenses rarely have an f/number below 2.5, even for fairly short focal lengths. Inexpensive lenses with a longer focal length have higher minimum f/numbers, often at least 5.6. The half-sampling frequency is also shown for several different pixel pitches, each labeled with the appropriate pixel pitch. Finally, small circles are shown for f/numbers equal to four and above where MTF goes through the frequency corresponding to the Rayleigh resolution limit (0.82ν_C). This figure shows that as the f/number is increased, the resolution limit is decreased and the MTF drops at all frequencies. While lenses are usually not diffraction-limited at their widest apertures, the MTF still tends to drop further as the f/number is increased.

References

[1] J.E. Adams and J.F. Hamilton, “Digital camera image processing chain design,” in Single-Sensor Imaging: Methods and Applications for Digital Cameras, R. Lukac (ed.), Boca Raton, FL, USA: CRC Press / Taylor & Francis, September 2008, pp. 67–103.

[2] L. Jones and C. Nelson, “Study of various sensitometric criteria of negative film speeds,” Journal of the Optical Society of America, vol. 30, no. 3, pp. 93–109, March 1940.

[3] A. Adams, The Negative. New York, USA: Little, Brown and Company, 1995.

[4] Commission Internationale de l’Eclairage, “CIE publication no. 15.2, Colorimetry,” Technical report, CIE, 1986.

[5] A. Stimson, “Photographic exposure measuring device,” U.S. Patent 3 232 192, February 1966.

[6] B. Johnson, “Photographic exposure control system and method,” U.S. Patent 4 423 936, January 1984.

[7] J.S. Lee, Y.Y. Jung, B.S. Kim, and S.J. Ko, “An advanced video camera system with robust AF, AE, and AWB control,” IEEE Transactions on Consumer Electronics, vol. 47, no. 3, pp. 694–699, August 2001.

[8] F. Arai, “Video camera exposure control method and apparatus for preventing improper exposure due to changing object size or displacement and luminance difference between the object and background,” U.S. Patent 5 049 997, September 1991.

[9] E. Gindele, “Digital image processing method and apparatus for brightness adjustment of digital images,” U.S. Patent 7 289 154, October 2007.

[10] S. Pattanaik, J. Ferwerda, M. Fairchild, and D. Greenberg, “A multiscale model of adaptation and spatial vision for realistic image display,” in Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, Orlando, FL, USA, July 1998, pp. 287–298.

[11] E. Reinhard, M. Stark, P. Shirley, and J. Ferwerda, “Photographic tone reproduction for digital images,” ACM Transactions on Graphics, vol. 21, no. 3, pp. 267–276, July 2002.

[12] D. Kerr, “APEX – the additive system of photographic exposure.” Available online, http://dougkerr.net/Pumpkin/articles/APEX.pdf, 2007.

[13] International Organization for Standardization, “Photography – digital still cameras – determination of exposure index, ISO speed ratings, standard output sensitivity, and recommended exposure index,” Technical report, ISO 12232, ISO TC42/WG 18, 2006.

[14] B. Pillman, “Camera exposure determination based on a psychometric quality model,” in Proceedings of IEEE Workshop on Signal Processing Systems, San Francisco, CA, USA, October 2010, pp. 339–344.

[15] B. Pillman and D. Jasinski, “Camera exposure determination based on a psychometric quality model,” Journal of Signal Processing Systems, submitted, 2011.

[16] K. Omata, T. Miyano, and M. Kiri, “Focusing device,” U.S. Patent 6 441 855, August 2002.

[17] M. Gamadia and N. Kehtarnavaz, “Enhanced low-light auto-focus system model in digital still and cell-phone cameras,” in Proceedings of IEEE International Conference on Image Processing, Cairo, Egypt, November 2009, pp. 2677–2680.

[18] M. Gamadia and N. Kehtarnavaz, “A real-time continuous automatic focus algorithm for digital cameras,” in Proceedings of IEEE Southwest Symposium on Image Analysis and Interpretation, Denver, CO, USA, March 2006, pp. 163–167.

[19] R. Evans, “Method for correcting photographic color prints,” U.S. Patent 2 571 697, October 1951.

[20] M. Kumar, E. Morales, J. Adams, and W. Hao, “New digital camera sensor architecture for low light imaging,” in Proceedings of IEEE International Conference on Image Processing, Cairo, Egypt, November 2009, pp. 2681–2684.

[21] L.M. Hurvich and D. Jameson, “An opponent-process theory of color vision,” Psychological Review, vol. 64, no. 6, pp. 384–404, November 1957.

[22] Y.I. Ohta, T. Kanade, and T. Sakai, “Color information for region segmentation,” Computer Graphics and Image Processing, vol. 13, no. 3, pp. 222–241, July 1980.

[23] E. Gindele, J. Adams, Jr., J. Hamilton, Jr., and B. Pillman, “Method for automatic white balance of digital images,” U.S. Patent 7 158 174, January 2007.

[24] T. Miyano and E. Shimizu, “Automatic white balance adjusting device,” U.S. Patent 5 644 358, July 1997.

[25] G. Finlayson, S. Hordley, and P. Hubel, “Color by correlation: A simple, unifying framework for color constancy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 11, pp. 1209–1221, November 2001.

[26] H. Lee, “Method for computing the scene-illuminant chromaticity from specular highlights,” Journal of the Optical Society of America A, vol. 3, no. 10, pp. 1694–1699, October 1986.

[27] J.J. McCann, S.P. McKee, and T.H. Taylor, “Quantitative studies in retinex theory,” Vision Research, vol. 16, no. 5, pp. 445–458, May 1976.

[28] T. Miyano, “Auto white adjusting device,” U.S. Patent 5 659 357, August 1997.

[29] J.E. Adams, “Interaction between color plane interpolation and other image processing functions in electronic photography,” Proceedings of SPIE, vol. 2416, pp. 144–151, February 1995.

[30] B.E. Bayer, “Color imaging array,” U.S. Patent 3 971 065, July 1976.

[31] M.J. McMahon and D.I.A. MacLeod, “The origin of the oblique effect examined with pattern adaptation and masking,” Journal of Vision, vol. 3, no. 3, pp. 230–239, April 2003.

[32] K.R. Boff and J.E. Lincoln, Engineering Data Compendium: Human Perception and Performance. Wright-Patterson AFB, Armstrong Aerospace Medical Research Laboratory, OH, USA, 1988.

[33] J.E. Adams, J.F. Hamilton, M. Kumar, E.O. Morales, R. Palum, and B.H. Pillman, “Single capture image fusion,” in Computational Photography: Methods and Applications, R. Lukac (ed.), Boca Raton, FL, USA: CRC Press / Taylor & Francis, October 2010, pp. 1–62.

[34] J.D. Gaskill, Linear Systems, Fourier Transforms, and Optics. New York, USA: John Wiley & Sons, 1978.

[35] R.H. Hibbard, “Apparatus and method for adaptively interpolating a full color image utilizing luminance gradients,” U.S. Patent 5 382 976, January 1995.

[36] J.E. Adams, “Design of practical color filter array interpolation algorithms for digital cameras,” Proceedings of SPIE, vol. 3028, pp. 117–125, February 1997.

[37] J.E. Adams and J.F. Hamilton, “Adaptive color plan interpolation in single sensor color electronic camera,” U.S. Patent 5 506 619, April 1996.

[38] J.F. Hamilton and J.E. Adams, “Adaptive color plan interpolation in single sensor color electronic camera,” U.S. Patent 5 629 734, May 1997.

[39] D.R. Cok, “Signal processing method and apparatus for producing interpolated chrominance values in a sampled color image signal,” U.S. Patent 4 642 678, February 1987.

[40] B.W. Keelan, Handbook of Image Quality: Characterization and Prediction. New York, USA: Marcel Dekker, 2002.

[41] J. McElvain, S. Campbell, J. Miller, and E. Jin, “Texture-based measurement of spatial frequency response using the dead leaves target: Extensions, and application to real camera systems,” Proceedings of SPIE, vol. 7537, p. 75370D, January 2010.

[42] J.S. Lee, “Digital image smoothing and the sigma filter,” Computer Vision, Graphics, and Image Processing, vol. 24, no. 2, pp. 255–269, November 1983.

[43] C. Tomasi and R. Manduchi, “Bilateral filtering for gray and color images,” in Proceedings of Sixth International Conference on Computer Vision, Bombay, India, January 1998, pp. 839–846.

[44] S. Daly, “A visual model for optimizing the design of image processing algorithms,” in Proceedings of IEEE International Conference on Image Processing, Austin, TX, USA, vol. 2, pp. 16–20, November 1994.

[45] R. Jenkin and B. Keelan, “Perceptually relevant evaluation of noise power spectra in adaptive pictorial systems,” Proceedings of SPIE, vol. 7867, pp. 786708:1–12, January 2011.

[46] I. Newton, Opticks. London, UK: Smith & Walford, 1704.

[47] T. Young, “On the theory of light and colours,” Philosophical Transactions of the Royal Society of London, vol. 92, pp. 12–48, 1802.

[48] H. von Helmholtz, Helmholtz’s Treatise on Physiological Optics, vol. 2. Rochester, NY, USA: The Optical Society of America, 1924.

[49] J.C. Maxwell, “On the theory of compound colours, and the relations of the colours of the spectrum,” Philosophical Transactions of the Royal Society of London, vol. 150, pp. 57–84, 1860.

[50] D. Baylor, T. Lamb, and K.W. Yau, “Responses of retinal rods to single photons,” Journal of Physiology, vol. 288, pp. 613–634, March 1979.

[51] J. Nathans, “The evolution and physiology of human color vision: Insights from molecular genetic studies of visual pigments,” Neuron, vol. 24, no. 2, pp. 299–312, October 1999.

[52] H. Okawa and A.P. Sampath, “Optimization of single-photon response transmission at the rod-to-rod bipolar synapse,” Physiology, vol. 22, no. 4, pp. 279–286, August 2007.

[53] J.J. Sheppard, Human Color Perception. New York, USA: Elsevier, 1968.

[54] R.B. Lotto, S.M. Williams, and D. Purves, “Mach bands as empirically derived associations,” Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 9, pp. 5245–5250, April 1999.

[55] R.W. Pridmore, “Bezold-Brücke hue-shift as functions of luminance level, luminance ratio, interstimulus interval and adapting white for aperture and object colors,” Vision Research, vol. 39, no. 23, pp. 3873–3891, November 1999.

[56] R.G.W. Hunt, Measuring Color. Surrey, UK: Fountain Press, 3rd edition, 1998.

[57] M.D. Fairchild, “Considering the surround in device-independent color imaging,” Color Research and Application, vol. 20, no. 6, pp. 352–363, December 1995.

[58] G. Wyszecki and W.S. Stiles, Color Science: Concepts and Methods, Quantitative Data and Formulas. New York, USA: John Wiley & Sons, 2nd edition, 2000.

[59] International Organization for Standardization, “Colorimetry – Part 4: CIE L*a*b* colour space,” ISO 11664-4, 2008.

[60] Commission Internationale de l’Eclairage, “Industrial colour-difference evaluation,” CIE 116-1995.

[61] Commission Internationale de l’Eclairage, “Improvement to industrial colour-difference evaluation,” CIE 142, 2001.

[62] R.S. Berns, Billmeyer and Saltzman’s Principles of Color Technology. New York, USA: John Wiley & Sons, 3rd edition, 2000.

[63] International Organization for Standardization, “Colorimetry – Part 1: Standard colorimetric observers,” ISO 11664-1, 2007.

[64] G. Sharma, “LCDs versus CRTs – color-calibration and gamut considerations,” Proceedings of the IEEE, vol. 90, no. 4, pp. 605–622, April 2002.

[65] R. Palum, “How many photons are there?,” in Proceedings of the Image Processing, Image Quality, Image Capture Systems Conference, Portland, OR, USA, April 2002, pp. 203–206.

[66] E.J. Giorgianni and T.E. Madden, Digital Color Management. Reading, MA, USA: Addison-Wesley, 1998.

[67] G.C. Higgins and R.N. Wolfe, “The relation of definition to sharpness and resolving power in a photographic system,” Journal of the Optical Society of America, vol. 45, no. 2, pp. 121–125, February 1955.

[68] E. Granger and K. Cupery, “An optical merit function (SQF), which correlates with subjective image judgments,” Photographic Science and Engineering, vol. 16, no. 3, pp. 221–230, May/June 1972.

[69] E. Jin, B. Keelan, J. Chen, J. Phillips, and Y. Chen, “Softcopy quality ruler method: Implementation and validation,” Proceedings SPIE, vol. 7242, pp. 724206:1-14, January 2009.

[70] E. Jin and B. Keelan, “Slider-adjusted softcopy ruler for calibrated image quality assessment,” Journal of Electronic Imaging, vol. 19, no. 1, pp. 011009, January 2010.

[71] S. Chang, B. Yu, and M. Vetterli, “Adaptive wavelet thresholding for image denoising and compression,” IEEE Transactions on Image Processing, vol. 9, no. 9, pp. 1532–1546, September 2000.

[72] K.A. Parulski and R. Reisch, “Digital camera image storage formats,” in Single-Sensor Imaging: Methods and Applications for Digital Cameras, R. Lukac (ed.), Boca Raton, FL, USA: CRC Press / Taylor & Francis, September 2008, pp. 351–379.

[73] G. Sharma and J. Trussell, “Digital color imaging,” IEEE Transactions on Image Processing, vol. 6, no. 7, pp. 901–932, July 1997.

[74] A.B. Watson (ed.), Digital Images and Human Vision. Cambridge, MA, USA: MIT Press, 1993.

[75] W.B. Pennebaker and J.L. Mitchell, JPEG Still Image Data Compression Standard. New York, USA: Van Nostrand Reinhold, 1993.

[76] M.G. Ramos and S.S. Hemami, “Perceptually based scalable image coding for packet networks,” Journal of Electronic Imaging, vol. 7, no. 3, pp. 453–463, July 1998.

[77] N. Jayant, J. Johnston, and R. Safranek, “Image compression based on models of human vision,” in Handbook of Visual Communications, K. Smith, S. Moriarty, G. Barbatsis, and K. Kenney (eds.), San Diego, CA, USA: Academic Press, 1995, pp. 73–125.

[78] S.J. Daly, C.T. Chen, and M. Rabbani, “Adaptive block transform image coding method and apparatus,” U.S. Patent 4 774 574, September 1988.

[79] G. Boreman, Modulation Transfer Function in Optical and Electro-Optical Systems. Belling-ham, WA, USA: SPIE Press, vol. TT52, 2001.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5 Perceptually Based Image Processing Algorithm Design

Create new playlist

Sign In

Sign Up

Table of Contents for
5 Perceptually Based Image Processing Algorithm Design