14 Perceptual Encryption of Digital Images and Videos

Digital images and videos are so ubiquitous nowadays that it would be difficult to live without them in the heavily digitized and networked world. Various digital devices, such as digital cameras, camcorders, scanners, digital televisions, personal computers, mobile phones, and printers, rely on digital image and video processing and coding technologies. Despite the extreme usefulness of digital images and videos, there are more and more concerns about security and privacy rising from a heavy dependence on digital images and videos: how to protect sensitive information recorded as digital images and videos from unauthorized access and misuse? Different people may have completely different reasons for having such a security concern. For instance, digital content providers are worrying about pirate copies of their products, end users want to keep their private images and videos in their mobile phones and their online web spaces safe from unwanted hands, and governmental/military/diplomatic bodies need to carefully protect classified documents (many of which are recorded in the form of digital image and video). All of these demands require content protection and access control, which can be fulfilled by applying cryptography to multimedia data, that is, by encrypting digital images and videos and by defining who can access the encrypted contents.

A unique feature of digital images and videos is their bulky size. Despite the increasing power of computers and image/video coding standards, the demand for high-resolution and high-quality multimedia contents, such as high-definition television (HDTV) and stereoscopic television, keeps asking for more storage space of digital multimedia data. For example, in the era of video compact disc (VCD), a typical movie was of size around 700 MB. Today, a high-definition movie on a Blu-ray disc can require more than 20 GB space. A natural consequence of the extensive size of multimedia data is that full encryption of digital images and videos may not be necessary or economical in some applications. Because of the entertaining nature of many commercial multimedia content, selectively encrypting parts of the whole content to downgrade the visual quality to a predefined degree becomes a desirable solution in applications like pay-per-view television (TV) services. This leads to a new concept called perceptual encryption (or transparent encryption), whose goal is not to conceal all perceptual information but only a specific amount of it. Here, the leakage of partial perceptual information is intended because the low-quality contents can be used to attract potential buyers and to allow further postprocessing (such as digital watermarking) of encrypted multimedia contents by an agent without access to the decryption key.

While perceptual encryption has to be implemented via selective encryption, the former has higher technical requirements than the latter, thus leading to much more technical challenges. There are mainly three issues making perceptual encryption more complicated. First, while selective encryption can be achieved, for example, by randomly encrypting one bit out of n bits of the image/video bitstream, perceptual encryption cannot be done this way because the bitstream has a strict syntax format that has to be respected so that the partial unencrypted perceptual information can be decoded. This calls for joint multimedia encryption-encoding, which is not trivial because most image and video coding standards are not designed to be encryption-friendly and recent research revealed that there exists trade-offs among different requirements of compression and encryption. Second, perceptual encryption naturally asks for a definition of perceptual information, which has been a long-standing open problem in the field because of a very limited understanding of the human visual system. Last, dividing multimedia data into two parts implies that there are two ways of attacking the system: breaking the encrypted part directly and recovering the encrypted part from unencrypted part. The second approach is possible in practice because existing compression techniques cannot completely remove correlative information among different parts of digital images and videos. Very recent research has demonstrated that attacks based on such correlative information are far more effective than originally expected. The attacks also challenge experts to re-think how perceptual information should be encoded more effectively and how encryption should be combined with compression to ensure a desired level of security.

This chapter gives an overview of perceptual encryption of digital images and videos, paying special attention to very recent research on advanced attacks on perceptual encryption schemes for images and videos coded using the discrete cosine transform (which covers most mainstream image and video coding standards like JPEG and MPEG-1/2/4). Namely, Section 14.2 introduces some fundamental backgrounds of multimedia coding and cryptography. Section 14.3 surveys multimedia encryption to prepare some basic facts about joint multimedia encryption-encoding (JMEE) as the foundation of perceptual encryption. Section 14.4 focuses on different aspects of perceptual encryption, including a brief survey of various encryption schemes and classifications of different attacks on these schemes. Section 14.5 introduces a special form of attacks, error-concealment attack (ECA), which represents the most latest and interesting advances in the field. Finally, this chapter concludes with Section 14.6 by pointing some future research trends.

14.2 Fundamentals of Multimedia Coding and Cryptography

14.2.1 Image and Video Coding

Multimedia technology has been rapidly developed in the past several decades. Multimedia coding – compression and binary representation of multimedia data – plays a key role in multimedia applications [1]. Among different forms of multimedia data, digital images and videos are widely used. A digital image is a two-dimensional matrix of pixels whose values represent the brightness or the color of discrete points on the image plane. Depending on the nature of an image, a pixel value can be either a scalar, which is the case of binary and grayscale images, or a triplet representing some color, which is the case of red-green-blue (RGB) color images, for instance. A pixel value can also contain more than three color components, which is the case for multispectral images, but this chapter will not discuss such cases. However, note that all facts about RGB images discussed in this chapter can be generalized to multispectral images because they are largely independent of the dimension of pixel values. Similarly, a digital video is a three-dimensional matrix of pixels where the additional dimension is time. In other words, a digital video is simply a collection of digital images with a sequential order in temporal domain. Each image contained in a digital video is called a frame, and the number of video frames per second is called the frame rate of the video. In some video coding systems, a video frame is divided into a top field composed of odd rows and a bottom field composed of even rows; the two fields are coded separately, which is called interlaced video coding. This chapter does not differentiate video frames from fields to simplify discussion.

Images

FIGURE 14.1
The general structure of an image or video codec, where the dotted parts represent optional components: (a) encoder, (b) decoder.

Whenever talking about coding, both encoding and decoding are always considered here. The encoding process (or an encoder) transforms an image or a video into a bitstream consuming less space, and the decoding process (or an decoder) recovers the encoded bitstream back to the original uncompressed image/video. An encoder and a decoder are often called a codec collectively.

The encoding and decoding processes are basically symmetric in structure, but they can have completely different computational complexities. Normally, the decoder is much less computationally costly than the encoder. The asymmetry is mainly due to different goals of the two processes: encoding tries to find the compression transform with the best overall encoding performance (compression efficiency as one major but not the only factor), but decoding merely translates what has been encoded in a bitstream into perceptual information following the transform selected by the encoding process. In some cases, however, there is a different scenario: the encoding process should be computationally light but the decoding process can be more computationally expensive. A typical application is wireless multimedia sensor networks (WMSNs) [2], in which encoding is done by resource-constrained sensor nodes (often powered by battery) and decoding by computationally more powerful devices like multicore computers. A special multimedia coding architecture called distributed coding has been developed to handle such asymmetric coding [3].

Although there are many different image and video coding schemes, almost all of them follow the basic structure shown in Figure 14.1. Because of the similarity between the encoder and the decoder, their functionality is briefly explained on the example of an encoder. In the first step, the input image or video passes a preprocessing step, which can include several different processes, such as color space conversion, signal partitioning, and prefiltering (for example, subsampling and noise removal). The main goal of this step is to prepare data for future steps and remove some redundant information that does not need to be processed by later steps. For instance, encoding of color images and videos normally involves a conversion from RGB space to another color space, such as YCbCr and YUV, to separate the luminance and chrominance components so that the lower sensitivity of the human visual system to chrominance components can be exploited to reduce information needed for encoding. The reduction of chrominance information is done by subsampling the chrominance components, meaning to sample them with a lower (normally halved) sampling rate than the luminance component. While this immediately leads to a 2:1 compression ratio, the visual quality of the image/video reconstructed from the halved data remains almost as good as that of the original image/video. Partitioning is another widely use preprocessing technique, which is often useful for speeding up the encoding process.

The second step is lossy coding. It aims at removing psycho-visually redundant information, thus creating an image that is perceptually nearly indistinguishable from the original. The lossy coding part is usually realized by converting the data from spatial domain to frequency domain via a transform, such as the discrete cosine transform (DCT) [4] and the discrete wavelet transform (DWT) [5], and then quantizing the resultant transform coefficients to be integers. The main purpose of using a transform is to decorrelate the original image/video so that less correlative information exists in the new transform domain than in spatial domain. Performing a transform on the whole image/video is computationally expensive, so the common practice is to first partition the image and each video frame into n × n blocks and apply the transform locally on each block. The lossy coding step is optional because in some applications, such as medical imaging and forensics, it may not be desired to lose any information (even when it is invisible to human eyes).

The lossy coding step is followed by lossless coding, which aims at encoding the data in a more compact form so that less bits are needed to represent the same amount of information. Since no information loss is involved in lossless codding, all information can later be exactly recovered by the decoder from the encoded bitstream. Lossless coding relies on information theory [6], which has introduced a concept called information entropy to measure the amount of information contained in a random source of messages and offered a number of effective ways of representing messages more effectively while still losslessly. All lossless coding methods based on information entropy are collectively called entropy coding. The two commonly used entropy coding algorithms are Huffman coding and arithmetic coding; both algorithms can be made context adaptive to improve their encoding performance. Besides entropy coding, some other lossless coding algorithms have also been developed. The two most important ones are dictionary coding and run-length coding. Dictionary coding compresses a message by building a static or dynamic dictionary of strings and replacing an input message by a sequence of dictionary entries. Run-length coding represents a message as a sequence of (run,level) pairs, which can effectively compress the message if the distribution of “run” has a long tail. Both methods have their roots in information theory, but normally they are not classified as entropy coding. Many lossless coding algorithms, including Huffman coding and arithmetic coding, belong to variable-length coding (VLC) because they encode an input symbol into a variable number of bits.

The last step of the encoding process is postprocessing. It mainly aims at collecting the output from the previous step and assembling the final bitstream. In some applications, the final bitstream may contain several quality layers that should be encoded in such a way that separated decoding is possible. When there are multiple streams involved, for example, audio and video streams of a TV program or the left- and right-eye video streams of a stereoscopic video, a container is needed to hold the multiple streams and some mechanism is needed to guarantee the synchronization among them. Different from previous steps, this last step does not involve any compression technique.

In addition to the above four steps, there is another optional step called predictive coding. It can be done together with the first two steps, which explains its multiple connections with other parts of the encoding process. As its name implies, the goal of predictive coding is to predict not-yet-encoded part from already-encoded data to reduce the amount of information for further encoding. If the prediction error (also called residual data) has a more skewed distribution than the original data, which is the normal case, information theory guarantees that the compression efficiency will be higher. For instance, for image and video coding based on DCT transform, the first transform coefficient (called DC coefficient) of each block can be predicted from the previous block’s DC coefficient with a fairly high accuracy. This prediction can also happen in temporal domain, where the motion of objects in the input video is estimated and the obtained motion vectors are encoded as part of the final bitstream. Predictive coding is particularly important for video coding because consecutive frames are often highly correlated in both spatial and temporal domains.

Different image and video coding schemes can be developed by combining different techniques for different steps of the whole encoding/decoding process. Various image and video coding standards have been established, among which JPEG and MPEG-1/2/4 are the most successful standards. The development of image and video standards on one hand takes advantage of existing technologies, but on the other hands also promotes new research by offering a platform for testing different kinds of new technologies. As a matter of fact, the latest image and video coding standards, such as JPEG2000 [7], JPEG-XR [8], MPEG-4 AVC/H.264 [9], and SMPTE VC-1 [10], represent the state of the art of the field.

Generally speaking, a multimedia coding standard defines a syntax format of the encoded bitstream and a method of decoding an encoded bitstream to reconstruct the original multimedia data (that is, the input of the encoder) with an acceptable perceptual quality. Multimedia coding standards are designed to offer balanced solutions to the following goals simultaneously: good perceptual quality, high compression efficiency, and low computational complexity of the codec. In addition, some other factors are often taken into account to enhance the applicability of the designed standard, which include flexibility to support diverse applications, backward compatibility with old standards, scalability, resilience against transmission errors, and so forth. Because of the limited space, existing image and video coding standards are not discussed in detail; this chapter aims to describe perceptual encryption schemes working with those standards in a more general sense.

14.2.2 Modern Cryptography and Cryptanalysis

Modern cryptography is the science of securing digital systems, which is often considered as a subfield of computer science although it does have a strong link to information theory (thus communications in general) and mathematics (especially number theory, finite field, and combinatorics) [11], [12]. Traditionally, cryptography is mainly about data encryption, but nowadays more complicated cryptographic systems, such as cryptographic hash functions, digital signatures, and security protocols, have been developed. This chapter is mainly focused on data encryption. The nature of modern cryptography implies that cryptanalysis, the study on how to break existing cryptographic systems, is extremely important and plays a key role in the development of any secure systems. In some sense, it can be said that modern cryptography is built on top of cryptanalysis; or that cryptology is composed of two subfields: cryptography and cryptanalysis, one is about designing secure systems and the other about breaking them. This subsection introduces only some very basic concepts and terms of modern cryptography and tries to avoid discussing any ad hoc designs and attacks; as these topics are widely discussed in References [11] and [12].

Images

FIGURE 14.2
The encryption and decryption procedures of an encipher/decipher pair.

An encryption system is also called an encipher, and a decryption system is called a decipher. When there is no need to differentiate an encipher and a decipher or the same system can be used for both encryption and decryption, such a system can be simply called a cipher or a cryptosystem.¹ The plain message taken as the input of an encipher (and the output of a decipher) is called a plaintext. Similarly, the encrypted message as the output of an encipher (and the input of a decipher) is called a ciphertext. Denoting the plaintext and the ciphertext by P and C, respectively, the encryption procedure of an encipher can be described by

$C = E_{K_{e}} (P), (14.1)$

where K_e is the encryption key and E (·) is the encryption function implemented by the encipher. Similarly, the decryption procedure of a decipher can be described by

$P = D_{K_{d}} (C), (14.2)$

where K_d is the decryption key and D (·) is the decryption function implemented by the decipher. If an encipher and a decipher are concatenated by feeding the ciphertext C to the decipher, the following is obtained:

$P = D_{K_{d}} (E_{K_{e}} (P)) . (14.3)$

Basically, the decryption function D (·) can be considered as the inverse of the encryption function E (·). Figure 14.2 illustrates how an encipher and a decipher can work together to allow secure transmission of data over a public insecure channel.

1The term cryptosystem actually can be used to represent more cryptographic systems beyond ciphers, but in many application scenarios it is a synonym of cipher.

Note that if the encryption and decryption keys should be made public it is possible to have two different kinds of ciphers. When both keys have to be kept secret, the cipher is called a private-key cipher or a symmetric cipher. The second name comes from the fact that for most private-key ciphers the encryption key and the decryption key are identical, or the decryption key can be derived from the encryption key so that they can be considered as identical. The main problem with private-key ciphers is the key management issue. Since the keys have to be kept secret, they have to be transmitted to both the sender and the receiver in advance via a secret channel. For n users in a community, communications between any of these users will require secure transmission of n (n − 1)/2 keys, which can become difficult or impossible to manage when n becomes large. This problem can be solved if only the decryption key needs to be kept secret but the encryption key can be made public, thus leading to the so-called public-key cipher or asymmetric cipher. Apparently, for an n-user community only n encryption keys need to be published to support secure communications between any two users. Normally, public-key ciphers are much slower than private-key ciphers due to their special properties, but they can be combined in such a way that a public-key cipher is used to implement the key management process of a private-key cipher. This chapter is more concerned about private-key ciphers because encryption of multimedia data is always performed by a private-key cipher.

Private-key ciphers can be further divided according to their encryption structure into block ciphers and stream ciphers. Block ciphers encrypt the plaintext block by block in the same manner, meaning that the same plaintext block will lead to the same ciphertext block independent of the position of the block in the whole plaintext/ciphertext. In contrast, stream ciphers encrypt the plaintext based on a pseudo-random sequence (called a keystream) generated under the control of the encryption key, thus allowing producing different ciphertext blocks for two identical plaintext blocks at different positions. Over the years, many different block ciphers and stream ciphers have been proposed, among which Advanced Encryption Standard (AES) [13] was the most widely used block cipher and (Alleged) Rivest Cipher 4 (RC4) the most widely used stream cipher [14] at the time this chapter was written.

While block ciphers are memoryless compared with stream ciphers, they can be made to behave in a way similar to a stream cipher by running it under different modes of operation. There are many different modes of operations, among which the naive way of running a block cipher is called the electronic code book (ECB) mode. By introducing ciphertext feedback, that is, revising the encryption procedure as follows:

$C_{i} = E (P_{i} \oplus C_{i - 1}), (14.4)$

the so-called cipher block chaining (CBC) mode is obtained, where C₀ is a given initial vector (IV) and ⊕ denotes XOR operation. There are also other modes that can transform a block cipher into a stream cipher, thus making block ciphers a foundation of building stream ciphers. This can be done, for example, by repeatedly encrypting an initial vector, a counter, or a combination of both. One of the reasons why more advanced modes of operation are needed is a security problem with the ECB mode, as shown in Figure 14.3.

Following a well-known principle proposed in References [15] and [16], the security of a cipher should rely solely on the decryption key K_d, but not about secrecy of the algorithm itself. This is widely accepted in modern cryptography because an obscured algorithm can be easily reverse engineered by crackers or exposed by insiders. Keeping this principle in mind, an upper bound of the security of a cipher is defined by the size of the key space, that is, the number of all possible keys. This is because the simplest attack on a cipher is to exhaustively try every possible key and see which one can give a meaningful output. Such a naive attack is called brute-force attack in cryptanalysis. Considering the computational power of today’s computers and distributed computing, the key space is required to be as large as 2¹²⁸ or even 2²⁵⁶. In addition to brute force attack, there are also attacks defined according to the capability of the attack to access/choose plaintexts and/or ciphertexts:

Images

FIGURE 14.3
The encryption results of an image by AES running in ECB and CBC modes: (a) plaintext, (b) ciphertext of ECB mode, and (c) ciphertext of CBC mode.

ciphertext-only attack: the attacker can access the ciphertexts only;
known-plaintext attack: the attacker has access to some plaintexts and the corresponding ciphertexts;
chosen-plaintext attack: the attacker can choose plaintexts and get the corresponding ciphertexts; and
chosen-ciphertext attack: the attacker can choose ciphertexts and get the corresponding plaintexts.

The ciphertext-only attack represents the most common attacking scenario because the ciphertext is supposed to be transmitted through a public channel (otherwise, data encryption is not needed at all). The known-plaintext attack is often possible in practice because many files have fixed headers and the contents of encrypted messages may be guessed in some cases (or simply exposed by the sender or receiver intentionally or unintentionally ²). Chosen-plaintext and chosen-ciphertext attacks become possible when the attacker gets temporary access to the encipher and the decipher (but not the key), respectively. Note that the goal of a ciphertext-only attack is usually the plaintext, but the other attacks normally target the decryption key.

2For instance, a receiver may decide to reveal a secret message once he/she believes that the value of the message itself has expired. But the exposure of this message may be useful for an attacker to accumulate information for breaking future messages encrypted with the same key.

Images

FIGURE 14.4
The threat model of a typical multimedia encryption system.

Besides the above attacks, there are also many other attacks defined in different contexts. For instance, when a cipher is implemented in a real product, the energy and/or the time consumed on encrypting different plaintexts with different encryption keys may differ. Such differences may be captured by an eavesdropper to derive useful information about the plaintext and the key, if one has access to the environment of the encryption machine. Such implementation-oriented attacks are called side channel attacks and they have been proved very effective in breaking many real implementations of theoretically secure ciphers.

14.3 Multimedia Encryption in General

As pointed out in the introductory section, depending heavily on multimedia creates the need for securing the data for both security and privacy purposes. As a consequence, multimedia encryption has attracted attention from all parties on the multimedia production and consumption chain, including content creators, producers, distributors, and consumers. Some standardization efforts have been made to facilitate incorporation of security tools (including ciphers) into multimedia codecs to build digital rights management (DRM) system working with image and video coding standards like JPEG2000 and MPEG-2/4 [17], [18], [19]. Contents and copy protection schemes have also been deployed in digital versatile discs (DVDs), advanced video coding high definition (AVCHD) media, and Blu-ray discs for thwarting unauthorized access to encrypted multimedia contents [20]. This section discusses the threat model and main technique challenges related to multimedia encryption in general.

14.3.1 Threat Model

A typical multimedia encryption system consists of an encoder-encipher (equipped with an encryption key K_e) and a decoder-decipher (equipped with a decryption key K_d), as shown in Figure 14.4. The input data may be already compressed in some applications, in such cases a decoder (or a partial decoder) is needed to decode the data first before encoding and encryption. A decoder without access to the decryption key K_d may try to decode the encoded-encrypted multimedia data. It is also possible that some multimedia processing units are inserted between the encoder-encipher and the decoder-decipher to perform some additional manipulation on encoded-encrypted multimedia data, such as watermark embedding and detection, re-encryption, filtering, bitrate re-control, re-packetization, transcoding, and so on. In most cases, these in-between multimedia processing units have no access to the encryption key nor to the decryption key. They may even be encryption-agnostic, that is, unaware whether the bitstream has been encrypted. On the other hand, the encoder-encipher often has no clue which multimedia processing operations may happen after the encoded-encrypted data are sent out to the transmission channel.

Images

FIGURE 14.5
Three possible approaches to multimedia encryption at the encoder side.

The threat model of such a multimedia encryption system is as follows. An attacker may appear at any point after the encoder-encipher and may have access to a compliant decoder, but no access to the decryption key. The attacker is assumed to know all the details of the encryption algorithm and of the key management process. The attacker may have access to a database of multimedia data of the same type as the plaintext. The main goal of an attacker will be to break the decryption key, or to completely/partially recover perceptual information in the encrypted multimedia data. In other words, both ciphertext-only attacks, known/chosen-plaintext attacks, and chosen-ciphertext attacks are possible threats. Side channel attacks may also be possible if the attacker has access to the ambience of the encoder-encipher or the decoder-decipher.

14.3.2 Three Approaches to Multimedia Encryption

According to where encryption is added into a multimedia encoder, there are three possible approaches to realize multimedia encryption, as shown in Figure 14.5: encryption before encoding, encryption after encoding, and joint encryption-encoding. The third approach also covers the case when an encoded multimedia file/bitstream is (partially) decoded and a joint process of encryption and (partial) re-encoding is performed to generate the encrypted multimedia file/bitstream. Note that the (partial) decoding and re-encoding processes may not involve decompression and/or re-compression. One typical example is sign bit encryption, in which only partial decompression is involved in the decoder at the encipher side, since the encrypted sign bits can be directly put back into their original places in the input (already encoded) multimedia data.

The first two approaches are much simpler due to their independence of the encoding process. Unfortunately, they suffer from some nontrivial drawbacks that cannot be easily overcome without introducing dependence on the encoding process. The following briefly discusses the three approaches of multimedia encryption and explains why the third approach, that is, joint encryption-encoding, is of greater interest.

14.3.2.1 Encryption before Encoding

This approach has been followed in some multimedia encryption algorithms designed to encrypt raw multimedia data. However, it hardly works with any existing multimedia coding systems because encryption generally leads to a random-like output that cannot be further compressed effectively. There do exist some special algorithms supporting compression after encryption. One solution based on distributed source coding was reported in Reference [21], taking the encryption key as useful side information available at the decoder side. For this solution, decoding is generally impossible without the knowledge of the key, so decryption and decoding have to be carried out simultaneously, which is not desired in applications requiring format compliance. In addition, this specific scheme puts some requirements on the encryption algorithm used such that not all available ciphers can be chosen and deployed.

14.3.2.2 Encryption after Encoding

This is the most direct and simplest approach, and is often called naive encryption in the literature [22]. Compared with the previous approach, this approach has no negative influence on the compression process and can run (much) faster since the data encrypted have been efficiently compressed by the encoder. The main drawback of naive encryption is its incapability of maintaining format compliance of compressed multimedia data, which is a natural result of its independence of the encoding process. Here, format compliance is defined as follows: the encoded-encrypted data can still be (probably partially) decoded without access to the key. Although format compliance is not always very important in all applications, it is indeed an essential requirement of many multimedia applications such as the following:

in-between multimedia processing of encrypted multimedia data without access to the key;
perceptual encryption: encryption of selective perceptual information, which requires that the unencrypted part of the multimedia data can still be decoded without access to the key;
scalable encryption: different resolutions and quality levels of the same multimedia bitstream have different encryption configurations; and
region-of-interest encryption: encryption is selectively performed on regions of interest (maybe dynamically detected objects) instead of the whole multimedia data.

The following example demonstrates why format compliance is useful in practice. Imagine a web site hosting millions of compressed video files uploaded by a large number of registered users. One user, Alice, wants some of her video files publicly accessible to all registered users, but with a visual quality determined by a threshold q. Interested users have to pay a fee to Alice for the high-resolution videos. The video hosting site automatically embeds a watermark into each video file to track each specific user for the anti-piracy purposes. This digital watermarking process has to be done without access to the decryption key, because Alice does not fully trust the server. Apparently, blindly applying a cipher like AES to an encoded video file cannot help Alice. Instead, the encipher has to be aware of the syntax format of the video file, in order to know what syntax elements should be encrypted. That is, Alice needs joint encryption-encoding – the third approach to multimedia encryption.

It deserves mentioning that homomorphic encryption [23] may be used to achieve format compliance for some applications. However, the performance of homomorphic encryption is still very limited because a really practical encryption scheme should be homomorphic to all possible signal processing and compression operations, which include not only addition and multiplication, but also nonlinear operations like non-uniform quantization. Although designing an encryption scheme homomorphic to a single operation like addition or multiplication is easy, it remains an open problem to design a practical encryption scheme homomorphic to various operations simultaneously [24], [25].

14.3.2.3 Joint Encryption and Encoding

While joint encryption-encoding is the only approach to keep compression efficiency and format compliance simultaneously, it actually has more advantages over the other two approaches. Essentially, joint encryption-encoding is a problem of how to apply a cipher to multimedia data so that some application-oriented requirements can still be fulfilled. As part of the encoding process, the encipher can adapt itself to achieve an optimal/acceptable solution to the target application. When there are multiple performance factors, which is the normal case in practice, jointly performing encryption and encoding is often helpful to achieve a better trade-off among all the performance factors.

Since the 1990s, a large number of joint multimedia encryption-encoding (JMEE) algorithms have been proposed. The basic encryption techniques [26] include secret permutations of syntax elements at different levels, fixed-length codewords encryption, variable-length codewords index encryption, secret entropy coding, “virtual full encryption” working with context-adaptive coding, header encryption, and so forth. Three main aspects of joint encryption-encoding are what cipher should be used, what syntax elements should be encrypted, and how the selected syntax elements should be encrypted. Most JMEE algorithms are independent of selection of the underlying cipher, although some can only work with stream ciphers or newly designed ciphers tailored for multimedia encryption.

For a typical JMEE algorithm, several performance factors need to be considered simultaneously, which mainly include security, format compliance, compression efficiency, computational complexity, energy efficiency, size preservation, and parallelism. The need for format compliance has been clarified in the previous subsection and the security requirement is obvious, so the following discusses other factors.

Compression efficiency becomes a concern when the encryption algorithm changes the behavior of the encoder, which often decreases the compression efficiency to some extent. Note that compression efficiency is always calculated and compared with respect to a level of perceptual quality.
Computational complexity determines the processing speed and the throughput of the JMEE system. Since encryption will always add more computations on top of the multimedia encoding process, it is beneficial to minimize the additional complexity induced by encryption.
Energy efficiency is one of the most crucial factors for resource-constrained devices like wireless multimedia sensor networks (WMSNs) [2]. It has a close link to computational complexity, but cannot be replaced by the latter because different computing components (for example, CPU and the wireless module) often have different energy consumption rates. This implies that the fastest solution is not necessarily the most energy-efficient one.
Size preservation can become a desirable feature, when the multimedia data have already been encoded. In this case, one may want that (ideally) each syntax element keeps its original size. Such a feature is useful to avoid bitrate re-control, and to support some interesting functionalities, such as on-the-fly encryption/decryption and concurrent encryption/decryption of different parts of a single file. Size preservation is a non-trivial issue for multimedia encryption due to the use of variable-length coding (VLC) in most multimedia coding systems.
Parallelism is possible for some selective encryption schemes of encoded multimedia data. For instance, sign bit encryption can run in parallel for different random access units (for example, frames or slices). The structure of a specific multimedia coding system may allow different kinds of parallelism at different levels, and the added encryption may enhance or compromise the parallelism.

As a whole, an ideal JMEE system should have the following features:

high security, meaning that no perceptual information leaked from the encrypted part;
low computational complexity, which corresponds to fast encryption/decryption and high throughput;
high energy efficiency or low energy consumption;
strict format compliance, meaning that every syntax element can be decoded without access to the key;
strict size preservation, that is, every syntax element preserves its original size;
high parallelism or high throughput on multicore and distributed platforms; and
other desirable features, such as supporting perceptual encryption, working with any cipher, and so forth.

Many multimedia encryption schemes are designed following the idea of selective encryption. Although many modern ciphers (like AES and RC4) are fast enough to handle multimedia data, it is always nice to avoid handling the whole bulky data. Even if the reduction in computational complexity is considered insignificant, the saved energy may still be meaningful for power-constrained devices, such as sensor nodes. For instance, a 10% reduction in computational complexity is not too much, but the saved energy will extend the lifetime of a power-constrained device from 9 days to 10 days. Selective encryption is specifically important when the input plaintext is compressed. In this case, there is no full encoding process, but only a partial decoding process followed by encryption and probably by a partial re-encoding process. This partial decoding and encoding process can be significantly simpler than the whole coding process, and thus selective encryption can help reduce the total computational complexity and energy consumption dramatically. For instance, for sign bit encryption of MPEG videos, the partial decoding process does not involve motion compensation, inverse DCT and some other time-consuming steps. In addition, the number of sign bits is significantly smaller than the total number of bits. Therefore, selective encryption will be much more efficient than naive full encryption in terms of computational complexity and energy efficiency.

Since multiple performance factors have to be considered, it is a challenging task to design a multimedia encryption scheme fulfilling all the desired features. As a matter of fact, recent research has shown trade-offs among different performance factors for all the available multimedia encryption methods. Some of these trade-offs are listed below.

Encryption of variable-length codewords or their indices can maintain format compliance, but normally cannot preserve the size. Leaving variable-length codewords unencrypted can preserve the size, but may not be able to offer a sufficient level of security.
Secret permutations can be easily implemented to ensure format compliance and size preservation, but it is insecure against known/chosen-plaintext attacks [27] and even ciphertext-only attacks [22].
FLC encryption can maintain format compliance and size preservation easily, but it is unable to provide a very high security level against error-concealment attacks (ECA) [28].
VLC index encryption can ensure format compliance and security when used properly, but it has an uncontrollable influence on compression efficiency and cannot maintain size preservation. By constraining the index encryption in a small number of VLCs, the influence on compression efficiency will be less problematic, but the security will be compromised to some extent [29].
Secret entropy coding is unable to keep format compliance and size preservation. In addition, most secret entropy coders are either insecure against chosen-plaintext attacks [30], [31], [32] or unable to provide a better performance than simple naive encryption (that is, encryption after compression) [32].
“Virtual full encryption” and header encryption cannot ensure format compliance for some multimedia coding standards, mainly due to the dependence of the encrypted syntax elements on the context.
Most format-compliant selective encryption schemes cannot conceal all perceptual information, thus they cannot be used for some applications requiring a high level of perceptual security [33], [34], [35], [36], [37], [38].
High parallelism can be achieved by some size-preserving JMEE schemes, which are unable to offer a high level of perceptual security because some syntax elements are left unencrypted.

It has been known that most of the trade-offs have their roots in the underlying multimedia coding process: independent coding of different syntax elements, scattering of visual information over different syntax elements, visual information contained in bit sizes of syntax elements, embedded error-concealment mechanisms, and so on. In case when the underlying multimedia coding standard cannot be changed, the problem of multimedia encryption becomes how to achieve a better trade-off among all the performance factors for the target application.

14.4 Perceptual Multimedia Encryption

As mentioned above, perceptual encryption is one of the key reasons for joint encryption-encoding. As a major branch of multimedia encryption, perceptual encryption can be useful in many applications, such as pay-per-view and video on demand (VoD) services. When a multimedia application involves multiple parties and not all of them are trusted, perceptual encryption is also useful to allow those untrusted parties to perform some postprocessing on encrypted multimedia data. For instance, in the example discussed in the previous section, digital watermarking may be allowed by a third party without access to the decryption key, where the watermark can be embedded into the unencrypted part without influencing the encrypted part but still remain detectable at the user end.

The concept of perceptual encryption was introduced in Reference [39], where the authors used the term “transparent encryption” in the context of digital TV broadcasting. Some other researchers later used the term “perceptual encryption” because “transparent encryption” may be confused with another term “format-compliant encryption” (meaning encryption is transparent to encoding and vice versa). This chapter adopts the term “perceptual encryption.” In the literature, there are different perceptual encryption proposals for audio, speech, image, and video. This chapter focuses on perceptual encryption for digital images and videos only.

14.4.1 Definition and Performance Evaluation

Perceptual encryption is selective multimedia encryption where only a specific amount of perceptual information is encrypted by encryption. Equivalently, perceptual encryption can be seen as intended quality degradation by the means of encryption but the degraded quality can be recovered by a decryption key. The nature of perceptual encryption leaves part of the data untouched and allows a decoder to recover unencrypted perceptual information without accessing to the decryption key. Ideally, a fully scalable perceptual encryption should allow the perceptual quality to be downgraded under the control of a continuous quality factor q ∈ [0, 1]. When q = 1, the maximum amount of perceptual information is encrypted and the perceptual quality becomes minimal; when q = 0, the minimum amount of perceptual information is encrypted and the perceptual quality is maximized; when 0 < q < 1, part of the perceptual information is encrypted and the perceptual quality is downgraded accordingly. Figure 14.6 shows an illustrative view of a perceptual encryption scheme.

Images

FIGURE 14.6
A perceptual encryption scheme.

Given the above definition of perceptual encryption, the following facts about the control factor q should be noted:

Despite extensive research on visual quality assessment (VQA), there still does not exist a well-accepted objective measure of visual quality that can replace an average human observer’s subjective quality assessment. Therefore, the control factor q is generally chosen to represent a rough measure of the quality degradation.
While the relationship between the control factor q and the quality degradation cannot be linear (due to the lack of a proper VQA metric), the monotonicity should be maintained, meaning that a larger q should statistically correspond to a higher level of quality degradation. This implies that a selective format-compliant encryption scheme is not necessarily a perceptual encryption scheme unless the selective encryption can downgrade the perceptual quality in a monotonic manner under the control of a factor q.
In most cases, the control factor q is defined as a parameter of the encryption scheme that have a direct relationship with the visual quality of the encrypted image/video, for example, the percentage of syntax elements encrypted.
For digital videos, the quality degradations of different frames may be different, so the control factor represents the average quality of all frames.
Depending on applications, the control factor may have different levels of granularity, so there are may be only a limited number of possible values of q.
Multiple control factors may be used to have a finer control of different aspects of visual quality degradation.

As in general multimedia encryption schemes, there are multiple performance factors to be considered when evaluating the performance of a perceptual encryption scheme. Among all the performance factors, main attention is focused on security since all the other factors are not unique for perceptual encryption. There are two unique features making the security of perceptual encryption schemes essentially different from other kinds of multimedia encryption schemes: i) the existence of two separate sets of data (one encrypted and one unencrypted) containing perceptual information; and ii) the unencrypted data are decodeable to an attacker who does not know the key. The need of maintaining some level of perceptual quality immediately leads to two different levels of security that should be considered:

cryptographical security – measurement of security against cryptographical attacks that try to recovery the plaintext of the encrypted data;
perceptual security – measurement of security against perceptual attacks that try to recover as much perceptual information as possible out of the unencrypted data without any attempt of breaking the underlying encryption algorithm.

Although most perceptual encryption schemes are based on cryptographically strong ciphers, the cryptographical security may not be guaranteed because the interface between the cipher and the image/video coding system may create new insecure points so that the cipher fails. Perceptual security is a unique concept for perceptual encryption and covers perceptual attacks that can recover perceptual information from the encrypted data in noncryptographical means. Perceptual attacks are by definition indifferent of any details about how the encryption algorithm works, which means that they are more general and can also be generalized to applications where some data are missing, corrupted, or distorted due to other reasons (for example, transmission errors and noise, and/or intended or unintended data removal, truncation, and manipulation).

In the following, a brief survey of perceptual encryption schemes for digital images and videos is provided, then the two levels of security are separately discussed with greater detail.

14.4.2 Perceptual Encryption Schemes

This subsection does not attempt to give a complete coverage of all schemes in the literature; it aims at highlighting the main relevant techniques and some representative schemes that can better demonstrate interesting features of perceptual encryption schemes in general. Most perceptual encryption schemes are conceptually simple and straightforward, and the main technical question is what data should be selected for encryption given a specific image and coding standard as the carrier of the multimedia data. Despite the conceptual simplicity of many perceptual encryption, performance evaluation of them especially security analysis is non-trivial, as will be shown later. Throughout the subsection, it is assumed that perceptual encryption is implemented by adopting a cryptographically strong cipher like AES or RC4 unless otherwise stated.

14.4.2.1 Scalability-Based Encryption

Some image and video coding standards, such as JPEG2000 [7], MPEG-2/4 [40], [41] and MPEG SVC [42], have inherent support on scalable coding, meaning that more than one quality layer can be included in a single image or video bitstream. As a consequence, it is natural to realize perceptual encryption by encrypting selected quality layers while leaving others untouched [43]. Perceptual encryption of this kind is also called scalable encryption or multilevel encryption due to the existence of several quality layers. The main problem with this perceptual encryption setting is the mandatory requirement on scalable coding, which may not be adopted by some image and video formats. The limited quality granularity caused by the small number of quality layers is another problem.

14.4.2.2 DCT-Based Image and Video Encryption

Discrete cosine transform (DCT) is the most widely used transform in image and video coding standards [9], [10], [40], [41], [44], [45],³ so many researchers have proposed perceptual encryption schemes based on selective encryption of DCT coefficients and their bit planes. There are several approaches of achieving perceptual encryption in DCT domain [28], [46], [47], [48], [49], [50], [51]: DC encryption, AC encryption, sign bit encryption, bit-plane encryption, and hybrid encryption (for example, DC encryption plus sign bit encryption). Note that DCT encryption can implement scalable encryption because different DCT coefficients of a block represent visual details of different frequency components.

In most DCT-based image and video coding standards, DC coefficients and sign bits are coded differently from AC coefficients and other bit planes, so although conceptually DC encryption has no real difference from AC encryption, the overall performance of the perceptual encryption scheme can be very different. For instance, in JPEG and MPEG-1/2/4 standards, DC coefficients and sign bits are (partly or completely) coded as fixed-length codewords, which means that DC and sign bit encryption can achieve strict size preservation and maintain compression efficiency. AC encryption and bit-plane encryption are relatively complicated because they involve rule-length coding and variable-length coding (VLC). One simple way of circumventing the VLC problem is not to encrypt the value of each AC coefficient or bit plane, but to encrypt their positions by shuffling all encrypted elements. The shuffling process can be driven by a stream cipher. Another solution is to encrypt selected AC coefficients and bit planes before VLC happens, which unfortunately will influence compression efficiency in an uncontrollable manner.

In DCT-based perceptual encryption, the quality factor q is usually set as the percentage of targeted syntax elements (DC/AC/sign bits/bit planes) encrypted. Perceptual encryption operations can be performed immediately after DCT transform, but only some can work at the bitstream level. For the former case, the compression efficiency will normally be compromised because encryption can increase the information entropy of the encrypted syntax elements. Therefore, the quality control factor also becomes a control factor of compromised compression efficiency. For the latter case, encryption cannot be directly applied to syntax elements in order to maintain format compliance, and often the only possible way of performing encryption is to randomly permute the syntax elements concerned. Depending on how the underlying image and video coding standard is designed, only some syntax elements can be permuted at the bitstream level. For instance, for JPEG and MPEG permutations of AC coefficients are possible by shuffling the VLC codewords representing a group of running zero AC coefficients and a nonzero AC coefficient.

14.4.2.3 DWT-Based Image and Video Encryption

As the second most widely used transform in image and video coding [7], [52], [53], discrete wavelet transform (DWT) has also attracted significant attention from researchers working on perceptual encryption [54], [55], [56], [57], [58], [59]. The perceptual encryption settings for DCT can be largely generalized to DWT, but different DWT decomposition levels and high-/low-frequency subbands replace DC and AC coefficients in the DCT case.

3Note that some video coding standards use an integral approximation of DCT, so strictly speaking they are not based on DCT but DCT-like transforms. However, such a subtle difference does not influence how perceptual encryption and cryptanalysis work, and therefore it is ignored in this chapter.

Most DWT-based image and video coding schemes follow an incremental coding principle; that is, the most significant bits are encoded first and less significant bits come later. This means perceptual encryption can be naturally realized by gradually encrypting the encoded bitstream from the beginning of each independently coded unit. In addition, DWT has a pyramid structure that allows performing random permutations at different levels and on different subbands to achieve different degrees of quality degradation.

For perceptual encryption schemes working at the encoded bitstream level of JPEG2000 images/videos, a special issue has to be carefully handled to ensure format compliance of ciphertexts; the encrypted bitstream should not create fake termination markers that are words in the range 0xff90 to 0xffff [60], [61].

14.4.2.4 Motion Vector Scrambling

For digital videos, scrambling motion vectors can create lightweight perceptual encryption schemes controlling temporal visual quality only [28], [62]. Since intra-coded video frames do not contain any motion vectors, this approach cannot have any quality control over these frames. As a result, motion vector scrambling should be used as a supplemental option for further enhancing the performance of perceptual encryption schemes based on other techniques.

14.4.2.5 Fixed-Length Codeword Encryption

In most image and video coding standards, there are syntax elements coded as fixed-length codewords (FLCs). Taking MPEG-1/2/4 videos as an example, the following syntax elements are FLCs: sign bits of nonzero DCT coefficients, (differential) DC coefficients in intra blocks, ESCAPE DCT coefficients (DCT coefficients out of the scope of Huffman coding), and sign bits and residuals of motion vectors. By selectively encrypting some FLCs in an image/video, one can achieve strict size-preservation and format compliance at the same time. The size-preservation will further allow on-the-fly encryption and concurrent encryption, as pointed out in Section 14.3. A typical perceptual encryption based on FLC encryption is the so-called perceptual video encryption algorithm (PVEA) [28], which is designed for MPEG videos. A unique feature of PVEA is the use of three independent quality-control factors for the low-resolution rough (spatial) view, the high-resolution (spatial) details, and the (temporal) motions.

14.4.2.6 Fractal Coding-Based Encryption

Fractal coding has been developed as an image compression algorithm [63]. The basic idea behind is to represent an image as a set of transformations associated with an iterated function system, which can converge to a so-called fixed point (which is a normally a fractal set). In the case of fractal image coding, the fixed point will be an approximation of the original image. Since fractal coding represents an image as a set of transforms, it is possible to encrypt part of the parameters of the transforms to achieve perceptual encryption, as proposed in Reference [64].

14.4.3 Cryptographical Security

The following discusses cryptographical security from the perspectives of different attacks. If one can break the cipher used to encrypt the selected part of the image and video data, then the whole perceptual encryption scheme will collapse. While normally a cryptographically strong cipher like AES or RC4 is used to guarantee security, the way how such a cipher is applied to image and video data can make a difference. This becomes more obvious when the cipher has to be applied in a way different from its original design. For instance, some perceptual encryption schemes are based on random permutations of selected syntax elements of an image/video bitstream. A stream cipher is usually used to produce a pseudo-random permutation map, which is then applied to achieve the effect of encryption. Here, the use of the stream cipher differs from the normal usage in cryptography, where the keystream is used to mask the plaintext (normally via XOR or modular addition) rather than to shuffle their positions.

14.4.3.1 Brute-Force Attacks

The simplest way of breaking a cryptosystem system is to exhaustively try all possible plaintexts/keys until the correct one is found. This attack can be estimated by the size of the plaintext/key space.

For the key space, the security against brute-force attack can be ensured by using a cryptographically strong cipher, which always uses a sufficiently large key space. The plaintext space is a bit more complicated because it depends on the number of bits of the encrypted part of the image/video bitstream. Here, one has to consider each independently encrypted unit rather than the whole encrypted data, because independent units can be guessed separately. Although it is possible to make all encrypted bits dependent on each other, for example, by using a mode with ciphertext feedback like CBC, those bits may still be separately guessed because of their semantic meanings. A typical example is DC encryption of DCT-transformed images and videos. One can make encryption results of all DC coefficients dependent on each other; however, the attacker only needs to break a small amount of them to reveal part of the image. Since part of an image may already contain sensitive perceptual information (for instance, a person’s face), this partial brute-force attack can be considered successful although it is not in a cryptographical sense. This can be a serious problem when the quality control factor q is small, meaning that the percentage of encrypted syntax elements is low and thus the average distance between consecutive syntax elements encrypted is large. In other words, this fact creates a lower bound of the quality control factor q. For a more detailed discussion of this subtle issue on an ad hoc design of perceptual encryption, readers are referred to Reference [28, Sec. III.B]. Note that this fact also blurs the boundary between cryptographical security and perceptual security because now one has to ask what is the consequence of revealing part of the encrypted data without considering what encryption algorithm is actually used.

Another related issue about brute-force attacks is how to judge a guessed key is correct. Apparently this should not be done by manually checking the perceptual quality of recovered plaintext unless this only requires a very small number of manual checks (say, hundreds). To automate the checking process, one needs some criteria. For image and video perceptual encryption, this can be naturally done by using a no-reference image/video quality assessment tool that can measure the difference between the original plaintext and the ciphertext in terms of perceptual quality. For instance, DC encryption will create discontinuity along block boundaries, so the level of such blockiness can be used as a measurement of how good a guessed key is. The key giving the minimum level of blockiness can be selected as the correct key.

Images

FIGURE 14.7
The means and variances of coefficients of 8 × 8 DCT on a 256 × 256 test image Cameraman.

14.4.3.2 Ciphertext-Only Attacks

Since it is assumed here that a cryptographically strong cipher is used for encryption, the target of a ciphertext-only attack in this context is not the key, but the plaintext of the encrypted data. Being cryptographically strong, the underlying cipher should not allow an attacker to launch a successful ciphertext-only attack. However, in perceptual encryption, there is the context that is not encrypted, that is, what the encrypted data represent semantically. This may give the attacker a chance. This is the case when random permutations (driven by a stream cipher) are applied to DCT coefficients. Consider a perceptual encryption scheme random permutes the first eight most significant DCT coefficients, that is, DC coefficient and the first seven AC coefficients. The nature of DCT as a suboptimal decorrelation map implies that the both the mean and the variance of the DCT coefficients drop from the low-frequency to high-frequency bands (see Figure 14.7). As a consequence, if the permuted DCT coefficients are sorted according to their amplitudes, the original order of the eight encrypted DCT coefficients will likely be recovered with a non-negligible probability, thus recovering the image/video with a better overall quality. Figure 14.8 shows the result when such an attack is applied to a test image Cameraman. One can see that the visual quality of Figure 14.8b is obviously much better than that of the ciphertext shown in Figure 14.8a. A similar attack for DCT encryption of MPEG videos can be found in Reference [65, Sec. 3.4]. There are also ciphertext-only attacks that are completely independent of how encryption is implemented, but those attacks are about perceptual security and will be discussed in the next subsection.

Images

FIGURE 14.8
(a) The ciphertext of randomly permuting the first eight DCT coefficients in each block. (b) The result of the ciphertext-only attack by sorting the first eight DCT coefficients according to their amplitudes.

14.4.3.3 Known/Chosen-Plaintext Attacks and Chosen-Ciphertext Attacks

In these attacking scenarios, the target is the key since the plaintexts have been known. Under the assumption that the underlying cipher is cryptographically strong, such attacks are largely impossible. However, due to the special structure of stream ciphers, the key cannot be reused to avoid extraction of the keystream by comparing the plaintext and its ciphertext. The common practice is to change the key for each image/video frame or the minimum encryption unit (for example, a slice of an MPEG-1/2/4 video frame). This can be done by making the key dependent on the contents of the plaintext. One approach is to combine a long-term master key with the hash value of the image/video (or hash value of the concatenation of both). In this case, the hash value should also be transmitted with the image/video, but it does not need to be kept secret so it can be encoded in the user data field of the image/video. When the employed image and video coding format does not support user data, one can consider embedding the hash value into the image/video as a watermark. Since the watermark does not need to be secret but need to be lossless, it is also accepted to make the watermark visible at one corner of the encrypted image/video like a logo in a broadcasted TV program.

14.4.4 Perceptual Security

For perceptual security, one does not care about any detail of the encryption algorithm, but cares about how to recover a better quality out of the unencrypted data. This means that all attacks in this category fit into the class of ciphertext-only attacks.

From an attacker’s point of view, the problem to be solved can be described as follows:

What is available – the unencrypted visual data and the known semantics of such data (for example, the encrypted data are all DC coefficients or sign bits).
What is a successful attack – if one can recover the plaintext image/video with a quality (statistically) better than that of the ciphertext (that is, the one defined by the quality control factor q).
What are the constraints – both the time complexity and the space complexity should be sufficiently low to make the attack practical even for relatively large images and videos.

The above points are self-explanatory, but the last one needs a bit more discussion. How high the time complexity, that is, how slow an attack, is allowed depends on the application. If the attacker wants to crack a live broadcast program, the attack will have to be fast enough to catch the frame rate, meaning that the attack should be done in O (10) milliseconds. If the attacker just wants to crack what is stored on a hard disk or a DVD disc, then it is possible to run the attack for hours or even days. When an attacker has access to a botnet or a supercomputer, the above requirement can be relaxed at an order of O (10⁶). Note that an attack running in polynomial time (in the so-called P complexity class) may not be practical at all if either the hidden constant factor or the fixed exponent is too large. The requirement on the space complexity is normally tighter because the storage available to an attacker is largely fixed and does not grow over time.

All attacks in the perceptual security category follow the same basic approach; namely, replace the encrypted data by an estimate from the available unencrypted data/information and then run the decoder to get an approximate edition of the plaintext image/video. The more accurate the estimate is, the better the recovered quality and the more successful the attack will be. Since this approach is very similar to error concealment in multimedia coding, these attacks are collectively called error-concealment attacks (ECAs) [34]. Some researchers also use the term concealment attacks or replacement attacks [59].

Error-concealment attacks have been so far the most successful attacks on perceptual encryption and some very promising results have been obtained in recent years on DCT-based encryption. It is also likely that these attacks can be easily generalized to other transforms, including DWT. Some error-concealment attacks also have more profound implications on image and video coding (compression) in general. Such attacks are discussed in more detail in the next section.

Since the performance of an error-concealment attack has to be judged against the quality defined by the ciphertext, measuring the perceptual information for the comparative purposes becomes an important task. The golden standard of perceptual information measurement is subjective evaluation by human experts. However, subjective quality assessment is very time-consuming and costly, so researchers have proposed various objective visual quality assessment (VQA) metrics for digital images and videos [66], [67]. Traditionally, image and video coding community has been using a simple metric called peak signal-to-noise ratio (PSNR). For an M × N image or video frame I* = {I* (i, j)} and its distorted version I = {I (i, j)}, where 1 ≤ i ≤ M and 1 ≤ j ≤ N, PSNR is defined as follows:

$PSNR = 10 \cdot \log_{10} \frac{I_{\max}^{2}}{{\sum^{}}_{i = 1}^{M} {\sum^{}}_{j = 1}^{N} {(I (i, j) - I^{*} (i, j))}^{2} / (M N)}, (14.5)$

where the denominator, called mean square error (MSE), is also widely used in objective quality assessment. Unfortunately, neither PSNR nor MSE matches subjective quality very well [67, Figure 1.1], so more advanced VQA metrics, such as the structural similarity index (SSIM) [68], visual information fidelity (VIF) [69], and visual signal-to-noise ratio (VSNR) [70], have been proposed. Although VQA metrics are commonly tested on various test databases to demonstrate their good correlation with subjective data collected from human observers, a recent comprehensive comparative study has shown that none of existing VQA metrics consistently outperforms all others [71].

Some encryption-oriented VQA metrics have also been proposed to obtain more accurate measurements on encrypted images and videos. Such metrics include the luminance similarity score (LSS) and edge similarity score (ESS) [72], neighborhood similarity [73], similarity of local entropy [74], and a number of metrics based on image histogram and radial Fourier spectrum [75]. Basically speaking, these encryption-oriented VQA metrics focus more on a specific aspect of the overall visual quality that is supposed to be influenced by perceptual encryption in a more scalable way (meaning that no abrupt change will happen as the quality factor q changes slightly). The main drawback of all the existing encryption-oriented VQA metrics is the lack of performance evaluation against subjective quality. This makes it difficult to judge whether there exist the best metric and if so, which one it is.

14.5 Error-Concealment Attacks

This section surveys all error-concealment attacks currently known. At first, it discusses the so-called naive ECA, which involves only replacement of encrypted data by a fixed value. Then, this section introduces different kinds of advanced ECAs that try to have a better estimate of the encrypted data by looking at the semantics/statistics of the encrypted data and its interplay with the unencrypted data. This section ends with the most promising ECA reported very recently, which represents a universal approach to breaking many perceptual encryption schemes with globally optimized results.

14.5.1 Naive Attacks

The naive ECA is the simplest ECA in the sense that it just tries to replace all encrypted syntax elements of the same kind by a fixed value according to their semantic meanings. This can be more easily explained by how such an attack works with DC encryption of DCT-transformed images. While one has no idea what the DC coefficients are, it is known that they have a range determined by the valid range of pixel values. For N × N DCT and eight-bit grayscale images, this range is [0, 255N]. By setting all DC coefficients to any value in this valid range, for example, 0 or 255N/2, and then scale the dynamic range of the resultant image to [0, 255], an image with an overall quality better than the ciphertext can be produced. Figure 14.9 shows the result on the test image Cameraman. One can see that the image obtained via the naive ECA obviously has a higher visual quality than the ciphertext image. Using PSNR as the objective VQA metric, the same conclusion can be reached: the ciphertext image has PSNR = 11.5883 dB, whereas the image recovered from this ciphertext by the naive ECA has PSNR = 13.0428 dB. The naive ECA is the widely acknowledged attack. Most perceptual encryption schemes were tested against this attack to validate its perceptual security.

Images

FIGURE 14.9
(a) The ciphertext image of Figure 14.7a when all DC coefficients are encrypted. (b) The image recovered from the ciphertext image by setting all encrypted DC coefficients to zeros.

14.5.2 Local Constraints-Based Attacks

In the naive ECA, all encrypted syntax elements of the same kind are replaced by the same fixed value. Such a global replacement is not necessarily (generally not) the best choice. By considering a small local window around each encrypted syntax element, some local constraints with a tighter range than the global range may be found. Once again, this can be best demonstrated by the case of DC encryption. Since the two-dimensional DCT is performed blockwise, the local window can be naturally selected as the block each encrypted DC coefficient lies in. In this case, Property 14.1 gives an interesting local constraint of the unknown DC coefficients defined by the known AC coefficients.

Property 14.1

Given an N × N block B, the pixel values calculated from its DC-free edition B ⁽⁰⁾ (that is, the inverse DCT result by setting the DC coefficient to 0) define a constraint on the DC coefficient:

$N (t_{\min} - \min (B^{(0)})) \leq DC (B) \leq N (t_{\max} - \max (B^{(0)})), (14.6)$

where [t_min,t_max] denotes the valid range of pixel values.

The local constraint defines a valid range of the DC coefficient of the block containing it, which is a proper subset of the global valid range with high probability. Therefore, if the midpoint of the tighter range is used to replace the encrypted DC coefficient, the ECA attack should give results closer to the plaintexts with a higher visual quality. Figure 14.10 shows the result of such an advanced ECA applied to the ciphertext in Figure 14.9a. Comparing Figure 14.10 with Figure 14.9b reveals that the visual quality is further improved, which is also reflected by higher PSNR, that is, 13.8837 dB.

Images

FIGURE 14.10
The image recovered from the ciphertext image shown in Figure 14.9a by setting each encrypted DC coefficient to the midpoint of the valid range of that block. ©2011 IEEE

Images

FIGURE 14.11
The results of attacks on the test image Cameraman. (a) The naive ECA by setting all encrypted DPSS coefficients to zeros, resulting in PSNR = 12.7071 dB. (b) The advanced ECA as reported in Reference [76], resulting in PSNR = 18.2721 dB.

This advanced ECA on DC coefficients can also be generalized to any encrypted DCT coefficients. When more than one DCT coefficient are encrypted in each block, the local constraints have to be calculated jointly by checking all possible values of the encrypted DCT coefficients and taking those values that do not produce invalid pixel values.

14.5.3 Prior Statistics-Based Attacks

If the attacker can have access to an image database that represents the statistics of the plaintext image of interest, then it will be more beneficial to replace the encrypted syntax elements by the means of the most likely values rather than the midpoints of the valid ranges. Such an advanced ECA was demonstrated in Reference [76] on an image encryption scheme based on two-dimensional discrete prolate spheroidal sequences (DPSS) [77]. This image encryption is not a real perceptual encryption scheme, but the basic idea behind the statistics-based ECA demonstrated in Reference [76] can be generalized to any perceptual encryption scheme. To be more exact, the advanced ECA exploits statistics of the ratios between the encrypted DPSS transform coefficients and the most significant unencrypted coefficient. Then, the statistics are used to derive a better estimate of each encrypted DPSS transform coefficient. Compared with the results of the naive ECA, the performance of the statistics-based ECA is significantly better, as shown in Figure 14.11.

14.5.4 DC Encryption Attacks Using Coefficient Recovery

All previously introduced ECAs are relatively simple and the recovered image can hardly reach a PSNR value larger than twenty. In Reference [38], the authors proposed a method that can boost the performance of an advanced ECA to a level far more than all previous ECAs. This attack was aimed at DC encryption of DCT-transformed images/videos. To facilitate the following discussion, this method is named here as USO by taking the initials of the three authors. The USO method is based on Property 14.1 of digital images and Property 14.2 of natural images.

Property 14.2

The difference between two neighboring pixels is a Laplacian variate with zero mean and small variance.

The two properties play two different roles and can work together to design an iterative ECA. Namely, Property 14.2 allows the attacker to estimate the DC coefficient of a block relative to its neighbor blocks by minimizing the pixel difference along the block boundaries, and Property 14.1 applies a tighter range of the DC coefficient on each block so that the intersection of all the relative DC coefficients can give a rough estimate of the global brightness of the image. After an attacker has the global brightness and all the relative DC coefficients, approximations of absolute values of the encrypted DC coefficients can immediately be obtained. Note that the estimation process of the relative DC coefficients involves a scanning process, which starts from a corner block of the image and then continues through the whole image toward the opposite corner to derive all the relative DC coefficients one after the other. It is obvious that such a scanning process suffers from error propagations. To mitigate this problem, Reference [38] proposed to have four different scans starting from the four different corners and then average the results to get a less noisy result. After the whole process, the recovered plaintext image may still have pixel values out of the valid range, then a scaling or truncating operation is applied to remove those invalid pixel values. Figure 14.12 shows the result of the USO attack on the test image Cameraman. As it can be seen, the perceptual quality of the recovered image is boosted a lot. The PSNR value becomes 19.6975 dB, much higher compared to that of the naive ECA and the simple advanced ECA described in Sections 14.5.1 and 14.5.2. For some other images, such as Lena, the PSNR can go beyond 20 dB.

Images

FIGURE 14.12
The image recovered from the ciphertext image shown in Figure 14.9a by performing the USO attack, resulting in PSNR = 19.6975 dB.

14.5.5 DC Encryption Attacks via Underflow/Overflow Rate Minimization

While the USO method works largely well, it sometimes suffers from a relatively low accuracy of the DC estimation when the error propagation is not under control so that many pixel values go out of the valid range. For the averaged result from the four scans shown in Figure 1 of Reference [78], the pixel values range in the interval [−88.6, 303.0]. By scaling the range to [0, 255], the finally recovered image has a low PSNR value of 14.3 dB.

To further reduce the error propagation effect, Reference [78] proposed to apply Property 14.1 during the relative DC estimation process rather than after it. To be more exact, during the scanning process, the underflowed and overflowed pixel values are checked for each block. Once such underflows and/or overflows happens, the whole block is shifted toward the valid range of pixel values until all underflowed and overflowed pixel values disappear. The modified DC estimation process makes the final result more sensitive to the global brightness, or equivalently the DC coefficient of the first block in each scan. Reference [78] also proposed an iterative process to get a more accurate estimate of the first DC coefficient, which is based on an interesting observation that roughly holds for various natural images: the more accurate the estimate of the first DC coefficient, the less frequently underflowed and overflowed pixel values appear in each block. In other words, the underflow/overflow rate is a unimodal function (with only one global minimum) of the estimate of the first DC coefficient. Based on this observation, the authors proposed to search all possible values of the first DC coefficients and take the one with the minimum under-flow/overflow rate as the estimate. Since this rate is a unimodal function, the searching process can be done effectively with a time complexity of O (M_B log₂(N(t_max − t_min)/Δ)). This method was named as underflow/overflow rate minimization (FRM).

Images

FIGURE 14.13
The performance of the FRM method on recovering the test image Cameraman shown in Figure 14.11a, resulting in PSNR = 22.8142 dB.

Because the iterative process of searching for the optimal value of the first DC coefficient, the FRM method is generally slower than the USO method, but it outperforms the USO method significantly according to experimental results carried out on 200 test images. The authors tested the performance improvement measured in ten different VQA metrics, and the FRM method was proved to be consistently better than the USO method. Figure 14.13 shows an example of using the FRM and USO methods for recovering a test image Cameraman; it can be seen that the FRM produces a result with an obviously better visual quality. For a detailed comparison between the USO and FRM methods, the readers are referred to Reference [78, Sec. 3.3].

14.5.6 Optimization-Based Approach to DCT Encryption Attacks

Both the USO and FRM methods can break DC encryption only. They cannot be easily generalized to more general cases, such as AC encryption or combination of DC and AC encryption, because Properties 14.1 and 14.2 describe the relationship between one unknown coefficient and all other known coefficients. The question about where an generalized attack exists was answered by in Reference [79] affirmatively.

This time the whole ECA is modeled as a discrete optimization problem and solved as a linear program. For the ECA breaking DC encryption, the discrete optimization problem can be described as follows:

$minimize f ({x (i, j)}_{0 \leq i, j \leq N - 1}) (14.7)$

Images

FIGURE 14.14
The image recovered from the ciphertext image shown in Figure 14.9a by performing the linear programming-based ECA, resulting in PSNR = 26.4866 dB. ©2011 IEEE

subject to x = A · y, x_min ≤ x(i, j) ≤ x_max, and y (k,l) = y^* (k,l) for all AC coefficients. The terms x (i, j) and y (k,l) are variables denoting the value of pixel (i, j) and the DCT coefficient (k,l), respectively. The term A represents the linear relationship between pixel values and DCT coefficients defined by the blockwise two-dimensional DCT, and y^*(k,l) represents the ground truth of y (k,l). Note that the last constraint fixes all AC coefficients to their ground truths while the DC coefficient y (0,0) remains as a variable. In principle, the objective function f can be any convex function to allow a polynomial-time optimization algorithm. However, considering Property 14.2, the authors chose the objective function to be Σ_{{(i,j),(i′, j′)}}|x(i, j) − x (i′, j′)|, where the sum ranges over all pairs of neighboring pixels. This choice makes the optimization problem a linear one, which can be solved efficiently by using the linear programming (LP) technique [80]. The linearization is done by introducing variables h_{i,j,i′,j′} into the model as follows:

$minimize \sum^{} h_{i, j, i', j'} (14.8)$

subject to x (i, j) − x(i′, j′) ≤ h_{i,j,i′,j′}, x (i′, j′) − x (i, j) ≤ h_{i,j,i′,j′}, and the constraints related to Equation 14.7. In the above model, the variable h_{i,j,i′,j′} will be tight at |x (i, j) − x (i′, j′)| for the optimum solution.

With the above optimization model, the generalization from DC encryption to DCT encryption is straightforward: one can simply fix y (k,l) = y ^*(k,l) for known DCT coefficients while keeping all unknown variables in their natural ranges. The generalized model maintains the linearity, so it can still be solved via linear programming. Given an n × m image and U unknowns in each of B blocks with dimensions N × N, the optimization model contains 2nm − (n + m) variables h and UB variables y while all other unknowns can be determined by these key variables in a presolve step. The average time complexity and space complexity of the optimization problem were experimentally verified to be O (n²m²U) and O(nmU), respectively.

Images

FIGURE 14.15
The recovery results of the linear programming-based ECA when sixteen most significant DCT coefficients are encrypted in each block of the image Cameraman shown in Figure 14.13a: (a) the image recovered by setting all missing DCT coefficient to the midpoints of their valid ranges and (b) the image recovered by the linear programming-based ECA.

The performance of the linear programming-based ECA works nearly perfectly for breaking DC encryption. For the test image Cameraman, the recovered image reaches a PSNR value of 26.4866 dB (see Figure 14.14). For the DC encryption case, the linear programming-based method outperforms the FRM method statistically with a significant margin (see Figure 2 of Reference [79]). What is more surprising, even when more than ten most significant DCT coefficients are encrypted, the linear programming-based ECA can still recover the plaintext with relatively high perceptual quality for some images. Figure 14.15 shows an example on the test image Cameraman when sixteen most significant DCT coefficients are missing from each block.

The big success of the optimization-based ECA is rather unexpected, especially when it turned out that a relatively simple optimization model can work so perfectly. The results not only reveal that DCT encryption largely failed, but also lead to some more profound implications on image and video coding. It raises the question about how much information should be encoded for a DCT-based image and video coding scheme. In addition, some digital watermarking and data hiding schemes can also be handled by the same optimization model if they use some AC coefficients to hide the watermarks/hidden information.

14.6 Conclusion

Perceptual encryption of digital images and videos is useful for many applications and can be easily implemented by selectively encrypting part of the whole image/video bitstream. However, recent research on error-concealment attacks has raised an alarm about the security of many perceptual encryption schemes.

Although the optimization-based ECA is very new, it has shown great potential of breaking probably almost all DCT-based perceptual encryption with high probability. Its generalization to DWT and other transforms is still to come, but it is highly likely that the generalized attacks will still work. Therefore, it is believed that more study on further extension of the optimization-based ECA should be the main research direction in the field. A combination of the optimization based method and the statistics-based approach could be another topic for further enhancing the results, especially when the plaintext image has unique statistics that are known to the attacker (for example, satellite images). In addition, applications of the optimization model to other fields, such as image and video coding, digital watermarking, and steganography, are also interesting directions that deserve further investigation.

It remains unclear whether the optimization-based attacks can also be generalized to non-perceptual multimedia encryption where the unencrypted data are not decodeable. It seems to be still possible that some new attacks can be developed by combining the current optimization model with the image and video codecs. For instance, although the unencrypted data are not decodeable, it is known that it will become decodeable if the encrypted data are correctly revealed. So the question becomes whether it is possible to build an optimization model that can find this optimal point which maximizes the degree of decodeablility of the non-decodeable data. This is certainly another big challenge to the multimedia encryption community.

Acknowledgment

Shujun Li was supported by a fellowship from the Zukunftskolleg of the University of Konstanz, Germany, which is part of the “Excellence Initiative” Program of the DFG (German Research Foundation).

Figures 14.10 and 14.14 are reprinted from Reference [79] with the permission of IEEE.

References

[1]. Y.Q. Shi and H. Sun, Image and Video Compression for Multimedia Engineering: Fundamentals, Algorithms, and Standards. Boca Raton, FL, USA: CRC Press, 2nd edition, 2008.

[2]. I.F. Akyildiz, T. Melodia, and K.R. Chowdhury, “Wireless multimedia sensor networks: Applications and testbeds,” Proceedings of the IEEE, vol. 96, no. 10, pp. 1588–1605, October 2008.

[3]. F. Dufaux, W. Gao, S. Tubaro, and A. Vetro, “Distributed video coding: Trends and perspectives,” EURASIP Journal on Image and Video Processing, vol. 2009, pp. 508167:1–13, 2009.

[4]. K.R. Rao and P.C. Yip, Discrete Cosine Transform: Algorithms, Advantages, Applications. Boston, MA, USA: Academic Press, 1990.

[5]. S. Mallat, A Wavelet Tour of Signal Processing. Boston, MA, USA: Academic Press, 2nd edition, 1999.

[6]. T.M. Cover and J.A. Thomas, Elements of Information Theory. Hoboken, NJ, USA: John Wiley & Sons, 2nd edition, 2005.

[7]. ISO/IEC, “Information technology – JPEG 2000 image coding system: Core coding system.” ISO/IEC 15444–1, 2000.

[8]. ISO/IEC, “Information technology – JPEG XR image coding system – Part 2: Image coding specification.” ISO/IEC 29199–2, 2009.

[9]. ISO/IEC, “Information technology – Coding of audio-visual objects – Part 10: Advanced Video Coding.” ISO/IEC 14496–10, 2004.

[10]. SMPTE (Society of Motion Picture and Television Engineers), “Standard for television – VC-1 compressed video bitstream format and decoding process.” SMPTE 421M, 2006.

[11]. A. Menezes, P. van Oorschot, and S. Vanstone, Handbook of Applied Cryptography. Boca Raton, FL, USA: CRC Press, 1996.

[12]. B. Schneier, Applied Cryptography – Protocols, Algorithms, and Source Code in C. Hoboken, NJ, USA: John Wiley & Sons, 2nd edition, 1996.

[13]. National Institute of Standards and Technology (US), “Specification for the advanced encryption standard (AES).” Federal Information Processing Standards Publication 197 (FIPS PUB 197), November 2001.

[14]. K. Kaukonen and R. Thayer, “A stream cipher encryption algorithm Arcfour.” Available online, http://www.mozilla.org/projects/security/pki/nss/draft-kaukonen-cipher-arcfour-03.txt.

[15]. A. Kerckhoffs, “La cryptographie militaire,” Journal des sciences militaires, vol. 9, pp. 5–38, January 1883.

[16]. A. Kerckhoffs, “La cryptographie militaire,” Journal des sciences militaires, vol. 9, pp. 161–191, February 1883.

[17]. ISO/IEC, “Information technology – JPEG 2000 image coding system: Secure JPEG 2000.” ISO/IEC 15444–8, 2007.

[18]. ISO/IEC, “Information technology – Generic coding of moving pictures and associated audio information – Part 11: IPMP on MPEG-2 systems.” ISO/IEC 13818–11, 2004.

[19]. ISO/IEC, “Information technology – Coding of audio-visual objects – Part 13: Intellectual Property Management and Protection (IPMP) extensions.” ISO/IEC 14496–13, 2004.

[20]. “Advanced Access Content System (aacs) specifications.” Available online, http://www.aacsla.com/specifications.

[21]. M. Johnson, P. Ishwar, V. Prabhakaran, D. Schonberg, and K. Ramchandran, “On compressing encrypted data,” IEEE Transactions on Signal Processing, vol. 52, no. 10, pp. 2992–3006, October 2004.

[22]. L. Qiao and K. Nahrsted, “Comparison of MPEG encryption algorithms,” Computers & Graphics, vol. 22, no. 4, pp. 437–448, August 1998.

[23]. C. Fontaine and F. Galand, “A survey of homomorphic encryption for nonspecialists,” EURASIP Journal on Information Security, vol. 2007, pp. 13801:1–10, 2007.

[24]. C. Gentry, “Fully homomorphic encryption using ideal lattices,” in Proceedings of the 41st Annual ACM Symposium on Theory of Computing, Bethesda, MD, USA, June 2009, pp. 169–178.

[25]. B. Schneier, “Homomorphic encryption breakthrough.” Available online, http://www.schneier.com/blog/archives/2009/07/homomorphic_enc.html.

[26]. S. Li, Z. Li, and W.A. Halang, “Multimedia encryption,” in Encyclopedia of Multimedia Technology and Networking, Hershey, PA, USA: IGI Global, August 2008, pp. 972–977.

[27]. S. Li, C. Li, G. Chen, N.G. Bourbakis, and K.T. Lo, “A general quantitative cryptanalysis of permutation-only multimedia ciphers against plaintext attacks,” Signal Processing: Image Communication, vol. 23, no. 3, pp. 212–223, March 2008.

[28]. S. Li, G. Chen, A. Cheung, B. Bhargava, and K.T. Lo, “On the design of perceptual MPEG-video encryption algorithms,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 2, pp. 214–223, February 2007.

[29]. Y. Mao and M. Wu, “A joint signal processing and cryptographic approach to multimedia encryption,” IEEE Transactions on Image Processing, vol. 15, no. 7, pp. 2061–2075, July 2006.

[30]. T. Uehara and R. Safavi-Naini, “Attacking and mending arithmetic coding encryption schemes,” in Proceedings of the 22nd Australasian Computer Science Conference, Auckland, New Zealand, January 1999, pp. 408–419.

[31]. S. Li, G. Chen, A. Cheung, K.T. Lo, and M. Kankanhalli, “On the security of an MPEG-video encryption scheme based on secret Huffman tables,” in Proceedings of the Third Pacific Rim Symposium, Tokyo, Japan, January 2009, pp. 898–909.

[32]. K.P. Subbalakshmi and G. Jakimoski, “Cryptanalysis of some encryption schemes for multimedia,” IEEE Transactions on Multimedia, vol. 10, no. 4, pp. 330–338, April 2008.

[33]. I. Agi and L. Gong, “An empirical study of secure MPEG video transmission,” in Proceedings of ISOC Symposium on Network and Distributed Systems Security, San Diego, CA, USA, February 1996, pp. 137–144.

[34]. J. Wen, M. Severa, W. Zeng, M.H. Luttrell, and W. Jin, “A format-compliant configurable encryption framework for access control of video,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 6, pp. 545–557, June 2002.

[35]. T.D. Lookabaugh, D.C. Sicker, D.M. Keaton, W.Y. Guo, and I. Vedula, “Security analysis of selectively encrypted MPEG-2 streams,” Proceedings of SPIE, vol. 5241, pp. 10–21, September 2003.

[36]. C.P. Wu and C.C.J. Kuo, “Design of integrated multimedia compression and encryption systems,” IEEE Transactions on Multimedia, vol. 7, no. 10, pp. 828–839, October 2005.

[37]. A. Said, “Measuring the strength of partial encryption schemes,” in Proceedings of 2005 IEEE International Conference on Image Processing, Genoa, Italy, September 2005, pp. 1126–1129.

[38]. T. Uehara, R. Safavi-Naini, and P. Ogunbona, “Recovering DC coefficients in block-based DCT,” IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3592–3596, November 2006.

[39]. B.M. Macq and J.J. Quisquater, “Cryptology for digital TV broadcasting,” Proceedings of the IEEE, vol. 83, no. 6, pp. 944–957, June 1995.

[40]. ISO/IEC, “Information technology – Generic coding of moving pictures and associated audio information: Video.” ISO/IEC 13818–2, 1996.

[41]. ISO/IEC, “Information technology – Coding of audio-visual objects – Part 2: Visual.” ISO/IEC 14496–2, 1999.

[42]. H. Schwarz, D. Marpe, and T. Wiegand, “Overview of the scalable video coding extension of the H.264/AVC standard,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 9, pp. 1103–1120, September 2007.

[43]. J. Dittmann and A. Steinmetz, “Enabling technology for the trading of MPEG-encoded video,” in Proceedings of the Second Australasian Conference, Sydney, NSW, Australia, July 1997, pp. 314–324.

[44]. ISO/IEC, “Information technology – Digital compression and coding of continuous-tone still images: Requirements and guidelines.” ISO/IEC 10918–1, 1994.

[45]. ISO/IEC, “Information technology – Coding of moving pictures and associated audio for digital storage media at up to about 1,5 Mbit/s – Part 2: Video.” ISO/IEC 11172–2, 1993.

[46]. M.V. Droogenbroeck and R. Benedett, “Techniques for a selective encryption of uncom-pressed and compressed images,” in Proceedings of Advanced Concepts for Intelligent Vision Systems, Ghent, Belgium, September 2002, pp. 90–97.

[47]. A. Torrubia and F. Mora, “Perceptual cryptography of JPEG compressed images on the JFIF bit-stream domain,” in Proceedings of the IEEE International Conference on Consumer Electronics, Los Angeles, CA, USA, June 2003, pp. 58–59.

[48]. F. Dufaux and T. Ebrahimi, “Scrambling for privacy protection in video surveillance systems,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, no. 8, pp. 1168–1174, August 2008.

[49]. L. Weng and B. Preneel, “On encryption and authentication of the DC DCT coefficient,” in Proceedings of the Second International Conference on Signal Processing and Multimedia Applications, Barcelona, Spain, July 2007, pp. 375–379.

[50]. G. Liu, T. Ikenaga, S. Goto, and T. Baba, “A selective video encryption scheme for MPEG compression standard,” IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, vol. E89-A, no. 1, pp. 194–202, January 2006.

[51]. M.I. Khan, J. Varun, and A.S. Malik, “On perceptual encryption: Variants of DCT block scrambling scheme for JPEG compressed images,” in Proceedings of the Signal Processing and Multimedia: International Conferences, SIP and MulGraB 2010, Jeju Island, Korea, December 2010, pp. 212–223.

[52]. BBC, “Dirac specification,” Version 2.2.3. Available online, http://diracvideo.org/specifications.

[53]. SMPTE (Society of Motion Picture and Television Engineers), “VC-2 video compression.” ST 2042–1–2009, 2009.

[54]. S. Lian, J. Sun, and Z. Wang, “Perceptual cryptography on SPIHT compressed images or videos,” in Proceedings of 2004 IEEE International Conference on Multimedia and Expo, Taipei, Taiwan, July 2004, pp. 2195–2198.

[55]. D. Engel and A. Uhl, “Parameterized biorthogonal wavelet lifting for lightweight JPEG 2000 transparent encryption,” in Proceedings of the 7th Workshop on Multimedia and Security, New York, NY, USA, August 2005, pp. 63–70.

[56]. J. Fang and J. Sun, “Compliant encryption scheme for JPEG 2000 image code streams,” Journal of Electronic Imaging, vol. 15, no. 4, pp. 043013:1–4, November 2006.

[57]. J.L. Liu, “Efficient selective encryption for JPEG 2000 images using private initial table,” Pattern Recognition, vol. 39, no. 8, pp. 1509–1517, August 2006.

[58]. T. Stütz and A. Uhl, “On efficient transparent JPEG2000 encryption,” in Proceedings of the 9th Workshop on Multimedia & Security, Dallas, TX, USA, September 2007, pp. 97–108.

[59]. D. Engel, T. Stütz, and A. Uhl, “A survey on JPEG2000 encryption,” Multimedia Systems, vol. 15, no. 4, pp. 243–270, August 2009.

[60]. H. Kiya, S. Imaizumi, and O. Watanabe, “Partial-scrambling of images encoded using JPEG2000 without generating marker codes,” in Proceedings of 2003 International Conference on Image Processing, Barcelona, Spain, September 2003, pp. 205–208.

[61]. H. Wu and D. Ma, “Efficient and secure encryption schemes for JPEG2000,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, QC, Canada, May 2004, pp. 869–972.

[62]. Y. Bodo, N. Laurent, and J.L. Dugelay, “A scrambling method based on disturbance of motion vector,” in Proceedings of the 10th ACM International Conference on Multimedia, Juan les Pins, France, December 2002, pp. 89–90.

[63]. R. Hamzaoui and D. Saupe, “Fractal image compression,” in Document and Image Compression, M. Barni (ed.), Boca Raton, FL, USA: CRC Press, May 2006, pp. 145–175.

[64]. S. Roche, J.L. Dugelay, and R. Molva, “Multi-resolution access control algorithm based on fractal coding,” in Proceedings of the International Conference on Image Processing, Lausanne, Switzerland, September 1996, pp. 235–238.

[65]. L. Qiao, Multimedia Security and Copyright Protection. PhD thesis, University of Illinois at Urbana–Champaign, Urbana, IL, USA, 1998.

[66]. S. Winkler, Digital Video Quality: Vision Models and Metrics. Hoboken, NJ, USA: John Wiley & Sons, 2005.

[67]. Z. Wang and A.C. Bovik, Modern Image Quality Assessment. San Rafael, CA, USA: Morgan & Claypool Publishers, 2006.

[68]. Z. Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli, “Image quality assessment: From error visibility to structural similarity,” IEEE Transactions on Image Processing, vol. 13, no. 4, pp. 600–612, April 2004.

[69]. H.R. Sheikh and A.C. Bovik, “Image information and visual quality,” IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, February 2006.

[70]. D.M. Chandler and S.S. Hemami, “VSNR: A wavelet-based visual signal-to-noise ratio for natural images,” IEEE Transactions on Image Processing, vol. 16, no. 9, pp. 2284–2298, September 2007.

[71]. W. Lin and C.C.J. Kuo, “Perceptual visual quality metrics: A survey,” Journal of Visual Communication and Image Representation, vol. 22, no. 4, pp. 297–312, May 2011.

[72]. Y. Mao and M. Wu, “Security evaluation for communication-friendly encryption of multimedia,” in Proceedings of the International Conference on Image Processing, Singapore, October 2004, pp. 24–27.

[73]. Y. Yao, Z. Xu, and J. Sun, “Visual security assessment for cipher-images based on neighborhood similarity,” Informatica, vol. 33, pp. 69–76, March 2009.

[74]. J. Sun, Z. Xu, J. Liu, and Y. Yao, “An objective visual security assessment for cipher-images based on local entropy,” Multimedia Tools and Applications, vol. 53, pp. 75–95, May 2011.

[75]. F. Ahmed and C.L. Resch, “Characterizing cryptographic primitives for lightweight digital image encryption,” Proceedings of SPIE, vol. 7351, pp. 73510G:1–11, April 2009.

[76]. S. Li, C. Li, K.T. Lo, and G. Chen, “Cryptanalysis of an image scrambling scheme without bandwidth expansion,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 18, no. 3, pp. 338–349, March 2008.

[77]. D.V.D. Ville, W. Philips, R.V. de Walle, and I. Lemanhieu, “Image scrambling without bandwidth expansion,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 14, no. 6, pp. 892–897, June 2004.

[78]. S. Li, J.J. Ahmad, D. Saupe, and C.C.J. Kuo, “An improved DC recovery method from AC coefficients of DCT-transformed images,” in Proceedings of the IEEE International Conference on Image Processing, Hong Kong, September 2010, pp. 2085–2088.

[79]. S. Li, A. Karrenbauer, D. Saupe, and C.C.J. Kuo, “Recovering missing coefficients in DCT-transformed images,” in Proceedings of the IEEE International Conference on Image Processing, Brussels, Belgium, September 2011.

[80]. D. Bertsimas and J.N. Tsitsiklis, Introduction to Linear Optimization. Nashua, NH, USA: Athena Scientific, 2nd edition, 1997.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 14 Perceptual Encryption of Digital Images and Videos

Create new playlist

Sign In

Sign Up

Table of Contents for
14 Perceptual Encryption of Digital Images and Videos