Performance analysis

A key feature of any big data algorithm is locating a way to execute some kind of a divide-and-conquer strategy. This is true of functional programming design as well as imperative design.

We have three options to speed up this processing; they are as follows:

We can try to use parallelism to do more of the calculations concurrently. On a four-core processor, the time can be cut to approximately 25 percent. This cuts the time to 8 minutes for Manhattan distances.
We can see if caching intermediate results will reduce the amount of redundant calculation. The question arises of how many colors are the same and how many colors are unique.
We can look for a radical change in the algorithm.

We'll combine the last two points by computing all the possible comparisons between source colors and target colors. In this case, as in many other contexts, we can easily enumerate the entire mapping and avoid redundant calculation when done on a pixel-by-pixel basis. We'll also change the algorithm from a series of comparisons to a series of simple lookups in a mapping object.

When looking at this idea of precomputing all transformations for source color to target color, we need some overall statistics for an arbitrary image. The code associated with this book includes IMG_2705.jpg. Here is a basic algorithm to collect all of the distinct color tuples from the specified image:

from collections import defaultdict, Counter
palette = defaultdict(list)
for xy, rgb in pixel_iter(img):
    palette[rgb].append(xy)

w, h = img.size
print("Total pixels", w*h)
print("Total colors", len(palette))

We collected all pixels of a given color into a list organized by color. From this, we'll learn the first of the following facts:

The total number of pixels is 9,980,928. This fits the expectation for a 10 megapixel image.
The total number of colors is 210,303. If we try to compute the Euclidean distance between actual colors and the 133 target colors, we would do 27,970,299 calculations, which might take about 76 seconds.
Using a 3-bit mask, 0b11100000, the total number of colors actually used is reduced to 214 out of a domain of possible colors.
Using a 4-bit mask, 0b11110000, 1,150 colors are actually used.
Using a 5-bit mask, 0b11111000, 5,845 colors are actually used.
Using a 6-bit mask, 0b11111100, 27,726 colors are actually used. The domain of possible colors swells to .

This gives us some insight into how we can rearrange the data structure, calculate the matching colors quickly, and then rebuild the image without doing a billion comparisons.

The core idea behind masking is to preserve the most significant bits of a value and eliminate the least significant bits. Consider a color with a red value of 200. We can use the Python bin() function to see the binary representation of that value:

>>> bin(200)
'0b11001000'
>>> 200 & 0b11100000
192
>>> bin(192)
'0b11000000'

The computation of 200 & 0b11100000 applied a mask to conceal the least significant 5 bits and preserve the most significant 3 bits. What remains after the mask is applied as a red value of 192.

We can apply mask values to the RGB three-tuple with the following command:

masked_color = tuple(map(lambda x: x&0b11100000, c))

This will pick out the most significant 3 bits of red, green, and blue values by using the & operator to select particular bits from an integer value. If we use this instead of the original color to create a Counter object, we'll see that the image only uses 214 distinct values after the mask is applied. This is fewer than half the theoretical number of colors.

Table of Contents for Performance analysis

Create new playlist

Sign In

Sign Up

Table of Contents for
Performance analysis