Chapter 7. Rearranging Bits and Bytes

7–1 Reversing Bits and Bytes

By “reversing bits” we mean to reflect the contents of a register about the middle so that, for example,

rev(0x01234567) = 0xE6A2C480.

By “reversing bytes” we mean a similar reflection of the four bytes of a register. Byte reversal is a necessary operation to convert data between the “little-endian” format used by DEC and Intel, and the “big-endian” format used by most other manufacturers.

Bit reversal can be done quite efficiently by interchanging adjacent single bits, then interchanging adjacent 2-bit fields, and so on, as shown below [Aus1]. These five assignment statements can be executed in any order. This is the same algorithm as the first population count algorithm of Section 5–1, but with addition replaced with swapping.

x = (x & 0x55555555)  <<   1 | (x & 0xAAAAAAAA) >>  1;
x = (x & 0x33333333)  <<   2 | (x & 0xCCCCCCCC) >>  2;
x = (x & 0x0F0F0F0F)  <<   4 | (x & 0xF0F0F0F0) >>  4;
x = (x & 0x00FF00FF)  <<   8 | (x & 0xFF00FF00) >>  8;
x = (x & 0x0000FFFF)  <<  16 | (x & 0xFFFF0000) >> 16;

A small improvement may result on some machines by using fewer distinct large constants and doing the last two assignments in a more straightforward way, as shown in Figure 7–1 (30 basic RISC instructions, branch-free).


unsigned rev(unsigned x) {
   x = (x & 0x55555555) << 1 | (x >> 1) & 0x55555555;
   x = (x & 0x33333333) << 2 | (x >> 2) & 0x33333333;
   x = (x & 0x0F0F0F0F) << 4 | (x >> 4) & 0x0F0F0F0F;
   x = (x << 24) | ((x & 0xFF00) << 8) |
       ((x >> 8) & 0xFF00) | (x >> 24);
   return x;
}


FIGURE 7–1. Reversing bits.

The last assignment to x in this code does byte reversal in nine basic RISC instructions. If the machine has rotate shifts, however, this can be done in seven instructions with

Image

PowerPC can do the byte-reversal operation in only three instructions [Hay1]: a rotate left of 8, which positions two of the bytes, followed by two “rlwimi” (rotate left word immediate then mask insert) instructions.

The next algorithm, by Christopher Strachey [Strach 1961], is old by computer standards, but it is instructive. It reverses the rightmost 16 bits of a word, assuming the leftmost 16 bits are clear at the start, and places the reversed halfword in the left half of the register.

Its operation is based on the number of bit positions that each bit must move. The 16 bits, taken from left to right, must move 1, 3, 5, ..., 31 positions. The bits that must move 16 or more positions are moved first, then those that must move eight or more positions, and so forth. The operation is illustrated below, where each letter denotes a single bit, and a period denotes a “don’t care” bit.

0000 0000 0000 0000 abcd efgh ijkl mnop Given
0000 0000 ijkl mnop abcd efgh .... .... After shl 16
0000 mnop ijkl efgh abcd .... .... .... After shl 8
00op mnkl ijgh efcd ab.. .... .... .... After shl 4
0pon mlkj ihgf edcb a... .... .... .... After shl 2
ponm lkji hgfe dcba .... .... .... .... After shl 1

Straightforward code consists of 16 basic RISC instructions, plus 12 to load the constants:

x = x | ((x & 0x000000FF) << 16);
x = (x & 0xF0F0F0F0) | ((x & 0x0F0F0F0F) << 8);
x = (x & 0xCCCCCCCC) | ((x & 0x33333333) << 4);
x = (x & 0xAAAAAAAA) | ((x & 0x55555555) << 2);
x = x << 1;

Complementation can be used to reduce the number of distinct masks. By using more irregular masks, the rightmost 16 bits can be preserved.

If rotate shifts are available, Strachey’s idea can be used to reverse a 32-bit word. The idea is to consider how many bit positions each bit must move rotationally to the left to get to its final position. Taking the bits from left to right, the shift amounts are 1, 3, 5, ..., 31, 1, 3, 5, ..., 31 (no bit moves an even number of positions). The algorithm first rotate-moves those bits that must move 16 or more positions, then those that must move eight or more positions, and so forth, and finally those that must move one position (which is all of the bits, because all move amounts are odd). This scheme is shown below, for reversing a 32-bit word x. Function shlr(x, y) rotates x left y positions.

x = shlr(x & 0x00FF00FF, 16) | x & ~0x00FF00FF;
x = shlr(x & 0x0F0F0F0F,  8) | x & ~0x0F0F0F0F;
x = shlr(x & 0x33333333,  4) | x & ~0x33333333;
x = shlr(x & 0x55555555,  2) | x & ~0x55555555;
x = shlr(x, 1);

The code uses and with complement to avoid loading some masks. If your machine does not have that instruction, it can be avoided by rewriting the first line of code as

x = shlr(x, 16) & 0x00FF00FF | x & ~0x00FF00FF;

which is a MUX operation, and using the identity

Image

to obtain

x = ((shlr(x, 16) ^ x) & 0x00FF00FF) ^ x;

and similarly for the other lines that have and with complement.

A slightly better way for many machines, in that it has a little instruction-level parallelism, is to use the identity [Karv]

Image

and common the and expression. This gives the function shown in Figure 7–2 (17 instructions, plus eight to load constants, or 25 in all).


unsigned rev(unsigned x) {
   unsigned t;
   t = x & 0x00FF00FF; x = shlr(t, 16) | t ^ x;
   t = x & 0x0F0F0F0F; x = shlr(t,  8) | t ^ x;
   t = x & 0x33333333; x = shlr(t,  4) | t ^ x;
   t = x & 0x55555555; x = shlr(t,  2) | t ^ x;
   x = shlr(x, 1);
   return x;
}


FIGURE 7–2. Reversing bits with rotate shifts.

It is perhaps worth noting that the constants 0x00FF00FF, 0x0F0F0F0F, and so on can be generated one from another as shown below. This is not useful for 32-bit machines (it may even be harmful by reducing parallelism), because 32-bit RISC machines generally can load the constants in two instructions. But it might be useful for a 64-bit machine, for which it is illustrated.

Image

Another way to reverse bits is to break the word up into three groups of bits, and swap the leftmost and rightmost groups, leaving the center group in place [Baum]. For a 27-bit word, this works as illustrated below.

012345678 9abcdefgh ijklmnopq   The given 27-bit word
ijklmnopq 9abcdefgh 012345678   First ternary swap
opqlmnijk fghcde9ab 678345012   Second ternary swap
qponmlkji hgfedcba9 876543210   Third ternary swap

Straightforward code for this follows. If run on a 32-bit machine, it reverses bits 0 to 26, placing the result in bit positions 0 to 26, and clearing bits 27 to 31.

x = (x & 0x000001FF) << 18 | (x & 0x0003FE00) |
    (x >> 18) & 0x000001FF;
x = (x & 0x001C0E07) <<  6 | (x & 0x00E07038) |
    (x >> 6) & 0x001C0E07;
x = (x & 0x01249249) <<  2 | (x & 0x02492492) |
    (x >> 2) & 0x01249249;

This amounts to 21 basic RISC instructions, plus 10 to load the constants, or 31 in all. In comparison, the code of Figure 7–1 is 24 basic RISC instructions, plus six to load constants, plus a shift right of 5 to right-justify the result, or 31 in all. Thus, the ternary method is equal or superior when there are 27 or fewer bits to be reversed.

The next function, by Donald E. Knuth [Knu8], is interesting because it reverses a 32-bit word with only four stages, and the shifting and masking steps are unexpectedly irregular. It uses one rotate shift and three ternary swaps. It works as follows:

01234567 89abcdef ghijklmn opqrstuv   Given
fghijklm nopqrstu v0123456 789abcde   Rotate left 15
pqrstuvm nofghijk labcde56 78901234   10-swap
tuvspqrm nojklifg hebcda96 78541230   4-swap
vutsrqpo mnlkjihg fedcba98 76543210   2-swap

Straightforward code is shown below.

x = shlr(x, 15);             // Rotateleft 15.
x = (x & 0x003F801F) << 10 | (x & 0x01C003E0) |
    (x >> 10) & 0x003F801F;
x = (x & 0x0E038421) <<  4 | (x & 0x11C439CE) |
    (x >>  4) & 0x0E038421;
x = (x & 0x22488842) <<  2 | (x & 0x549556B5) |
    (x >>  2) & 0x22488842;

An improvement in operation count, at the expense of parallelism, results from rewriting

x = (x & M1) << s | (x & M2) | (x >> s) & M1;

where M2 is ~(M1 | (M1 << s)), as:

t = (x ^ (x >> s)) & M1; x = (t | (t << s)) ^ x;

This results in the code in Figure 7–3 (19 full RISC instructions, plus six to load constants, or 25 in all).


unsigned rev(unsigned x) {
   unsigned t;

   x = shlr(x, 15);              // Rotateleft 15.
   t = (x ^ (x>>10)) & 0x003F801F;  x = (t | (t<<10)) ^ x;
   t = (x ^ (x>> 4)) & 0x0E038421;  x = (t | (t<< 4)) ^ x;
   t = (x ^ (x>> 2)) & 0x22488842;  x = (t | (t<< 2)) ^ x;
   return x;
}


FIGURE 7–3. Reversing bits, Knuth’s algorithm.

Although Knuth’s algorithm does not beat the algorithm shown in Figure 7–2 for reversing a 32-bit quantity with rotate shifts allowed (17 instructions, plus eight to load constants), Knuth’s code uses only one rotate shift instruction. If it is coded as

x = (x << 15) | (x >> 17);    // Rotate left 15.

then Knuth’s algorithm is 21 instructions, plus six to load constants, which is the best found by these measures for rotating a 32-bit word using only basic RISC instructions. This makes one wonder if there is a simple way to predict the number of shifts and logical operations required to reverse a word of a given length.

Can Knuth’s algorithm be extended to reversing 64 bits on a 64-bit machine? Yes, there is a simple way and a way that is more difficult to work out. The simple way is to first swap the two halves of the 64-bit register, and then apply the 32-bit version of Knuth’s algorithm to both halves, in parallel. The resulting code is shown in Figure 7–4. It is 24 operations, if the swap (rotate 32) counts as one.


unsigned long long rev(unsigned long long x) {
   unsigned long long t;

   x = (x << 32) | (x >> 32);    // Swap register halves.
   x = (x & 0x0001FFFF0001FFFFLL) << 15 | // Rotate left
       (x & 0xFFFE0000FFFE0000LL) >> 17;  // 15.
   t = (x ^ (x >> 10)) & 0x003F801F003F801FLL;
   x = (t | (t << 10)) ^ x;
   t = (x ^ (x >> 4)) & 0x0E0384210E038421LL;
   x = (t | (t << 4)) ^ x;
   t = (x ^ (x >> 2)) & 0x2248884222488842LL;
   x = (t | (t << 2)) ^ x;
   return x;
}


FIGURE 7–4. Knuth’s algorithm applied to 64 bits.

The other way is to find shift amounts and masks analogous to those used in Knuth’s 32-bit reversal algorithm. This is shown below. It is 25 operations, if the rotate left shift of 31 positions counts as one operation.

unsigned long long rev(unsigned long long x) {
   unsigned long long t;

   x = (x << 31) | (x >> 33);   // I.e., shlr(x, 31).
   t = (x ^ (x >> 20)) & 0x00000FFF800007FFLL;
   x = (t | (t << 20)) ^ x;
   t = (x ^ (x >> 8)) & 0x00F8000F80700807LL;
   x = (t | (t << 8)) ^ x;
   t = (x ^ (x >> 4)) & 0x0808708080807008LL;
   x = (t | (t << 4)) ^ x;
   t = (x ^ (x >> 2)) & 0x1111111111111111LL;
   x = (t | (t << 2)) ^ x;
   return x;
}

Bit reversal can be aided by table lookup. The code that follows reverses a byte at a time, using a 256-byte table, and accumulates in reverse order the four bytes selected from the table. If the loop is strung out, this amounts to 13 basic RISC instructions, plus four loads, so it could be a winner on some machines.


unsigned rev(unsigned x) {
   static unsigned char table[256] = {0x00, 0x80, 0x40,
   0xC0, 0x20, 0xA0, 0x60, 0xE0, ..., 0xBF, 0x7F, 0xFF};
   int i;
   unsigned r;

   r = 0;
   for (i = 3; i >= 0; i--) {
      r = (r << 8) + table[x & 0xFF];
      x = x >> 8;
   }
   return r;
}

Generalized Bit Reversal

[GLS1] suggests that the following sort of generalization of bit reversal, which he calls “flip,” is a good candidate to consider for a computer’s instruction set:


if (k &  1) x = (x & 0x55555555) <<  1 | (x & 0xAAAAAAAA) >>  1;
if (k &  2) x = (x & 0x33333333) <<  2 | (x & 0xCCCCCCCC) >>  2;
if (k &  4) x = (x & 0x0F0F0F0F) <<  4 | (x & 0xF0F0F0F0) >>  4;
if (k &  8) x = (x & 0x00FF00FF) <<  8 | (x & 0xFF00FF00) >>  8;
if (k & 16) x = (x & 0x0000FFFF) << 16 | (x & 0xFFFF0000) >> 16;

(The last two and operations can be omitted.) For k = 31, this operation reverses the bits in a word. For k = 24, it reverses the bytes in a word. For k = 7, it reverses the bits in each byte, without changing the positions of the bytes. For k = 16, it swaps the left and right halfwords of a word, and so on. In general, it moves the bit at position m to position mk. It can be implemented in hardware very similarly to the way a rotate shifter is usually implemented (five stages of MUX’s, with each stage controlled by a bit of the shift amount k).

Bit-Reversing Novelties

Item 167 in [HAK] contains rather esoteric expressions for reversing 6-, 7-, and 8-bit integers. Although these expressions are designed for a 36-bit machine, the one for reversing a 6-bit integer works on a 32-bit machine, and those for 7- and 8-bit integers work on a 64-bit machine. These expressions are as follows:

Image

The result of all these is a “clean” integer—right-adjusted with no unused high-order bits set.

In all these cases the remu function can instead be rem or mod, because its arguments are positive. The remainder function is simply summing the digits of a base-256 or base-1024 number, much like casting out nines. Hence, it can be replaced with a multiply and a shift right. For example, the 6-bit formula has the following alternative on a 32-bit machine (the multiplication must be modulo 232):

Image

These formulas are limited in their utility, because they involve a remaindering operation (20 cycles or more) and/or some multiplications, as well as loading of large constants. The formula immediately above requires ten basic RISC instructions, two of which are multiply’s, which amounts to about 20 cycles on a present-day RISC. On the other hand, an adaptation of the code of Figure 7–1 to reverse 6-bit integers requires about 15 instructions, and probably about 9 to 15 cycles, depending on the amount of instruction-level parallelism in the machine. These techniques, however, do give compact code. Below are a few more techniques that might possibly be useful, all for a 32-bit machine. They involve a sort of double application of the idea from [HAK], to extend the technique to 8- and 9-bit integers on a 32-bit machine.

The following is a formula for reversing an 8-bit integer:

Image

Here the remu cannot be changed to a multiply and shift. (You have to work these out, and look at the bit patterns, to see why.)

Here is a similar formula for reversing an 8-bit integer, which is interesting because it can be simplified quite a bit:

Image

The simplifications are that the second product is just a shift left of the first product, the last mask can be generated from the second with just one instruction (shift), and the remainder can be replaced by a multiply and shift. It simplifies to 14 basic RISC instructions, two of which are multiply’s:

Image

The following is a formula for reversing a 9-bit integer:

Image

The second multiplication can be avoided, because the product is equal to the first product shifted right six positions. The last mask is equal to the second mask shifted right eight positions. With these simplifications, this requires 12 basic RISC instructions, including one multiply and one remainder. The remainder operation must be unsigned, and it cannot be changed to a multiply and shift.

The reader who studies these marvels will be able to devise similar code for other bit-permuting operations. As a simple (and artificial) example, suppose it is desired to extract every other bit from an 8-bit quantity and compress the four bits to the right. That is, the desired transformation is

0000 0000 0000 0000 0000 0000 abcd efgh ==>
0000 0000 0000 0000 0000 0000 0000 bdfh

This can be computed as follows:

Image

On most machines, the most practical way to do all these operations is by indexing into a table of 1-byte (or 9-bit) integers.

Incrementing a Reversed Integer

The Fast Fourier Transform (FFT) algorithm employs an integer i and its bit reversal rev(i) in a loop in which i is incremented by 1 [PuBr]. Straightforward coding would increment i and then compute rev(i) on each loop iteration. For small integers, computing rev(i) by table lookup is fast and practical. For large integers, however, table lookup is not practical and, as we have seen, computing rev(i) requires some 29 instructions.

If table lookup cannot be used, it is more efficient to maintain i in both normal and bit-reversed forms, incrementing them both on each loop iteration. This raises the question of how best to increment an integer that is in a register in reversed form. To illustrate, on a 4-bit machine we wish to successively step through the values (in hexadecimal)

0, 8, 4, C, 2, A, 6, E, 1, 9, 5, D, 3, B, 7, F.

In the FFT algorithm, i and its reversal are both some specific number of bits in length, almost certainly less than 32, and they are both right-justified in the register. However, we assume here that i is a 32-bit integer. After adding 1 to the reversed 32-bit integer, a shift right of the appropriate number of bits will make the result usable by the FFT algorithm (both i and rev(i) are used to index an array in memory).

The straightforward way to increment a reversed integer is to scan from the left for the first 0-bit, set it to 1, and set all bits to the left of it (if any) to 0’s. One way to code this is

unsigned x, m;

m = 0x80000000;
x = x ^ m;
if ((int)x >= 0) {
   do {
      m = m >> 1;
      x = x ^ m;
   } while (x < m);
}

This executes in three basic RISC instructions if x begins with a 0-bit, and four additional instructions for each loop iteration. Because x begins with a 0-bit half the time, with 10 (binary) one-fourth of the time, and so on, the average number of instructions executed is approximately

Image

In the second line we added and subtracted 1, with the first 1 in the form 1/2 + 1/4 + 1/8 + 1/16 + .... This makes the series similar to the one analyzed on page 113. The number of instructions executed in the worst case, however, is quite large (131).

If number of leading zeros is available, adding 1 to a reversed integer can be done as follows:

Image

Either method requires five full RISC instructions and, to properly wrap around from 0xFFFFFFFF to 0, requires that the shifts be modulo 64. (These formulas fail in this respect on the Intel x86 machines, because the shifts are modulo 32.)

The rather puzzling one-liner below [Möbi] increments a reversed integer in six basic RISC instructions. It is free of branches and loads but includes an integer division operation. It works for integers of length up to that of the word size of the machine, less 1.

Image

To use this, both the non-reversed integer i and its reversal revi must be available. The variable m is the modulus; if we are dealing with n-bit integers, then m = 2n. Applying the formula gives the next value of the reversed integer. The non-reversed integer i would be incremented separately. The reversed integer is incremented “in place”; that is, it is not shifted to the high-order end of the register, as in the two preceding methods.

A variation is

Image

which executes in five instructions if the machine has and not, and if m is a constant so that the calculation of m / 2 does not count. It works for integers of length up to that of the word size of the machine. (For full word-size integers, use 0 for the first occurrence of m in the formula, and 2n-1 for m / 2.)

7–2 Shuffling Bits

Another important permutation of the bits of a word is the “perfect shuffle” operation, which has applications in cryptography. There are two varieties, called the “outer” and “inner” perfect shuffles. They both interleave the bits in the two halves of a word in a manner similar to a perfect shuffle of a deck of 32 cards, but they differ in which card is allowed to fall first. In the outer perfect shuffle, the outer (end) bits remain in the outer positions, and in the inner perfect shuffle, bit 15 moves to the left end of the word (position 31). If the 32-bit word is (where each letter denotes a single bit)

abcd efgh ijkl mnop ABCD EFGH IJKL MNOP,

then after the outer perfect shuffle it is

aAbB cCdD eEfF gGhH iIjJ kKlL mMnN oOpP,

and after the inner perfect shuffle it is

AaBb CcDd EeFf GgHh IiJj KkLl MmNn OoPp.

Assume the word size W is a power of 2. Then the outer perfect shuffle operation can be accomplished with basic RISC instructions in log2(W / 2) steps, where each step swaps the second and third quartiles of successively smaller pieces [GLS1]. That is, a 32-bit word is transformed as follows:

abcd efgh ijkl mnop ABCD EFGH IJKL MNOP
abcd efgh ABCD EFGH ijkl mnop IJKL MNOP
abcd ABCD efgh EFGH ijkl IJKL mnop MNOP
abAB cdCD efEF ghGH ijIJ klKL mnMN opOP
aAbB cCdD eEfF gGhH iIjJ kKlL mMnN oOpP

Straightforward code for this is


x = (x & 0x0000FF00) << 8 | (x >> 8) & 0x0000FF00 | x & 0xFF0000FF;
x = (x & 0x00F000F0) << 4 | (x >> 4) & 0x00F000F0 | x & 0xF00FF00F;
x = (x & 0x0C0C0C0C) << 2 | (x >> 2) & 0x0C0C0C0C | x & 0xC3C3C3C3;
x = (x & 0x22222222) << 1 | (x >> 1) & 0x22222222 | x & 0x99999999;

which requires 42 basic RISC instructions. This can be reduced to 30 instructions, although at an increase from 17 to 21 cycles on a machine with unlimited instruction-level parallelism, by using the exclusive or method of exchanging two fields of a register (described on page 47). All quantities are unsigned:

t = (x ^ (x >> 8)) & 0x0000FF00; x = x ^ t ^ (t << 8);
t = (x ^ (x >> 4)) & 0x00F000F0; x = x ^ t ^ (t << 4);
t = (x ^ (x >> 2)) & 0x0C0C0C0C; x = x ^ t ^ (t << 2);
t = (x ^ (x >> 1)) & 0x22222222; x = x ^ t ^ (t << 1);

The inverse operation, the outer unshuffle, is easily accomplished by performing the swaps in reverse order:


t = (x ^ (x >> 1)) & 0x22222222; x = x ^ t ^ (t << 1);
t = (x ^ (x >> 2)) & 0x0C0C0C0C; x = x ^ t ^ (t << 2);
t = (x ^ (x >> 4)) & 0x00F000F0; x = x ^ t ^ (t << 4);
t = (x ^ (x >> 8)) & 0x0000FF00; x = x ^ t ^ (t << 8);

Using only the last two steps of either of the above two shuffle sequences shuffles the bits of each byte separately. Using only the last three steps shuffles the bits of each halfword separately, and so on. Similar remarks apply to unshuffling, except by using the first two or three steps.

To get the inner perfect shuffle, prepend to these sequences a step to swap the left and right halves of the register:

x = (x >> 16) | (x << 16);

(or use a rotate of 16 bit positions). The unshuffle sequence can be similarly modified by appending this line of code.

Altering the transformation to swap the first and fourth quartiles of successively smaller pieces produces the bit reversal of the inner perfect shuffle of 2n bits for odd n, and the bit reversal of the outer perfect shuffle for even n.

Perhaps worth mentioning is the special case in which the left half of the word x is all 0. In other words, we want to move the bits in the right half of x to every other bit position—that is, to transform the 32-bit word

0000 0000 0000 0000 ABCD EFGH IJKL MNOP

to

0A0B 0C0D 0E0F 0G0H 0I0J 0K0L 0M0N 0O0P.

The outer perfect shuffle code can be simplified to do this task in 22 basic RISC instructions. The code below, however, does it in only 19, at no cost in execution time on a machine with unlimited instruction-level parallelism (12 cycles with either method). This code does not require that the left half of word x be initially cleared.

x = ((x & 0xFF00) << 8) | (x & 0x00FF);
x = ((x << 4) | x) & 0x0F0F0F0F;
x = ((x << 2) | x) & 0x33333333;
x = ((x << 1) | x) & 0x55555555;

Similarly, for the inverse of this “half shuffle” operation (a special case of compress; see page 150), the outer perfect unshuffle code can be simplified to do the task in 26 or 29 basic RISC instructions, depending on whether or not an initial and operation is required to clear the bits in the odd positions. The code below, however, does it in only 18 or 21 basic RISC instructions, and with less execution time on a machine with unlimited instruction-level parallelism (12 or 15 cycles).

x = x & 0x55555555;          // (If required.)
x = ((x >> 1) | x) & 0x33333333;
x = ((x >> 2) | x) & 0x0F0F0F0F;
x = ((x >> 4) | x) & 0x00FF00FF;
x = ((x >> 8) | x) & 0x0000FFFF;

7–3 Transposing a Bit Matrix

The transpose of a matrix A is a matrix whose columns are the rows of A and whose rows are the columns of A. Here we consider the problem of computing the transpose of a bit matrix whose elements are single bits that are packed eight per byte, with rows and columns beginning on byte boundaries. This seemingly simple transformation is surprisingly costly in instructions executed.

On most machines it would be very slow to load and store individual bits, mainly due to the code that would be required to extract and (worse yet) to store individual bits. A better method is to partition the matrix into 8×8 submatrices, load each 8×8 submatrix into registers, compute the transpose of the submatrix in registers, and then store the 8×8 result in the appropriate place in the target matrix. Figure 7–5 illustrates the transposition of a bit matrix of size 16×3 bytes. A, B, ..., F are submatrices of size 8×8 bits. AT, BT, ... denote the transpose of submatrices A, B, ....

Image

FIGURE 7–5. Transposing a 16×24-bit matrix.

For the purposes of transposing an 8×8 submatrix, it doesn’t matter whether the bit matrix is stored in row-major or column-major order; the operations are the same in either event. Assume for discussion that it’s in row-major order. Then the first byte of the matrix contains the top row of A, the next byte contains the top row of B, and so on. If L denotes the address of the first byte (top row) of a submatrix, then successive rows of the submatrix are at locations L + n, L + 2n, ..., L + 7n.

For this problem we will depart from the usual assumption of a 32-bit machine and assume the machine has 64-bit general registers. The algorithms are simpler and more easily understood in this way, and it is not difficult to convert them for execution on a 32-bit machine. In fact, a compiler that supports 64-bit integer operations on a 32-bit machine will do the work for you (although probably not as effectively as you can do by hand).

The overall scheme is to load a submatrix with eight load byte instructions and pack the bytes left-to-right into a 64-bit register. Then the transpose of the register’s contents is computed. Finally, the result is stored in the target area with eight store byte instructions.

The transposition of an 8×8 bit matrix is illustrated here, where each character represents a single bit.

Image

In terms of doublewords, the transformation to be done is to change the first line to the second line below.

01234567 89abcdef ghijklmn opqrstuv wxyzABCD EFGHIJKL MNOPQRST UVWXYZ$.
08g0wEMU 19hpxFNV 2aiqyGOW 3bjrzHPX 4cksAIQY 5dltBJRZ 6emuCKS$ 7fnvDLT.

Notice that the bit denoted by 1 moves seven positions to the right, the bit denoted by 2 moves 14 positions to the right, and the bit denoted by 8 moves seven positions to the left. Every bit moves 0, 7, 14, 21, 28, 35, 42, or 49 positions to the left or right. Since there are 56 bits in the doubleword that have to be moved and only 14 different nonzero movement amounts, an average of about four bits can be moved at once, with appropriate masking and shifting. Straightforward code for this follows.

y = x & 0x8040201008040201LL        |
   (x & 0x0080402010080402LL) <<  7 |
   (x & 0x0000804020100804LL) << 14 |
   (x & 0x0000008040201008LL) << 21 |
   (x & 0x0000000080402010LL) << 28 |
   (x & 0x0000000000804020LL) << 35 |
   (x & 0x0000000000008040LL) << 42 |
   (x & 0x0000000000000080LL) << 49 |
   (x >>  7) & 0x0080402010080402LL |
   (x >> 14) & 0x0000804020100804LL |
   (x >> 21) & 0x0000008040201008LL |
   (x >> 28) & 0x0000000080402010LL |
   (x >> 35) & 0x0000000000804020LL |
   (x >> 42) & 0x0000000000008040LL |
   (x >> 49) & 0x0000000000000080LL;

This executes in 43 instructions on the basic RISC, exclusive of mask generation (which is not important in the application of transposing a large bit matrix, because the masks are loop constants). Rotate shifts do not help. Some of the terms are of the form (x & mask)<< s, and some are of the form (x >> s)& mask. This reduces the number of masks required; the last seven are repeats of earlier masks. Notice that each mask after the first can be generated from the first with one shift right instruction. Because of this, it is a simple matter to write a more compact version of the code that uses a for-loop that is executed seven times.

Another variation is to employ Steele’s method of using exclusive or to swap bit fields (described on page 47). That technique does not help much in this application. It results in a function that executes in 42 instructions, exclusive of mask generation. The code starts out

t = (x ^ (x >> 7)) & 0x0080402010080402LL;
x = x ^ t ^ (t << 7);

and there are seven such pairs of lines.

Although there does not seem to be a really great algorithm for this problem, the method to be described beats the straightforward method and its variations described above by approximately a factor of 2 on the basic RISC, for the calculation part (not counting loading and storing the submatrices or generating masks). The method gets its power from its high level of bit-parallelism. It would not be a good method if the matrix elements are words. For that, you can’t do better than loading each word and storing it where it goes.

First, treat the 8×8-bit matrix as 16 2×2-bit matrices and transpose each of the 16 2×2-bit matrices. Then treat the matrix as four 2×2 submatrices whose elements are 2×2-bit matrices and transpose each of the four 2×2 submatrices. Finally, treat the matrix as a 2×2 matrix whose elements are 4×4-bit matrices and transpose the 2×2 matrix. These transformations are illustrated below [Floyd].

Image

A complete procedure is shown in Figure 7–6. Parameter A is the address of the first byte of an 8×8 submatrix of the source matrix, and parameter B is the address of the first byte of an 8×8 submatrix in the target matrix.

The calculation part of this function executes in 21 instructions. Each of the three major steps is swapping bits, so a version can be written that uses the Steele exclusive or bit field swapping device. Using it, the first assignment to x in Figure 7–6 becomes:

t = (x ^ (x >> 7)) & 0x00AA00AA00AA00AALL;
x = x ^ t ^ (t << 7);

The calculation part of the revised function executes in only 18 instructions, but it has little instruction-level parallelism.

The algorithm of Figure 7–6 runs from fine to coarse granularity, based on the lengths of the groups of bits that are swapped. The method can also be run from coarse to fine granularity. To do this, first treat the 8×8-bit matrix as a 2×2 matrix whose elements are 4×4-bit matrices and transpose the 2×2 matrix. Then, treat each of the four 4×4 submatrices as a 2×2 matrix whose elements are 2×2-bit matrices, and transpose each of the four 2×2 submatrices, and so forth. The code for this is the same as that of Figure 7–6 except for the three assignments that do the bit rearranging being run in reverse order.



void transpose8(unsigned char* A, int m, int n,
                unsigned char* B) {
   unsigned long long x;
   int i;

   for (i = 0; i <= 7; i++)    // Load 8 bytes from the
      x = x << 8 | A[m*i];    // input array and pack
                               // them into x.

   x = x & 0xAA55AA55AA55AA55LL        |
      (x & 0x00AA00AA00AA00AALL) <<  7 |
      (x >> 7) & 0x00AA00AA00AA00AALL;
   x = x & 0xCCCC3333CCCC3333LL        |
      (x & 0x0000CCCC0000CCCCLL) << 14 |
      (x >> 14) & 0x0000CCCC0000CCCCLL;
   x = x & 0xF0F0F0F00F0F0F0FLL        |
      (x & 0x00000000F0F0F0F0LL) << 28 |
      (x >> 28) & 0x00000000F0F0F0F0LL;

   for (i = 7; i >= 0; i--) {     // Store result into
      B[n*i] = x; x = x >> 8;}    // output array B.
}


FIGURE 7–6. Transposing an 8×8-bit matrix.

As was mentioned, these functions can be modified for execution on a 32-bit machine by using two registers for each 64-bit quantity. If this is done and any calculations that would result in zero are used to make obvious simplifications, the results are that a 32-bit version of the straightforward method described on page 143 runs in 74 instructions (compared to 43 on a 64-bit machine), and a 32-bit version of the function of Figure 7–6 runs in 36 instructions (compared to 21 on a 64-bit machine). Using Steele’s bit-swapping technique gives a reduction in instructions executed at the expense of instruction-level parallelism, as in the case of a 64-bit machine.

Transposing a 32×32-Bit Matrix

The same recursive technique that was used for the 8×8-bit matrix can be used for larger matrices. For a 32×32-bit matrix it takes five stages.

The details are quite different from Figure 7–6, because here we assume that the entire 32×32-bit matrix does not fit in the general register space, and we seek a compact procedure that indexes the appropriate words of the bit matrix to do the bit swaps. The algorithm to be described works best if run from coarse to fine granularity.

In the first stage, treat the matrix as four 16×16-bit matrices, and transform it as follows:

Image

A denotes the left half of the first 16 words of the matrix, B denotes the right half of the first 16 words, and so on. It should be clear that the above transformation can be accomplished by the following swaps:

Right half of word 0 with the left half of word 16,
Right half of word 1 with the left half of word 17,
...
Right half of word 15 with the left half of word 31.

To implement this in code, we will have an index k that ranges from 0 to 15. In a loop controlled by k, the right half of word k will be swapped with the left half of word k + 16.

In the second stage, treat the matrix as 16 8×8-bit matrices, and transform it as follows:

Image

This transformation can be accomplished by the following swaps:

Bits 0x00FF00FF of word 0 with bits 0xFF00FF00 of word 8,
Bits 0x00FF00FF of word 1 with bits 0xFF00FF00 of word 9, and so on.

This means that bits 0–7 (the least significant eight bits) of word 0 are swapped with bits 8–15 of word 8, and so on. The indexes of the first word in these swaps are k = 0, 1, 2, 3, 4, 5, 6, 7, 16, 17, 18, 19, 20, 21, 22, 23. A way to step k through these values is

Image

In the loop controlled by k, bits of word k are swapped with bits of word k + 8.

Similarly, the third stage does the following swaps:

Bits 0x0F0F0F0F of word 0 with bits 0xF0F0F0F0 of word 4,
Bits 0x0F0F0F0F of word 1 with bits 0xF0F0F0F0 of word 5, and so on.

The indexes of the first word in these swaps are k = 0, 1, 2, 3, 8, 9, 10, 11, 16, 17, 18, 19, 24, 25, 26, 27. A way to step k through these values is

Image

In the loop controlled by k, bits of word k are swapped with bits of word k + 4.

These considerations are coded rather compactly in the C function shown in Figure 7–7 [GLS1]. The outer loop controls the five stages, with j taking on the values 16, 8, 4, 2, and 1. It also steps the mask m through the values 0x0000FFFF, 0x00FF00FF, 0x0F0F0F0F, 0x33333333, and 0x55555555. (The code for this, m = m ^ (m << j), is a nice little trick. It does not have an inverse, which is the main reason this code works best for coarse to fine transformations.) The inner loop steps k through the values described above. The inner loop body swaps the bits of a[k] identified by mask m with the bits of a[k+j] shifted right j and identified by m, which is equivalent to the bits of a[k+j] identified with the complement of m. The code for performing these swaps is an adaptation of the “three exclusive or” technique shown on page 46 column (c).


void transpose32(unsigned A[32]) {
   int j, k;
   unsigned m, t;

   m = 0x0000FFFF;
   for (j = 16; j != 0; j = j >> 1, m = m ^ (m << j)) {
      for (k = 0; k < 32; k = (k + j + 1) & ~j) {
           t = (A[k] ^ (A[k+j] >> j)) & m;
           A[k] = A[k] ^ t;
           A[k+j] = A[k+j] ^ (t << j);
      }
   }
}


FIGURE 7–7. Compact code for transposing a 32×32-bit matrix.

Based on compiling this function with the GNU C compiler to a machine very similar to the basic RISC, this compiles into 31 instructions, with 20 in the inner loop, and 7 in the outer loop but not in the inner loop. Thus, it executes in 4 + 5(7 + 16 · 20) = 1639 instructions. In contrast, if this function were performed using 16 calls on the 8×8 transpose program of Figure 7–6 (modified to run on a 32-bit machine), then it would take 16(101 + 5) = 1696 instructions, assuming the 16 calls are “strung out.” This includes five instructions for each function call (observed in compiled code). Therefore, the two methods are, on the surface anyway, very nearly equal in execution time.

On the other hand, for a 64-bit machine the code of Figure 7–7 can easily be modified to transpose a 64×64-bit matrix, and it would take about 4 + 6(7 + 32 · 20) = 3886 instructions. Doing the job with 64 executions of the 8×8 transpose method would take about 64(85 + 5) = 5760 instructions.

The algorithm works in place, and thus if it is used to transpose a larger matrix, additional steps are required to move 32×32-bit submatrices. It can be made to put the result matrix in an area distinct from the source matrix by separating out either the first or last execution of the “for j-loop” and having it store the result in the other area.

About half the instructions executed by the function of Figure 7–7 are for loop control, and the function loads and stores the entire matrix five times. Would it be reasonable to reduce this overhead by unrolling the loops? It would, if you are looking for the ultimate in speed, if memory space is not a problem, if your machine’s I-fetching can keep up with a large block of straight-line code, and especially if the branches or loads are costly in execution time. The bulk of the program will be the six instructions that do the bit swaps repeated 80 times (5 · 16). In addition, the program will need 32 load instructions to load the source matrix and 32 store instructions to store the result, for a total of at least 544 instructions.

Figure 7–8 outlines a program in which the unrolling is done by hand. This program is shown as not working in place, but it executes correctly in place, if that is desired, by invoking it with identical arguments. The number of “swap” lines is 80. Our GNU C compiler for the basic RISC machine compiles this into 576 instructions (branch-free, except for the function return), counting prologs and epilogs. This machine does not have the store multiple and load multiple instructions, but it can save and restore registers two at a time with store double and load double instructions.



#define swap(a0, a1, j, m) t = (a0 ^ (a1 >> j)) & m;
                           a0 = a0 ^ t;
                           a1 = a1 ^ (t << j);

void transpose32(unsigned A[32], unsigned B[32]) {
   unsigned m, t;
   unsigned a0, a1, a2, a3, a4, a5, a6, a7,
            a8, a9, a10, a11, a12, a13, a14, a15,
            a16, a17, a18, a19, a20, a21, a22, a23,
            a24, a25, a26, a27, a28, a29, a30, a31;

   a0 = A[ 0]; a1 = A[ 1]; a2 = A[ 2]; a3 = A[ 3];
   a4 = A[ 4]; a5 = A[ 5]; a6 = A[ 6]; a7 = A[ 7];
   . . .
   a28 = A[28]; a29 = A[29]; a30 = A[30]; a31 = A[31];

   m = 0x0000FFFF;
   swap(a0, a16, 16, m)
   swap(a1, a17, 16, m)
   . . .
   swap(a15, a31, 16, m)
   m = 0x00FF00FF;
   swap(a0, a8, 8, m)
   swap(a1, a9, 8, m)
   . . .
   . . .
   swap(a28, a29, 1, m)
   swap(a30, a31, 1, m)

   B[ 0] = a0;   B[ 1] = a1;   B[ 2] = a2;   B[ 3] = a3;
   B[ 4] = a4;   B[ 5] = a5;   B[ 6] = a6;   B[ 7] = a7;
   . . .
   B[28] = a28;  B[29] = a29;  B[30] = a30;  B[31] = a31;
}


FIGURE 7–8. Straight-line code for transposing a 32×32-bit matrix.

There is a way to squeeze a little more performance out of this if your machine has a rotate shift instruction (either left or right). The idea is to replace all the swap operations of Figure 7–8, which take six instructions each, with simpler swaps that do not involve a shift, which take four instructions each (use the swap macro given, with the shifts omitted).

First, rotate right words A[16..31] (that is, A[k] for 16 ≤ k ≤ 131) by 16 bit positions. Second, swap the right halves of A[0] with A[16], A[1] with A[17], and so on, similarly to the code of Figure 7–8. Third, rotate right words A[8..15] and A[24..31] by eight bit positions, and then swap the bits indicated by a mask of 0x00FF00FF in words A[0] and A[8], A[1] and A[9], and so on, as in the code of Figure 7–8. After five stages of this, you don’t quite have the transpose. Finally, you have to rotate left word A[1] by one bit position, A[2] by two bit positions, and so on (31 instructions). We do not show the code, but the steps are illustrated below for a 4×4-bit matrix.

Image

The bit-rearranging part of the program of Figure 7–8 requires 480 instructions (80 swaps at six instructions each). The revised program, using rotate instructions, requires 80 swaps at four instructions each, plus 80 rotate instructions (16 · 5) for the first five stages, plus a final 31 rotate instructions, for a total of 431 instructions. The prolog and epilog code would be unchanged, so using rotate instructions in this way saves 49 instructions.

There is another quite different method of transposing a bit matrix: apply three shearing transformations [GLS1]. If the matrix is n×n, the steps are (1) rotate row i to the right i bit positions, (2) rotate column j upwards (j + 1) mod n bit positions, (3) rotate row i to the right (i + 1) mod n bit positions, and (4) reflect the matrix about a horizontal axis through the midpoint. To illustrate, for a 4×4-bit matrix:

Image

This method is not quite competitive with the others, because step (2) is costly. (To do it at reasonable cost, rotate upward all columns that rotate by n/2 or more bit positions by n / 2 bit positions [these are columns n / 2 – 1 through n–2], then rotate certain columns upward n / 4 bit positions, and so on.) Steps 1 and 3 require only n – 1 instructions each, and step 4 requires no instructions at all if the results are simply stored to the appropriate locations.

If an 8×8-bit matrix is stored in a 64-bit word in the obvious way (top row in the most significant eight bits, and so on), then the matrix transpose operation is equivalent to three outer perfect shuffles or unshuffles [GLS1]. This is a very good way to do it if your machine has shuffle or unshuffle as a single instruction, but it is not a good method on a basic RISC machine.

7–4 Compress, or Generalized Extract

The APL language includes an operation called compress, written B/V, where B is a Boolean vector and V is vector of the same length as B, with arbitrary elements. The result of the operation is a vector consisting of the elements of V for which the corresponding bit in B is 1. The length of the result vector is equal to the number of 1’s in B.

Here we consider a similar operation on the bits of a word. Given a mask m and a word x, the bits of x for which the corresponding mask bit is 1 are selected and moved (“compressed”) to the right. For example, if the word to be compressed is (where each letter denotes a single bit)

abcd efgh ijkl mnop qrst uvwx yzAB CDEF.

and the mask is

0000 1111 0011 0011 1010 1010 0101 0101,

then the result is

0000 0000 0000 0000 efgh klop qsuw zBDF.

This operation might also be called generalized extract, by analogy with the extract instruction found on many computers.

We are interested in code for this operation with minimum worst-case execution time, and offer the simple loop of Figure 7–9 as a straw man to be improved upon. This code has no branches in the loop, and it executes in 260 instructions worst case, including the subroutine prolog and epilog.


unsigned compress(unsigned x, unsigned m) {
   unsigned r, s, b;    // Result, shift, mask bit.

   r = 0;
   s = 0;
   do {
      b = m & 1;
      r = r | ((x & b) << s);
      s = s + b;
      x = x >> 1;
      m = m >> 1;
   } while (m != 0);
   return r;
}


FIGURE 7–9. A simple loop for the compress operation.

It is possible to improve on this by repeatedly using the parallel suffix method (see page 97) with the exclusive or operation [GLS1]. We will denote the parallel suffix operation by PS-XOR. The basic idea is to first identify the bits of argument x that are to be moved right an odd number of bit positions, and move those. (This operation is simplified if x is first anded with the mask, to clear out irrelevant bits.) Mask bits are moved in the same way. Next, we identify the bits of x that are to be moved an odd multiple of 2 positions (2, 6, 10, and so on), and then we move these bits of x and the mask. Next, we identify and move the bits that are to be moved an odd multiple of 4 positions, then those that move an odd multiple of 8, and then those that move 16 bit positions.

Because this algorithm, believed to be original with [GLS1], is a bit difficult to understand, and because it is perhaps surprising that something along these lines can be done at all, we will describe its operation in some detail. Suppose the inputs are


x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF,
m = 1000 1000 1110 0000 0000 1111 0101 0101,
    1    1    111
    9    6    333            4444  3 2  1 0

where each letter in x represents a single bit (with value 0 or 1). The numbers below each 1-bit in the mask m denote how far the corresponding bit of x must move to the right. This is the number of 0’s in m to the right of the bit. As mentioned above, it is convenient to first clear out the irrelevant bits of x, giving

x = a000 e000 ijk0 0000 0000 uvwx 0z0B 0D0F.

The plan is to first determine which bits move an odd number of positions (to the right), and move those one bit position. Recall that the PS-XOR operation results in a 1-bit at each position where the number of 1’s at and to the right of that position is odd. We wish to identify those bits for which the number of 0’s strictly to the right is odd. This can be done by computing mk = ~m << 1 and performing PS-XOR on the result. This gives

mk = 1110 1110 0011 1111 1110 0001 0101 0100,
mp = 1010 0101 1110 1010 1010 0000 1100 1100.

Observe that mk identifies the bits of m that have a 0 immediately to the right, and mp sums these, modulo 2, from the right. Thus, mp identifies the bits of m that have an odd number of 0’s to the right.

The bits that will be moved one position are those that are in positions that have an odd number of 0’s strictly to the right (identified by mp) and that have a 1-bit in the original mask. This is simply mv = mp & m:

mv = 1000 0000 1110 0000 0000 0000 0100 0100.

These bits of m can be moved with the assignment

m = (m ^ mv) | (mv >> 1);

and the same bits of x can be moved with the two assignments

t = x & mv;
x = (x ^ t) | (t >> 1);

(Moving the bits of m is simpler because all the selected bits are 1’s.) Here the exclusive or is turning off bits known to be 1 in m and x, and the or is turning on bits known to be 0 in m and x. The operations could also, alternatively, both be exclusive or, or subtract and add, respectively. The results, after moving the bits selected by mv right one position, are:

m = 0100 1000 0111 0000 0000 1111 0011 0011,
x = 0a00 e000 0ijk 0000 0000 uvwx 00zB 00DF.

Now we must prepare a mask for the second iteration, in which we identify bits that are to move an odd multiple of 2 positions to the right. Notice that the quantity mk & ~mp identifies those bits that have a 0 immediately to the right in the original mask m, and those bits that have an even number of 0’s to the right in the original mask. These properties apply jointly, although not individually, to the revised mask m. (That is to say, mk identifies all the positions in the revised mask m that have a 0 to the immediate right and an even number of 0’s to the right.) This is the quantity that, if summed from the right with PS-XOR, identifies those bits that move to the right an odd multiple of 2 positions (2, 6, 10, and so on). Therefore, the procedure is to assign this quantity to mk and perform a second iteration of the above steps. The revised value of mk is

mk = 0100 1010 0001 0101 0100 0001 0001 0000.

A complete C function for this operation is shown in Figure 7–10. It does the job in 127 basic RISC instructions (constant)1, including the subroutine prolog and epilog. Figure 7–11 shows the sequence of values taken on by certain variables at key points in the computation, with the same inputs that were used in the discussion above. Observe that a by-product of the algorithm, in the last value assigned to m, is the original m with all its 1-bits compressed to the right.



unsigned compress(unsigned x, unsigned m) {
   unsigned mk, mp, mv, t;
   int i;

   x = x & m;       // Clear irrelevant bits.
   mk = ~m << 1;    // We will count 0's to right.

   for (i = 0; i < 5; i++) {
      mp = mk ^ (mk << 1);            // Parallel suffix.
      mp = mp ^ (mp << 2);
      mp = mp ^ (mp << 4);
      mp = mp ^ (mp << 8);
      mp = mp ^ (mp << 16);
      mv = mp & m;                    // Bits to move.
      m = m ^ mv | (mv >> (1 << i));  // Compress m.
      t = x & mv;
      x = x ^ t | (t >> (1 << i));    // Compress x.
      mk = mk & ~mp;
   }
   return x;
}


FIGURE 7–10. Parallel suffix method for the compress operation.

We calculate that the algorithm of Figure 7–10 would execute in 169 instructions on a 64-bit basic RISC, as compared to 516 (worst case) for the algorithm of Figure 7–9.

The number of instructions required by the algorithm of Figure 7–10 can be reduced substantially if the mask m is a constant. This can occur in two situations: (1) a call to “compress(x, m)” occurs in a loop, in which the value of m is not known, but it is a loop constant, and (2) the value of m is known, and the code for compress is generated in advance, perhaps by a compiler.

Notice that the value assigned to x in the loop in Figure 7–10 is not used in the loop for anything other than the assignment to x. And x is dependent only on itself and variable mv. Therefore, the subroutine can be coded with all references to x deleted, and the five values computed for mv can be saved in variables mv0, mv1, ..., mv4. Then, in situation (1) the function without references to x can be placed outside the loop in which “compress(x, m)” occurs, and the following statements can be placed in the loop:

x = x & m;
t = x & mv0; x = x ^ t | (t >> 1);
t = x & mv1; x = x ^ t | (t >> 2);
t = x & mv2; x = x ^ t | (t >> 4);
t = x & mv3; x = x ^ t | (t >> 8);
t = x & mv4; x = x ^ t | (t >> 16);

This is only 21 instructions in the loop (the loading of the constants can be placed outside the loop), a considerable improvement over the 127 required by the full subroutine of Figure 7–10.


           x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF
           m = 1000 1000 1110 0000 0000 1111 0101 0101
           x = a000 e000 ijk0 0000 0000 uvwx 0z0B 0D0F

i = 0,    mk = 1110 1110 0011 1111 1110 0001 0101 0100
After PS, mp = 1010 0101 1110 1010 1010 0000 1100 1100
          mv = 1000 0000 1110 0000 0000 0000 0100 0100
           m = 0100 1000 0111 0000 0000 1111 0011 0011
           x = 0a00 e000 0ijk 0000 0000 uvwx 00zB 00DF

i = 1,    mk = 0100 1010 0001 0101 0100 0001 0001 0000
After PS, mp = 1100 0110 0000 1100 1100 0000 1111 0000
          mv = 0100 0000 0000 0000 0000 0000 0011 0000
           m = 0001 1000 0111 0000 0000 1111 0000 1111
           x = 000a e000 0ijk 0000 0000 uvwx 0000 zBDF

i = 2,    mk = 0000 1000 0001 0001 0000 0001 0000 0000
After PS, mp = 0000 0111 1111 0000 1111 1111 0000 0000
          mv = 0000 0000 0111 0000 0000 1111 0000 0000
           m = 0001 1000 0000 0111 0000 0000 1111 1111
           x = 000a e000 0000 0ijk 0000 0000 uvwx zBDF

i = 3,    mk = 0000 1000 0000 0001 0000 0000 0000 0000
After PS, mp = 0000 0111 1111 1111 0000 0000 0000 0000
          mv = 0000 0000 0000 0111 0000 0000 0000 0000
           m = 0001 1000 0000 0000 0000 0111 1111 1111
           x = 000a e000 0000 0000 0000 0ijk uvwx zBDF

i = 4,    mk = 0000 1000 0000 0000 0000 0000 0000 0000
After PS, mp = 1111 1000 0000 0000 0000 0000 0000 0000
          mv = 0001 1000 0000 0000 0000 0000 0000 0000
           m = 0000 0000 0000 0000 0001 1111 1111 1111
           x = 0000 0000 0000 0000 000a eijk uvwx zBDF


FIGURE 7–11. Operation of the parallel suffix method for the compress operation.

In situation (2), in which the value of m is known, the same sort of thing can be done, and further optimization may be possible. It might happen that one of the five masks is 0, in which case one of the five lines shown above can be omitted. For example, mask mv0 is 0 if it happens that no bit moves an odd number of positions, and mv4 is 0 if no bit moves more than 15 positions, and so on.

As an example, for

m = 0101 0101 0101 0101 0101 0101 0101 0101,

the calculated masks are

mv0 = 0100 0100 0100 0100 0100 0100 0100 0100
mv1 = 0011 0000 0011 0000 0011 0000 0011 0000
mv2 = 0000 1111 0000 0000 0000 1111 0000 0000
mv3 = 0000 0000 1111 1111 0000 0000 0000 0000
mv4 = 0000 0000 0000 0000 0000 0000 0000 0000

Because the last mask is 0, in the compiled code situation this compression operation is done in 17 instructions (not counting the loading of the masks). This is not quite as good as the code shown for this operation on page 141 (13 instructions, not counting the loading of masks), which takes advantage of the fact that alternate bits are being selected.

Using Insert and Extract

If your computer has the insert instruction, preferably with immediate values for the operands that identify the bit field in the target register, then in the compiled situation insert can often be used to do the compress operation with fewer instructions than the methods discussed above. Furthermore, it doesn’t tie up registers holding the masks.

The target register is initialized to 0, and then, for each contiguous group of 1’s in the mask m, variable x is shifted right to right-justify the next field, and the insert instruction is used to insert the bits of x in the appropriate place in the target register. This does the operation in 2n + 1 instructions, where n is the number of fields (groups of consecutive 1’s) in the mask. The worst case is 33 instructions, because the maximum number of fields is 16 (which occurs for alternating 1’s and 0’s).

An example in which the insert method uses substantially fewer instructions is m = 0x0010084A. Compressing with this mask requires moving bits 1, 2, 4, 8, and 16 positions. Thus, it takes the full 21 instructions for the parallel suffix method, but only 11 instructions for the insert method (there are five fields). A more extreme case is m = 0x80000000. Here a single bit moves 31 positions, requiring 21 instructions for the parallel suffix method, but only three instructions for the insert method and only one instruction (shift right 31) if you are not constrained to any particular scheme.

You can also use the extract instruction in various simple ways to do the compress operation with a known mask in 3n – 2 instructions, where n is the number of fields in the mask.

Clearly, the problem of compiling optimal code for the compress operation with a known mask is a difficult one.

Compress Left

To compress bits to the left, obviously you can reverse the argument x and the mask, compress right, and reverse the result. Another way is to compress right and then shift left by pop(Image). These might be satisfactory if your computer has an instruction for bit reversal or population count, but if not, the algorithm of Figure 7–10 is easily adapted: Just reverse the direction of all the shifts except the two in the expressions 1 << i (eight to change).

The BESM-6 computer (ca. 1967) had an instruction for the compress left function (“Pack Bits in A Masked by X”) and its inverse (“Unpack ...”), which operated on the machine’s 48-bit registers. These instructions are not easy to implement. It is surmised by cryptography experts that their only use was for breaking US codes [Knu8]. The BESM-6 also had the population count instruction which, as has been noted, seems to be important to the National Security Agency.

7–5 Expand, or Generalized Insert

The inverse of the compress right function moves bits from the low-order end of a register to positions given by a mask, while keeping the bits in order. For example, expand(0000abcd, 10011010) = a00bc0d0. Thus

compress(expand(x, m), m) = x.

This function has also been called unpack, scatter, and deposit.

It can be obtained by running the code of Figure 7–10 in reverse [Allen]. To avoid overwriting bits in x, it is necessary to move (to the left) the bits that move a large distance first, and to move those that move only one position last. This means that the first five “move” quantities (mv in the code) must be computed, saved, and used in the reverse of the order in which they were computed. For many applications this is not a problem, because these applications apply the same mask m to large amounts of data, and so they would compute the move quantities in advance and reuse them anyway.

The code is shown in Figure 7–12. It executes approximately 168 basic RISC instructions (constant), including five stores and five loads. A 64-bit version for a 64-bit machine would execute approximately 200 instructions.

For a machine that does not have the and not instruction, the MUX operation in the second loop can be coded in one fewer instruction with

x = ((x ^ t) & mv) ^ x;


unsigned expand(unsigned x, unsigned m) {
  unsigned m0, mk, mp, mv, t;
  unsigned array[5];
  int i;

  m0 = m;        // Save original mask.
  mk = ~m << 1;  // We will count 0's to right.

  for (i = 0; i < 5; i++) {
     mp = mk ^ (mk << 1);             // Parallel suffix.
     mp = mp ^ (mp << 2);
     mp = mp ^ (mp << 4);
     mp = mp ^ (mp << 8);
     mp = mp ^ (mp << 16);
     mv = mp & m;                     // Bits to move.
     array[i] = mv;
     m = (m ^ mv) | (mv >> (1 << i)); // Compress m.
     mk = mk & ~mp;
  }

  for (i = 4; i >= 0; i--) {
     mv = array[i];
     t = x << (1 << i);
     x = (x & ~mv) | (t & mv);
  }
  return x & m0;    // Clear out extraneous bits.
}


FIGURE 7–12. Parallel suffix method for the expand operation.

7–6 Hardware Algorithms for Compress and Expand

This section gives hardware-oriented algorithms for the compress right function and its inverse [Zadeck]. Like the algorithms of the preceding sections, their execution times are proportional to the log of the computer’s word size. They are suitable for implementation in hardware, but do not yield fast code if implemented in basic RISC instructions. We simply describe how they work without giving C or machine code.

Compress

To illustrate the operation of the algorithm, we represent each bit of x with a letter and consider a specific example mask m, shown below.


Input x =      abcd efgh ijkl mnop qrst uvwx yzAB CDEF
Mask m =       0111 1110 0110 1100 1010 1111 0011 0010

The algorithm works in log2(W) “phases,” where W is the computer’s word size in bits. Each phase operates in parallel on “pockets” of size 2n bits, for n ranging from 1 to log2(W). At the end of each phase, each pocket of x contains the original pocket of x with the bits selected by that pocket of m compressed to the right. Each pocket of m will contain an integer that is the number of 0-bits in that pocket of the original m. This is equal to the number of bits of x that are not compressed to the right. They are the known leading 0-bits in the pocket of x.

In each phase, the algorithm performs the following steps, in parallel, on each pocket of x and m, where w is the pocket size in bits.

1. Set L = the left half of the pocket of x, extended with w / 2 0-bits on the right.

2. Shift L (all w bits) right by the amount given in the right half of the corresponding pocket of m, inserting 0’s on the left. No 1’s will be shifted out on the right, because the maximum shift amount is w / 2.

3. Set R = w / 2 0-bits followed by the right half of the pocket of x.

4. Replace the entire w-bit pocket of x with the or of R and the shifted L.

5. Add the left and right halves of the pocket of m, and replace the entire pocket with the sum.

To apply these steps to the first phase (w = 2) would require first and’ing x with m, to clear out irrelevant bits of x, and complementing m so that each bit of m is the number of 0-bits in each 1-bit half pocket. It is simpler to make an exception of the first phase, and combine these steps with the first compression operation by applying the logic shown in the table below to each 2-bit pocket of x and m.

Image

The third line, for example, has m = 10 (binary). This means that the left bit of x is selected to be part of the result, but the right bit is not. Thus, the left bit (a) is compressed to the right. The other bit of x is cleared, which ensures that in the final result, all the high-order (not selected) bits will be 0.

Applying this logic to the original x and m gives:

Bit pairs, x = 0bcd ef0g 0j0k mn00 0q0s uvwx 00AB 000E
           m = 0100 0001 0101 0010 0101 0000 1000 1001

In the second phase, consider for example the second nibble above (ef0g). The quantities L = ef00 and R = 000g are formed. L is shifted right by one position (given by the right half of the nibble of m), giving 0ef0. This is or’ed with R, giving 0efg as the new value of the nibble. The left and right halves of m are added, giving 0001 (no change).

Nibbles,   x = 0bcd 0efg 00jk 00mn 00qs uvwx 00AB 000E
           m = 0001 0001 0010 0010 0010 0000 0010 0011

Similarly, for the third, fourth, and fifth phases, each byte, halfword, and word of x are compressed, and m is updated, as follows:

Bytes,     x = 00bc defg 0000 jkmn 00qs uvwx 0000 0ABE
           m = 0000 0010 0000 0100 0000 0010 0000 0101

Halfwords, x = 0000 00bc defg jkmn 0000 000q suvw xABE
           m = 0000 0000 0000 0110 0000 0000 0000 0111

Words,     x = 0000 0000 0000 0bcd efgj kmnq suvw xABE
           m = 0000 0000 0000 0000 0000 0000 0000 1101

Upon completion, m is an integer that gives the number of known leading 0’s in x. Subtracting this from the word size gives the number of compressed bits in x, which equals the number of 1-bits in the original mask m.

The reason this is not a very good algorithm for implementation with basic RISC instructions is that it is hard to shift the half-pockets right by differing amounts. On the other hand, it might possibly be useful on an SIMD machine that has instructions that operate on the pockets of a word in parallel and independently.

Expand

The hardware compression algorithm can be turned into an expansion algorithm by, essentially, running it first forward and then in reverse. As in the algorithms based on the parallel suffix method, the five masks of the hardware compression algorithm are computed, saved, and used in the reverse of the order in which they were computed. Actually, the last mask is not used (nor is it used in the compression algorithm), but an additional one is required (m0) that is simply the complement of the original mask. In the forward pass, only the steps for computing the masks need be done; those involving the data x can be omitted.

To illustrate, suppose we have

Input x = abcd efgh ijkl mnop qrst uvwx yzAB CDEF
Mask  m = 0111 1110 0110 1100 1010 1111 0011 0010

Then the result of the expansion should be

0nop qrs0 0tu0 vw00 x0y0 zABC 00DE 00F0.

The masks are shown below.

m0 = 1000 0001 1001 0011 0101 0000 1100 1101
m1 = 0100 0001 0101 0010 0101 0000 1000 1001
m2 = 0001 0001 0010 0010 0010 0000 0010 0011
m3 = 0000 0010 0000 0100 0000 0010 0000 0101
m4 = 0000 0000 0000 0110 0000 0000 0000 0111

The integer values of each half of m4 give the number of 0-bits in the corresponding half of the original mask m. In particular, the right half of m has seven 0-bits. This means that the seven high-order bits of the right half of x do not belong there—they should be in the left half of x. Thus, bits 9 through 15 of x should be shifted left just enough to put them in the left half of x, and higher-order bits of x should be shifted left to accommodate them. This can be accomplished by shifting left the entire 32-bit word x by seven positions and replacing the left half of x with the left half of the shifted quantity. This gives

x = hijk lmno pqrs tuvw qrst uvwx yzAB CDEF.

In general, the algorithm works with pocket sizes from 32 down to 2, in five phases, using masks m4 down to m0. Each pocket (in parallel) is shifted left, discarding bits that are shifted out on the left, and supplying 0’s to vacated positions on the right, so that the shifted quantity is the same length as the pocket from which it came. Then the left half of the pocket is replaced by the left half of the shifted quantity. This will leave “garbage” bits in both halves of the pocket. They will be zeroed-out after the last phase by and’ing with the original mask.

Continuing, we treat m3 as two 16-bit pockets. The left pocket has the integer 4 in its right half, so the left pocket of x is shifted left four positions (giving lmno pqrs tuvw 0000), and the left half of this replaces the left half of the left pocket in x, making the left pocket of x = lmno pqrs. Performing the same operation on the right 16-bit pocket of x gives

x = lmno pqrs pqrs tuvw vwxy zABC yzAB CDEF.

The next phase uses m2, which consists of four 8-bit pockets. Applying it to x gives

x = mnop pqrs rstu tuvw vwxy zABC BCDE CDEF.

The next phase uses m1, which consists of eight 4-bit pockets. Applying it to x gives

x = mnop qrrs sttu vwvw wxxy zABC BCDE DEEF.

The last phase uses m0, which consists of sixteen 2-bit pockets. Applying it to x gives

x = mnop qrss stuu vwww xxyy zABC CCDE EEFF.

The final step is to and this with the original mask to clear irrelevant bits. This gives

x = 0nop qrs0 0tu0 vw00 x0y0 zABC 00DE 00F0.

The half-pockets of each computed mask contain a count of the number of 0-bits in the corresponding half-pocket of the original mask m. Therefore, as an alternative to computing the masks and saving them, the machine could employ circuits for doing a population count of the 0’s in the half-pockets “on the fly.”

7–7 General Permutations, Sheep and Goats Operation

To do general permutations of the bits in a word, or of anything else, a central problem is how to represent the permutation. It cannot be represented very compactly. Because there are 32! permutations of the bits in a 32-bit word, at least log2(32!) = 118 bits, or three words plus 22 bits, are required to designate one permutation out of the 32!.

One interesting way to represent permutations is closely related to the compression operations discussed in Section 7–4 [GLS1]. Start with the direct method of simply listing the bit position to which each bit moves. For example, for the permutation done by a rotate left of four bit positions, the bit at position 0 (the least significant bit) moves to position 4, 1 moves to 5, ..., 31 moves to 3. This permutation can be represented by the vector of 32 5-bit indexes:

00100
00101
...
11111
00000
00001
00010
00011

Treating that as a bit matrix, the representation we have in mind is its transpose, except reflected about the off diagonal so the top row contains the least significant bits and the result uses little-endian bit numbering. This we store as five 32-bit words in array p:

p[0] = 1010 1010 1010 1010 1010 1010 1010 1010
p[1] = 1100 1100 1100 1100 1100 1100 1100 1100
p[2] = 0000 1111 0000 1111 0000 1111 0000 1111
p[3] = 0000 1111 1111 0000 0000 1111 1111 0000
p[4] = 0000 1111 1111 1111 1111 0000 0000 0000

Each bit of p[0] is the least significant bit of the position to which the corresponding bit of x moves, each bit of p[1] is the next more significant bit, and so on. This is similar to the encoding of the masks denoted by mv in the previous section, except that mv applies to revised masks in the compress algorithm, not to the original mask.

The compression operation we need compresses to the left all bits marked with 1’s in the mask, and compresses to the right all bits marked with 0’s.2 This is sometimes called the “sheep and goats” operation (SAG), or “generalized unshuffle.” It can be calculated with

SAG(x, m) = compress_left(x, m) | compress(x, ~m).

With SAG as a fundamental operation, and a permutation p as described above, the bits of a word x can be permuted by p in the following 15 steps:

x    = SAG(x,    p[0]);
p[1] = SAG(p[1], p[0]);
p[2] = SAG(p[2], p[0]);
p[3] = SAG(p[3], p[0]);
p[4] = SAG(p[4], p[0]);

x    = SAG(x,    p[1]);
p[2] = SAG(p[2], p[1]);
p[3] = SAG(p[3], p[1]);
p[4] = SAG(p[4], p[1]);

x    = SAG(x,    p[2]);
p[3] = SAG(p[3], p[2]);
p[4] = SAG(p[4], p[2]);

x    = SAG(x,    p[3]);
p[4] = SAG(p[4], p[3]);

x    = SAG(x,    p[4]);

In these steps, SAG is used to perform a stable binary radix sort. Array p is used as 32 5-bit keys to sort the bits of x. In the first step, all bits of x for which p[0] = 1 are moved to the left half of the resulting word, and all those for which p[0] = 0 are moved to the right half. Other than this, the order of the bits is not changed (that is, the sort is “stable”). Then all the keys that will be used for the next round of sorting are similarly sorted. The sixth line is sorting x based on the second least significant bit of the key, and so on.

Similar to the situation of compressing, if a certain permutation p is to be used on a number of words x, then a considerable savings results by precomputing most of the steps above. The permutation array is revised to

p[1] = SAG(p[1], p[0]);
p[2] = SAG(SAG(p[2], p[0]), p[1]);
p[3] = SAG(SAG(SAG(p[3], p[0]), p[1]), p[2]);
p[4] = SAG(SAG(SAG(SAG(p[4], p[0]), p[1]), p[2]), p[3]);

and then each permutation is done with


x = SAG(x, p[0]);
x = SAG(x, p[1]);
x = SAG(x, p[2]);
x = SAG(x, p[3]);
x = SAG(x, p[4]);

A more direct (but perhaps less interesting) way to do general permutations of the bits in a word is to represent a permutation as a sequence of 32 5-bit indexes. The kth index is the bit number in the source from which the kth bit of the result comes. (This is a “comes from” list, whereas the SAG method uses a “goes to” list.) These could be packed six to a 32-bit word, thus requiring six words to hold all 32 bit indexes. An instruction can be implemented in hardware such as

bitgather Rt,Rx,Ri,

where register Rt is a target register (and also a source), register Rx contains the bits to be permuted, and register Ri contains six 5-bit indexes (and two unused bits). The operation of the instruction is

Image

In words, the contents of the target register are shifted left six bit positions, and six bits are selected from word x and placed in the vacated six positions of t. The bits selected are given by the six 5-bit indexes in word i, taken in left-to-right order. The bit numbering in the indexes could be either little- or big-endian, and the operation would probably be as described for either type of machine.

To permute a word, use a sequence of six such instructions, all with the same Rt and Rx, but different index registers. In the first index register of the sequence, only indexes i4 and i5 are significant, as the bits selected by the other four indexes are shifted out of the left end of Rt.

An implementation of this instruction would most likely allow index values to be repeated, so the instruction can be used to do more than permute bits. It can be used to repeat any selected bit any number of times in the target register. The SAG operation lacks this generality.

It is not unduly difficult to implement this as a fast (e.g., one cycle) instruction. The bit selection circuit consists of six 32:1 MUX’s. If these are built from five stages of 2:1 MUX’s in today’s technology (6 · 31 = 186 MUX’s in all), the instruction would be faster than a 32-bit add instruction [MD].

Some of the Intel machines have instructions that work much like the bit permutation operation described, but that permute bytes, “words” (16 bits), and “doublewords” (32 bits). These are PSHUFB, PSHUFW, and PSHUFD (Shuffle Packed Bytes/Words/Doublewords).

Permuting bits has applications in cryptography, and the closely related operation of permuting subwords (e.g., permuting the bytes in a word) has applications in computer graphics. Both of these applications are more likely to deal with 64-bit words, or possibly with 128, than with 32. The SAG and bitgather methods apply with obvious changes to these larger word sizes.

To encrypt or decrypt a message with the Data Encryption Standard (DES) algorithm requires a large number of permutation-like mappings. First, key generation is done, once per session. This involves 17 permutation-like mappings. The first, called “permuted choice 1,” maps from a 64-bit quantity to a 56-bit quantity (it selects the 56 non-parity bits from the key and permutes them). This is followed by 16 permutation-like mappings from 56 bits to 48 bits, all using the same mapping, called “permuted choice 2.”

Following key generation, each block of 64 bits in the message is subjected to 34 permutation-like operations. The first and last operations are 64-bit permutations, one being the inverse of the other. There are 16 permutations with repetitions that map 32-bit quantities to 48 bits, all using the same mapping. Finally, there are 16 32-bit permutations, all using the same permutation. The total number of distinct mappings is six. They are all constants and are given in [DES].

DES is obsolete, as it was proved to be insecure in 1998 by the Electronic Frontier Foundation, using special hardware. The National Institute of Standards and Technology (NIST) has endorsed a temporary replacement called Triple DES, which consists of DES run serially three times on each 64-bit block, each time with a different key (that is, the key length is 192 bits, including 24 parity bits). Hence, it takes three times as many permutation operations as does DES to encrypt or decrypt.

The “permanent” replacement for DES and Triple DES, the Advanced Encryption Standard (previously known as the Rijndael algorithm [AES]), involves no bit-level permutations. The closest it comes to a permutation is a simple rotation of 32-bit words by a multiple of 8-bit positions. Other encryption methods proposed or in use generally involve far fewer bit-level permutations than DES.

To compare the two permutation methods discussed here, the bitgather method has the advantages of (1) simpler preparation of the index words from the raw data describing the permutation, (2) simpler hardware, and (3) more general mappings. The SAG method has the advantages of (1) doing the permutation in five rather than six instructions, (2) having only two source registers in its instruction format (which might fit better in some RISC architectures), (3) scaling better to permute a doubleword quantity, and (4) permuting subwords more efficiently.

Item (3) is discussed in [LSY]. The SAG instruction allows for doing a general permutation of a two-word quantity with two executions of the SAG instruction, a few basic RISC instructions, and two full permutations of single words. The bitgather instruction allows for doing it by executing three full permutations of single words, plus a few basic RISC instructions. This does not count preprocessing of the permutation to produce new quantities that depend only on the permutation. We leave it to the reader to discover these methods.

Regarding item (4), to permute, for example, the four bytes of a word with bitgather requires executing six instructions, the same as for a general bit permutation by bitgather. But with SAG it can be done in only two instructions, rather than the five required for a general bit permutation by SAG. The gain in efficiency applies even when the subwords are not a power of 2 in size; the number of steps required is log2n, where n is the number of subwords, not counting a possible non-participating group of bits that stays at one end or the other.

[LSY] discusses the SAG and bitgather instructions (called “GRP” and “PPERM,” respectively), other possible permutation instructions based on networks, and permuting by table lookup.

There is a neat hack to add 1 to the goats—that is, to compute

Image

without using the SAG function or its inverse [Knu8]. Here we assume SAG(x, m) puts the goats on the right, and the addition does not overflow into the “sheep” field. We leave to the reader the pleasure of discovering this trick.

7–8 Rearrangements and Index Transformations

Many simple rearrangements of the bits in a computer word correspond to even simpler transformations of the coordinates, or indexes, of the bits [GLS1]. These correspondences apply to rearrangements of the elements of any one-dimensional array provided the number of array elements is an integral power of 2. For programming purposes, they are useful primarily when the array elements are a computer word or larger in size.

As an example, the outer perfect shuffle of the elements of an array A of size eight, with the result in array B, consists of the following moves:

Image

Each B-index is the corresponding A-index rotated left one position, using a 3-bit rotator. The outer perfect unshuffle is, of course, accomplished by rotating right each index. Some similar correspondences are shown in Table 7–1. Here n is the number of array elements, “lsb” means least significant bit, and the rotations of indexes are done with a log2n-bit rotator.

TABLE 7–1. REARRANGEMENTS AND INDEX TRANSFORMATIONS

Image

7–9 An LRU Algorithm

Ever wonder how your computer keeps track of which cache line is the least recently used? Here we describe one such algorithm, known as the reference matrix method. It is primarily a hardware algorithm, but it might have application in software.

We won’t go into a long discussion of the intriguing world of caches, but only say that we have in mind the high-speed caches that buffer data between a computer’s main memory and the processor. These caches may get a request for a word every computer cycle, and they should usually respond with the data within a cycle or two, so there is not much time for a complicated algorithm.

A cache contains a copy of a subset of the data in main memory, and the problem we are addressing is: when a cache miss occurs (that is, when a word at a certain address is requested and the data at that address are not in the cache), how does the computer decide which block (or line, in cache jargon) to replace with the requested data? Ideally, it should replace the data in the line that will not be referenced for the longest time in the future. But we cannot know the future, so we have to guess. The best guess over a wide variety of application programs seems to be the least recently used (LRU) policy. This policy replaces the line that has not been referenced for the longest time.

Caches come in three varieties: direct-mapped, fully associative, and set-associative. In a direct-mapped cache, certain bits of the address of the load or store instruction directly address a particular cache line. When a miss occurs, there is no question as to what line to replace—it must be the addressed line. There is no need for an LRU or any other guessing policy.

In a fully associative cache, a block from main memory can be placed in any cache line. When a load or store is executed, the address is looked up to see if it is in the cache. If not, it is necessary to replace the contents of some line. The machine has complete flexibility in the choice of line to replace. Several strategies have been used (FIFO, random, and LRU are the most common) and, as mentioned above, LRU seems to be the one that most often results in the lowest miss rate. Unfortunately, LRU is the most expensive to implement when there are many lines to consider for replacement.

Often the set-associative organization is chosen. It is a compromise between direct-mapped and fully associative. The designer decides on the degree of associativity, which is usually 2, 4, 8, or 16. The cache is divided into a number of “sets,” each of which contains 2, 4, 8, or 16 lines (typically). The set is directly addressed, using certain bits of the load or store address, but the line within the set must be looked up. The lookup in the set is done much the same as in the case of a fully associative cache. Now, when it is necessary to replace a line, the LRU algorithm need only determine which of the lines within one set is the least recently used, and replace that.

With this brief background, we can describe the reference matrix method. To illustrate, assume the cache is four-way set-associative. This means that there are four lines for which we wish to keep track of the least recently used (referenced). The cache may be fully associative and consist of only four lines, or it may be set-associative with four lines per set.

The reference matrix method employs a square bit matrix of dimension equal to the degree of associativity (in principle; we will modify this statement later). Each associative set has one such matrix. The essence of the method is that when line i is referenced, row i of the matrix is set to 1’s, and then column i is set to 0’s. Figure 7–13 illustrates the changes in the matrix from an initial state to its configuration after a reference to lines 3, 1, 0, 2, 0, 3, and 2, in that order.

Image

FIGURE 7–13. Illustration of the reference matrix method.

Each matrix has a row containing three 1’s, two 1’s, one 1, and no 1’s. The number of the row with no 1’s is the least recently used line. The number of the row with one 1 is the next least recently used line, and so on. When a cache miss occurs, the machine finds the row with all 0’s and replaces the corresponding line. It then records it as the most recently used line by setting its row to all 1’s and its column to all 0’s.

Why does this work? Denoting the matrix by M, the reason it works is that Mij indicates whether or not line i is more recently used than line j. If Mij = 1, line i is more recently used than line j, and if Mij = 0, line i is not more recently used than line j.

Consider an arbitrary 4×4 matrix for which line 2 is referenced. Then the matrix changes as shown in Figure 7–14. Setting row i to 1’s (except for the element on the main diagonal) is recording that line i is more recently used than line j, for all ji. Setting column i to 0’s is recording that line j is not more recently used than line i, for all j. Relations among cache lines other than i are not changed. When all the lines have been referenced, all the “more recently used” relations will be established.

Thus, the reference matrix is antisymmetric and the main diagonal is always all 0’s. Therefore, only part of the matrix, either the elements above the main diagonal or those below the main diagonal, need be stored in the cache. That is what is done in practice. For an n-way associative set, n(n – 1)/2 memory bits are required. For n = 4, this is six; for n = 8, it is 28. Twenty-eight is getting to be a bit large, so the reference matrix method, and in fact the true LRU policy, is not often used for degrees of associativity greater than 8. Instead, there are approximate LRU methods and methods that are not LRU at all.

In software, the LRU policy would probably be implemented with a list of the line numbers (either a simple vector or a linked list). When line i is referenced, the list is searched for i, and then i is moved to the top of the list. The least recently used line number then migrates to the bottom of the list.

That method is relatively slow on references (because of rearranging the list), but fast in deciding which line to replace. Another method, with the opposite speed characteristics, is to have a vector of length equal to the degree of associativity, with position i holding both the address that line i holds and its “age” (actually “newness”) encoded as an integer. When line i is referenced, a single variable that holds the current “age” is incremented, and the resulting value is stored in the vector at position i. To find the least recently used line, the vector is searched for the line with the smallest value of “age.” This method fails if the “age” integer overflows.

Image

FIGURE 7–14. One step of the reference matrix method.

There might be one “age” integer per associative set, or only one for the whole cache, or in hardware a cycle counter could be used.

The reference matrix method might be useful in software when the degree of associativity is small. For example, suppose an application uses eight-way set-associativity and is to run on a 64-bit machine. Then the reference matrix can be stored in a single 64-bit register. Let the low-order eight bits of the register hold row 0 of the matrix, the next eight bits hold row 1, and so forth. Then when line i is referenced, byte i of the register should be set to 1’s, and bits i, i + 8, ..., i + 56 should be cleared. Denoting the register by m, this is accomplished as shown here.

Image

This amounts to five or six instructions, plus a few to load constants. To find the least recently used line, search for an all-zero byte (see Section 6–1). The advantage of this method over the other software methods briefly outlined above is that all the work is done in a register.

Exercises

1. Explain the workings of the second Möbius formula (Equation (1), page 139).

2. The perfect outer shuffle operation and its inverse employ the following masks:

Image

What is a formula for the general case, mk? A formula might be useful in situations in which an upper bound on the length of the integers being shuffled is not known in advance, such as in “bignum” applications.

3. Code a function similar to the compress function of Figure 7–9 that does the expand operation.

4. For an n-way set-associative cache, what is the theoretical minimum number of bits required to implement the LRU policy? Compare that to the number of bits required for the reference matrix method, for a few small values of n.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset