26.3SmallBlockAllocator 417
HowFree()Works
Deleting a block of memory is a thornier problem. All the system knows is that,
based on the address falling within an SBA page, the SBA must free this pointer.
To determine this, it must also have found the page it came from. Knowing that
all pages in the SBA are the same size, the
Tail structure is always at the same
offset from the start of a page. However, the start of the bit mask varies depend-
ing on the number of blocks in the page. This can be calculated from the number
of bytes per block, which we stored in the
Tail structure for this purpose.
Based on the block size and the pointer, the index into the free block bit
mask that represents this memory is simply
(ptr – pageStart) / blockSize.
We set the bit to one, indicating it is free for allocation. Notice that there is no
need to explicitly handle adjacent free block coalescing because every block in
the page is the exact size that is allocated.
Finally, the results of a free operation on a page that was completely depleted
of free blocks, is to reinsert the page into the list of available memory pages.
PerformanceAnalysis
The two major operations, allocation and deallocation, are very fast. Deallocation
is a constant-time
1O operation and has no loops once the SBA page has been
located. Locating the page is a separate operation, described in Section 26.6. Al-
location requires scanning for set bits in a bit mask, on average
2n , where a page
is divided into
n blocks. A naive C implementation of scanning for set bits using
bytes can be this slow, however, we can trivially test 32, 64, or even 128 bits at a
time by loading the mask into a register and comparing against zero, generally in
just two instructions. Intel CPUs can use the bit scan forward (
BSF) instruction to
check for the first set bit in either a 32- or 64-bit register in just a few cycles. In
addition, for most practical page and block sizes, the entire
Tail structure and
free block bit mask fits within a single L1 cache line, ensuring the fastest possi-
ble allocation performance.
As a specific example, given a 4-kB page with 32 bytes per block, there are
127 allocatable blocks per page plus one reserved block for the
Tail structure.
This is 127 bits for the bit mask, which fits in only four 32-bit words. Worst case,
obtaining an allocation requires four 4-byte requests from memory, but on aver-
age finds a free block in two requests. The first load causes a delay as a cache
line is flushed and another is pulled into the L2 and L1 caches, but the second
load requires only one or two cycles on typical hardware.