414 26.AHighlyOptimizedPortableMemoryManager
PagedAllocation
It is fine to suggest different allocation strategies for different-sized memory re-
quests, but that memory has to come from somewhere. The simplest method is to
preallocate a fixed number of blocks for each allocation size and hope you don’t
exceed that limit. Having done this on shipped titles, we honestly don’t recom-
mend it. However, it has one benefit in that you can determine the size of an allo-
cation based on the address alone (because it is either in the small, medium, or
large region of memory). Still, it is far better to have a flexible solution that
doesn’t require tweaking for every project having slightly different memory allo-
cation characteristics.
The solution is to preallocate in large batches, or pages, and string these pag-
es together to serve as small- and medium-sized memory heaps. Ideally, these are
quite large, granular pages of equal sizes for both SBA and MBA pages, or MBA
pages are at least an even multiple of the page size used for SBA. This is prefera-
ble, as mentioned above, due to the inherent resistance to fragmentation when
allocations are all the same size. By making page sizes large, there are fewer of
them allocated, leading to fewer chances for failure and less overall CPU time
spent managing pages.
Inevitably, a single page does not hold all of the allocations the game needs
for a specific-sized allocation, so pages must be linked together by some data
structure. We selected a linked list for pages to make the traversal cheap and to
make it easy to extract or insert pages with minimal cache line misses. Searching
a linked list is slow, so we come up with ways to prevent the search entirely.
EasingPortabilitywiththeOSAPIClass
The exact policy for how and where pages come from is going to vary per plat-
form, so the memory manager template is parameterized by a class called
OSAPI.
OSAPI has functions to allocate and free SBA and MBA pages, allocate and free
large blocks, and identify whether a pointer was allocated by an SBA, MBA, or
LBA call. It also specifies the small and medium page sizes that are created when
new pages are requested.
By making this a separate class, you can experiment with various page man-
agement methods without requiring modifications to the rest of the memory
manager. Considering the
OSAPI implementation is likely to be almost identical
across platforms except for how memory is allocated, the separation may seem
unnecessary, but in the event that one platform needs to pay very special atten-
tion to where memory comes from for certain kinds of data, this class makes just
such an implementation simple. For example, MEM1 is faster than MEM2 on the
26.3SmallBlockAllocator 415
Nintendo Wii. It is reasonable to force all small and medium allocations to be in
MEM1 by providing all pages from MEM1, and meanwhile force all large allo-
cations to be in MEM2, which is fine because the CPU is unlikely to need direct
access to large allocations.
26.3SmallBlockAllocator
The SBA is implemented as a full page of equally sized blocks, minus a small
section of data at the end of the page called the
Tail, which includes a bit mask
indicating which blocks are free, pointers to the previous and next pages, a count
of free blocks in the page (for verification purposes only), and the number of
bytes in a single block. Since the number of blocks in an SBA page depends on
both the number of bytes in a block and the size of the page, this is a variable-
sized structure and is thus slightly more complicated to deal with than your typi-
cal C structure. Figure 26.2 shows the memory layout for a 4-kB page. The entire
Tail and bit mask fits entirely within a single 32-byte block, which is also small-
er than the L1 and L2 cache line size for modern processors.
Figure 26.2. This is the memory layout for an example SBA page that totals 4 kB seg-
mented into 32-byte allocations. Notice the tail is the last block and is quite small relative
to the amount of memory managed.
32 bytes/block
FreeBlockBitMask
FreeBlockCount BlockSize
Prev Page Next Page
SmallBlockAllocator
P
a
ge
4
kB Page
416 26.AHighlyOptimizedPortableMemoryManager
Due to the added complexity of handling variable-sized structures at the
Tail, the code is written in such a way that the fixed-sized portion of the struc-
ture is stored in the
Tail, and the bit mask is arranged such that it ends just be-
fore the start of the
Tail.
Why put the data at the end of the page rather than at the start? It’s done part-
ly out of habit and partly because placing a variable structure at the start of the
page would most likely require some decision making about whether the first
block should occur on the next byte, or on the next 32-byte address, or aligned
based on the last block in the page. The choices are likely to be platform-specific,
too. Since this is meant to be a simple, portable system that operates identically
as much as possible on all platforms, the decision to move the metadata to the
end of the page removes all doubt and simplifies the code somewhat. The main
risk is that a very misbehaving program could smash the entire page by overrun-
ning its memory and destroying the tracking information at the end of the page.
True, but this is one reason why the free block count exists in the
Tail struc-
ture—to sanity-check the validity of the block whenever problems arise.
HowAlloc()Works
Once a page is found that provides memory allocations of the appropriate size,
the job of the SBA is straightforward. First, we use the page size to determine the
location of the
Tail structure and use the block size to locate the start of the free
block bit mask. Then, we scan through the entire bit mask looking for a bit set to
one. A bit set to one indicates a free block at that index. Once found, the bit is set
to zero, the free count in the
Tail structure is decremented for accounting pur-
poses, and a pointer is returned to the caller that points to the newly allocated
block of memory.
Notice that we do not describe a failure case for allocation. Of course, we
could devise a system that allows for depleted pages to coexist in the same list as
pages with blocks remaining, but this would be wasteful. Completely depleted
pages mixed in with those that have allocations available would always degener-
ate into a linked-list search through pages, ending either in an allocation or a new
page being created in the list. A little experimentation with a profiler showed 75
percent of all CPU time was spent waiting for memory cache misses on each
page. So, the memory manager must also handle segregating depleted pages from
those that have blocks remaining. Thus, every time an allocation is requested, it
is always found on the very first page it checks. The result of an allocation can
cause that page to be put in the depleted page list or a new page to be created
should the free block page list be empty.
26.3SmallBlockAllocator 417
HowFree()Works
Deleting a block of memory is a thornier problem. All the system knows is that,
based on the address falling within an SBA page, the SBA must free this pointer.
To determine this, it must also have found the page it came from. Knowing that
all pages in the SBA are the same size, the
Tail structure is always at the same
offset from the start of a page. However, the start of the bit mask varies depend-
ing on the number of blocks in the page. This can be calculated from the number
of bytes per block, which we stored in the
Tail structure for this purpose.
Based on the block size and the pointer, the index into the free block bit
mask that represents this memory is simply
(ptr – pageStart) / blockSize.
We set the bit to one, indicating it is free for allocation. Notice that there is no
need to explicitly handle adjacent free block coalescing because every block in
the page is the exact size that is allocated.
Finally, the results of a free operation on a page that was completely depleted
of free blocks, is to reinsert the page into the list of available memory pages.
PerformanceAnalysis
The two major operations, allocation and deallocation, are very fast. Deallocation
is a constant-time
1O operation and has no loops once the SBA page has been
located. Locating the page is a separate operation, described in Section 26.6. Al-
location requires scanning for set bits in a bit mask, on average
2n , where a page
is divided into
n blocks. A naive C implementation of scanning for set bits using
bytes can be this slow, however, we can trivially test 32, 64, or even 128 bits at a
time by loading the mask into a register and comparing against zero, generally in
just two instructions. Intel CPUs can use the bit scan forward (
BSF) instruction to
check for the first set bit in either a 32- or 64-bit register in just a few cycles. In
addition, for most practical page and block sizes, the entire
Tail structure and
free block bit mask fits within a single L1 cache line, ensuring the fastest possi-
ble allocation performance.
As a specific example, given a 4-kB page with 32 bytes per block, there are
127 allocatable blocks per page plus one reserved block for the
Tail structure.
This is 127 bits for the bit mask, which fits in only four 32-bit words. Worst case,
obtaining an allocation requires four 4-byte requests from memory, but on aver-
age finds a free block in two requests. The first load causes a delay as a cache
line is flushed and another is pulled into the L2 and L1 caches, but the second
load requires only one or two cycles on typical hardware.
418 26.AHighlyOptimizedPortableMemoryManager
MemoryOverhead
The SBA scheme described is very memory-efficient. For a given page, there is a
fixed amount of overhead due to the
Tail structure of exactly 12 bytes. There is
also a variable amount of overhead for the bit mask, which is always rounded up
to the next four-byte boundary, and occupies one bit per block in the page. For
the 4-kB page with 32 bytes per block above, this works out to 16 bytes for the
bit mask. So the total amount of management overhead within a single SBA page
is less than two bits per allocation. In practical settings, pages are probably larger
than 4 kB, resulting in overhead that rapidly approaches one bit per allocation.
Contrast this with dlmalloc, which has a relatively efficient 16 bytes of overhead
per allocation.
26.4MediumBlockAllocator
The MBA serves a slightly different purpose than the SBA, but does so through
similar techniques. The MBA works by always rounding up allocations to a fixed
block size so that fragmentation is more manageable. Similar to the SBA, the
MBA expects to be handed pages of some fixed size, requiring a
Tail structure
at the end of each page to manage multiple memory pages. Also, a bit mask is
used to mark individual blocks as free or available. The similarities end there.
Since the MBA must service many different allocation lengths, it is infeasi-
ble to create separate large pages just for each granular jump in size. This is par-
ticularly true since the page size is fixed—as the medium allocations increase in
size, a single page might not hold more than one or two allocations of maximum
length, and any remainder is wasted. Thus, a new technique is required that al-
lows many allocations of varying length to share a single page. Further, all pages
need to have the same block size, or allocation granularity, as a method of im-
proving performance as well as simplifying the code that handles allocation and
deallocation. It would be difficult, if not impossible, to predict the ideal block
size for a page since allocation sizes vary, so it works out better to have the user
just supply one that is globally applied for MBA pages.
The logical first step is to allow sequential blocks to be considered as a single
allocation. The allocation function can be modified to scan the free block bit
mask to find a sequence of set bits long enough to enclose the allocation or step
to the next page. Should no page have enough blocks, a new page can be allocat-
ed. However, this is a slow operation due to the number of bit scans and potential
number of page traversals, sometimes resulting in a long linked list traversal that
ultimately fails.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset