26. A Highly Optimized Portable Memory Manager (2/4)

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

414 26.AHighlyOptimizedPortableMemoryManager

PagedAllocation

It is fine to suggest different allocation strategies for different-sized memory re-

quests, but that memory has to come from somewhere. The simplest method is to

preallocate a fixed number of blocks for each allocation size and hope you don’t

exceed that limit. Having done this on shipped titles, we honestly don’t recom-

mend it. However, it has one benefit in that you can determine the size of an allo-

cation based on the address alone (because it is either in the small, medium, or

large region of memory). Still, it is far better to have a flexible solution that

doesn’t require tweaking for every project having slightly different memory allo-

cation characteristics.

The solution is to preallocate in large batches, or pages, and string these pag-

es together to serve as small- and medium-sized memory heaps. Ideally, these are

quite large, granular pages of equal sizes for both SBA and MBA pages, or MBA

pages are at least an even multiple of the page size used for SBA. This is prefera-

ble, as mentioned above, due to the inherent resistance to fragmentation when

allocations are all the same size. By making page sizes large, there are fewer of

them allocated, leading to fewer chances for failure and less overall CPU time

spent managing pages.

Inevitably, a single page does not hold all of the allocations the game needs

for a specific-sized allocation, so pages must be linked together by some data

structure. We selected a linked list for pages to make the traversal cheap and to

make it easy to extract or insert pages with minimal cache line misses. Searching

a linked list is slow, so we come up with ways to prevent the search entirely.

EasingPortabilitywiththeOSAPIClass

The exact policy for how and where pages come from is going to vary per plat-

form, so the memory manager template is parameterized by a class called

OSAPI.

OSAPI has functions to allocate and free SBA and MBA pages, allocate and free

large blocks, and identify whether a pointer was allocated by an SBA, MBA, or

LBA call. It also specifies the small and medium page sizes that are created when

new pages are requested.

By making this a separate class, you can experiment with various page man-

agement methods without requiring modifications to the rest of the memory

manager. Considering the

OSAPI implementation is likely to be almost identical

across platforms except for how memory is allocated, the separation may seem

unnecessary, but in the event that one platform needs to pay very special atten-

tion to where memory comes from for certain kinds of data, this class makes just

such an implementation simple. For example, MEM1 is faster than MEM2 on the

26.3SmallBlockAllocator 415

Nintendo Wii. It is reasonable to force all small and medium allocations to be in

MEM1 by providing all pages from MEM1, and meanwhile force all large allo-

cations to be in MEM2, which is fine because the CPU is unlikely to need direct

access to large allocations.

26.3SmallBlockAllocator

The SBA is implemented as a full page of equally sized blocks, minus a small

section of data at the end of the page called the

Tail, which includes a bit mask

indicating which blocks are free, pointers to the previous and next pages, a count

of free blocks in the page (for verification purposes only), and the number of

bytes in a single block. Since the number of blocks in an SBA page depends on

both the number of bytes in a block and the size of the page, this is a variable-

sized structure and is thus slightly more complicated to deal with than your typi-

cal C structure. Figure 26.2 shows the memory layout for a 4-kB page. The entire

Tail and bit mask fits entirely within a single 32-byte block, which is also small-

er than the L1 and L2 cache line size for modern processors.

Figure 26.2. This is the memory layout for an example SBA page that totals 4 kB seg-

mented into 32-byte allocations. Notice the tail is the last block and is quite small relative

to the amount of memory managed.

32 bytes/block

FreeBlockBitMask

FreeBlockCount BlockSize

Prev Page Next Page

SmallBlockAllocator

kB Page

416 26.AHighlyOptimizedPortableMemoryManager

Due to the added complexity of handling variable-sized structures at the

Tail, the code is written in such a way that the fixed-sized portion of the struc-

ture is stored in the

Tail, and the bit mask is arranged such that it ends just be-

fore the start of the

Tail.

Why put the data at the end of the page rather than at the start? It’s done part-

ly out of habit and partly because placing a variable structure at the start of the

page would most likely require some decision making about whether the first

block should occur on the next byte, or on the next 32-byte address, or aligned

based on the last block in the page. The choices are likely to be platform-specific,

too. Since this is meant to be a simple, portable system that operates identically

as much as possible on all platforms, the decision to move the metadata to the

end of the page removes all doubt and simplifies the code somewhat. The main

risk is that a very misbehaving program could smash the entire page by overrun-

ning its memory and destroying the tracking information at the end of the page.

True, but this is one reason why the free block count exists in the

Tail struc-

ture—to sanity-check the validity of the block whenever problems arise.

HowAlloc()Works

Once a page is found that provides memory allocations of the appropriate size,

the job of the SBA is straightforward. First, we use the page size to determine the

location of the

Tail structure and use the block size to locate the start of the free

block bit mask. Then, we scan through the entire bit mask looking for a bit set to

one. A bit set to one indicates a free block at that index. Once found, the bit is set

to zero, the free count in the

Tail structure is decremented for accounting pur-

poses, and a pointer is returned to the caller that points to the newly allocated

block of memory.

Notice that we do not describe a failure case for allocation. Of course, we

could devise a system that allows for depleted pages to coexist in the same list as

pages with blocks remaining, but this would be wasteful. Completely depleted

pages mixed in with those that have allocations available would always degener-

ate into a linked-list search through pages, ending either in an allocation or a new

page being created in the list. A little experimentation with a profiler showed 75

percent of all CPU time was spent waiting for memory cache misses on each

page. So, the memory manager must also handle segregating depleted pages from

those that have blocks remaining. Thus, every time an allocation is requested, it

is always found on the very first page it checks. The result of an allocation can

cause that page to be put in the depleted page list or a new page to be created

should the free block page list be empty.

26.3SmallBlockAllocator 417

HowFree()Works

Deleting a block of memory is a thornier problem. All the system knows is that,

based on the address falling within an SBA page, the SBA must free this pointer.

To determine this, it must also have found the page it came from. Knowing that

all pages in the SBA are the same size, the

Tail structure is always at the same

offset from the start of a page. However, the start of the bit mask varies depend-

ing on the number of blocks in the page. This can be calculated from the number

of bytes per block, which we stored in the

Tail structure for this purpose.

Based on the block size and the pointer, the index into the free block bit

mask that represents this memory is simply

(ptr – pageStart) / blockSize.

We set the bit to one, indicating it is free for allocation. Notice that there is no

need to explicitly handle adjacent free block coalescing because every block in

the page is the exact size that is allocated.

Finally, the results of a free operation on a page that was completely depleted

of free blocks, is to reinsert the page into the list of available memory pages.

PerformanceAnalysis

The two major operations, allocation and deallocation, are very fast. Deallocation

is a constant-time





1O operation and has no loops once the SBA page has been

located. Locating the page is a separate operation, described in Section 26.6. Al-

location requires scanning for set bits in a bit mask, on average

2n , where a page

is divided into

n blocks. A naive C implementation of scanning for set bits using

bytes can be this slow, however, we can trivially test 32, 64, or even 128 bits at a

time by loading the mask into a register and comparing against zero, generally in

just two instructions. Intel CPUs can use the bit scan forward (

BSF) instruction to

check for the first set bit in either a 32- or 64-bit register in just a few cycles. In

addition, for most practical page and block sizes, the entire

Tail structure and

free block bit mask fits within a single L1 cache line, ensuring the fastest possi-

ble allocation performance.

As a specific example, given a 4-kB page with 32 bytes per block, there are

127 allocatable blocks per page plus one reserved block for the

Tail structure.

This is 127 bits for the bit mask, which fits in only four 32-bit words. Worst case,

obtaining an allocation requires four 4-byte requests from memory, but on aver-

age finds a free block in two requests. The first load causes a delay as a cache

line is flushed and another is pulled into the L2 and L1 caches, but the second

load requires only one or two cycles on typical hardware.

418 26.AHighlyOptimizedPortableMemoryManager

MemoryOverhead

The SBA scheme described is very memory-efficient. For a given page, there is a

fixed amount of overhead due to the

Tail structure of exactly 12 bytes. There is

also a variable amount of overhead for the bit mask, which is always rounded up

to the next four-byte boundary, and occupies one bit per block in the page. For

the 4-kB page with 32 bytes per block above, this works out to 16 bytes for the

bit mask. So the total amount of management overhead within a single SBA page

is less than two bits per allocation. In practical settings, pages are probably larger

than 4 kB, resulting in overhead that rapidly approaches one bit per allocation.

Contrast this with dlmalloc, which has a relatively efficient 16 bytes of overhead

per allocation.

26.4MediumBlockAllocator

The MBA serves a slightly different purpose than the SBA, but does so through

similar techniques. The MBA works by always rounding up allocations to a fixed

block size so that fragmentation is more manageable. Similar to the SBA, the

MBA expects to be handed pages of some fixed size, requiring a

Tail structure

at the end of each page to manage multiple memory pages. Also, a bit mask is

used to mark individual blocks as free or available. The similarities end there.

Since the MBA must service many different allocation lengths, it is infeasi-

ble to create separate large pages just for each granular jump in size. This is par-

ticularly true since the page size is fixed—as the medium allocations increase in

size, a single page might not hold more than one or two allocations of maximum

length, and any remainder is wasted. Thus, a new technique is required that al-

lows many allocations of varying length to share a single page. Further, all pages

need to have the same block size, or allocation granularity, as a method of im-

proving performance as well as simplifying the code that handles allocation and

deallocation. It would be difficult, if not impossible, to predict the ideal block

size for a page since allocation sizes vary, so it works out better to have the user

just supply one that is globally applied for MBA pages.

The logical first step is to allow sequential blocks to be considered as a single

allocation. The allocation function can be modified to scan the free block bit

mask to find a sequence of set bits long enough to enclose the allocation or step

to the next page. Should no page have enough blocks, a new page can be allocat-

ed. However, this is a slow operation due to the number of bit scans and potential

number of page traversals, sometimes resulting in a long linked list traversal that

ultimately fails.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 26. A Highly Optimized Portable Memory Manager (2/4)

Create new playlist

Sign In

Sign Up

Table of Contents for
26. A Highly Optimized Portable Memory Manager (2/4)