Memory
This chapter provides an overview of the memory subsystem and explains how it relates to the Compute Node Kernel (CNK). This chapter includes the following topics:
4.1 Memory system overview
The Blue Gene/Q system contains a distributed memory system, which includes an on-chip cache hierarchy and an off-chip main store. It contains optimized on-chip symmetric multiprocessing (SMP) support for locking and communication between the 17 ASIC processors. Each processor can have four threads.
The aggregate memory of the total system is distributed in the style of a multicomputer, with no hardware sharing between nodes. Each node contains 16 GB of physical main memory. This memory is non stacked synchronous dynamic random access memory (SDRAM).
The first-level (L1) caches are contained in the A2 core macro. The L1P cache is used as a prefetch cache and write-back buffers for L1 data. The second-level (L2) cache is 32 MB.
Table 4-1 on page 27 lists the memory specifications for the Blue Gene/Q system.
4.1.1 L1 prefetch cache overview
The level 1 prefetch (L1p) cache is a module that provides the interface between an A2 core and the rest of the Blue Gene/Q system. It interfaces to the Blue Gene/Q switch, the device control register (DCR) device ring, a memory-mapped I/O space that is local to the core, and a static random‑access memory (SRAM) module that might be local to the core. The L1p cache manages a 32 × 128 byte cache structure to identify and prefetch memory access patterns. This functionality is critical for performance. The L1p cache also performs write combining. It presents multiple small writes to the switch as a single write, while maintaining data coherency.
The L1P cache has the following functions:
Provides A2 interfaces to the Blue Gene/Q system:
 – Request
 – Store
 – Reload
 – Synchronize
 – Reservation
 – Invalidation
DCR bridge visible as memory mapped I/O
Prefetching:
 – Stream prefetch engine with automatically detected and software‑hinted streams
 – List prefetch engine
 – Optional symmetrical treatment of information and data prefetch
Write combining support
Synchronization support using generation protocol
Pipelined switch interface:
 – Out-of‑order interface to distinct destinations
 – In‑order interface to a single destination
L1P instruction support for Blue Gene/Q compute chips
Table 4-1 on page 27 lists the memory specifications for the Blue Gene/Q system.
Table 4-1 Blue Gene/Q memory specifications
Cache
Quantity1
Size
Latency2
Replacement policy
Other information
Clock domain
L1 instruction cache
18
(1 per processor)
16 KB
3 processor clocks (pclk)
Pseudo least recently used (LRU)
4‑way set-associative
64-byte line size
Pclk
L1 data cache
18
(1 per processor)
16 KB
6 pclk
(integer)
Pseudo LRU
8‑way set-associative
64-byte line size
Pclk
L1 prefetch cache
18
(1 per processor)
32 × 128 bytes
24 pclk
Depth stealing and round robin
128‑byte line
Pclk / 2
L2 cache
16
2 MB/slice 32 MB total
L2 cache on-chip
82 pclk
LRU
16-way set-associative
16‑way sliced
4 banks per slice
8 sub‑banks per slice
128‑byte line
Pclk / 2
Double‑data rate (DDR) memory
2
16 GB total
≥ 350 pclk
 
128‑byte line
Pclk × (5 / 6)
Embedded dynamic random‑ access memory (eDRAM)
1
256 KB
≥ 80 pclk
Software control
16 bytes wide
8 eDRAM macro-internal bank
Pclk / 2

1 This value is the quantity on each Blue Gene/Q compute chip.
2 The latency value is determined relative to instruction dispatch.
Prefetch algorithms
Two prefetch algorithms are supported, linear streams and list streams.
Linear Streams
Up to 16 concurrent linear streams of consecutive addresses can be simultaneously prefetched. Linear streams can be automatically identified, or hinted, using data cache block touch (dcbt) instructions, or established optimistically for any miss. Stream underflow (a hit on a line that is being fetched from the switch) triggers a depth increase when adaptation is enabled. Stream replacement and depth stealing lines are selected with a least recently hit algorithm.
List Streams
With software cooperation, access patterns can be recorded and reused by a list fetch engine. This implementation allows iterative application software to make efficient use of completely general, but repetitive, access patterns. The recording of patterns of physical memory access by hardware enables virtual memory issues to be ignored.
A2 interface
The L1p cache accepts commands and data from the A2 core at the pclk period. Received commands are queued in a 32-deep lookup queue. This depth of 32 supports eight outstanding data load requests, four instruction load requests, and 20 store requests. The A2 core supports a maximum of eight outstanding data load requests and four instruction load requests. At reset, the A2 core is programmed to issue no more than 20 outstanding store commands. The number 20 corresponds to the maximum of 16 requests that can be accepted by the switch and an additional four active store commands, not committed to the switch, which the L1p cache can maintain for write combining. The elements of this queue include pointers to a request array that contains the address associated with that command. For store operations, this queue also includes a pointer to the location in the 20‑entry store buffer that contains the store data.
4.1.2 L2 cache functional overview
The L2 cache units provide most of the memory system caching on the Blue Gene/Q compute chip. There are 16 individual caches, or slices. Each cache is assigned to store a unique subset of the physical memory lines. The physical memory addresses that are assigned to each cache slice are static and configurable. The L2 line size is 128 bytes, which is twice the width of an L1 line. L2 slices are set-associative and organized as 1024 sets. Each set has 16‑way association. The L2 data store comprises embedded DRAM, and the tag store comprises SRAM. The main memory is accessed through two on-chip DDR controllers. Each controller manages eight L2 slices. The primary logic of the L2 caches operates at half the processor clock frequency. Some interface logic operates at lower frequencies. Each L2 slice has a single read data port that is 256 bits wide, a single write data port that is 256 bits wide, and a single request port. This port is shared by all processors through the crossbar switch.
The L2 caches primarily operate as normal, set-associative caches. They also support speculative threads and atomic memory transactions.
The L2 caches serve as the point of coherence for all processors. Therefore, they generate L1 invalidations when required. Because the L2 caches are inclusive of the L1 caches, they can remember which processors might have a valid copy of every line. They can multicast invalidations to only those processors. The L2 caches are also a synchronization point, so they coordinate synchronization (msync), load and reserve (lwarx), and store conditional (stwcx) instructions.
4.1.3 Boot eDRAM overview
The Blue Gene/Q system uses a boot eDRAM macro. The eDRAM macro has the following properties:
256 KB capacity
16‑byte‑wide access
1.25 ns cycle time, 5 ns latency
4-way banked, fully pipelined for high throughput
The module is directly operational after reset and provides the boot code. It is also used as a background communication path to the host system. The boot eDRAM macro is connected to the A2 cores with the device bus, which is directly connected to the cores with the crossbar switch. Joint Test Action Group (JTAG) access is managed by the JTAG controller, which is also connected to the device bus.
4.2 Memory management
For optimal performance, manage memory carefully on the Blue Gene/Q system. The memory subsystem of Blue Gene/Q nodes has specific characteristics and limitations. Although a Blue Gene/Q node has 16 GB of memory, physical memory size constraints must still be considered when writing, running, and debugging applications.
The CNK does not dynamically grow its memory usage over time. The CNK consumes a fixed size of 16 MB out of the 16 GB of memory. Therefore, additional threads, mmaps, system calls, buffers, and so on, do not change internal kernel memory usage. The remainder of the memory (16,368 MB on a 16 GB node) is partitioned for the application.
When the application is started, the CNK examines the following information:
Virtual addresses, sizes, and permissions for all application sections
Size of memory to parcel
Requested size of the shared memory segment
Persistent memory size and its present physical address
The number of processes to create
Whether an interpreter is required (an interpreter is commonly required by dynamically linked executable programs)
The CNK partitions memory to form a static memory map. This static memory map is a translation between virtual addresses (as seen by the application) into physical addresses in the DDR3 memory. This partitioning process is designed to generate a valid mapping that maximizes memory use.
4.3 Memory protection
The CNK has several mechanisms that provide protection against incorrect memory accesses:
All storage used by the kernel is inaccessible to user applications.
The text segment of a statically linked application is write protected.
The text segment of a dynamically linked application is write protected.
The nonshared address space of processes on the node is not directly accessible by other processes on the node.
Guard pages can be activated if the compiler does not insert speculative dcbt instructions.
4.4 Shared memory
The CNK supports shared memory between all the processes on a given node. The size of shared memory must be specified to the runjob command with an environment variable. Shared memory is supported in all process counts. However, shared memory with one process per node is not necessary because each processor already has access to all of the node memory.
Shared memory is allocated with the standard Linux shm_open() and mmap() methods. The CNK does not have dynamic virtual pages. Therefore, the physical memory that backs the shared memory must come out of a memory region that is dedicated for shared memory. The size of this memory region is set when a job is started.
The BG_SHAREDMEMSIZE environment variable specifies the amount of memory to be allocated in MB. Use the runjob --envs flag. For example, BG_SHAREDMEMSIZE = 32 allocates 32 MB of shared memory storage. For more information about environment variables, see “Compute Node Kernel environment variables” on page 142.
 
The amount of memory to be set aside for this memory region can be changed at job launch.
Example 4-1 illustrates shared-memory allocation.
Example 4-1 Shared memory allocation
fd = shm_open( SHM_FILE, O_RDWR, 0600 );
ftruncate( fd, MAX_SHARED_SIZE );
shmptr1 = mmap( NULL, MAX_SHARED_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
Example 4-2 illustrates shared-memory deallocation.
Example 4-2 Shared memory deallocation
munmap(shmptrl, MAX_SHARED_SIZE);
close(fd)
shm_unlink(SHM_FILE);
The shm_open() and shm_unlink() routines access a pseudo-device, /dev/shm/filename, which the kernel interprets. Because multiple processes can access or close the shared-memory file, allocation and deallocation are tracked by a simple reference count. Therefore, the processes are not required to coordinate deallocation of the shared memory region.
The value of BG_SHAREDMEMSIZE can be queried by the application using the Kernel_GetMemorySize(KERNEL_MEMSIZE_SHARED, &shared_size) SPI call.
4.5 Persistent memory
Persistent memory is process memory that retains its contents from job to job. To allocate persistent memory, the environment variable BG_PERSISTMEMSIZE = X must be specified. “X” represents the number of megabytes to be allocated for use as persistent memory. For the persistent memory to be maintained across jobs, all job submissions must specify the same value for the BG_PERSISTMEMSIZE variable. The contents of persistent memory can be reinitialized during job startup either by changing the value of BG_PERSISTMEMSIZE or by specifying the environment variable BG_PERSISTMEMRESET = 1. The persist_open() kernel function supports persistent memory.
4.6 Compute node ramdisk
The CNK provides a random-access file system that is resident in compute node memory. This file system is local to the compute node. File system operations to the ramdisk do not result in I/O activity to the I/O node or other compute nodes.
There are three mount points for the compute ramdisk. Table 4-2 on page 31 shows the mount points.
 
Table 4-2 Mount points for the compute node ramdisk
Name
Mount point
Size
Scope
Shared memory
/dev/shm/
Size is determined by the BG_SHAREDMEMSIZE environment variable.
 
MPICH and PAMI use some of this shared memory.
Node-wide, cleared when job exits.
Persistent memory
/dev/persist/
Size is determined by the BG_PERSISTMEMSIZE environment variable.
Node-wide, cleared only when BG_PERSISTMEMSIZE is specified differently or if BG_PERSISTMEMRESET is set.
Local memory
/dev/local
Size is determined by the Kernel_SetLocalFSWindow() SPI call.
Process-wide, cleared when job exits.
The CNK supports a range of system calls for the ramdisk, including the following calls: read, write, open, close, lseek, utime, rename, dup, dup2, mmap, munmap, truncate, ftruncate, stat, lstat, fstat, fsync, llseek, readv, writev, truncate64, ftruncate64, stat64, lstat64, and fstat64.
This support has the following limitations:
File access permissions and ownership are not tracked or honored. The chmod and chown functions do not have an effect. However, the access() system call can be used to determine file existence.
File directories are not modeled. So, the mkdir() and rmdir() system calls have no effect. Instead, the open() system call honors separate file namespaces that are specified with directory prefixes. (That is, the slash '/' character is treated as part of the file name, not as a separator of the directory hierarchy.)
The flock() system call handles only the LOCK_EX and LOCK_UN opcodes.
Advisory file access hints using the posix_fadvisory() calls are ignored.
The unlink() system call cannot reclaim space for unlinked, memory-mapped files in the CN RAM disk.
4.7 Support for the /proc file system
The CNK creates several files that can be used to obtain process-specific data. These files are in the /proc/<pid> directory. Table 4-3 lists the files.
Table 4-3 Files in the /proc/<pid> directory
File
Description
/proc/<pid>/exe
A symbolic link to the executable.
/proc/<pid>/cwd
A symbolic link to the current working directory.
/proc/<pid>/maps
A regular file that represents the memory map for the process. The memory map includes text, data, heap, stack, and dynamic library address ranges for the process.
/proc/<pid>/cmdline
A regular file that contains the command line passed into the process.
/proc/<pid>/environ
A regular file that contains the environment variables for the process at job start.
The <pid> is the value that is returned by the getpid() function. For example, if the getpid() function returns 17, the file name is "/proc/17/maps". Alternatively, the /proc/self/<filename> syntax can be used.
Regular files can be accessed through the Blue Gene/Q Code Development and Tools Interface (CDTI) Get File Names, Get File Stat Data, and Get File Contents commands. For more information, see the IBM System Blue Gene Solution: Blue Gene/Q Code Development and Tools Interface, REDP-4659 Redpapers publication.
4.8 L1P prefetcher
The Blue Gene/Q hardware has two prefetch engines that can be manipulated by the user to control cache prefetch behaviors.
The L1 prefetcher (L1p) is a module in the Blue Gene/Q compute chip that is interposed between the A2 bus and the rest of the Blue Gene/Q chip. There are 17 copies of the L1p. Each copy is attached to one of the 17 Blue Gene/Q A2 cores. Figure 4-1 shows the L1p module.
Figure 4-1 L1p module
The L1p performs the performance‑critical tasks of identifying and prefetching memory access patterns and managing a 32 x 128-byte cache structure for prefetch data. The prefetcher is designed to predict which pieces of data will be required by the processor. The L1 prefetcher in the Blue Gene/Q system provides two prefetch algorithms: a linear stream prefetcher and a perfect prefetcher.
4.8.1 Linear stream prefetcher overview
The linear stream prefetcher that is used in the Blue Gene/Q system detects positive sequential memory access strides and prefetches ahead of the stream when possible. It can track up to 16 streams of memory accesses. Each stream is sequential and a stride of one L1 cache line (64 bytes).
Figure 4-2 on page 33 describes two streams of data that are required by the processor. The streams are at different addresses, but the linear stream prefetcher can track them independently. The goal is that when the processor needs address 0x100080 or 0x175080, the L1p has prefetched that data into its prefetch cache, reducing memory access latency.
Figure 4-2 Linear stream prefetcher
The linear stream prefetcher is the default prefetch algorithm and is always active.
Establishing a stream
The L1p only prefetches from established streams. The stream detection logic has several different modes that are settable by the application to control how streams are established.
Optimistic mode Assumes that all L1 misses will become streams. In this mode, the L1p immediately starts prefetching from the next cache line in the stream.
Confirmed mode The linear stream prefetcher will wait for at least one additional L1 miss that corresponds to the stream to be detected. After confirmation, the L1p starts prefetching from the stream.
Confirmed or cache touch mode The prefetcher behaves like confirmed mode. Additionally, a stream will be established if an explicit dcbt (data cache block touch) instruction that results in an L1 cache miss is executed.
Adaptive prefetching
When a stream is established, it is internally assigned a stream number. This stream number is used to track the depth of the prefetch for that stream. It is possible that some streams advance quicker than other streams. Therefore, certain streams should prefetch further ahead than other streams.
With adaptive prefetching, when the L1p detects that an established stream encounters an L1p miss that is not already in the prefetch cache, it automatically increases that stream number's prefetch depth by one L1p 128‑byte cache line (up to the adaptive prefetch depth limit). The maximum and initial prefetch depths for adaptive streams are configurable between 1 and 8. The rate of adaptation might be configurably throttled to performance tune the adaptive prefetch.
Thread synchronization
Each core in a Blue Gene/Q system has a single dedicated linear stream prefetcher. However, all four hardware threads share the same linear stream prefetcher. This sharing can cause an atomicity and ordering problem if multiple threads modify the linear stream prefetcher’s configuration registers. The L1p and the SPI do not restrict or block access to the configuration registers. If the application developer is potentially modifying the configuration registers from multiple hardware threads, locking must be added or there might be some non-deterministic choices in the L1p configuration.
The following strategies can be used for locking:
pthread_mutexes Standard POSIX locking primitives. These primitives are simple and work well in threaded processes.
larx/stcx PowerPC load reservation locking mechanism. This mechanism can be used to create a lighter‑weight lock than pthread_mutex. The larx/stcx instructions can also be used for multiple processes that have a shared memory region.
L2 atomics Using the Blue Gene/Q L2 atomic operations to "take a ticket" with load and increment. The thread blocks until the "now serving" counter matches the ticket. The thread then updates the configuration register and performs a store and add operation to add 1 to the "now serving" counter.
L2 transactional memory regions cannot be used for atomically setting L1p configuration registers because the L2 does not version the L1p MMIO memory region.
4.8.2 Perfect prefetcher overview
The perfect prefetcher in the Blue Gene/Q system uses a recorded pattern of memory accesses to effectively prefetch data into the caches. Unlike the linear stream prefetcher, the perfect prefetch algorithm requires that the application train the perfect prefetcher with specific patterns of memory accesses. When the application executes the same section of code again, the application must inform the L1p hardware that the previously recorded pattern will be reoccurring. As the hardware thread is executing this section of code, the L1p hardware is tracking the progress of the pattern and attempting to prefetch ahead of the anticipated data. Since the recorded pattern and the next pattern through a section of code might not be precisely the same, the L1p has some tolerance for pattern deviations.
There are four perfect prefetchers per L1p. Each perfect prefetcher is assigned to a separate hardware thread in the associated A2 core. This allows each hardware thread to be creating and executing a separate list of prefetches with no requirement for software coordination.
The headers and documentation for the L1p system programming interface (SPI) are provided in the spi/include/l1p directory.
Training the perfect prefetcher
To train the perfect prefetcher, the application configures the L1p with the L1P_PatternConfigure(size) L1p SPI call. The size parameter contains the maximum number of L1 misses that are expected through the code sequence. This is used for calculating the memory buffer space that is needed to hold the patterns. There is no enforced limit (outside of the size of available memory) for the maximum size of the pattern. If the buffer space is exceeded by the pattern, pattern recording is halted and the "Maximum" bit in the L1P_Status_t structure indicates this overflow condition.
Next, the application calls L1P_PatternStart() with the record flag set and then executes a section of code. Meanwhile, the L1p hardware is recording each L1 cache miss in the application-provided storage. When the application has completed the code section, it tells the L1p to stop recording by means of another L1p SPI call, L1P_PatternStop().
This prefetch algorithm works when there is a consistent L1 cache miss pattern. For applications that momentarily deviate from consistency, the L1p can disable or pause the prefetcher (both training and prefetching). This process prevents the prefetcher from recording L1 cache misses that are not likely to repeat during the next execution of the recorded code.
Using a trained pattern
When a list of L1 cache misses has been trained, the application calls the L1P_PatternStop() function. This stops training and sets up the trained pattern to be used for prefetching on the next iteration.
To start prefetching with this new pattern, the application then issues a L1P_PatternStart() call. This call tells the hardware thread's L1 perfect prefetcher to start loading the list into its cache. It then monitors the loaded section of the list, and tracks the L1 cache misses with the list.
In this call to the L1P_PatternStart() function, the record flag can optionally be set. When set (in self-healing mode), the L1p creates a new revision of the list in a separate physical memory location for later use. The L1P_PatternStart() function then implicitly toggles between the two lists (current and next list) by swapping the physical addresses for the L1p perfect prefetcher's read/write base addresses.
To synchronize the prefetching of the data pattern with the A2 execution, the perfect prefetcher must track where the application is executing with respect to the prefetch list. Since the application is not required to be perfectly reproducible with regards to L1 cache misses, some tolerance is provided for the appearance of L1 miss addresses that are not present in the prerecorded pattern and for addresses that are recorded in the pattern which are missing from the actual stream of L1 cache misses. Figure 4-3 shows an L1p cache miss.
Figure 4-3 L1p cache miss
In Figure 4-3, the address at location 'x' was not forecast in the list. However, the next L1 miss at location 'c' was expected. The perfect prefetcher ignores the rogue address 'x' and continues matching at location 'c'. Similarly, locations 'y' and 'z' were in the list, but were not presented to the L1p as L1 cache misses. Again, the L1p was able to look ahead in the pattern and adjust its list address offset to correspond to location 'e'.
Since the L1p cannot have a multi-megabyte metadata cache containing the full list, it fetches as many as 24 entries of the list to identify data to be prefetched and to synchronize with the stream of L1 misses. The pattern prefetching hardware is able to compare the current L1 miss address with the next, not-yet-matched list entry and up to 7 subsequent addresses in the prerecorded pattern. If an L1 miss is not present in this group of up to 8 pattern addresses available for comparison, the perfect prefetcher drops that L1 miss address but tracks the number of such consecutive non-list misses. If the number of consecutive misses exceeds a predefined miss threshold, the perfect prefetcher abandons the list and halts prefetching. This behavior is achieved by providing a 24-location buffer inside each perfect prefetcher into which the prerecorded pattern of addresses is automatically loaded from memory. These 24 locations are divided into two groups. The first group contains the first 8 addresses in the pattern presently being prefetched and is instrumented to permit the 8 comparisons. The remaining 16 locations are in standard SRAM and provide a buffering function, increasing the likelihood that the prerecorded addresses are available in the perfect prefetcher when needed.
Thus, there are two scenarios that can lead to list abandonment:
The thread started executing code with memory access patterns that contain a sequence of addresses that is unrelated to the recorded memory patterns of length greater than the preset threshold. The preset threshold can be set with the L1P_PatternSetAbandonThreshold() function.
The thread's memory access pattern jumped ahead in the list by more than 8 entries (and therefore the L1p lost synchronization).
The perfect prefetcher status bits, returned by the L1P_PatternStatus() function, can be used to determine if the prefetch abandoned or completed the list.
A recorded pattern can be used at any time, but only 1 prefetch list can be active per hardware thread at any time.
Saving and restoring patterns
The L1p perfect prefetcher SPI only manages one pattern at a time. However, for greater flexibility, the L1P_GetPattern() function can be used to retrieve the active pattern from the SPI. This allows pattern storage allocated by the L1P_PatternConfigure() function to be detached and set aside for later usage without destroying the pattern or requiring regeneration of the pattern.
Later on, the application developer can restore the old pattern using the L1P_SetPattern() function. This is more efficient because it does not require a reallocation and regeneration of the pattern. After restoring the pattern, the application calls the L1P_PatternStart() function to begin executing the pattern.
The number of patterns that can be retained by the application is limited only by the amount of memory available to hold patterns.
When a pattern has been retrieved using the L1P_GetPattern() function, the application must manage the storage for the pattern. When the pattern is no longer needed, the application should call the L1P_DeallocatePattern() function to release the storage.
Interaction with the linear stream prefetcher
Each L1p comprises a single copy of the linear stream prefetcher and four copies of the perfect prefetcher. Although the algorithms are separate, they share a significant amount of the internal L1p arrays.
The prefetch data array contains all of the demand-loaded and prefetched data, regardless of which prefetch algorithm fetched the data. Therefore, this array is shared between all five prefetchers (one linear stream + four perfect prefetchers). The L1P_SetStreamTotalDepth() routine can be used to limit the linear stream prefetcher's total depth. This reserves a section of the prefetch data array specifically for the perfect prefetcher.
Nested patterns
If a segment of code that is using the perfect prefetcher starts executing a subroutine or library that is not well understood, the application programmer must consider the following scenarios:
1. If the target subroutine has a fairly predictable access pattern and contains no perfect prefetcher use, the subroutine call can also contribute to the pattern that is being recorded by the calling routine. No surrounding perfect prefetcher directives are required.
2. If the subroutine has some random access or data-dependent elements, it might be helpful for the application programmer to issue an L1p_PatternPause() command before the subroutine call and an L1p_PatternResume() command when that subroutine returns. Pausing the prefetcher might reduce the chance of pattern abandonment. When the subroutine is understood, reevaluate the usage of pause.
3. If the subroutine or library also uses the perfect prefetcher and makes its own L1P_Configure() calls, there are controls over the behavior of the L1p SPI when there are nested L1P_Configure() calls.
If a L1P_Configure() call occurs while a perfect prefetcher is active, the default action is to perform a context switch of the calling hardware thread's L1p perfect prefetcher hardware state. The addresses from which the pattern is being read and to which the new pattern is being written are stored to memory with the content of the prefetcher's configuration registers. When the matched L1P_Unconfigure() function is executed, the perfect prefetcher context is restored and its normal prefetching continues. The pattern that is produced during such a sequence can be successfully used to prefetch data during a later execution of the same code. Any number of such context switches can be nested, supporting a normal, modular programming environment. From a performance perspective, these context switches might require a system call, which might reduce performance.
If this default L1P_Configure() function behavior is not preferred, the application programmer can call the L1P_PatternSetNestingMode() function. This routine allows the application to manage the L1P SPI behavior when a nested routine is being called. There are four modes:
 – Save/Restore context is always performed when encountering L1P_Configure/Unconfigure() function calls. This is the default mode.
 – Nested L1P_Configure() routines disable the ability to change the active pattern (that is, the current pattern stays active).
 – Nested L1P_Configure() and L1P_SetPattern() routines are disabled. However, other L1P SPI routines that do not switch the active pattern continue to function.
 – Any nested L1P_Configure() routines can result in a fatal error. This can be used for debugging.
Performance counter support
The perfect prefetcher makes full use of the performance counters in the Blue Gene/Q chip. Thus, software can monitor the number of:
Pattern write overflows
Times the pattern was abandoned
Pattern starts
Times the core was stalled waiting for the pattern to be read
Times the core was stalled waiting for the pattern to be written
Times the core address does not match any pattern address
Times an address in the pattern is skipped over
Times the pattern comparison catches up with the pattern prefetching
Comparisons with the pattern
These performance counters can be obtained through the Blue Gene/Q performance counter library and related tools.
4.8.3 L1P prefetcher API descriptions
This section describes the L1P prefetcher API. It contains the following information:
Defines and enumerations
Table 4-4 describes the L1P_StreamPolicy_t enumeration.
Table 4-4 enum L1P_StreamPolicy_t
Name
Description
L1P_stream_optimistic
Any L1 cache miss memory reference (optimistically) establishes a stream.
L1P_stream_confirmed
The L1p waits for confirmation before establishing the stream.
L1P_stream_confirmed_or_dcbt
Any L1 cache miss memory reference using a dcbt (data cache block touch) instruction automatically creates an established stream.
Otherwise, the L1p waits for confirmation before establishing the stream.
Table 4-5 describes the L1P_PatternNest_t enumeration.
Table 4-5 enum L1P_PatternNest_t
Name
Description
L1P_NestingSaveContext (default)
Any nested L1P_Configure() routines result in an implicit context save. The matched L1P_Unconfigure() routine restores the L1P context.
L1P_NestingIgnore
The L1P pattern routines are disabled for the thread after a nested L1P_Configure() routine is performed. The pattern routines are re-enabled when the matching L1P_Unconfigure() routine is performed.
L1P_NestingFlat
When the application makes a nested L1P_Configure() call, it is reference counted and ignored. Any L1P_SetPattern() calls are ignored if they occur in a nested context.
L1P_NestingError
Nested L1P_Configure() routines cause an error message and assert a failure. This causes the termination of the application.
Table 4-6 describes the L1P_PatternLimitPolicy_t enumeration.
Table 4-6 enum L1P_PatternLimitPolicy_t
Name
Description
L1P_PatternLimit_Disable
The limit on the number of allocated patterns is disabled (that is, no limit)
L1P_PatternLimit_Error
Exceeding the limit on the number of allocated patterns causes the pattern allocation to fail with L1P_NOMEMORY.
L1P_PatternLimit_Assert
Exceeding the limit on number of allocated patterns causes an assertion failure and the process abnormally terminate.
L1P_PatternLimit_Prune
Exceeding the limit on number of allocated patterns causes pattern allocations to be treated as though the nesting mode is L1P_NestingIgnore.
The value for L1P_CACHELINESIZE is 128.
Data types
Table 4-7 describes the L1P_Pattern_t struct.
Table 4-7 Struct L1P_Pattern_t
Name
Type
Width
Description
Size
Size_t
8 bytes
Size (in bytes) of the memory region that is allocated by the L1P_Allocate() routine.
ReadPattern
Void*
8 bytes
Virtual memory address for the read pattern
WritePattern
void*
8 bytes
Virtual memory address for the write/generated pattern
Table 4-8 on page 40 describes the L1P_Status_t struct.
Table 4-8 struct L1P_Status_t
Name
Type
Width
Description
Finished
uint64_
1 bit
Boolean that indicates that the perfect prefetcher has completed the list. This bit is cleared when a list has started executing, and set when the list has completed.
Abandoned
1 bit
Set if a failure to match causes list comparison to be abandoned.
Maximum
1 bit
Set if the length of the update reaches the maximum.
L1P perfect prefetcher configuration functions
Table 4-9 describes the L1P_PatternConfigure function.
Table 4-9 int L1P_PatternConfigure(uint64_t n)
Name
Description
Parameter
uint64_t n
Input
The maximum number of L1 misses that can be tracked by the list.
Return Codes
L1P_NOMEMORY
The application was unable to allocate enough memory.
L1P_
ALREADYCONFIGURED
The L1p perfect prefetcher was already configured.
Latency
The implementation might require system calls.
On the CNK, this routine might use the glibc malloc() function internally. The malloc() function can then perform brk() or mmap() system calls to allocate storage.
Description:
Allocates enough storage so that the perfect prefetcher can track up to <n> L1 misses.
Storage is retained until the following actions occur:
L1P_Unconfigure() is performed.
L1P_SetPattern() is performed.
If the L1P_Configure() command is nested:
If nesting mode has been set to L1P_NestingSaveContext, the L1P SPI pushes a L1P context structure onto a stack of L1P context structures. When an L1P_Unconfigure() function is called, this L1P context structure is restored. This is the default mode.
If nesting mode has been set to L1P_NestingIgnore, the L1P SPI will reference count the L1P_Configures. When nested, the SPI does not write new pattern addresses into the L1p hardware. When the same number of L1P_Unconfigure() routines have been called, the L1P SPI returns to normal function.
If the nesting node has been set to L1P_NestingFlat, then the L1P SPI will reference count and ignore nested L1P_Configures calls. All L1P_SetPattern() calls are ignored if they occur in a nested context.
If nesting mode has been set to L1P_NestingError, the L1P SPI will display an error message and assert. This terminates the active process with a core file. This mode is to be used for debug purposes.
Example
Nested L1P_Configure:
Unnested L1P_Configure
L1P_Configure(1000);
// …code…
L1P_Configure(1500);
// …code…
L1P_Unconfigure();
// …code…
L1P_Unconfigure();
L1P_Configure(1000);
// …code…
L1P_Unconfigure();
L1P_Configure(1500);
// …code…
L1P_Unconfigure();
Table 4-10 describes the int L1P_PatternUnconfigure() function.
Table 4-10 int L1P_PatternUnconfigure()
Name
Description
Parameters
None
Return codes
L1P_NOTCONFIGURED
The L1p has not been configured.
Latency
Implementation might require system calls.
On CNK, this routine might use the glibc free() routine internally. The free() call can then perform brk() or munmap() system calls to free storage.
Description:
Deallocates storage used by the L1p SPI.
If one is available, the L1P SPI will pop a L1P context structure from the stack of L1P context structures. The context will then be used to restore the previous L1P pattern status and pointers.
 
L1p perfect prefetcher control functions
This section describes the L1p perfect prefetcher control functions.
Table 4-11 describes the L1P_PatternStart(int record) function.
Table 4-11 int L1P_PatternStart(int record)
Name
Description
Parameters
int record
Input
Boolean that indicates whether L1P_PatternStart generates a new pattern. If set to TRUE, a new pattern is generated. Generation of a new pattern might occur simultaneously with the execution of an old pattern.
Return Codes:
L1P_PATTERNACTIVE
L1P_PatternStart() called while a pattern was active.
L1P_NOTCONFIGURED
The L1p has not been configured.
Latency
Inlineable function call that accesses user-space memory mapped registers.
Description:
The perfect prefetcher will start monitoring L1 misses and performing prefetch requests based on those misses. The 'record' parameter instructs the PatternStart to record the pattern of L1 misses for the next iteration.
This L1P_PatternStart() should be called at the beginning of every entrance into the section of code that has been recorded.
Table 4-12 describes the L1P_PatternPause() function.
Table 4-12 int L1P_PatternPause()
Name
Description
Parameters
None
Return codes
L1P_NOTCONFIGURED
The L1p prefetcher has not been configured.
Latency
Inlineable function call that accesses user-space memory mapped registers.
Description:
Suspends the active perfect prefetcher. The Linear Stream Prefetcher and the other three perfect prefetchers on the core continue to execute.
This routine can be used in conjunction with L1P_PatternResume() function to avoid recording out-of-bound memory fetches, such as instructions performing a periodic printf. It can also be used to avoid sections of code that perform memory accesses that are inconsistent between iterations.
Table 4-13 describes the L1P_PatternResume() function.
Table 4-13 int L1P_PatternResume()
Name
Description
Parameters
None
Return codes
L1P_NOTCONFIGURED
The L1p has not been configured.
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Resumes the perfect prefetcher from the last pattern offset location.
This routine can be used in conjunction with L1P_PatternPause() to avoid recording memory fetches that are not likely to repeat, such as instructions performing a periodic printf. It can also be used to avoid sections of code that perform memory accesses that are inconsistent between iterations.
Table 4-14 describes the L1P_PatternStop() function.
Table 4-14 int L1P_PatternStop()
Name
Description
Parameters
None
Return codes
L1P_NOTCONFIGURED
The L1p has not been configured.
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Stops the perfect prefetcher and resets the list offsets to zero.
Table 4-15 describes the L1P_PatternStatus function.
Table 4-15 int L1P_PatternStatus(L1P_State_t* status)
Name
Description
Parameters
L1P_Status_t status
Output
Perfect prefetcher status bits
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Stops the perfect prefetcher and resets the list offsets to zero.
Table 4-16 describes the L1P_PatternStatus function.
Table 4-16 int L1P_PatternStatus(L1P_State_t* status)
Name
Description
Parameters
L1P_Status_t status
Output
Perfect prefetcher status bits
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Returns the current status for the L1 perfect prefetcher.
Table 4-17 describes the L1P_PatternGetCurrentDepth function.
Table 4-17 int L1P_PatternGetCurrentDepth(uint64_t* fetch _depth, uint64_t* generate _depth)
Name
Description
Parameters
uint64_t* fetch_depth
Output
Current depth of L1 misses in the prefetching pattern.
uint64_t* generate_depth
Output
Current depth of L1 misses in the generated pattern.
Return codes
None defined
Latency
Inlineable function call that accesses a read-only user-space memory mapped registers
Description:
Returns the current pattern depths for the L1 perfect prefetcher. The pattern depth is the current index into the pattern that the L1p is executing.
The fetch_depth parameter is used to determine how far in the current pattern/sequence the L1p has progressed.
The generate depth parameter can be used to optimize the pattern length parameter to L1P_PatternConfigure() to reduce the memory footprint of the L1p pattern.
Table 4-18 describes the L1P_PatternGetNestingMode function.
Table 4-18 int L1P_PatternGetNestingMode (L1P_PatternNest_t* mode)
Name
Description
Parameters
L1P_PatternNest_t mode
Output
Old Nesting
Return Codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Returns the current nesting mode for the L1 perfect prefetcher.
The supported nesting modes are L1P_NestingSaveContext, L1P_NestingIgnore, L1P_NestingFlat, L1P_NestingError. A description of each of these modes is in “Defines and enumerations” on page 38.
Table 4-19 describes the L1P_PatternSetNestingMode function.
Table 4-19 int L1P_PatternSetNestingMode(L1P_PatternNest_t mode)
Name
Description
Parameters
L1P_PatternNest_t mode
Input
New nesting mode
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Returns the current status for the L1 perfect prefetcher.
The default mode is L1P_NestingSaveContext. Other nesting modes are L1P_NestingIgnore, L1P_NestingFlat, L1P_NestingError. A description of each of these modes is in “Defines and enumerations” on page 38.
Table 4-20 on page 45 describes the L1P_PatternSetAbandonThreshold function.
Table 4-20 int L1P_PatternSetAbandonThreshold(uint64_t numL1misses)
Name
Description
Parameters
Uint64_t numL1misses
Input
The number of consecutive, non-matching L1 misses that will result in a pattern being abandoned.
The valid range is 1 to 63.
Default = 63
Return codes
None defined
 
 
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Sets the number of consecutive L1 misses that did not match the current location in the pattern. After this number has been exceeded, the prefetching activity will cease and the pattern will be marked as "Abandoned" in the L1P_Status_t structure returned by the L1P_PatternStatus() function.
Table 4-21 describes the L1P_PatternSetAbandonThreshold function.
Table 4-21 int L1P_PatternSetAbandonThreshold(uint64_t numL1misses)
Name
Description
Parameters
Uint64_t numL1misses
Input
The number of consecutive, non-matching L1 misses that will result in a pattern being abandoned.
The valid range is 1 to 63.
Default = 63
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Sets the number of consecutive L1 misses that did not match the current location in the pattern. After this number has been exceeded, the prefetching activity will cease and the pattern will be marked as "Abandoned" in the L1P_Status_t structure returned by the L1P_PatternStatus() function.
Table 4-22 describes the L1P_PatternGetAbandonThreshold function.
Table 4-22 int L1P_PatternGetAbandonThreshold(uint64_t* numL1misses)
Name
Description
Parameters
Uint64_t* numL1misses
Output
The number of consecutive, non-matching L1 misses that will result in a pattern being abandoned.
Return codes
None defined
 
 
Latency
Inlineable function call that accesses user-space memory mapped registers
Returns the number of consecutive L1 misses that did not match the current location in the pattern. After this number has been exceeded, the prefetching activity will cease and the pattern will be marked as "Abandoned" in the L1P_Status_t structure returned by the L1P_PatternStatus() function.
Table 4-23 describes the L1P_PatternSetEnable function.
Table 4-23 int L1P_PatternSetEnable(int enable)
Name
Description
Parameters
int enable
Input
L1p pattern prefetcher enable flag
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Sets a software enable/disable for L1p perfect prefetcher. This can be used to ascertain whether the usage of the prefetcher is improving performance.
Table 4-24 describes the L1P_PatternGetEnable function.
Table 4-24 int L1P_PatternGetEnable(int* enable)
Name
Description
Parameters
int enable
Output
L1p pattern prefetcher enable flag
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Returns the software enable/disable for L1p perfect prefetcher.
 
Explicit pattern management functions
This section describes explicit pattern management functions.
Table 4-25 describes the L1P_AllocatePattern function.
Table 4-25 int L1P_AllocatePattern(uint64_t n, L1P_Pattern_t** ptr)
Name
Description
Parameters
uint64_t n
Input
The maximum number of L1 misses that can be tracked by the list.
L1P_Pattern_t** ptr
Output
Pointer to an existing memory access pattern.
Return codes
L1P_NOMEMORY
Application was unable to allocate enough memory
Latency
Implementation might require system calls.
On the CNK, this routine can use the glibc malloc() function internally. The malloc() function call can then perform brk() or mmap() system calls to allocate storage.
Description:
Allocates storage to hold an L1p pattern of L1 miss addresses. This allows for the application to allocate storage for uninitialized patterns. This pattern storage can be passed to L1P_SetPattern(). Storage must be deallocated with L1P_DeallocatePattern().
Table 4-26 describes the L1P_SetPattern function.
Table 4-26 int L1P_SetPattern(L1P_Pattern_t* pattern)
Name
Description
Parameters
L1P_Pattern_t* pattern
Input
Pointer to a valid pattern
Return codes
L1P_NOTAPATTERN
The specified pointer is not a valid pointer.
Latency
Implementation might require system calls.
On CNK, since memory protection is a requirement, this routine will result in a system call to validate the pattern and setup physical addresses needed by the hardware.
Description:
Sets the perfect prefetcher's hardware registers with a given pattern. This allows for retaining several patterns of memory accesses and finer control of the L1p. It is not required for the default usage model.
The L1p SPI will not deallocate the structure.
Table 4-27 describes the L1P_GetPattern function.
Table 4-27 int L1P_GetPattern(L1P_Pattern_t** pattern)
Name
Description
Parameters
L1P_Pattern_t** pattern
Output
Location to store the pointer to the pattern structure.
Return codes
L1P_NOTCONFIGURE
L1p has not been configured.
Latency
Implementation might require system calls.
Description:
Returns pointers to the current L1p pattern. Later, the pattern pointer can then be passed back into L1P_SetPattern().
After L1P_GetPattern is called, the application will own the pattern and must call L1P_DeallocatePattern() to reclaim that storage. This allows pattern storage that is allocated through L1P_PatternConfigure() to be detached and retained for later usage.
This allows for retaining several patterns of memory accesses and finer control of the L1p. It is not required for the default usage model.
Table 4-28 on page 48 describes the L1P_DeallocatePattern function.
Table 4-28 int L1P_DeallocatePattern(L1P_Pattern_t* ptr)
Name
Description
Parameters
L1P_Pattern_t* ptr
Input
Pointer to an existing memory access pattern.
Return codes
L1P_NOTAPATTERN
The specified pointer is not a valid pointer.
Latency
Implementation might require system calls.
On CNK, this routine can use the glibc free() routine internally. The free() call can then perform brk() or munmap() system calls to deallocate storage.
Description:
Deallocates storage previously assigned to the list of addresses.
This allows for the application to deallocate storage for patterns that have been detached from normal L1p SPI control.
Do not use L1P_DeallocatePattern() on non-detached patterns.
Table 4-29 describes the L1P_PatternSetPatternLimit function.
Table 4-29 int L1P_PatternSetPatternLimit(L1P_PatternLimitPolicy_t policy, int numallocatedpatterns,)
Item
Description
Parameters
L1P_PatternLimitPolicy_t policy
Input
Specifies the behavior when the number of allocated patterns has been exceeded.
int numallocatedpatterns
Input
Number of allocated patterns that are allowed in the application
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Sets behavior when the number of allocated patterns that the application can have active exceeds an artificial limit. This can be used to determine if there is a memory leak in the pattern allocations.
The default policy is L1P_PatternLimit_Disable.
Table 4-30 describes the L1P_PatternGetPatternLimit function.
Table 4-30 int L1P_PatternGetPatternLimit(L1P_PatternLimitPolicy_t* policy, int* numactivelists)
Item
Description
Parameters
L1P_PatternLimitPolicy_t policy
Output
Behavior when the number of allocated patterns has been exceeded.
 
int numactivelists
Output
Number of active/allocated patterns that are allowed in the application
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Returns the current behavior when the number of allocated patterns that the application can have active exceeds that limit. The current limit is also returned.
 
L1P linear stream prefetcher control functions
This section describes the L1p linear stream prefetcher control functions.
Table 4-31 describes the L1P_GetStreamAdaptiveMode function.
Table 4-31 int L1P_GetStreamAdaptiveMode(int* adaptiveState)
Item
Description
Parameters
int adaptiveState
Output
Boolean that indicates whether adaptive mode is enabled or disabled.
TRUE = enabled.
FALSE = disabled.
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Returns enable/disable status of the linear stream prefetcher's adaptation mode.
Table 4-32 describes the L1P_SetStreamAdaptiveMode function.
Table 4-32 int L1P_SetStreamAdaptiveMode(int Enable)
Item
Description
Parameters
int Enable
Input
Boolean that enables/disables adaptive mode.
TRUE = enabled.
FALSE = disabled.
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Enables or disables the linear stream prefetcher's depth adaptation mode
Table 4-33 describes the L1P_GetStreamPolicy function.
Table 4-33 int L1P_GetStreamPolicy(L1P_StreamPolicy_t* policy)
Item
Description
Parameters
L1P_StreamPolicy_t policy
Output
Current L1P stream policy
Return codes
None defined
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Returns the linear stream prefetch policy in the specified pointer. The policy controls when a stream is established.
Table 4-34 describes the L1P_SetStreamPolicy function.
Table 4-34 int L1P_SetStreamPolicy(L1_StreamPolicy_t policy)
Item
Description
Parameters
L1P_StreamPolicy_t policy
Input
New Policy
Return codes
L1P_PARMRANGE
An invalid stream policy was specified.
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Changes the linear stream prefetch policy. The policy controls when a stream is established.
Table 4-35 describes the L1P_GetStreamDepth function.
Table 4-35 int L1P_GetStreamDepth(uint32_t* depth)
Item
Name
Parameters
depth
Output
Integer 1 to 8 for the number of 128‑byte lines ahead to fetch for all future established stream
Return codes
L1P_PARMRANGE
The specified address would have resulted in a segmentation violation.
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Returns the default stream depth when a new stream has been created. This default depth can be modified on a per stream basis using the adaptive mode (if enabled).
Table 4-36 describes the L1P_SetStreamDepth function.
Table 4-36 int L1P_SetStreamDepth(uint32_t depth)
Item
Description
Parameters
uint32_t depth
Input
Number of 128 byte lines ahead to fetch for all future established stream.
The valid range is 1 to 8.
Return codes
L1P_PARMRANGE
Specified stream depth is not within the valid range.
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
When a new stream is established, the stream is set to the initial target prefetch depth specified by L1P_SetStreamDepth(). A streams prefetch depth can subsequently vary if the adaptive prefetch mode is enabled.
Table 4-37 describes the L1P_GetStreamTotalDepth function.
Table 4-37 int L1P_GetStreamTotalDepth(uint32_t* depth)
Item
Description
Parameters
depth
Integer 1 to 32 for total footprint of 128-byte lines that the stream engine will endeavor to use.
Return codes
L1P_PARMRANGE
The specified address will cause a segmentation violation.
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Gets the number of 128-byte cache lines that can be used by the linear stream prefetcher. Unallocated lines will be used by the perfect prefetcher. This can help prevent thrashing between the prefetch algorithms.
Table 4-38 describes the L1P_SetStreamTotalDepth function.
Table 4-38 int L1P_SetStreamTotalDepth(uint32_t depth)
Item
Description
Parameters
Uint32_t depth
Input
Total footprint of 128-byte lines stream engine will endeavor to use.
The valid range is 1 to 32.
Return codes
L1P_PARMRANGE
The specified total stream depth is not within the valid range.
Latency
Inlineable function call that accesses user-space memory mapped registers
Description:
Sets the number of 128-byte cache lines that can be used by the linear stream prefetcher. The unallocated lines will continue to be used by the perfect prefetcher to help prevent thrashing between the prefetch algorithms.
 
L1p error conditions
Table 4-39 on page 52 describes the L1p error conditions.
Table 4-39 L1p error conditions
Error code
Description
0
No error
L1P_NOMEMORY
There was not enough memory available to set up the L1P for the given pattern size.
L1P_PARMRANGE
The parameters that are passed to the L1P exceeded the valid range supported by the L1p hardware.
L1P_PATTERNACTIVE
Attempted to use a function when a pattern was already active. The application must issue an explicit L1P_PatternStop() before calling the function.
L1P_NOTAPATTERN
The application specified a pointer that either does not represent a generated pattern or the pointer is not valid.
L1P_ALREADYCONFIGURED
The L1p has already been configured without being previously unconfigured.
L1P_NOTCONFIGURED
The L1p has not been configured.
 
4.8.4 Performance considerations
This section describes the performance considerations for the L1p prefetcher.
Pattern loading
The L1p hardware can simultaneously load an existing list and create a new list. This can be used to do continuous refinement of the L1 cache miss list of addresses. However, in some cases, the first iteration through a routine creates a good enough list, such that additional gains would be overshadowed by the cost of periodically writing the refined list to DDR memory. This behavior can be controlled using the record parameter to the L1P_PatternStart() function.
Pattern creation overhead
There is a memory overhead versus performance overhead optimization with regards to list creation. When a list has been created, the application can remember the list for future reference. There is not an architectural limit to the number of lists that can be maintained. However, each list consumes memory and there is a bookkeeping overhead associated with tracking the list and keeping it resident in memory. There is also an opportunity cost associated with using that memory for other optimizations (for example, bigger lookup tables).
When switching between different patterns, the SPI performs one system call to install the new pattern's address in the L1p registers. Since a system call is a relatively heavy-weight operation, avoid switching patterns for small sections of code. It preferable to pause the pattern during these periods. Pausing or resuming a pattern is only a user-space MMIO write and can be accomplished with only a few instructions.
Prefetcher contention
Each A2 core's L1p is a shared resource: there are four hardware threads on each A2 core that shares the L1p. Each hardware thread can be running a list using its perfect prefetcher. Some hardware threads can be performing lots of linear stream prefetches while another hardware thread is executing a prefetch list pattern.
All of these activities compete for prefetch buffer space in the L1p. The L1P_SetStreamDepth(), L1P_SetStreamAdaptiveMode(), and L1P_SetStreamTotalDepth() functions are designed to be used by application developers to balance applications for optimal performance.
4.9 L2 atomic operations
The Blue Gene/Q nodes have support for atomic memory operations in the L2 cache. In some circumstances, atomic memory operations can be more efficient than standard PowerPC larx/stcx atomic instructions. The larx/stcx instructions require at least two operations for atomicity:
1. A load with reservation, which brings the data back to a processor general-purpose register (GPR)
2. A store operation, which pushes out the data
Typically, there is also a simple arithmetic operation interposed between those steps. Blue Gene/Q L2 atomics allow for a single load or store operation to perform a simple arithmetic operation in the L2 cache. This method saves the latency of the load. If the L2 atomic operation is a store operation code, the store operation is placed on the queue and the A2 core does not stall.
The CNK has support for Blue Gene/Q L2 atomic operations. However, the memory regions that contain L2 atomic memory must be predesignated. This predesignation is required because the CNK must create special memory translation entries for L2 atomic memory. Use the following SPI routine to predesignate memory:
uint64_t Kernel_L2AtomicsAllocate(void* atomic_vaddress, size_t length);
There are a limited number of memory translation entries. The CNK tries various combinations of mappings for atomic operations. However, the call can fail. If a failure occurs, try the call with a different virtual address.
4.10 Speculative execution
The Blue Gene/Q nodes contain a multiversion L2 memory cache that can be configured for speculative execution (also known as thread level speculation). This support enables the system software to simultaneously execute portions of the program on up to 128 hardware threads. The compiler generates multiple possible execution paths. The software runtime environment uses real-time performance data to determine which path is selected. If the system detects a conflict, it automatically reruns the code without speculation to ensure correct execution.
The Symmetric Multi-Processing Runtime (SMPRT) for the compiler and the CNK work together to configure the hardware for speculative execution support. For more information about using speculative execution, see the SMPRT #pragmas in the IBM XL compiler documentation. Section 7.2.1, “IBM XL compilers” on page 80 describes the IBM XL compilers.
4.11 Support for dynamic linking
The CNK uses the Linux user callable facility for dynamic linking, which loads a library image into the virtual address space of an application process. It loads only the executable and linking format (ELF) sections that are required by the application into the physical memory space. To release the library from both virtual and physical memory, call the dlclose() function. This function is similar to the function that is used on the Linux operating system.
The CNK supports Python-based applications with minimal or no modifications. In these applications, it is necessary to communicate and load appropriate sections of the scientific codes into the application.
The CNK does not support fork() or exec() functions for shell commands that might be used in existing Python-based applications. If an application uses the fork() function, the exec() function commands, or runtime use of shell commands, it might require modification to execute correctly on the Blue Gene/Q system. Some of the required modifications might include:
Replacing use of the Linux cp() (copy) command with inline code to copy files.
Replacing shell commands with system calls to delete files, for example, use unlink (“path”) instead of system (“rm -f path”).
Moving application setup to the front end node.
Each compute node independently requests dynamic libraries to be loaded. This solution relies on the file system caches on the Linux I/O node to avoid huge spikes in demand to the file system. It is possible that the file system caching might be insufficient for certain classes of dynamic applications.
4.12 Transactional memory
Transactional memory can be used to simplify simultaneous use of large numbers of threads. The Blue Gene/Q nodes contain a multiversion L2 memory cache that can be configured for transactional memory.
When transactional memory mode is used, the user defines the parallel work to be done. The user also defines which code is atomic. The hardware automatically detects memory read or write conflicts in the atomic region and the runtime retries the region. When many sections of code are marked atomic, performance can be reduced if these sections are frequently rerun.
The XLSMP runtime and the CNK work together to configure the hardware to support transactional memory. For more information about using transactional memory, see the SMPRT #pragmas in the IBM XL compiler documentation. Section 7.2.1, “IBM XL compilers” on page 80 describes the IBM XL compilers.
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset