Processes
The compute nodes on the Blue Gene/Q system comprise 17 cores on a single chip with 16 GB of dedicated physical memory. Applications run on 16 of the cores with the 17th core reserved for system software. Nearly the full 16 GB of physical memory is dedicated to application usage.
Within each job, processes (also known as tasks) are distributed to all of the compute nodes. Each node runs separate instantiations of the Compute Node Kernel (CNK). Each CNK can run multiple tasks or processes. This chapter describes some of the characteristics of those processes and how to configure them.
3.1 Importance of process count
When submitting a job to the Blue Gene/Q system with the runjob command or another job scheduler command, it is important to decide how many processes or tasks to run on each node. This decision significantly impacts performance and the memory that is allocated to each process, for example:
Running 16 processes per node divides the total number of cores and the total physical memory into 16ths. Thus, one core and roughly 1 GB is available to each task.
Running one process with 16 threads might yield equivalent performance but causes no subdivision of the memory layout.
When a job is submitted to the Control System, the user specifies the number of processes to create. This value is used in the configuration of memory and number of cores assigned to the process.
3.2 Process creation
Jobs are submitted to the CNK through the Control System using the functional network path through the I/O node to load the application. A job can be a subset of the block. Multiple jobs comprising one or more node jobs can exist per block. However, a single job cannot span blocks. From a CNK perspective, jobs comprise one or more processes to the node. With each job, the Control System provides information that describes the environment of the job (environment variables, program arguments, and so on).
For statically linked applications, the CNK loads the applications into memory at process creation time. To ensure that the application load scales to large system configurations, the application is loaded from a job leader node and broadcast to the other nodes in the job.
For dynamically linked applications, the CNK loads the dynamic interpreter (ld.so) into memory. The interpreter then pulls the target application and associated dynamic libraries into main memory by using standard system calls.
Processes are only created at the time of job initialization. The CNK does not support the fork system call and therefore cannot dynamically spawn processes.
3.3 Processes per node
When a job is submitted to the Control System, the user specifies the number of processes to create. This value is used to configure memory and the number of cores that are assigned to the process. To specify the process count, use the runjob -p (or --ranks-per-node) option.
When the user specifies the process count per node, the memory on that node is divided evenly among the processes. Generally, each process has roughly the same amount of memory, although there can be slight variations. Variations can be because of the size of the application text, shared memory size, kernel storage, and so on. The CNK configures the hardware to avoid memory page translations while the application is running. This configuration is known as a static memory map.
To achieve a static memory map, the process count must be a power of two. Thus, the valid numbers are 1, 2, 4, 8, 16, 32, or 64 processes per node. User submitted jobs run on 16 of the 17 cores.
The CNK allocates a number of cores to a process. Therefore, as shown in Table 3-1, the number of threads that each process can have active at a given moment is dictated by the processes per node value.
Table 3-1 Processes per node
Processes per node
Number of A2 cores per process
Maximum number of active hardware threads per process
1
16
64
2
8
32
4
4
16
8
2
8
16
1
4
32
2 processes per core
2
64
4 processes per core
1
3.4 Determining how many processes per node to use
The best configuration of processes per node depends on the type of application, the memory requirement, and the parallel paradigm that is implemented. There might be several options for applications that use a hybrid paradigm with both MPI and OpenMP or pthreads, depending on the memory footprint. Hybrid applications that support a high degree of threading might work well with a single process per node, but scenarios with 2, 4, 8, or 16 processes per node are more common. For single‑threaded applications, the memory requirement per process is the main consideration. If possible, use all 16 of the cores with 16, 32, or 64 processes per node.
One trade off to consider is that each additional process that is running on a node has a fixed amount of overhead. Overhead consists of a replicated data segment (for example, the global storage for the process), storage for the main stack, and storage for the heap.
3.5 Specifying process count
The default mode for the runjob command is one process per node. To specify other values for processes per node, use the following commands:
runjob ... -p 2 ...
runjob ... --ranks-per-node 2 ...
runjob ... -p 16 ...
runjob ... -p 64 ...
3.6 Support for 64-bit applications
The Blue Gene/Q system only supports processes compiled for the 64-bit PowerPC application binary interface (ABI). 32-bit processes are not supported:
For compilations with the GCC compiler, the -m64 flag is the default.
For compilations with the XL compiler, the -q64 flag is the default.
3.7 Object identifiers
There are various identifiers related to the process objects on the compute node. Each of the 16 physical cores that support the application processes is assigned a processor core identifier. Each of the four hardware threads within that core is assigned a processor thread identifier. There is also a unique identifier for each hardware thread within the node termed the processor identifier. Table 3-2 describes the interrelationship between these hardware identifiers. The first 16 cores are used to run applications. The 17th core is reserved for system use.
Table 3-2 Physical core ID, thread ID, and processor ID
Processor core ID
Processor thread ID
Processor ID
0
0, 1, 2, 3
0, 1, 2, 3
1
0, 1, 2, 3
4, 5, 6, 7
2
0, 1, 2, 3
8, 9, 10, 11
3
0, 1, 2, 3
12, 13, 14, 15
4
0, 1, 2, 3
16, 17, 18, 19
5
0, 1, 2, 3
20, 21, 22, 23
6
0, 1, 2, 3
24, 25, 26, 27
7
0, 1, 2, 3
28, 29, 30, 31
8
0, 1, 2, 3
32, 33, 34, 35
9
0, 1, 2, 3
36, 37, 38, 39
10
0, 1, 2, 3
40, 41, 42, 43
11
0, 1, 2, 3
44, 45, 46, 47
12
0, 1, 2, 3
48, 49, 50, 51
13
0, 1, 2, 3
52, 53, 54, 55
14
0, 1, 2, 3
56, 57, 58, 59
15
0, 1, 2, 3
60, 61, 62, 63
16
0, 1, 2, 3
64, 65, 66, 67
3.7.1 Process identifier
The process identifier (PID) is a 4-byte signed number that identifies a process. Each process on a compute node has a unique PID. The PID value is not unique across compute nodes. The PID can be passed to various pthreads, signal APIs, and system calls as a thread group identifier (TGID) for the process. The thread identifier (TID) that corresponds to the primary thread of the process is the same value as the PID for the process.
3.7.2 Thread identifier
The thread identifier (TID) is a number assigned by the kernel used to uniquely identify a thread within the node. The TID of the process’ main thread is the same as the PID of the process. See 3.7.1, “Process identifier” on page 16 for more information about PID number generation.
3.7.3 Thread group identifier
The thread group identifier (TGID) is an input parameter on several pthread and signal APIs and system calls. The PID (that is, the main thread TID of the process) serves as a valid TGID.
3.7.4 T coordinate
Multiple processes within a node for a given job are assigned a “T” coordinate value. That value can be used with the A, B, C, D, and E coordinates to uniquely identify a specific process or rank in the block or sub‑block. The “T” coordinates begin at 0 and are assigned in sequential order to a value that is equal to the number of processes minus one. For example, if the node is configured to contain four processes, the “T” coordinates range from 0 to 3. The coordinate T = 0 corresponds to the first process containing processor IDs 0 - 15. The coordinate T = 3 corresponds to the last process containing processor IDs 48 - 63. Each unique rank within a job is identified by a corresponding set of A, B, C, D, E, T coordinates.
3.8 Sub-node jobs
The compute node kernel supports sub‑block jobs within a node. A sub-node, sub‑block job is known as a sub-node job. A sub-node job occupies a subset of the 16 cores available for assignment to applications. Jobs that are running in a subset of the 16 cores in the node can be started and ended asynchronously. Only one core per sub-node job is supported. Only one process per node in a sub-node job is supported. Sub‑node jobs are restricted to a single user per node.
3.9 Threading overview
The CNK provides a threading model based on the Native Portable Operating System Interface (POSIX) Thread Library (NPTL) available in the glibc library. The NPTL package is the default threading package for Linux applications. The NPTL threading package implements the POSIX pthread API. The NPTL package allows the same POSIX pthread API library that Linux uses (-lpthreads) to function on the CNK without special parameters.
3.9.1 Hardware thread over-commitment
More than one pthread can be assigned to a given hardware thread. These additional pthreads are supported by additional kernel thread structures (that is, an M:N threading model of 1:1). By default, five pthreads can be assigned to one hardware thread. In the Blue Gene/Q threading model, pthreads have absolute affinity to the hardware threads they are associated with. There is no time-quantum driven preemption of pthreads running on a hardware thread. After a pthread begins to run on a hardware thread, it continues to run until one of the following occurs:
The thread calls pthread_yield(), and an equal or higher‑priority thread available for dispatch is found.
A signal is being delivered to a higher‑priority pthread on the same hardware thread.
The thread enters a futex wait condition.
The thread enters a nanosleep system call.
A new pthread is created on this hardware thread or is migrated to this hardware thread. Its priority is higher than the currently running thread.
The priority of the running thread is lower or the priority of a thread that is ready to run on the same hardware thread is raised such that a more eligible thread is now available to be dispatched.
The thread exits.
A nanosleep previously initiated by a higher‑priority pthread on the same hardware thread expires.
A timed futex wait previously initiated by a higher‑priority pthread on the same hardware thread expires.
3.10 Thread scheduler
The kernel scheduler runs on each hardware thread independently. Each local dispatcher handles the dispatching of the software threads assigned to the one hardware thread that it controls. There is no global dispatcher. Therefore, no global locks or blocking conditions are required to manage the dispatching of threads.
3.10.1 Thread preemption
A pthread is preempted when and only when a pthread with a strictly higher software priority is available to be run on the same hardware thread. This scenario can occur for the following reasons:
A futex-wait by a higher-priority pthread is satisfied.
A signal is delivered to a higher-priority pthread.
A new pthread with a higher software priority is created on, or is migrated to, this hardware thread.
The software priority of the current pthread is lowered, or the priority of another pthread on the same hardware thread is raised.
A nanosleep initiated by a higher‑priority pthread on the same hardware thread expires.
A timed futex wait initiated by a higher‑priority pthread on the same hardware thread expires.
3.10.2 Thread yield
When a pthread executes a pthread_yield() function and another pthread with the same software priority is available to be dispatched, the current thread relinquishes control. The other thread is dispatched. If there is no other runnable thread of equal or higher‑priority, control returns to the thread that executed the pthread_yield() function.
3.10.3 Round-robin dispatch
A thread relinquishes control due to a yield or a futex wait. If there are other pthreads with the same software priority, those pthreads are selected over the current thread for the next dispatch in a round-robin order. In other words, when there are multiple equal-priority pthreads on a hardware thread, and each pthread issues frequent yields, each of the pthreads makes progress. There is no guarantee that each thread will make equal progress. Interrupt conditions presented to the hardware thread might cause unbalanced thread dispatching within the scheduler's simple, light-weight, round-robin algorithm.
3.11 Thread affinity
When a pthread is created within a process, the CNK must select a hardware thread for the pthread. The kernel supports two layout algorithms for assigning pthreads to hardware threads. The number of hardware threads that are available to the process is dependent on the number of processes in the node. See 3.3, “Processes per node” on page 14. The layout types in the following sections can be activated through the use of an environment variable, BG_THREADLAYOUT. If required, additional layout algorithms can be added. When possible, the even-numbered processor IDs within the process are assigned before the odd-numbered processor IDs because of the configuration limitations that are imposed by the hardware universal performance counter implementation.
3.11.1 Breadth-first assignment
Breadth-first is the default thread layout algorithm. This algorithm corresponds to BG_THREADLAYOUT = 1. With breadth-first assignment, the hardware thread-selection algorithm progresses across the cores that are defined within the process before selecting additional threads within a given core.
3.11.2 Depth-first assignment
This algorithm corresponds to BG_THREADLAYOUT = 2. With depth-first assignment, the hardware thread-selection algorithm progresses within each core before moving to another core defined within the process.
3.11.3 Thread affinity control
Controlling the placement of pthreads on the existing hardware threads is supported by the kernel through the sched_setaffinity() system call. The target of the affinity operation must be one and only one hardware thread. The interface to specify the target hardware thread is defined by the glibc structure, cpu_set_t. The CPU numbers to be specified by the caller correspond to the processor IDs 0 - 63. The caller must be aware of the range of valid processor IDs for the current process. For a configuration where there is one process on the node, all processor IDs are owned by the process. However, on a system that has four processes in the node, the first process owns processor IDs 0 - 15. The second process owns processor IDs 16 - 31. Determining what processor IDs are controlled by a given process can be accomplished by using the Kernel_ThreadMask(T) and the Kernel_MyTcoord() SPIs. After the T coordinate is obtained using the Kernel_MyTcoord system programming interface (SPI), supply it to the Kernel_ThreadMask(T) SPI. The SPI returns a 64-bit mask representing the processor IDs owned by the currently running process.
There are two methods to set the affinity of a pthread. The first method is at pthread creation time through the pthread attributes structure. The second method is explicitly through the set_affinity system call.
3.11.4 Setting affinity with the pthread attribute
Example 3-1 shows how to set affinity with the pthread attributes at pthread creation.
Example 3-1 Setting affinity through the pthread attributes
pthread_attr_t attr; // create an attribute object
cpu_set_t cpumask; // create a cpu mask object
pthread_attr_init(&attr); // initialize an attribute object
CPU_ZERO(&cpumask); // initialize the cpu mask
CPU_SET(processorID, &cpumask);
pthread_attr_setaffinity(&attr, CPU_SETSIZE, &cpumask);
rc = pthread_create(&thread[t], &attr, myThreadFunction, NULL);
3.11.5 Setting affinity with the system call
The following code example shows how to set explicit affinity with the system call.
Example 3-2 Setting affinity using the system call
cpu_set_t mask;
CPU_ZERO( &mask ); /* CPU_SET sets only the bit corresponding to cpu. */
CPU_SET( processorID, &mask ); /* pthread_setaffinity returns 0 in success */
if( pthread_setaffinity_np( tid, sizeof(mask), &mask ) == -1 )
{
printf("WARNING: Could not set CPU Affinity, continuing... ");
}
3.11.6 Extended thread affinity control
Extended thread affinity control is a facility that allows a process to place, using set affinity, software threads on hardware threads that were not originally allocated to that process. This feature is useful in application environments where an application might enter different phases of execution that require a larger number of threads to be used by a subset of the processes in a node while other processes in the node are not actively using their threads.
Enablement
An environment variable is used to enable the extended thread affinity control facility. If the BG_THREADMODEL environment variable is set to the value 2, set affinity APIs can be used to place a pthread onto a hardware thread that is not configured as a hardware thread owned by the current process. See Table D-5 on page 142 for more information about the BG_THREADMODEL variable.
For example, if the application must transition between 16 active processes, each using four hardware threads, to four active processes, each using 16 hardware threads, the application is started with 16 processes configured. Each process creates its pthreads using the pthread_create() function. Each of the four processes can use the pthread_create() function to create up to a total of 16 threads and use the setaffinity API interfaces to place these pthreads on hardware threads outside its configured set of hardware threads. When the application reaches the end of its first phase, 12 of the 16 processes block using a standard POSIX synchronization mechanism, such as a shared mutex, condition, or barrier. Then the four remaining processes begin running their additional pthreads using the additional hardware threads that were previously used by the now blocked 12 processes. When this phase of the application completes, the processes are unblocked and the application returns to its original behavior of having 16 active processes each using four hardware threads.
Usage restrictions
The following restrictions apply to the extended thread affinity control facility:
The job must be configured with 2, 4, 8, or 16 ranks per node.
A core can host the originally configured process plus 1, 2, 3, or 4 additional threads of any one additional process within the node.
MPI operations are not supported for pthreads that are executing on a hardware thread that was not originally configured to a pthread’s process.
Memory allocation across the processes in the node is based exclusively on the initial memory configuration at job start time.
Transactional memory and thread level speculation operations are not supported on pthreads that are executing on a hardware thread not originally configured to a pthread’s process.
Setting and handling of the itimer is not supported for pthreads that are executing on a hardware thread not originally configured to a pthread’s process.
Performance monitoring (BGPM) is not supported for pthreads that are executing on a hardware thread not originally configured to a pthread’s process.
Controlling Application Phases
The application can use any of the following CNK supported interprocess synchronization mechanisms to transition into and out of actively running pthreads that are executing on a hardware thread that is not configured to a pthread’s process:
Barriers using pthread_barrier with shared attribute set
Conditions using pthread_cond with shared attribute set
Mutexes using pthread_mutex with shared attribute set
The application can also use its own synchronization mechanisms as long as the blocked threads are not waiting on the completion of a function shipping system call.
3.12 Thread priority
The software thread priority can be set within a pthread through either the pthread attribute structure or through an explicit system call. The priority values that are supported depend on the scheduling policy that is set for the pthread. Thread priorities are evaluated in the scheduler when a condition occurs within a hardware thread that causes the scheduler to select a potentially different pthread for dispatching. Because the Blue Gene/Q system has absolute hardware thread affinity, the relative pthread priorities of pthreads on different hardware threads has little consequence. The relative thread priorities are important for pthreads assigned to the same hardware thread.
There are conditions in which a communication thread might require control only when no other application threads are running. Because of this requirement, communication threads can specify a priority that is lower than any application thread. Conversely, there are situations when a communication thread might need to be the highest priority software thread on the hardware thread. Therefore, communication threads are allowed to set a priority value that is more favored than any application thread priority. This widened range of priorities is supported by the use of a special scheduling policy, SCHED_COMM. See 3.12.1, “Setting priority through the pthread attribute” on page 22.
Thread priority can be modified dynamically, for example, a pthread might want to raise or lower its priority before relinquishing control. A priority change results in a call to the kernel within the target thread. At that time, the relative priorities of the software threads on the hardware thread are re-evaluated. The most eligible software thread is dispatched.
3.12.1 Setting priority through the pthread attribute
Priority can be set through the pthread attribute structure supplied to pthread_create. The following API must first be issued to have the priority information in the attribute used. The following code sets the inherit attribute:
pthread_attr_setinheritsched(&attr, PTHREAD_EXPLICIT_SCHED);
Setting the priority within the pthread attribute can be done by either specifying just the priority or by specifying both the policy and priority. To specify both the policy and priority, see 3.12, “Thread priority” on page 21. To specify just the priority, use the following API:
pthread_attr_setschedparam(pthread_attr_t *attr, struc sched_param *param)
If this API is used, the policy is inherited from the caller if the pthread_attr_setschedpolicy() was not called, even though the PTHREAD_EXPLICIT_SCHED value is specified on the pthread_attr_setinheritsched() API is specified.
To determine the priority range for a given scheduler policy, use the following APIs:
sched_get_priority_max(int Policy)
sched_get_priority_min(int Policy)
Table 3-3 outlines the priorities for the Blue Gene/Q system. However, these ranges are subject to change. The application code must avoid making assumptions regarding the valid priority range for a given policy and use the previously mentioned APIs.
Table 3-3 Blue Gene/Q priorities
Policy
Minimum priority
Maximum priority
SCHED_OTHER
2
98
SCHED_FIFO
2
98
SCHED_COMM
1
99
3.12.2 Explicit setting of priority
Priority can be set explicitly through the use of the pthread_setschedparam() API. Example 3-2 on page 20 shows an example of explicitly setting affinity.
Example 3-3 Setting priorities explicitly
#include <sched.h>
int pthread_setschedparam(pthread_t thread, int policy, const struct sched_param *param);
3.12.3 Hardware thread priority
The hardware thread priority represents the relative proportion of available core cycles that are to be given to hardware threads that share the same core. There are seven PowerPC architected hardware thread priorities. However, the Blue Gene/Q system internally implements two priority levels. Based on internal configuration settings, the mapping between the architected and available priority levels are shown in Table 3-4.
Table 3-4 Priority level mapping
PowerPC architected priority level
PowerPC instruction
PPR32::PRI
Implemented Blue Gene/Q priority level
low
or 1, 1, 1
0b010
low
medium-low
or 6, 6, 6
0b011
medium
medium
or 2, 2, 2
0b100
medium
Setting hardware thread priority from an application
Applications can use the low and medium Blue Gene/Q hardware thread priorities. The following statement brings in the required header file that contains the inline interfaces:
#include <hwi/include/bqc/A2inlines.h>
The interfaces control hardware thread priorities.
 
Important: These priority terms refer to the architected PowerPC priority level terminology, not the Blue Gene/Q priority terminology.
The following inlines can be used to set hardware thread priorities:
void ThreadPriorityMedium();
void ThreadPriorityMediumLow();
void ThreadPriorityLow();
The ThreadPriority_Medium() and ThreadPriority_MediumLow() interfaces both map to the medium Blue Gene/Q priority level. The ThreadPriority_Low() sets the low Blue Gene/Q priority level.
The current priority of the hardware thread can be obtained by reading the PPR32 register. Table 3-4 describes the mapping of the priority levels. The thread priority can also be set by writing to this register. This approach can be useful when restoring a previously saved priority after priority modification. Example 3-4 on page 24 demonstrates a sequence of saving, modifying, and restoring the hardware priority of a thread.
Example 3-4 Sequence of saving, modifying, and restoring the hardware priority of a thread
// Save current hardware priority
uint64_t ppc32 = mfspr(SPRN_PPR32);
 
// Force hardware priority to low
ThreadPriority_Low();
 
// perform some function ...
 
// restore priority
mtspr(SPRN_PPR32, ppc32);
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset