MPI and CNK environment variables
This appendix describes the environment variables that affect the run time characteristics of programs that run on the Blue Gene/Q compute nodes. These variables configure settings for the Message Passing Interface (MPI) and the Compute Node Kernel (CNK).
Environment variables can be used to improve performance or modify functional attributes of the application.
The following topics are covered:
Message Passing Interface environment variables
The Blue Gene/Q Message Passing Interface (MPI) implementation provides several environment variables that affect its behavior. Setting these environment variables can allow a program to run faster, or, if set incorrectly, might cause the program not to run at all. None of these environment variables are required to be set for the Blue Gene/Q MPI implementation to work.
Table D-1 shows the MPI environment variables.
Table D-1 MPI environment variables
Environment variable
Description
Default value
COMMAGENT_
RGETPACINGMAX
The maximum number of bytes allowed to be in the network at one time as a result of paced remote gets from each node. This number must be a multiple of COMMAGENT_RGETPACINGSUBSIZE.
The default value depends on the number of nodes in the block, as shown in Table D-2.
COMMAGENT_
RGETPACINGSUBSIZE
The size, in bytes, of a submessage used for remote get pacing. The pacing logic breaks a large remote get into submessages of this size.
Table D-3 shows the default values for the COMMAGENT_RGETPACINGSUBSIZE environment variable. These values vary depending on the size of the block where the job is being run.
MUSPI_INJFIFOSIZE
The size, in bytes, of each injection first-in, first-out queue (FIFO). These FIFOs store 64-byte descriptors. Each descriptor describes a memory buffer to be sent on the torus. Making this size larger might reduce memory usage and latency when there are many outstanding messages. Reducing this size might increase that memory usage and latency. PAMI messaging optimally uses 10 injection FIFOs per context, although fewer FIFOs can be used when resources are constrained.
65536
(64 KB)
MUSPI_NUMBATIDS
The number of base address table IDs per process reserved for use by a messaging unit (MU) SPI application.
0
MUSPI_NUMCLASSROUTES
The number of collective class routes reserved for use by an MU SPI application. This value is also the number of global interrupt class routes reserved for use by an MU SPI application.
0
MUSPI_NUMINJFIFOS
The number of injection FIFOs per process reserved for use by an MU system programming interface (SPI) application.
0
MUSPI_NUMRECFIFOS
The number of reception FIFOs per process reserved for use by an MU SPI application.
0
MUSPI_RECFIFOSIZE
The size, in bytes, of each reception FIFO. Incoming torus packets are stored in this FIFO until software can process them. Making this size larger can reduce torus network congestion. Making this size smaller leaves more memory available to the application. PAMI messaging uses one reception FIFO per context.
1048576 bytes
(1 MB)
PAMI_A2A_PACING_
WINDOW
The number of simultaneous send operations to start on Alltoall(v) collectives. Additional send operations cannot be started until these operations finish. This requirement reduces the resource usage for large geometries.
1024
PAMI_
ATOMICBARRIER_LOOPS
The number of attempts to complete the barrier in each pass.
32
PAMI_CLIENT_SHMEMSIZE
The number of bytes that are allocated from shared memory to each client. Use the “K” and “k” suffix as a 1024 multiplier or the 'M' and 'm' suffix as a 1024 × 1024 multiplier.
The default value depends on the number of tasks in the node.
4M More than one task is on the node.
0 All other settings.
See description
PAMI_CLIENTS
A comma-separated ordered list of clients (no spaces). The complete syntax is [name][:repeat][/weight][,[name][:repeat][/weight]]*
Each client has the form [name][:repeat][/weight], where:
"name" is the name of the client. For example, the name of the Blue Gene/Q MPICH2 client is MPI. The default value for this option is the null string.
":repeat" is the repetition factor, where repeat is the number of clients having this same name. The default value for this option is 1.
"/weight" is the relative weight assigned to the client, where weight is the weight value. The default value for this option is 1. The weight is used to determine the portion of the messaging resources that is given to the client, relative to the other clients.
 
When middleware calls the PAMI_Client_create() function, it provides the name of the client. PAMI searches through the PAMI_CLIENTS in the order they are specified, looking for an exact name match. If there is not an exact name match with any of the PAMI_CLIENTS, PAMI searches through the PAMI_CLIENTS again, looking for a client with a null name string. The null name string is a wildcard and matches any client name. If there are exact or wildcard name matches, the first match that does not already have an active client is used, and the weight of that client determines the percentage of the available resources that are allocated to the client. If there are no available and matching clients, the PAMI client is not created.
 
The default value of the PAMI_CLIENTS environment variable is :1/1, which means that all resources are assigned to the first client created, regardless of the client name, and all subsequent attempts to create a client fail due to insufficient resources.
 
If any of the clients specified on PAMI_CLIENTS are unnamed, or more than one client has the same name, the order in which the clients are created must be the same on all processes in the job.
The first client listed has exclusive use of the message unit combining collective hardware for optimizing reduction operations. The other clients use algorithms that do not use the message unit combining collective hardware.
":1/1"
PAMI_CLIENTS (continued)
The following examples show how the PAMI_CLIENTS variable can be used.
PAMI_CLIENTS=MPI,CLIENTA means that up to two clients can use PAMI: one must be MPI, and the other must be CLIENTA. The MPI client is assigned the message unit combining collective hardware, and the two clients evenly split the remaining messaging resources.
PAMI_CLIENTS=MPI:3,CLIENTA/2,CLIENTB:2/3 means that up to six clients can use PAMI. Three can be MPI, one can be CLIENTA, and two can be CLIENTB. Each MPI client has weight 1. CLIENTA has weight 2, and each CLIENTB client has weight 3. In this example, each CLIENTB client gets three times the amount of resources as each MPI client, and the first MPI client created is assigned the message unit combining collective hardware.
PAMI_CLIENTS=MPI/3,/2 means that up to three clients can use PAMI. Two of the clients are unnamed, meaning that they can be any of the PAMI clients, and one client can only be MPI. The first MPI client created has resource weight 3 and is assigned the message unit combining collective hardware. The first non-MPI client created (or possibly the second MPI client created) has resource weight 2, and the second non-MPI client created (or possibly the second or third MPI client created) has resource weight 1.
PAMI_CLIENTS is not specified. This setting means that there can be only one client, with any name, and it is assigned all of the resources.
Default ":1/1"
PAMI uses one reception FIFO per context and, optimally, uses 10 injection FIFOs per context, although fewer injection FIFOs can be used when resources are constrained.
For more information, see the descriptions of the following variables:
MUSPI_NUMBATIDS
MUSPI_NUMCLASSROUTES
MUSPI_NUMINJFIFOS
MUSPI_NUMRECFIFOS
MUSPI_INJFIFOSIZE
MUSPI_RECFIFOSIZE
PAMI_MU_RESOURCES
":1/1"
PAMI_CONTEXT_
SHMEMSIZE
Number of bytes allocated from shared memory to every context in each client. Use the “K” and “k” suffix as a 1024 multiplier or the “M” and “m” suffix as a 1024 × 1024 multiplier.
135K
PAMI_GLOBAL_SHMEMSIZE
Number of bytes allocated from shared memory for global information such as the mapcache. Use the “K” and “k” suffix as a 1024 multiplier, or the “M” and “m” suffix as a 1024 × 1024 multiplier.
4M
PAMI_M2M_ROUTING
For an all-to-all message transfer, this setting specifies the network routing that is used.
"DETERMINISTIC" Use deterministic routing.
"DYNAMIC" Use dynamic routing.
"DYNAMIC"
PAMI_M2M_ZONE
For an all-to-all message transfer that uses DYNAMIC routing, this variable specifies the routing zone that is used:
0 Use zone 0.
1 Use zone 1.
2 Use zone 2.
3 Use zone 3.
 
The default settings depend on the size of the block:
1 The blocks is smaller than 512 nodes.
0 The block is 512 nodes or larger.
See description
PAMI_MAX_COMMTHREADS
Maximum number of commthreads to create. This setting can be used to avoid hardware thread oversubscription.
(64 / ­ranks per node) - 1
PAMI_MEMORY_OPTIMIZED
Determines whether PAMI is configured for a restricted memory job. If not set, PAMI is not memory optimized and uses memory as needed to increase performance.
Not set.
PAMI_MU_RESOURCES
Determines whether PAMI calculates the number of available contexts based on an “optimal” or a “minimal” allocation of MU resources to each context. Supported environment variable values are not case-sensitive and include:
Optimal An optimal allocation of MU resources to each context limits the maximum number contexts that can be created. Each context is allocated sufficient MU resources to fully use the MU hardware and torus network.
Minimal A minimal allocation of MU resources to each context allows the maximum number of contexts to be created regardless of MU hardware and torus network considerations.
"Optimal"
PAMI_
NUMDYNAMICROUTING
Number of simultaneous dynamically routed messages per context. If more than this many messages are being transferred, the additional messages are deterministically routed. Dynamic routing can be faster than deterministic routing. However, dynamically routed messages require more storage to track their progress, hence the reason for this option. Specify this number in increments of 64 (for example: 64, 128, 192, 256, ...).
64
PAMI_RGETINJFIFOSIZE
The size, in bytes, of each remote get FIFO. These FIFOs store 64-byte descriptors. Each descriptor describes a memory buffer to be sent on the torus, and is used to queue requests for data (remote gets). Making this size larger might reduce torus network congestion and reduce overhead. Making this size smaller might increase that congestion, memory usage, and latency. PAMI messaging uses 10 remote get FIFOs per node.
65536
(64 KB)
PAMI_RGETPACING
Specifies whether to consider messages for pacing:
0 No messages are paced.
1 Messages are considered for pacing. The default setting depends on the block size.
0 The block size is one rack (1024 nodes) or smaller.
1 The block size is larger than one rack.
See description
PAMI_RGETPACINGDIMS
Messages between nodes whose coordinates differ in more than this many dimensions in ABCD are considered for pacing. For example, node A has ABCD coordinates (0,0,0,0) and node B has (3,2,1,0). They differ in three dimensions (A, B, and C). Specifying 2 means that messages between these nodes are considered for pacing.
1
PAMI_RGETPACINGHOPS
Messages between nodes that are more than this many hops apart on the network are considered for pacing.
4
PAMI_RGETPACINGSIZE
Messages exceeding this size in bytes are considered for pacing.
65536
(64 KB)
PAMI_ROUTING
Specifies the PAMI network routing options to be used for point‑to‑point messages that are large enough to use the rendezvous protocol. That is, the messages are larger than the size specified for PAMID_EAGER.
The complete syntax is PAMI_ROUTING=[size][,[small][,[low:high][,[in][,out]]]]
When the source and destination nodes are on a network line (their ABCDE coordinates differ in at most one dimension), deterministic routing is always used. PAMI_ROUTING does not override this setting.
 
When the message size is less than or equal to "size", PAMI uses the "small" network routing.
 
When the message size is larger than "size", PAMI uses the "flexibility metric" to determine the network routing as follows:
The "low:high" range is the flexibility metric range. The flexibility metric gauges the routing flexibility between a source node and destination node. The values for "low" and "high" must be floating-point numbers in the range 0.0 through 4.0. The low value must be less than or equal to the high value. The PAMI computes the flexibility metric between the source node and the destination node of a messaging transfer. The metric is the sum of the flexibility of dimensions A, B, C, and D between those nodes. The flexibility of a particular dimension is the ratio of the number of hops between the source and the destination in that dimension and the size of that dimension, and can range from 0.0 through 1.0.
 
The "in" value is the network routing to be used for a message transfer between two nodes when their flexibility metric is between "low" and "high", and the "out" value is the network routing otherwise. The values for “in” and “out” and “small” might each be one of the following values.
 
0 Dynamic routing zone 0
1 Dynamic routing zone 1
2 Dynamic routing zone 2
3 Dynamic routing zone 3
4 Deterministic routing
PAMID_ASYNC_PROGRESS
This variable determines whether one or more communications threads are started to make asynchronous progress. This variable is required for maximum performance in message throughput cases:
0 No internal communications threads are started to assist with making communications progress
1 One or more communications threads can assist with making communications progress. Use this setting when high message throughput is required.
 
The default value depends on the settings that are used:
0 The application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, xl.legacy.ndebug), MPI_Init_thread() is called without MPI_THREAD_MULTIPLE, or the MPI_Init() function is called.
1 The application is linked with the gcc, xl, and xl.ndebug MPICH libraries and the MPI_Init_thread() function is called with MPI_THREAD_MULTIPLE.
 
The default value cannot be changed when the application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, or xl.legacy.ndebug). Attempting to set this environment variable causes an application that is linked with a "legacy" MPICH library to display a message and exit during the MPI_Init() function call.
See description
PAMID_COLLECTIVE_name
Turns on or off specific protocols for the MPI collective specified as name. Possible values for collectives are ALLTOALL, ALLTOALLV, ALLREDUCE, BARRIER, BCAST, SCATTER, SCATTERV, GATHER, GATHERV, ALLGATHER, ALLGATHERV, SCAN, and REDUCE.
The MPICH option can be used to turn off all optimizations for a specific collective and use the MPICH point-to-point protocol. For many MPI_collective operations, this setting can cause poor performance on larger blocks.
For information about other options and the default values, use the PAMID_VERBOSE=2 setting.
For some PAMID_COLLECTIVE_* environment variables, especially PAMID_COLLECTIVE_ALLREDUCE, some optimized protocols only work with specific parameters (such as data type, operation, or message size) that are specified for the particular MPI_collective invocation. Therefore, specifying a specific protocol on the PAMID_COLLECTIVE_* environment variable might not work for a given MPI_collective invocation. In that case, the MPICH protocol is used instead. To find out which protocol was used for a specific MPI_collective invocation, invoke the MPIX_Get_last_algorithm_name() function immediately after the MPI_collective invocation. For more information, see the mpix.h file.
See description
PAMID_COLLECTIVES
Controls whether optimized collectives are used. The possible values are:
0 Optimized collectives are not used. Only MPICH point-to-point based collectives are used.
1 Optimized collectives are used.
1
PAMID_CONTEXT_MAX
This variable sets the maximum allowable number of contexts. Contexts are a method of dividing hardware resources among a Parallel Active Messaging Interface (PAMI) client (for example, MPI) to set how many parallel operations can occur at one time. Contexts are similar to channels in a communications system. The practical maximum is usually 64 contexts per node.
The default value depends on the number of processes per node and the settings that are used:
1 The application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, xl.legacy.ndebug), the MPI_Init_thread() function is called without MPI_THREAD_MULTIPLE, or the MPI_Init() function is called.
1 The application is linked with the gcc, xl, and xl.ndebug MPICH libraries and the MPI_Init_thread() function is called with MPI_THREAD_MULTIPLE.
 
The default value cannot be changed when the application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, or xl.legacy.ndebug). Attempting to set this environment variable causes an application that is linked with a "legacy" MPICH library to display a message and exit during the MPI_Init() function call.
See description
PAMID_CONTEXT_POST
This variable must be enabled to allow parallelism of multiple contexts. It might increase latency. Enabling this variable is the only method to allow parallelism between contexts:
0 Only one parallel communications context can be used. Each operation runs in the application thread.
1 Multiple parallel communications contexts can be used. An operation is posted to one of the contexts, and communications for that context are driven by communications threads.
 
The default value depends on the settings that are used:
0 The application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, xl.legacy.ndebug), the MPI_Init_thread() function is called without MPI_THREAD_MULTIPLE, or the MPI_Init() function is called.
1 The application is linked with the gcc, xl, and xl.ndebug MPICH libraries and MPI_Init_thread() is called with MPI_THREAD_MULTIPLE.
 
The default value cannot be changed when the application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, or xl.legacy.ndebug). Attempting to set this environment variable causes an application that is linked with a "legacy" MPICH library to display a message and exit during the MPI_Init() function call.
See description
PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT
Overrides the default job size at which point the eager protocols are disabled for internal MPI operations. This override has the same effect as specifying the environment variable:
PAMID_PT2PT_LIMITS=::::0:0:0:0
 
This environment variable is processed before the PAMID_EAGER or PAMID_RZV, PAMID_EAGER_LOCAL or PAMID_RZV_LOCAL, PAMID_SHORT, and PAMID_PT2PT_LIMITS environment variables.
512k
PAMID_EAGER_LOCAL, PAMID_RZV_LOCAL
Sets the cutoff value for the switch to the rendezvous protocol when the destination rank is local. The two options are identical. This variable takes an argument, in bytes, to switch from the eager protocol to the rendezvous protocol for point-to-point messaging. The default value effectively disables the eager protocol for local transfers because the default value for PAMID_EAGER_LOCAL is less than the default value for PAMID_SHORT.
The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used.
4097 bytes
PAMID_EAGER, PAMID_RZV
Sets the cutoff for the switch to the rendezvous protocol. These options are identical. This variable takes an argument, in bytes, to switch from the eager protocol to the rendezvous protocol for point-to-point messaging. Increasing the limit might help for larger blocks and if most of the communication is with the nearest neighbor.
The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used.
4097 bytes
PAMID_PT2PT_LIMITS
Specify all point-to-point limit overrides. This environment variable is processed after the PAMID_EAGER or PAMID_RZV, PAMID_EAGER_LOCAL or PAMID_RZV_LOCAL, and PAMID_SHORT environment variables.
The entire point-to-point limit set is determined by three Boolean configuration values:
'is non-local limit' versus 'is local limit'
'is eager limit' versus 'is immediate limit'
'is application limit' versus 'is internal limit'
The point-to-point configuration limit values are specified in order and are delimited by ':' characters. If a value is not specified for a given configuration, the limit is not changed. There is no requirement to specify all eight configuration values. However, to set the last (eighth) configuration value, the previous seven configurations must be listed. The 'k', 'K', 'm', and 'M' multipliers can be specified. For example:
 
PAMID_PT2PT_LIMITS=":::::::10k"
The configuration entries can be described as:
0 remote eager application limit
1 local eager application limit
2 remote immediate application limit
3 local immediate application limit
4 remote eager internal limit
5 local eager internal limit
6 remote immediate internal limit
7 local immediate internal limit
 
The following example show how the PAMID_PT2PT_LIMITS variable can be used.
"10K" sets the application internode eager (the "normal" eager limit)
"10240::64" sets the application internode eager and immediate limits
"::::0:0:0:0" disables 'eager' and 'immediate' for all internal point-to-point limits
This environment variable does not override any point-to-point limits by default.
 
If no other point-to-point limit environment variables are used, and if the job size is less than PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT, the effective default value is:
 
4097:4097:113:113:2049:64:113:113
 
If no other point-to-point limit environment variables are used, and if the job size is not less than PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT, the effective default value is:
 
4097:4097:113:113:0:0:0:0
See description
PAMID_RMA_PENDING
Maximum outstanding Remote Memory Access (RMA) requests. Limits the number of PAMI_Request objects allocated by MPI one-sided (also known as RMA) operations.
The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used.
1000
PAMID_SHMEM_PT2PT
Determines whether intranode point-to-point communication uses the optimized shared memory protocols:
0 Optimized shared memory protocols are not used.
1 Optimized shared memory protocols are used.
1
PAMID_SHORT
Sets the cutoff for the switch to the eager protocol. This variable takes an argument, in bytes, to switch from the short protocol to the eager protocol for point-to-point messaging. If a value greater than 113 bytes is specified, 113 bytes are used.
The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used.
113 bytes
PAMID_STATISTICS
Turns on the printing of statistics for the message layer such as the maximum receive queue depth. Possible values:
0 No statistics are printed.
1 Statistics are printed.
0
PAMID_THREAD_
MULTIPLE
Specifies the messaging execution environment. It specifically selects whether there can be multiple independent communications occurring in parallel, driven by internal communications threads:
0 The application threads drive the communications. No additional internal communications threads are used. This setting is equivalent to specifying PAMID_ASYNC_PROGRESS=0, PAMID_CONTEXT_POST=0, PAMID_CONTEXT_MAX=1.
1 There can be multiple independent communications occurring in parallel, driven by internal communications threads. This setting is equivalent to specifying PAMID_ASYNC_PROGRESS=1, PAMID_CONTEXT_POST=1, PAMID_CONTEXT_MAX1.
 
The default value depends on the settings that are used:
0 The application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, xl.legacy.ndebug), or the MPI_Init_thread() function is called without MPI_THREAD_MULTIPLE, or the MPI_Init() function is called.
1 The application is linked with the gcc, xl, and xl.ndebug MPICH libraries and the MPI_Init_thread() function is called with MPI_THREAD_MULTIPLE.
 
The default value cannot be changed when the application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, or xl.legacy.ndebug). Attempting to set this environment variable causes an application that is linked with a "legacy" MPICH library to display a message and exit during the MPI_Init() function call.
See description
PAMID_VERBOSE
Provides debugging information during MPI_Abort() and during various MPI function calls. Some settings affect performance.
To simplify debugging, set this variable to 1 for all applications:
0 No additional information is provided.
1 Print summary information during the MPI_Init() function call on rank 0. Use this setting to simplify debugging. It only impacts performance by using the MPI_Init() function to print the information. A small amount of text is printed, including the PAMID_, PAMI_, MUSPI_, COMMAGENT_, and BG_ environment variables and other variables that the user specifies. The MPI_Init() function does not verify that variable names are specified correctly.
2 Print summary information for collective operations and print additional information for point-to-point. This information can be useful when debugging which collective is being used on a communicator. Approximately one line of output per rank per communicator is created, and one line of output per rank of point-to-point send statistics are provided on finalize. This setting can affect the performance of routines that are typically not timed (for example, MPI_Comm_create, MPI_Finalize, and so on).
3 Print detailed information. This setting generates extensive information when used with large numbers of ranks.
0
Table D-2 shows the default values for the COMMAGENT_RGETPACINGMAX environment variable.
Table D-2 Default values for the COMMAGENT_RGETPACINGMAX variable
Block size (racks)
COMMAGENT_RGETPACINGMAX value
Racks < 2
65536
2 racks < 4
65536
4 racks < 8
32768
8 racks < 16
24576
16 racks < 32
24576
32 racks < 48
24576
48 racks < 64
24576
64 racks < 80
24576
80 racks < 96
24576
96 racks
24576
Table D-3 shows the default values for the COMMAGENT_RGETPACINGSUBSIZE environment variable. These values vary depending on the size of the block where the job is being run.
Table D-3 Default settings for the COMMAGENT_RGETPACINGSUBSIZE variable
Block size (racks)
COMMAGENT_RGETPACINGSUBSIZE value
Racks < 2
16384
2 racks < 4
16384
4 racks < 8
8192
8 racks < 16
8192
16 racks < 32
8192
32 racks < 48
8192
48 racks < 64
8192
64 racks < 80
8192
80 racks < 96
8192
96 racks
8192
Table D-4 shows the default values for the PAMI_ROUTING environment variable. These values vary depending on the size of the block where the job is being run.
Table D-4 Default values for the PAMI_ROUTING environment variable
Block size
Size
Small
Flexibility metric range, low - high
Routing when in range
Routing when out of range
32 nodes < 64
65536
4
1.5 - 3.5
3
3
64 nodes < 128
65536
4
1.5 - 3.5
3
3
128 nodes < 256
65536
4
1.5 - 3.5
3
3
256 nodes < 512
65536
4
1.5 - 3.5
3
3
512 nodes < 1024
65536
4
1.5 - 3.5
3
3
1 racks < 2
65536
4
1.5 - 3.5
2
0
2 racks < 4
65536
4
1.5 - 3.5
3
0
4 racks < 8
65536
4
1.5 - 3.5
3
0
8 racks < 16
65536
4
1.5 - 2.5
3
0
16 racks < 32
65536
4
1.3 - 3.0
3
0
32 racks < 48
65536
4
1.3 - 3.0
3
0
48 racks < 64
65536
4
1.3 - 3.0
3
0
64 racks < 80
65536
4
.75 - 3.0
3
0
80 racks < 96
65536
4
.75 - 3.0
3
0
96 racks
65536
4
.75 - 3.0
3
0
Compute Node Kernel environment variables
Several environment variables affect the runtime characteristics of the Compute Node Kernel (CNK). If these variables are set incorrectly, programs might not run. None of these environment variables are required to be set for the CNK to work.
Table D-5 lists the CNK environment variables.
Table D-5 MPI environment variables
Environment variable
Description
Default value
BG_AGENTHEAPSIZE
The heap size in MB that is allocated to the application agent process.
The default value is 16 if an application agent is defined in the BG_APPAGENT environment variable. Otherwise, it is 0.
16 or 0.
See description.
BG_AGENTCOMMHEAPSIZE
The heap size in MB that is allocated to the application agent that is reserved for use by the messaging software.
The default value is 16 if an application agent for messaging is not disabled by BG_APPAGENTCOMM=DISABLE. Otherwise, it is 0.
16 or 0.
See description.
BG_APPAGENT
The path to an application agent program. The default is no application agent.
See description
BG_APPAGENTCOMM
The path to an application agent that is reserved for use by the messaging software. To disable the default PAMI application agent, specify DISABLE.
The default value is /bgsys/drivers/ppcfloor/agents/bin/comm.elf
See description
BG_COREDUMPBINARY
Specifies the MPI ranks for which a binary core file is generated rather than a lightweight core file. This type of core file can be used with the GNU Project Debugger (GDB) but not the Blue Gene/Q Coreprocessor utility. If this variable is not set, all ranks generate a lightweight core file. To generate a binary core file, set the variable to a comma-separated list of the ranks. To have all ranks generate a binary core file, set the variable to “*” (an asterisk).
See description
BG_COREDUMPDISABLED
Boolean that specifies whether core files are created:
0 Enable creation of core files.
1 Disable creation of core files.
0
BG_COREDUMPFILEPREFIX
Sets the file name prefix of the core files. The MPI task number is appended to this prefix to form the file name.
“core.”
BG_COREDUMPFPR
Boolean that controls whether register information is included in the core files. BG_COREDUMP_FPR controls output of floating-point registers (FPRs):
0 Disable this setting.
1 Enable this setting.
1
BG_COREDUMPGPR
Boolean that controls whether register information is included in the core files. BG_CORE_DUMPGPR controls integer general-purpose registers (GPRs):
0 Disable this setting.
1 Enable this setting.
1
BG_COREDUMPINTCOUNT
Boolean that controls whether the number of interrupts handled by the node is included in the core file:
0 Disable this setting.
1 Enable this setting.
1
BG_COREDUMPMAXNODES
Specifies the maximum number of nodes that generate core files for abnormally terminating processes. This variable can be used to limit the number of core files generated in cases when most of the processes in a very large block abnormally terminate.
2048
BG_COREDUMPONERROR
Boolean that controls the creation of core files when the application exits with a nonzero exit status. This variable is useful when the application performed an exit(1) operation and the cause and location of the exit(1) is not known:
0 Disable this setting.
1 Enable this setting.
0
BG_COREDUMPONEXIT
Boolean that controls the creation of core files when the application exits. This variable is useful when the application performed an exit() operation and the cause and location of the exit() operation is not known. To enable this setting, the value must be set to 1:
0 Disable this setting.
1 Enable this setting.
0
BG_COREDUMPPATH
Sets the directory for the core files.
The default value is the current working directory.
See description.
BG_COREDUMPPERS
Boolean that controls whether the node personality information (XYZ dimension location, memory size, and so on) is included in the core files:
0 Disable this setting.
1 Enable this setting.
1
BG_COREDUMPRANKS
Specifies a comma-separated list of ranks to generate a core file when the job ends. The ranks specified in this list are not prevented from being generated by any other BG_COREDUMP environment variable.
See description.
BG_COREDUMPREGS
Boolean that controls whether register information is included in the core files. BG_COREDUMPREGS is the master switch:
0 Disable this setting.
1 Enable this setting.
1
BG_COREDUMPSPR
Boolean that controls whether register information is included in the core files. BG_COREDUMP_SPR controls the output of special-purpose registers (SPRs):
0 Disable this setting.
1 Enable this setting.
1
BG_COREDUMPSTACK
Boolean that controls whether the application stack addresses are to be included in the core file:
0 Disable this setting.
1 Enable this setting.
0
BG_COREDUMPTLBS
Boolean that controls whether the TLB layout at the time of the core is to be included in the core file:
0 Disable this setting.
1 Enable this setting.
1
BG_MAPCOMMONHEAP
This option obtains a uniform heap allocation between the processes; however, the trade off is that memory protection between processes is not as stringent. In particular, when using the option, it is possible
to write into another process’ heap. Normally this would cause a segmentation violation, but with this option set, the protection mechanism is disabled to provide a balanced heap allocation. The processes have independent heaps and system calls return EFAULT if an address is passed in that is out-of-bounds.
0
BG_MAPNOALIASES
This option disables long-running alias mode. This feature is used for some TM or SE configurations.
0
BG_MAPALIGN16
This option changes the memory alignment restrictions for ppn = 16, ppn = 32, and ppn = 64. With the option enabled, the physical memory for each process begins on a 16 MB boundary, as opposed to a power-of-2 size. This has the potential for better memory mappings for ppn 16. By default, BG_MAPALIGN16 is enabled.
1
BG_MAXALIGNEXP
The maximum number of floating-point alignment exceptions that the CNK can handle. If the maximum is exceeded, the application core dumps:
0 No alignment exceptions are processed.
-1 All alignment exceptions are processed.
<n> n alignment exceptions are processed
1000
BG_PERSISTMEMRESET
Boolean that indicates that the persistent memory region must be cleared before the job starts:
0 Disable this setting.
1 Enable this setting.
0
BG_PERSISTMEMSIZE
Size, in MB, of the persistent memory region.
0
BG_POWERMGMTDUR
The number of microseconds spent in one proactive power management idle loop.
0
BG_POWERMGMTPERIOD
The number of microseconds between proactive power management idle loops. When 0, power management is disabled.
0
BG_SHAREDMEMSIZE
Size, in MB, of the shared memory region. To increase the default value by a specific number of MB, specify a '+' prefix with the value. To replace the default value with a new value, omit the '+'.
 
The default shared memory size is chosen by the CNK based on the known requirements of the current configuration:
 
If ranks-per-node = 1, the shared memory size defaults to 32 MB.
If ranks-per-node > 2, the shared memory size defaults to 64 MB.
See description.
BG_STACKGUARDENABLE
Boolean that indicates whether the CNK creates guard pages. If the variable is specified, a value must be set to either 0 or 1:
0 Do not create guard pages.
1 Create guard pages.
0
BG_STACKGUARDSIZE
The size, in bytes, of the main() function stack guard area. If the specified value is greater than zero but less than 512, 512 bytes are used.
4096
BG_SYSIODPOSIXMODE
Run I/O operations with POSIX rules:
0 I/O operation that is initiated from a compute node can cause multiple I/O operations on the I/O node.
1 Each I/O operation that is initiated from a compute node completes atomically.
0
BG_THREADLAYOUT
Specifies the algorithm that the CNK uses to select a hardware thread during software thread creation:
1 Assign software threads across the cores within the process before assigning software threads to additional hardware threads within a core.
2 Assign software threads to all hardware threads within a core before assigning software threads on other cores.
1
BG_THREADMODEL
Activates a specific thread model:
0 Operate in the native Blue Gene/Q thread model, allowing multiple pthreads per hardware thread.
1 Allow only one application pthread per hardware thread.
2 For V1R2M0 and later releases, enable the extended thread affinity control.
0
Setting environment variables
The simplest method to set environment variables is to specify them on the command line when running the runjob command. For example, to set environment variable “XYZ” to value “ABC,” call the runjob command as the following example shows:
$ runjob --envs XYZ=ABC myprogram.rts
To send multiple environment variables, separate them with a space, for example:
$ runjob --envs XYZ=ABC DEF=123 myprogram.rts
 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset