Appendix D. MPI and CNK environment variables

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

MPI and CNK environment variables

This appendix describes the environment variables that affect the run time characteristics of programs that run on the Blue Gene/Q compute nodes. These variables configure settings for the Message Passing Interface (MPI) and the Compute Node Kernel (CNK).

Environment variables can be used to improve performance or modify functional attributes of the application.

The following topics are covered:

•Message Passing Interface environment variables

•Compute Node Kernel environment variables

•Setting environment variables

Message Passing Interface environment variables

The Blue Gene/Q Message Passing Interface (MPI) implementation provides several environment variables that affect its behavior. Setting these environment variables can allow a program to run faster, or, if set incorrectly, might cause the program not to run at all. None of these environment variables are required to be set for the Blue Gene/Q MPI implementation to work.

Table D-1 shows the MPI environment variables.

Table D-1 MPI environment variables

Environment variable	Description	Default value
COMMAGENT_ RGETPACINGMAX	The maximum number of bytes allowed to be in the network at one time as a result of paced remote gets from each node. This number must be a multiple of COMMAGENT_RGETPACINGSUBSIZE. The default value depends on the number of nodes in the block, as shown in Table D-2.	See Table D-2
COMMAGENT_ RGETPACINGSUBSIZE	The size, in bytes, of a submessage used for remote get pacing. The pacing logic breaks a large remote get into submessages of this size. Table D-3 shows the default values for the COMMAGENT_RGETPACINGSUBSIZE environment variable. These values vary depending on the size of the block where the job is being run.	See Table D-3
MUSPI_INJFIFOSIZE	The size, in bytes, of each injection first-in, first-out queue (FIFO). These FIFOs store 64-byte descriptors. Each descriptor describes a memory buffer to be sent on the torus. Making this size larger might reduce memory usage and latency when there are many outstanding messages. Reducing this size might increase that memory usage and latency. PAMI messaging optimally uses 10 injection FIFOs per context, although fewer FIFOs can be used when resources are constrained.	65536 (64 KB)
MUSPI_NUMBATIDS	The number of base address table IDs per process reserved for use by a messaging unit (MU) SPI application.	0
MUSPI_NUMCLASSROUTES	The number of collective class routes reserved for use by an MU SPI application. This value is also the number of global interrupt class routes reserved for use by an MU SPI application.	0
MUSPI_NUMINJFIFOS	The number of injection FIFOs per process reserved for use by an MU system programming interface (SPI) application.	0
MUSPI_NUMRECFIFOS	The number of reception FIFOs per process reserved for use by an MU SPI application.	0
MUSPI_RECFIFOSIZE	The size, in bytes, of each reception FIFO. Incoming torus packets are stored in this FIFO until software can process them. Making this size larger can reduce torus network congestion. Making this size smaller leaves more memory available to the application. PAMI messaging uses one reception FIFO per context.	1048576 bytes (1 MB)
PAMI_A2A_PACING_ WINDOW	The number of simultaneous send operations to start on Alltoall(v) collectives. Additional send operations cannot be started until these operations finish. This requirement reduces the resource usage for large geometries.	1024
PAMI_ ATOMICBARRIER_LOOPS	The number of attempts to complete the barrier in each pass.	32
PAMI_CLIENT_SHMEMSIZE	The number of bytes that are allocated from shared memory to each client. Use the “K” and “k” suffix as a 1024 multiplier or the 'M' and 'm' suffix as a 1024 × 1024 multiplier. The default value depends on the number of tasks in the node. 4M More than one task is on the node. 0 All other settings.	See description
PAMI_CLIENTS	A comma-separated ordered list of clients (no spaces). The complete syntax is [name][:repeat][/weight][,[name][:repeat][/weight]]* Each client has the form [name][:repeat][/weight], where: •"name" is the name of the client. For example, the name of the Blue Gene/Q MPICH2 client is MPI. The default value for this option is the null string. •":repeat" is the repetition factor, where repeat is the number of clients having this same name. The default value for this option is 1. •"/weight" is the relative weight assigned to the client, where weight is the weight value. The default value for this option is 1. The weight is used to determine the portion of the messaging resources that is given to the client, relative to the other clients. When middleware calls the PAMI_Client_create() function, it provides the name of the client. PAMI searches through the PAMI_CLIENTS in the order they are specified, looking for an exact name match. If there is not an exact name match with any of the PAMI_CLIENTS, PAMI searches through the PAMI_CLIENTS again, looking for a client with a null name string. The null name string is a wildcard and matches any client name. If there are exact or wildcard name matches, the first match that does not already have an active client is used, and the weight of that client determines the percentage of the available resources that are allocated to the client. If there are no available and matching clients, the PAMI client is not created. The default value of the PAMI_CLIENTS environment variable is :1/1, which means that all resources are assigned to the first client created, regardless of the client name, and all subsequent attempts to create a client fail due to insufficient resources. If any of the clients specified on PAMI_CLIENTS are unnamed, or more than one client has the same name, the order in which the clients are created must be the same on all processes in the job. The first client listed has exclusive use of the message unit combining collective hardware for optimizing reduction operations. The other clients use algorithms that do not use the message unit combining collective hardware.	":1/1"
PAMI_CLIENTS (continued)	The following examples show how the PAMI_CLIENTS variable can be used. •PAMI_CLIENTS=MPI,CLIENTA means that up to two clients can use PAMI: one must be MPI, and the other must be CLIENTA. The MPI client is assigned the message unit combining collective hardware, and the two clients evenly split the remaining messaging resources. •PAMI_CLIENTS=MPI:3,CLIENTA/2,CLIENTB:2/3 means that up to six clients can use PAMI. Three can be MPI, one can be CLIENTA, and two can be CLIENTB. Each MPI client has weight 1. CLIENTA has weight 2, and each CLIENTB client has weight 3. In this example, each CLIENTB client gets three times the amount of resources as each MPI client, and the first MPI client created is assigned the message unit combining collective hardware. •PAMI_CLIENTS=MPI/3,/2 means that up to three clients can use PAMI. Two of the clients are unnamed, meaning that they can be any of the PAMI clients, and one client can only be MPI. The first MPI client created has resource weight 3 and is assigned the message unit combining collective hardware. The first non-MPI client created (or possibly the second MPI client created) has resource weight 2, and the second non-MPI client created (or possibly the second or third MPI client created) has resource weight 1. •PAMI_CLIENTS is not specified. This setting means that there can be only one client, with any name, and it is assigned all of the resources. •Default ":1/1" PAMI uses one reception FIFO per context and, optimally, uses 10 injection FIFOs per context, although fewer injection FIFOs can be used when resources are constrained. For more information, see the descriptions of the following variables: •MUSPI_NUMBATIDS •MUSPI_NUMCLASSROUTES •MUSPI_NUMINJFIFOS •MUSPI_NUMRECFIFOS •MUSPI_INJFIFOSIZE •MUSPI_RECFIFOSIZE •PAMI_MU_RESOURCES	":1/1"
PAMI_CONTEXT_ SHMEMSIZE	Number of bytes allocated from shared memory to every context in each client. Use the “K” and “k” suffix as a 1024 multiplier or the “M” and “m” suffix as a 1024 × 1024 multiplier.	135K
PAMI_GLOBAL_SHMEMSIZE	Number of bytes allocated from shared memory for global information such as the mapcache. Use the “K” and “k” suffix as a 1024 multiplier, or the “M” and “m” suffix as a 1024 × 1024 multiplier.	4M
PAMI_M2M_ROUTING	For an all-to-all message transfer, this setting specifies the network routing that is used. "DETERMINISTIC" Use deterministic routing. "DYNAMIC" Use dynamic routing.	"DYNAMIC"
PAMI_M2M_ZONE	For an all-to-all message transfer that uses DYNAMIC routing, this variable specifies the routing zone that is used: •0 Use zone 0. •1 Use zone 1. •2 Use zone 2. •3 Use zone 3. The default settings depend on the size of the block: •1 The blocks is smaller than 512 nodes. •0 The block is 512 nodes or larger.	See description
PAMI_MAX_COMMTHREADS	Maximum number of commthreads to create. This setting can be used to avoid hardware thread oversubscription.	(64 / ranks per node) - 1
PAMI_MEMORY_OPTIMIZED	Determines whether PAMI is configured for a restricted memory job. If not set, PAMI is not memory optimized and uses memory as needed to increase performance.	Not set.
PAMI_MU_RESOURCES	Determines whether PAMI calculates the number of available contexts based on an “optimal” or a “minimal” allocation of MU resources to each context. Supported environment variable values are not case-sensitive and include: Optimal An optimal allocation of MU resources to each context limits the maximum number contexts that can be created. Each context is allocated sufficient MU resources to fully use the MU hardware and torus network. Minimal A minimal allocation of MU resources to each context allows the maximum number of contexts to be created regardless of MU hardware and torus network considerations.	"Optimal"
PAMI_ NUMDYNAMICROUTING	Number of simultaneous dynamically routed messages per context. If more than this many messages are being transferred, the additional messages are deterministically routed. Dynamic routing can be faster than deterministic routing. However, dynamically routed messages require more storage to track their progress, hence the reason for this option. Specify this number in increments of 64 (for example: 64, 128, 192, 256, ...).	64
PAMI_RGETINJFIFOSIZE	The size, in bytes, of each remote get FIFO. These FIFOs store 64-byte descriptors. Each descriptor describes a memory buffer to be sent on the torus, and is used to queue requests for data (remote gets). Making this size larger might reduce torus network congestion and reduce overhead. Making this size smaller might increase that congestion, memory usage, and latency. PAMI messaging uses 10 remote get FIFOs per node.	65536 (64 KB)
PAMI_RGETPACING	Specifies whether to consider messages for pacing: •0 No messages are paced. •1 Messages are considered for pacing. The default setting depends on the block size. •0 The block size is one rack (1024 nodes) or smaller. •1 The block size is larger than one rack.	See description
PAMI_RGETPACINGDIMS	Messages between nodes whose coordinates differ in more than this many dimensions in ABCD are considered for pacing. For example, node A has ABCD coordinates (0,0,0,0) and node B has (3,2,1,0). They differ in three dimensions (A, B, and C). Specifying 2 means that messages between these nodes are considered for pacing.	1
PAMI_RGETPACINGHOPS	Messages between nodes that are more than this many hops apart on the network are considered for pacing.	4
PAMI_RGETPACINGSIZE	Messages exceeding this size in bytes are considered for pacing.	65536 (64 KB)
PAMI_ROUTING	Specifies the PAMI network routing options to be used for point‑to‑point messages that are large enough to use the rendezvous protocol. That is, the messages are larger than the size specified for PAMID_EAGER. The complete syntax is PAMI_ROUTING=[size][,[small][,[low:high][,[in][,out]]]] When the source and destination nodes are on a network line (their ABCDE coordinates differ in at most one dimension), deterministic routing is always used. PAMI_ROUTING does not override this setting. When the message size is less than or equal to "size", PAMI uses the "small" network routing. When the message size is larger than "size", PAMI uses the "flexibility metric" to determine the network routing as follows: The "low:high" range is the flexibility metric range. The flexibility metric gauges the routing flexibility between a source node and destination node. The values for "low" and "high" must be floating-point numbers in the range 0.0 through 4.0. The low value must be less than or equal to the high value. The PAMI computes the flexibility metric between the source node and the destination node of a messaging transfer. The metric is the sum of the flexibility of dimensions A, B, C, and D between those nodes. The flexibility of a particular dimension is the ratio of the number of hops between the source and the destination in that dimension and the size of that dimension, and can range from 0.0 through 1.0. The "in" value is the network routing to be used for a message transfer between two nodes when their flexibility metric is between "low" and "high", and the "out" value is the network routing otherwise. The values for “in” and “out” and “small” might each be one of the following values. 0 Dynamic routing zone 0 1 Dynamic routing zone 1 2 Dynamic routing zone 2 3 Dynamic routing zone 3 4 Deterministic routing	See Table D-4 on page 141
PAMID_ASYNC_PROGRESS	This variable determines whether one or more communications threads are started to make asynchronous progress. This variable is required for maximum performance in message throughput cases: 0 No internal communications threads are started to assist with making communications progress 1 One or more communications threads can assist with making communications progress. Use this setting when high message throughput is required. The default value depends on the settings that are used: 0 The application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, xl.legacy.ndebug), MPI_Init_thread() is called without MPI_THREAD_MULTIPLE, or the MPI_Init() function is called. 1 The application is linked with the gcc, xl, and xl.ndebug MPICH libraries and the MPI_Init_thread() function is called with MPI_THREAD_MULTIPLE. The default value cannot be changed when the application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, or xl.legacy.ndebug). Attempting to set this environment variable causes an application that is linked with a "legacy" MPICH library to display a message and exit during the MPI_Init() function call.	See description
PAMID_COLLECTIVE_name	Turns on or off specific protocols for the MPI collective specified as name. Possible values for collectives are ALLTOALL, ALLTOALLV, ALLREDUCE, BARRIER, BCAST, SCATTER, SCATTERV, GATHER, GATHERV, ALLGATHER, ALLGATHERV, SCAN, and REDUCE. The MPICH option can be used to turn off all optimizations for a specific collective and use the MPICH point-to-point protocol. For many MPI_collective operations, this setting can cause poor performance on larger blocks. For information about other options and the default values, use the PAMID_VERBOSE=2 setting. For some PAMID_COLLECTIVE_* environment variables, especially PAMID_COLLECTIVE_ALLREDUCE, some optimized protocols only work with specific parameters (such as data type, operation, or message size) that are specified for the particular MPI_collective invocation. Therefore, specifying a specific protocol on the PAMID_COLLECTIVE_* environment variable might not work for a given MPI_collective invocation. In that case, the MPICH protocol is used instead. To find out which protocol was used for a specific MPI_collective invocation, invoke the MPIX_Get_last_algorithm_name() function immediately after the MPI_collective invocation. For more information, see the mpix.h file.	See description
PAMID_COLLECTIVES	Controls whether optimized collectives are used. The possible values are: 0 Optimized collectives are not used. Only MPICH point-to-point based collectives are used. 1 Optimized collectives are used.	1
PAMID_CONTEXT_MAX	This variable sets the maximum allowable number of contexts. Contexts are a method of dividing hardware resources among a Parallel Active Messaging Interface (PAMI) client (for example, MPI) to set how many parallel operations can occur at one time. Contexts are similar to channels in a communications system. The practical maximum is usually 64 contexts per node. The default value depends on the number of processes per node and the settings that are used: 1 The application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, xl.legacy.ndebug), the MPI_Init_thread() function is called without MPI_THREAD_MULTIPLE, or the MPI_Init() function is called. ≥1 The application is linked with the gcc, xl, and xl.ndebug MPICH libraries and the MPI_Init_thread() function is called with MPI_THREAD_MULTIPLE. The default value cannot be changed when the application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, or xl.legacy.ndebug). Attempting to set this environment variable causes an application that is linked with a "legacy" MPICH library to display a message and exit during the MPI_Init() function call.	See description
PAMID_CONTEXT_POST	This variable must be enabled to allow parallelism of multiple contexts. It might increase latency. Enabling this variable is the only method to allow parallelism between contexts: 0 Only one parallel communications context can be used. Each operation runs in the application thread. 1 Multiple parallel communications contexts can be used. An operation is posted to one of the contexts, and communications for that context are driven by communications threads. The default value depends on the settings that are used: 0 The application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, xl.legacy.ndebug), the MPI_Init_thread() function is called without MPI_THREAD_MULTIPLE, or the MPI_Init() function is called. 1 The application is linked with the gcc, xl, and xl.ndebug MPICH libraries and MPI_Init_thread() is called with MPI_THREAD_MULTIPLE. The default value cannot be changed when the application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, or xl.legacy.ndebug). Attempting to set this environment variable causes an application that is linked with a "legacy" MPICH library to display a message and exit during the MPI_Init() function call.	See description
PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT	Overrides the default job size at which point the eager protocols are disabled for internal MPI operations. This override has the same effect as specifying the environment variable: PAMID_PT2PT_LIMITS=::::0:0:0:0 This environment variable is processed before the PAMID_EAGER or PAMID_RZV, PAMID_EAGER_LOCAL or PAMID_RZV_LOCAL, PAMID_SHORT, and PAMID_PT2PT_LIMITS environment variables.	512k
PAMID_EAGER_LOCAL, PAMID_RZV_LOCAL	Sets the cutoff value for the switch to the rendezvous protocol when the destination rank is local. The two options are identical. This variable takes an argument, in bytes, to switch from the eager protocol to the rendezvous protocol for point-to-point messaging. The default value effectively disables the eager protocol for local transfers because the default value for PAMID_EAGER_LOCAL is less than the default value for PAMID_SHORT. The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used.	4097 bytes
PAMID_EAGER, PAMID_RZV	Sets the cutoff for the switch to the rendezvous protocol. These options are identical. This variable takes an argument, in bytes, to switch from the eager protocol to the rendezvous protocol for point-to-point messaging. Increasing the limit might help for larger blocks and if most of the communication is with the nearest neighbor. The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used.	4097 bytes
PAMID_PT2PT_LIMITS	Specify all point-to-point limit overrides. This environment variable is processed after the PAMID_EAGER or PAMID_RZV, PAMID_EAGER_LOCAL or PAMID_RZV_LOCAL, and PAMID_SHORT environment variables. The entire point-to-point limit set is determined by three Boolean configuration values: •'is non-local limit' versus 'is local limit' •'is eager limit' versus 'is immediate limit' •'is application limit' versus 'is internal limit' The point-to-point configuration limit values are specified in order and are delimited by ':' characters. If a value is not specified for a given configuration, the limit is not changed. There is no requirement to specify all eight configuration values. However, to set the last (eighth) configuration value, the previous seven configurations must be listed. The 'k', 'K', 'm', and 'M' multipliers can be specified. For example: PAMID_PT2PT_LIMITS=":::::::10k" The configuration entries can be described as: 0 remote eager application limit 1 local eager application limit 2 remote immediate application limit 3 local immediate application limit 4 remote eager internal limit 5 local eager internal limit 6 remote immediate internal limit 7 local immediate internal limit The following example show how the PAMID_PT2PT_LIMITS variable can be used. •"10K" sets the application internode eager (the "normal" eager limit) •"10240::64" sets the application internode eager and immediate limits •"::::0:0:0:0" disables 'eager' and 'immediate' for all internal point-to-point limits This environment variable does not override any point-to-point limits by default. If no other point-to-point limit environment variables are used, and if the job size is less than PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT, the effective default value is: 4097:4097:113:113:2049:64:113:113 If no other point-to-point limit environment variables are used, and if the job size is not less than PAMID_DISABLE_INTERNAL_EAGER_TASK_LIMIT, the effective default value is: 4097:4097:113:113:0:0:0:0	See description
PAMID_RMA_PENDING	Maximum outstanding Remote Memory Access (RMA) requests. Limits the number of PAMI_Request objects allocated by MPI one-sided (also known as RMA) operations. The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used.	1000
PAMID_SHMEM_PT2PT	Determines whether intranode point-to-point communication uses the optimized shared memory protocols: 0 Optimized shared memory protocols are not used. 1 Optimized shared memory protocols are used.	1
PAMID_SHORT	Sets the cutoff for the switch to the eager protocol. This variable takes an argument, in bytes, to switch from the short protocol to the eager protocol for point-to-point messaging. If a value greater than 113 bytes is specified, 113 bytes are used. The 'K' and 'M' multipliers can be used in the value. For example, "16K" or "1M" can be used.	113 bytes
PAMID_STATISTICS	Turns on the printing of statistics for the message layer such as the maximum receive queue depth. Possible values: 0 No statistics are printed. 1 Statistics are printed.	0
PAMID_THREAD_ MULTIPLE	Specifies the messaging execution environment. It specifically selects whether there can be multiple independent communications occurring in parallel, driven by internal communications threads: 0 The application threads drive the communications. No additional internal communications threads are used. This setting is equivalent to specifying PAMID_ASYNC_PROGRESS=0, PAMID_CONTEXT_POST=0, PAMID_CONTEXT_MAX=1. 1 There can be multiple independent communications occurring in parallel, driven by internal communications threads. This setting is equivalent to specifying PAMID_ASYNC_PROGRESS=1, PAMID_CONTEXT_POST=1, PAMID_CONTEXT_MAX≥1. The default value depends on the settings that are used: 0 The application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, xl.legacy.ndebug), or the MPI_Init_thread() function is called without MPI_THREAD_MULTIPLE, or the MPI_Init() function is called. 1 The application is linked with the gcc, xl, and xl.ndebug MPICH libraries and the MPI_Init_thread() function is called with MPI_THREAD_MULTIPLE. The default value cannot be changed when the application is linked with the "legacy" MPICH libraries (gcc.legacy, xl.legacy, or xl.legacy.ndebug). Attempting to set this environment variable causes an application that is linked with a "legacy" MPICH library to display a message and exit during the MPI_Init() function call.	See description
PAMID_VERBOSE	Provides debugging information during MPI_Abort() and during various MPI function calls. Some settings affect performance. To simplify debugging, set this variable to 1 for all applications: 0 No additional information is provided. 1 Print summary information during the MPI_Init() function call on rank 0. Use this setting to simplify debugging. It only impacts performance by using the MPI_Init() function to print the information. A small amount of text is printed, including the PAMID_, PAMI_, MUSPI_, COMMAGENT_, and BG_ environment variables and other variables that the user specifies. The MPI_Init() function does not verify that variable names are specified correctly. 2 Print summary information for collective operations and print additional information for point-to-point. This information can be useful when debugging which collective is being used on a communicator. Approximately one line of output per rank per communicator is created, and one line of output per rank of point-to-point send statistics are provided on finalize. This setting can affect the performance of routines that are typically not timed (for example, MPI_Comm_create, MPI_Finalize, and so on). 3 Print detailed information. This setting generates extensive information when used with large numbers of ranks.	0

Table D-2 shows the default values for the COMMAGENT_RGETPACINGMAX environment variable.

Table D-2 Default values for the COMMAGENT_RGETPACINGMAX variable

Block size (racks)	COMMAGENT_RGETPACINGMAX value
Racks < 2	65536
2 ≤ racks < 4	65536
4 ≤ racks < 8	32768
8 ≤ racks < 16	24576
16 ≤ racks < 32	24576
32 ≤ racks < 48	24576
48 ≤ racks < 64	24576
64 ≤ racks < 80	24576
80 ≤ racks < 96	24576
96 ≤ racks	24576

Table D-3 shows the default values for the COMMAGENT_RGETPACINGSUBSIZE environment variable. These values vary depending on the size of the block where the job is being run.

Table D-3 Default settings for the COMMAGENT_RGETPACINGSUBSIZE variable

Block size (racks)	COMMAGENT_RGETPACINGSUBSIZE value
Racks < 2	16384
2 ≤ racks < 4	16384
4 ≤ racks < 8	8192
8 ≤ racks < 16	8192
16 ≤ racks < 32	8192
32 ≤ racks < 48	8192
48 ≤ racks < 64	8192
64 ≤ racks < 80	8192
80 ≤ racks < 96	8192
96 ≤ racks	8192

Table D-4 shows the default values for the PAMI_ROUTING environment variable. These values vary depending on the size of the block where the job is being run.

Table D-4 Default values for the PAMI_ROUTING environment variable

Block size	Size	Small	Flexibility metric range, low - high	Routing when in range	Routing when out of range
32 ≤ nodes < 64	65536	4	1.5 - 3.5	3	3
64 ≤ nodes < 128	65536	4	1.5 - 3.5	3	3
128 ≤ nodes < 256	65536	4	1.5 - 3.5	3	3
256 ≤ nodes < 512	65536	4	1.5 - 3.5	3	3
512 ≤ nodes < 1024	65536	4	1.5 - 3.5	3	3
1 ≤ racks < 2	65536	4	1.5 - 3.5	2	0
2 ≤ racks < 4	65536	4	1.5 - 3.5	3	0
4 ≤ racks < 8	65536	4	1.5 - 3.5	3	0
8 ≤ racks < 16	65536	4	1.5 - 2.5	3	0
16 ≤ racks < 32	65536	4	1.3 - 3.0	3	0
32 ≤ racks < 48	65536	4	1.3 - 3.0	3	0
48 ≤ racks < 64	65536	4	1.3 - 3.0	3	0
64 ≤ racks < 80	65536	4	.75 - 3.0	3	0
80 ≤ racks < 96	65536	4	.75 - 3.0	3	0
96 ≤ racks	65536	4	.75 - 3.0	3	0

Compute Node Kernel environment variables

Several environment variables affect the runtime characteristics of the Compute Node Kernel (CNK). If these variables are set incorrectly, programs might not run. None of these environment variables are required to be set for the CNK to work.

Table D-5 lists the CNK environment variables.

Table D-5 MPI environment variables

Environment variable	Description	Default value
BG_AGENTHEAPSIZE	The heap size in MB that is allocated to the application agent process. The default value is 16 if an application agent is defined in the BG_APPAGENT environment variable. Otherwise, it is 0.	16 or 0. See description.
BG_AGENTCOMMHEAPSIZE	The heap size in MB that is allocated to the application agent that is reserved for use by the messaging software. The default value is 16 if an application agent for messaging is not disabled by BG_APPAGENTCOMM=DISABLE. Otherwise, it is 0.	16 or 0. See description.
BG_APPAGENT	The path to an application agent program. The default is no application agent.	See description
BG_APPAGENTCOMM	The path to an application agent that is reserved for use by the messaging software. To disable the default PAMI application agent, specify DISABLE. The default value is /bgsys/drivers/ppcfloor/agents/bin/comm.elf	See description
BG_COREDUMPBINARY	Specifies the MPI ranks for which a binary core file is generated rather than a lightweight core file. This type of core file can be used with the GNU Project Debugger (GDB) but not the Blue Gene/Q Coreprocessor utility. If this variable is not set, all ranks generate a lightweight core file. To generate a binary core file, set the variable to a comma-separated list of the ranks. To have all ranks generate a binary core file, set the variable to “*” (an asterisk).	See description
BG_COREDUMPDISABLED	Boolean that specifies whether core files are created: 0 Enable creation of core files. 1 Disable creation of core files.	0
BG_COREDUMPFILEPREFIX	Sets the file name prefix of the core files. The MPI task number is appended to this prefix to form the file name.	“core.”
BG_COREDUMPFPR	Boolean that controls whether register information is included in the core files. BG_COREDUMP_FPR controls output of floating-point registers (FPRs): 0 Disable this setting. 1 Enable this setting.	1
BG_COREDUMPGPR	Boolean that controls whether register information is included in the core files. BG_CORE_DUMPGPR controls integer general-purpose registers (GPRs): 0 Disable this setting. 1 Enable this setting.	1
BG_COREDUMPINTCOUNT	Boolean that controls whether the number of interrupts handled by the node is included in the core file: 0 Disable this setting. 1 Enable this setting.	1
BG_COREDUMPMAXNODES	Specifies the maximum number of nodes that generate core files for abnormally terminating processes. This variable can be used to limit the number of core files generated in cases when most of the processes in a very large block abnormally terminate.	2048
BG_COREDUMPONERROR	Boolean that controls the creation of core files when the application exits with a nonzero exit status. This variable is useful when the application performed an exit(1) operation and the cause and location of the exit(1) is not known: 0 Disable this setting. 1 Enable this setting.	0
BG_COREDUMPONEXIT	Boolean that controls the creation of core files when the application exits. This variable is useful when the application performed an exit() operation and the cause and location of the exit() operation is not known. To enable this setting, the value must be set to 1: 0 Disable this setting. 1 Enable this setting.	0
BG_COREDUMPPATH	Sets the directory for the core files. The default value is the current working directory.	See description.
BG_COREDUMPPERS	Boolean that controls whether the node personality information (XYZ dimension location, memory size, and so on) is included in the core files: 0 Disable this setting. 1 Enable this setting.	1
BG_COREDUMPRANKS	Specifies a comma-separated list of ranks to generate a core file when the job ends. The ranks specified in this list are not prevented from being generated by any other BG_COREDUMP environment variable.	See description.
BG_COREDUMPREGS	Boolean that controls whether register information is included in the core files. BG_COREDUMPREGS is the master switch: 0 Disable this setting. 1 Enable this setting.	1
BG_COREDUMPSPR	Boolean that controls whether register information is included in the core files. BG_COREDUMP_SPR controls the output of special-purpose registers (SPRs): 0 Disable this setting. 1 Enable this setting.	1
BG_COREDUMPSTACK	Boolean that controls whether the application stack addresses are to be included in the core file: 0 Disable this setting. 1 Enable this setting.	0
BG_COREDUMPTLBS	Boolean that controls whether the TLB layout at the time of the core is to be included in the core file: 0 Disable this setting. 1 Enable this setting.	1
BG_MAPCOMMONHEAP	This option obtains a uniform heap allocation between the processes; however, the trade off is that memory protection between processes is not as stringent. In particular, when using the option, it is possible to write into another process’ heap. Normally this would cause a segmentation violation, but with this option set, the protection mechanism is disabled to provide a balanced heap allocation. The processes have independent heaps and system calls return EFAULT if an address is passed in that is out-of-bounds.	0
BG_MAPNOALIASES	This option disables long-running alias mode. This feature is used for some TM or SE configurations.	0
BG_MAPALIGN16	This option changes the memory alignment restrictions for ppn = 16, ppn = 32, and ppn = 64. With the option enabled, the physical memory for each process begins on a 16 MB boundary, as opposed to a power-of-2 size. This has the potential for better memory mappings for ppn ≥ 16. By default, BG_MAPALIGN16 is enabled.	1
BG_MAXALIGNEXP	The maximum number of floating-point alignment exceptions that the CNK can handle. If the maximum is exceeded, the application core dumps: 0 No alignment exceptions are processed. -1 All alignment exceptions are processed. <n> n alignment exceptions are processed	1000
BG_PERSISTMEMRESET	Boolean that indicates that the persistent memory region must be cleared before the job starts: 0 Disable this setting. 1 Enable this setting.	0
BG_PERSISTMEMSIZE	Size, in MB, of the persistent memory region.	0
BG_POWERMGMTDUR	The number of microseconds spent in one proactive power management idle loop.	0
BG_POWERMGMTPERIOD	The number of microseconds between proactive power management idle loops. When 0, power management is disabled.	0
BG_SHAREDMEMSIZE	Size, in MB, of the shared memory region. To increase the default value by a specific number of MB, specify a '+' prefix with the value. To replace the default value with a new value, omit the '+'. The default shared memory size is chosen by the CNK based on the known requirements of the current configuration: If ranks-per-node = 1, the shared memory size defaults to 32 MB. If ranks-per-node > 2, the shared memory size defaults to 64 MB.	See description.
BG_STACKGUARDENABLE	Boolean that indicates whether the CNK creates guard pages. If the variable is specified, a value must be set to either 0 or 1: 0 Do not create guard pages. 1 Create guard pages.	0
BG_STACKGUARDSIZE	The size, in bytes, of the main() function stack guard area. If the specified value is greater than zero but less than 512, 512 bytes are used.	4096
BG_SYSIODPOSIXMODE	Run I/O operations with POSIX rules: 0 I/O operation that is initiated from a compute node can cause multiple I/O operations on the I/O node. 1 Each I/O operation that is initiated from a compute node completes atomically.	0
BG_THREADLAYOUT	Specifies the algorithm that the CNK uses to select a hardware thread during software thread creation: 1 Assign software threads across the cores within the process before assigning software threads to additional hardware threads within a core. 2 Assign software threads to all hardware threads within a core before assigning software threads on other cores.	1
BG_THREADMODEL	Activates a specific thread model: 0 Operate in the native Blue Gene/Q thread model, allowing multiple pthreads per hardware thread. 1 Allow only one application pthread per hardware thread. 2 For V1R2M0 and later releases, enable the extended thread affinity control.	0

Setting environment variables

The simplest method to set environment variables is to specify them on the command line when running the runjob command. For example, to set environment variable “XYZ” to value “ABC,” call the runjob command as the following example shows:

$ runjob --envs XYZ=ABC myprogram.rts

To send multiple environment variables, separate them with a space, for example:

$ runjob --envs XYZ=ABC DEF=123 myprogram.rts

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Appendix D. MPI and CNK environment variables

Create new playlist

Sign In

Sign Up

Table of Contents for
Appendix D. MPI and CNK environment variables