34

–––––––––––––––––––––––

Parallel Programming Models for Scalable Computing

James Dinan and Pavan Balaji

34.1   INTRODUCTION TO PARALLEL PROGRAMMING MODELS

A programming model can be thought of as the abstract machine that a programmer is writing instructions for. Such models form a rich topic for computer science research since programmers prefer them to be (1) expressive (capable of expressing any abstract algorithm), (2) portable (capable of being used on any computer architecture), (3) efficient (capable of delivering performance commensurate with that of the underlying hardware), and (4) easy to use. Often, simultaneously meeting all four of these requirements is hard, so different programming models make different trade-offs. This is one of the primary reasons several dozen scientific programming models exist today.

At the core of parallel computing is the broad space of languages and libraries used to craft parallel programs. Recent changes in system architecture have led to renewed interest in programming models as a means for mapping computation to complex hardware architectures and for enhancing the productivity of parallel programming on these platforms. Deepening memory hierarchies and nonuniformity in both the communication and the memory subsystems has led to an increased interest in models that can effectively incorporate the cost of data access, allowing programmers to exploit data locality in their application.

A parallel programming model defines the methods and mechanisms used to express and manage parallelism within an application. Parallel programming models encapsulate many diverse concepts. At their core, such models must define two components: a data model and a computation model. The data model defines data visibility boundaries and methods of data exchange. The computation model specifies the control and execution models, defining the characteristics of the computational entities in a parallel application. In addition to these core properties, parallel programming models define methods by which parallelism can be expressed as well as models and methods for reasoning about and tuning the performance of a parallel program.

Trends in system architecture are driving rapid increases in the number of processing elements (cores and threads) per chip. Such trends have resulted in application programmers looking at what are referred to as hybrid programming models, in which more than one programming model is used in the same application. Using different programming models within the node and across nodes is a common example of a hybrid programming model.

34.1.1   Data Model

The data model defines the visibility of data across computational entities as well as how these entities exchange information. The two most common data visibility models are shared memory and private memory. In the shared-memory model, multiple computational entities (e.g., threads) share a common region of memory that can be used to exchange information using load and store instructions. In the private-memory model, the memory belonging to a computational entity (e.g., a process) is accessible only by that entity. Data exchange in this model is often accomplished by using explicit send and receive operations.

A third data model, remotely accessible memory, lies in between conventional shared and private models. In this model, an entitys ´; memory is divided into private regions and remotely accessible regions. Data within shared regions is exchanged by using asynchronous, one-sided get and put operations that copy data between the data spaces of the origin and target entities. Data within private regions is exchanged by copying information into the shared space or through send and receive operations.

34.1.2   Computation Model

The computational model defines the control and execution mechanisms of the programming model. Many control and execution models are possible; common control models define computational entities as processes, persistent threads, lightweight threads, or tasks. Often, the control model implies a data visibility model; however, the relationship between control and data visibility models is not strict since some models utilize both shared memory and processes (e.g., distributed shared memory or interprocess communication systems), whereas others provide private memory when using threads (e.g., Threaded MPI [1]).

34.1.3   Expression of Parallelism

Many diverse methods of expressing parallelism have been proposed in the parallel computing literature. Each method captures different classes of parallelism, affecting the programmability of a particular computational task under that method. Examples of these methods are loop-level data parallelism and fork‒join task parallelism.

Often the method of expressing parallelism also defines a model for how work will be mapped to the underlying computational entities. A common mechanism is static assignment of work to computational entities; this method provides very low overhead and is effective when the optimal mapping can be computed a priori. When a fixed work assignment is not possible, the work mapping is dynamic. The dynamic mapping can be achieved through collective rebalancing steps (e.g., graph partitioning, space-filling curve) or through noncollective redistribution of work. Noncollective work distribution can be achieved through centralized and distributed schemes. A common centralized scheme is master‒worker, and a commonly used distributed strategy is work stealing.

34.1.4   Performance Model

In the high-performance computing space, parallelism is a means for increasing the performance of a given computation. Therefore, most parallel programming models also define a performance model that is used by programmers to reason about the execution time of their algorithm. The performance model defines the relative cost of operations for a particular programming model. In addition, it may define ways to utilize low-level operations to increase performance.

34.2   THE MESSAGE-PASSING INTERFACE (MPI)

Traditionally, supercomputing vendors each provided different system-specific techniques for computation and data movement. While this approach allowed each vendor to provide an interface best suited for the platforms supported by the vendor, it created a portability hassle for applications; that is, applications written on one system could not easily be ported to other systems. This situation was cumbersome and counterproductive for large applications because they were either locked in to specific platforms or had to be rewritten for every new machine acquired.

In an attempt to mitigate these portability problems, several researchers designed portability interfaces, such as p4 [2], Chameleon [3], and PVM [4], that were intended to allow applications to execute on any platform without any modifications. However, it was hard for such interfaces to keep up with the rapidly evolving landscape of supercomputing architectures without support from vendors developing these architectures. In November 1992, a group of computer science researchers, application developers, and vendors gathered together to standardize the best practices in distributed-memory parallel programming into a common interface. The idea of such standardization was to provide a portable interface that is rich with respect to features and is expressive enough to allow different platforms to optimize their implementations. This standardized interface came to be known as MPI [5] and the standardization body as the MPI Forum.

In May 1994, the MPI Forum released the first version of the MPI standard: MPI-1.0. This provided a large set of functionality, including two-sided point-to-point communication, and group/collective communication operations, that allowed applications to express their algorithms easily and to achieve high performance on every platform. MPI quickly gained popularity among application developers because of its standardized interface, which would portably work on all platforms. Consequently, many applications migrated to using MPI.

In May 1995, the MPI Forum reconvened to correct minor errors within the MPI-1.0 standard (the versions were referred to as MPI-1.1, MPI-1.2, etc.) and to form the MPI-2.0 standard. This new standard consisted of major new feature additions to MPI, including one-sided communication operations, file I/O capabilities, interoperability with threading models, and dynamically spawned processes. The MPI-2.0 standard was released in July 1997.

Over the past two decades, several MPI implementations have emerged for various supercomputing platforms, allowing users to portably take advantage of the performance provided by those platforms. Today, MPI is considered the de facto standard of parallel programming and is available on virtually every supercomputing system in the world.

The rest of the section describes various capabilities in MPI. Section 34.2.1 discusses some of the core concepts in MPI, including its execution model semantics. Next, we discuss two fundamental communication methodologies within MPI: two-sided communication (Section 34.2.2) and group/collective communication (Section 34.2.3).

34.2.1 Core Concepts in MPI

MPI utilizes a process-centric model in which the user launches a number of processes and explicitly partitions work on these processes. In order to understand MPI, four concepts need to be clarified: communicators, ranks, tags, and data types.

34.2.1.1   Communicators   A communicator is a combination of a group (which lists a set of processes) and a context identifier (which, in some sense, is a name alias to the group of processes). Multiple communicators can be created on the same group of processes. A communicator provides a conduit for communication; use of separate communicators allows multiple libraries linked into a single executable to safely communicate without interfering with the communication of other libraries within the same executable.

34.2.1.2   Ranks   Every process in MPI has an identifier, known as the process rank, for every communicator that it is a part of. The end points of all communication in MPI are identified by using process ranks.

34.2.1.3   Tags  Every point-to-point message in MPI has an associated tag. The tag is user defined and can be used to distinguish messages from one process rank to another over the same communicator.

34.2.1.4   Data Types   MPI defines data types that describe the data being communicated. Most C, Fortran, and C++ data types have predefined equivalents in MPI. Further, MPI also allows users to create user-defined data types such as vectors, structures, and multidimensional subarrays.

34.2.2 Two-Sided Communication

Two-sided communication in MPI involves two process ranks: a sender and a receiver. Every message sent by a process rank has to be explicitly received by the destination process rank.

34.2.2.1   Communication Operations   A process rank can send a point-to-point message to another process rank using four types of send operations: regular send, buffered send, ready send, and synchronous send. A regular send operation is the most commonly used variant whereby the completion of the send operation guarantees only that the user buffer is free to be reused. The MPI implementation might choose to internally copy the data to a temporary buffer if available or to perform some form of zero-copy communication. Depending on how much memory the MPI implementation is willing to expend, it might or might not have temporary buffer space available to perform a copy—this implementation detail is completely hidden from the user. To provide better control over this to the user, MPI also provides a buffered send operation through which the user can provide this temporary intermediate buffer to the MPI implementation, thus potentially optimizing communication performance.

When a process performs a regular or buffered send, the receiver is not required to have called a corresponding receive operation yet. Thus, the MPI implementation has to perform the necessary synchronization between the processes in such cases. A ready send allows the user to specify to the MPI implementation that the receiver process is “ready” and has already called a corresponding receive operation.

Completion of a regular, buffered, or ready send operation does not guarantee that the message has been received by the receiver, just that the sender buffer is free to be reused. A synchronous send operation, on the other hand, allows the user to wait until the message has been received by the destination process.

For all four forms of communication, three variants are allowed: blocking, nonblocking, and persistent. With blocking communication, when a send call returns, the send operation has completed. With nonblocking communication, when a send call returns, it means only that the send operation has been initiated and can be tested or waited on for completion at a later time. This allows the MPI implementation to perform communication in the background while the application computes between the nonblocking send and wait calls. With persistent communication, the user can “initiate” a send operation prior to actually issuing it. This allows the MPI implementation to perform some optimizations before the actual send operation, such as registering the communication buffer with the network interface.

For receive operations, only a regular receive is provided, which can receive data sent using any form of send operation. All three variants of receive—blocking, nonblocking, and persistent—are provided as well.

34.2.2.2   Matching Semantics   MPI matches point-to-point communication based on three elements: the communicator, the source, and the tag. Two messages sent over the same communicator from the same source using the same tag are always ordered with respect to matching, though not necessarily with respect to completion; that is, if the receiver process posts two nonblocking receive operations on a communicator to receive data from the same source using the same tag, the first message sent by the sender will be deposited into the buffer pointed to by the first nonblocking receive, and the second message into the buffer pointed to by the second nonblocking receive.

However, no guarantees are made about the completion of the two send or receive operations for nonblocking communication; that is, the second send can finish before the first send; similarly, the second receive can finish before the first receive.

34.2.2.3   Wildcard Receives  For receive operations, MPI allows users to specify a wildcard source or tag, which matches all values of source ranks or tags. Such wildcards are useful when the destination process cannot deterministically decide which source rank might send it a message or which message tag it might receive next. Once the receive is complete, the user can query for which source sent the message and what tag the message contained.

34.2.3   Collective or Group Communication

Collective or group communication is “collective” over all process ranks in a communicator; that is, they need to be called by all process ranks in the communicator. There are three major types of collective communication operations: synchronization, communication-only, and communication and computation. Other types of collective operations, such as communicator creation operations and process spawning operations, also exist, but we do not discuss them in this chapter.

34.2.3.1   Synchronization   MPI provides one operation for synchronizing processes within a communicator—MPI_Barrier. This operation performs a logical barrier between all processes, so no process can exit a barrier before all other processes have arrived at the barrier. It does not guarantee any global synchronization with respect to the clock timing.

34.2.3.2   Communication-Only   MPI provides several operations, such as MPI_Bcast, MPI_Gather, MPI_Allgather and MPI_Alltoall, which are communication-only operations in that their only purpose is to move data between a group of processes. The MPI_Bcast operation broadcasts data from one process to all other processes in the communicator. The MPI_Gather operation gathers data provided by all processes in the communicator into a single process´;s buffer. The MPI_Allgather operation is similar to the MPI_Gather operation except that the data collected from all the processes are deposited at all processes, instead of a single process. The MPI_Alltoall operation allows each process to send a unique message to every other process in the communicator. Several other communication-only collective operations are also provided by MPI.

34.2.3.3   Communication and Computation   MPI provides a few combined communication‒computation operations, such as MPI_Reduce and MPI_Allreduce. The MPI_Reduce operation is similar to a gather operation except that, instead of simply accumulating the data, it performs a reduction operation on the data. Several reduction operations are defined in MPI, such as a sum, product, and bitwise AND/OR operations. The MPI_Allreduce operation is similar to an MPI_Reduce operation except that the resulting data are deposited at all processes instead of a single process.

34.3   PARTITIONED GLOBAL ADDRESS SPACE (PGAS) MODELS

The family of PGAS programming models shares a common data model that allows for global access to distributed, shared data. In this model, computational entities make a portion of their memory accessible to others in the computation. The aggregation of these shared-memory regions forms a global space (shown in Figure 34.1) that can be accessed throughout the computation by using portable global addresses. The global address space itself is partitioned in both performance and capability through the locality of a given piece of data; if data are local, they can be quickly accessed, frequently through load and store operations.

Both language and library implementations of PGAS models have been proposed in the literature. In this section, we describe three of the most popular models: Unified Parallel C (UPC), Co-Array Fortran (CAF), and Global Arrays (GA). UPC and CAF are general-purpose language extensions that add a global address space to C and Fortran, respectively. GA takes a library approach that can be used with a variety of languages and focuses specifically on distributed shared arrays.

34.3.1   UPC

UPC [6] is an extension to the C programming language that adds support for a PGAS. UPC presents the programmer with a single-program, multiple-data (SPMD) execution model in which multiple instances of the same program are executed concurrently and each execution is parameterized by an integer identity. In the UPC terminology, an instance is referred to as a thread, the identity of the current instance is the constant MYTHREAD, and the number of instances is THREADS. The threads of execution in UPC can be implemented either as processes or as persistent threads; both are supported by some UPC environments.

image

FIGURE 34.1. Low-level implementation of a partitioned global address space model showing a one-sided access operation. NIC, network interface controller.

In a UPC execution, the standard C stack and heap are private; a shared heap is added that contains a given thread´s partition of the global address space. A shared type qualifier is added to the C language that can be used to specify that a particular heap variable should be placed in the global address space or that a pointer points to a location on the shared heap. Shared pointers are portable and can be transmitted from one thread to another, making it possible to build distributed, shared, linked data structures (e.g., linked lists and trees) in UPC. A thread affinity is associated with the location targeted by a shared pointer. The affinity specifies on which thread the data resides and can be queried, allowing the programmer to leverage locality for improved performance. When a given piece of data has affinity to the local thread, the sharedness of the pointer can be type cast away, allowing for more efficient direct local access to the data.

Shared arrays in UPC have an associated data distribution. This data distribution determines the mapping between elements of the array and their thread affinity in the global address space. UPC supports a block-cyclic distribution method and a layout qualifier can be provided in the array declaration that specifies the block size of an array:

shared [B] double data[N];

In this example, the array data will contain N elements distributed with a block size of B. Data distribution in UPC is block-cyclic in the linearized representation of the array. When the layout qualifier is omitted, the default block size of 1 is used. An empty block size or a block size of 0 is referred to as an indefinite block size and results in all elements being allocated on thread 0. A block size of *; tells UPC to create equal-sized blocks and to assign one block per thread.

UPC also adds a parallel for loop to the C language. This loop has the following form:

for (initialization-expr; conditional-expr; update-expr; affinity-expr) {…}

The affinity expression in the loop statement is an integer or shared expression that identifies the thread that should execute the given iteration. If a shared pointer or shared array reference is given, the thread with affinity to the given element executes the iteration. When an integer expression (e.g., i % THREADS) is given, the thread with the given identity executes the iteration.

A notable feature of UPC is its trade-off between programmability and performance. It is possible to quickly port a shared-memory or sequential application to UPC by adding the shared qualifier or by using UPC´s shared-memory allocator for shared data structures and utilizing UPC´s parallel loops. The programmer then can specify a data distribution and leverage locality to improve performance. In addition to these tools, the programmer can relax the consistency model on a per-object basis. The default strict consistency model provides strong, but costly, data consistency. The relaxed consistency model offers better performance but requires the programmer to perform explicit consistency-management operations. In addition, UPC provides one-sided data movement operations that can be used to further tune accesses to shared data.

34.3.2   CAF

CAF [7] is a PGAS extension to the Fortran programming language that adds support for distributed, shared array data structures. CAF was originally created as an extension to Fortran 95 and has recently been incorporated into the Fortran 2008 language standard [8]. The designers of CAF strove to create the smallest possible extension to Fortran that can enable its use as an effective and efficient parallel programming language. To this end, CAF extends Fortran with a PGAS data model and an SPMD execution model. In the CAF terminology, each SPMD execution instance is referred to as an image, and array data structures that are accessible from all images are referred to as coarrays.

Many PGAS models, such as UPC and GA, take a top-down approach to creating shared arrays, specifying the dimensions of the full array and how its elements should be distributed. In contrast, CAF takes a bottom-up approach, in which individual pieces located on different images are assembled to form the full array. This is accomplished through the addition of array codimensions that are used to create a logical array of the given corank, in which each element in the coarray is an array of the given rank.

Several examples of coarray specifications and their data layouts are shown in Figure 34.2. The dimensions of the array portion of a coarray are given in the standard parenthesized Fortran notation; the codimension is given in brackets. The corank, or number of codimensions, is one in Figure 34.2a and two in Figure 34.2b, c. The codimensions specify a logical organization of the constituent parts of the coarray, and the corank can be chosen independently of the array rank. A codimension of *; indicates that CAF should calculate the given cobound to include as many images as possible. The coshape is the name used to describe the exact codimensions at runtime. In Figure 34.2 the coshape is shown for an execution with four images. For example, in Figure 34.2b, the coshape will be [2, N], where N = ⌈num_images/2⌉. An explicit corange can also be specified as the codimension (e.g., -1: *).

image

FIGURE 34.2. Several coarrays and their associated shapes for a four-image execution.

When accessing a coarray, programmers must specify the coindex that they wish to access. For the coarray given in Figure 34.2c, this could be performed as follows.

y = X(1, 1)

y = X(1, 1)[1, 1]

Z(: ,:) = X(: ,:)[2, 1]

In the first example, the coindex is omitted, resulting in an access to the image´s local piece of the coarray. This type of array reference provides the fastest access to local data in the coarray. In the second example, an element of X is copied into the local variable y , potentially resulting in communication to fetch the value from the image that holds the given coindex. In the third example, the entire X array at coindex [2, 1] is copied into the local array Z.

Locality plays an important role in the performance of PGAS programs. CAF provides several mechanisms for querying locality with respect to a given image index. These mechanisms can be used to reduce communication overheads. Split-phase barriers and teamwise synchronization are also provided to localize synchronization and to reduce overheads. In addition, several synchronization and data consistency mechanisms are available to ensure that data are available for access by other images. CAF also provides several collective functions for operating on coarrays. These can be used to perform Boolean evaluation of coarrays, to locate minimum or maximum values, or to find the sum or product of elements in the coarray.

34.3.3   Global Arrays

Global Arrays (GA) [9, 10] is a library-based PGAS programming model that provides support for large, distributed, shared-array data structures. This model is especially appropriate when an array data structure that is larger than the memory of a single node is needed, and asynchronous or irregular accesses will be performed. This type of array is common to several important computational chemistry and physics methods that make extensive use of GA [11].

At present, GA provides C, C++ , Fortran, and Python language interfaces. GA is built on top of the low-level aggregate remote memory copy interface (ARMCI) [12], which provides support for mapping remotely accessible memory regions and for performing one-sided data movement operations. GA and ARMCI are designed to be interoperable with MPI, and many GA programs use both GA and MPI.

In the GA model, an array is allocated on a group of processes, and the elements of the array are distributed across the processes according to a specified data distribution. The programmer can choose from several standard distribution patterns or specify an arbitrary, customized data layout. Once an array has been allocated, it is accessed using get, put, and accumulate operations that perform one-sided data transfer from the array into a local buffer. The data to be transferred are specified as a rectangular patch in the index space of the array. The local buffer may also be a rectangular patch of a larger local buffer, and a leading dimensions argument is provided to GA to specify the local buffer layout. The location of an element in a global array and its distribution can be queried, enabling the programmer to leverage locality to reduce communication overheads. When the desired data are available locally, the programmer can call a local access routine that returns a pointer to the data and allows for direct load/store access.

Because of its one-sided communication model, GA and other PGAS models are frequently said to provide a get-compute-put model of computation. This name arises because, in contrast with shared-memory systems, data in the global address space is not operated on in-place; data must be copied into a local buffer before it can be processed. Once processing has completed, the global address space is updated with the results.

In addition to these features, GA provides a library of high-level mathematical routines that operate on arrays in parallel. Routines include matrix multiplication, diagonalization, and solvers. GA also supports several specialized features for different types of computations, such as ghost cells, mirrored (i.e., replicated) arrays, and disk-resident arrays for out-of-core array processing.

34.4   TASK-PARALLEL PROGRAMMING MODELS

Many applications pose significant challenges to the efficient management of parallelism. In dynamic parallel computations, additional work is uncovered as the computation progresses. This new work must be distributed to workers, and any dependence and data flow must be handled. Irregular workloads arise when the domain of the computation is sparse or irregular in computational intensity. These types of computations can result in load imbalance at runtime, leading to inefficient use of compute resources.

Many parallel programming models have been developed to address the needs of dynamic, irregular, and unbalanced parallel computations. In this section, we describe three models that take a task-parallel approach to expressing parallelism. Task parallelism is generic parallelism that is expressed in terms of individual chunks of work. Tasks from different programs can have a variety of properties: divisible, idempotent, strict dependence, persistence, and so forth. Most task-parallel programming models assume a task model in which tasks have particular properties that are leveraged to enhance performance, expressivity, or fitness to a particular class of application.

34.4.1  Charm + +

Charm + + is an object-oriented parallel programming model that extends the C++ programming language with concurrent objects stored in a global object space. Parallelism in Charm + + is expressed through concurrently executing objects called chares. A Charm + + program begins execution with one or more (typically one) main chares, which expose parallelism by creating additional chares. Once created, chares can communicate with each other by using remote method invocation. Unlike conventional models for parallel programming, Charm + + programs are expressed in terms of object-level concurrency rather than the number of processors or cores executing the computation.

Charm + + uses inheritance and other language features to extend C++ with parallelism. Because it works within the C++ standard, it does not require a special compiler and is compatible with existing C++ compilers. To create a Charm + + program, programmers must add a Charm + + interface file to their project specifying which C++ classes will be chares and indicating any entry methods, or methods that can be invoked by other chares. A Charm + + tool is then used to process the interface files and to generate additional source files that must be included in the project.

Invoking an entry method on a chare sends a message to the given object; the method is invoked asynchronously, and the call returns at the callsite as soon as the send has completed. Because of this asynchronous model, entry methods are not permitted to return a value; a return message can be generated by the callee by invoking an entry method on its caller. The Charm + + runtime system, referred to as the charm kernel, maintains a queue of incoming messages. Message forwarding is performed to route messages to the location of the desired object, and messages are processed from the queue as resources become available.

Parallelism in Charm + + can be structured through the use of chare collections. Collections can be used to influence the mapping of chares to processor cores and nodes as well as to express a structure in the set of chares used to define the computation. In addition, chare collections allow the programmer to act on many chares concurrently. The most commonly used chare collection is the chare array, which defines a multidimensional grid of chares that can be constructed to follow the structure of the computation. When a chare array is used, any chare can identify itself and others in the grid by index and perform, for example, neighborhood data exchange.

The charm kernel is also responsible for distributing chares across available resources and for balancing the workload by migrating chares from highly to lightly loaded nodes. Many load balancing strategies have been incorporated into Charm + + , including asynchronous and synchronous methods. For many of its load balancing methods, Charm + + collects runtime information about the workload to inform the mapping of chares to computational resources. Because the runtime has control over the mapping of computation to resources, the Charm + + runtime system is also able to provide automated fault tolerance by migrating chares away from faulty nodes and recovering individual chares when they are lost because of failures.

34.4.2   Scioto

Scioto (scalable collections of task objects) [13] is a library extension to programming models with PGAS data spaces that adds support for task parallelism. Currently, Scioto supports the GA PGAS model; however, its task-based programming model was designed to be compatible with any global address space system.

In the Scioto model, the programmer expresses the computation as a collection of tasks that can be executed concurrently. Scioto tasks operate on data that are stored in a global address space (e.g., a global array); task inputs are fetched from and outputs written to globally accessible locations. This approach enables Scioto tasks to be executed by any process in the computation and gives the runtime system the flexibility to perform automatic load balancing. A Scioto task is defined by a callback function that operates on a task descriptor that contains a header with task metadata, and a user-defined body that contains the task´s arguments.

A typical Scioto program begins in the SPMD execution model of the programming model that provides the global address space. The programmer creates a task collection and seeds it with an initial set of tasks. The programmer can select which process each task is assigned to, providing an initial work distribution. Once the task pool has been seeded, all processes participating in the task-parallel computation enter a task-processing phase by collectively calling the Scioto process function. During this phase, processes effectively give up their SPMD ranks and execute global address space tasks. Tasks can be dynamically created and added to the current task collection or to a new task collection that will be processed in a separate phase. The Scioto runtime system automatically detects quiescence of the task-parallel phase when all tasks have completed and returns the program to SPMD mode.

Scioto uses a strict task dependence model. In this model, tasks can create new subtasks, enabling the expression of dynamic parallelism. However, this type of ancestor‒descendant control flow is the only type of task‒task dependence allowed. If a task is not a descendant of another task, then it should not depend on its execution in order to complete. This restriction allows Scioto to achieve high efficiency in the scheduling and execution of tasks by avoiding overheads associated with time-sharing resources and task migration.

Scioto´s runtime system performs automatic, dynamic load balancing of tasks across available computational resources using work stealing [14]. In the Scioto performance model, the computation is made up of work-stealing end points (e.g., GA processes) that contain a local work queue. When a task is created, the programmer can select the work queue in which it should be placed. This approach allows the programmer to make an initial placement of tasks that can be fine-tuned by the Scioto runtime system. In addition, it allows the programmer to push tasks to specific locations where they should be executed. If the programmer would like to explicitly map tasks to end points, work stealing can be disabled while still providing support for dynamic parallelism.

34.4.3   ADLB

ADLB, the asynchronous dynamic load balancing library [15], is a library extension to MPI that adds support for task-parallel execution. ADLB provides a Linda [16]-like shared task space where the application puts work into the logically shared task space and gets work for execution. ADLB tasks contain user-defined data that provides the task´s input. Tasks can produce output in the memory of the process that executes the task as well as spawn child tasks. In addition to their input buffer, tasks are tagged with a work type that is used to select matching tasks when a process performs a task get operation. A process can specify multiple work types when requesting work, and it can provide a wildcard if any work type is acceptable.

An ADLB work request is composed of two steps: reservation and retrieval. Work is first reserved, and work metadata are received by the requesting process. This approach allows the worker to allocate space in local memory for the task´s input buffer before retrieving it. Once the worker is ready, it performs a get operation that transfers task inputs to the process´s local memory. ADLB utilizes a strict task execution model in which tasks are not permitted to communicate with each other. Hence, the input buffer is the primary means for passing input to ADLB tasks. This design choice relaxes the load balancing problem and enables ADLB to achieve better performance.

The ADLB architecture utilizes a master‒worker execution model in which a user-defined subset of processes act as masters by calling the ADLB server function. The remaining processes act as workers, and each associates itself with a particular master that serves as its work manager. Masters communicate with one another to distribute work and assign tasks to workers. When a new task is created, the input data are stored on the originating worker to avoid resource exhaustion on the master. MPI one-sided communication is then used to retrieve input data when the get operation is performed for that task.

34.5  HIGH-PRODUCTIVITY PARALLEL PROGRAMMING MODELS

X10, Chapel, and Fortress are new parallel programming languages that have been developed as a part of the Defense Advanced Research Projects Agency (DARPA) High Productivity Computing Systems (HPCS) program. The goal of HPCS was to develop new high-performance computing systems, including both hardware and software components, that radically increase both performance and productivity. Thus, X10, Chapel, and Fortress take new and holistic approaches to parallel programming with the goals of greatly improving the ease of expressing parallelism and achieving high performance.

34.5.1   X10

X10 [17] is an object-oriented parallel programming language whose development has been led by IBM Research in conjunction with academic partners. The X10 language derives from popular imperative programming languages, especially Java, and focuses on distributed-memory fork‒join parallelism. An X10 execution is composed of multiple places, represented by the places array, each of which encapsulates a data locality domain (e.g., processor or compute node). When aggregated, X10 places form a PGAS for storing data such as distributed arrays.

An X10 program is composed of multiple activities. Execution begins with the main activity, which executes on the first place. Additional parallelism is expressed by spawning concurrent computations as async activities. Activities can be synchronized with a parent or other ancestor activity through a corresponding finish statement. Thus, X10 activities naturally form a tree with the main activity at the root. When a finish statement is encountered in an activity, the activity is suspended, and the next statement is not executed until the corresponding asynchronous activities and any descendant activities have completed. This model of joining activities is referred to as a terminally strict fork‒join model, which encapsulates and extends the fully strict model used by Cilk [18].

X10 provides support for exceptions using Java-like try-catch and throw constructs. Exceptions thrown by asynchronous activities are handled by using a rooted exception model. At any moment in time, every activity is said to have a root in the activity tree, which corresponds to the nearest ancestor that is awaiting termination of the activity. The path from an activity to its root is referred to as the activation path of the activity. When an exception occurs, X10 forwards the exception along the activation path until it is caught, possibly all the way to the main activity at the root of the tree.

An activity executes in the context of a particular place and can access data objects located at that place. When the programmer wishes to migrate the activity or access data in a different place, an at expression is used. A statement such as at(p) { fcn(); } executes the given statements (in this case, the function, fcn) at place p . Asynchronous activities can also be spawned at different places by com bining async and at expressions, for example, async at(p) {…}.

Activities can be synchronized through several mechanisms: atomic blocks, when clauses, and clocks. Atomic blocks guarantee that the given statement or block of statements executes atomically. A when clause is used to suspend an activity until a particular condition becomes true, for example, when (x > 5) {…}. Clocks can be used to synchronize execution phases across a set of activities. Activities can be registered with one or more clocks when they are created; when an activity completes its phase of the computation, it advances the corresponding clocks to indicate that it is ready to advance to the next stage of the computation. The activity then waits for all other activities registered on the clock to arrive before continuing.

34.5.2   Chapel

Chapel [19] is a multithreaded, object-oriented parallel programming language whose development has been led by Cray Inc. Chapel inherits features and concepts from many existing programming languages and models. One of the most significant relatives of Chapel is the ZPL programming language [20], which has influenced Chapel´s array-based parallelism. As a language, Chapel is imperative and block structured, similar to many familiar languages such as C, C++ , and Java. Like X10, Chapel also incorporates many popular, modern programming language features, including object; generic programming; and a strong, static type system. Chapel also provides several novel features, including intents that are used to specify how function arguments will be used (as inputs, outputs, or both); type inference, which allows the programmer to omit type information when it can be inferred statically; and first-class domains, which incorporate index sets for aggregate data structures (e.g., arrays) as language-level entities.

A Chapel program begins execution as a single thread. Additional parallelism is exposed through a variety of task-parallel and data-parallel constructs. A Chapel execution is composed of a set of locales, which is exposed to the programmer through the locales array. A locale is an abstract unit of the system on which the program is running, most commonly a compute node or a processor. Chapel utilizes a global address space, where every data element is associated with the locale on which it resides.

Asynchronous tasks in a Chapel program can be spawned through begin statements, for example, begin { f(); g(); }. If the parent thread wishes to wait for the completion of a task, it synchronizes on begun tasks and their children by wrapping one or more such tasks in a sync block. Chapel also provides another mechanism for spawning tasks: cobegin { f(); g(); }. In a cobegin, each statement in the block is executed concurrently, and execution does not proceed past the cobegin until all tasks have completed.

Threads synchronize with each other using atomic sections or through synchronized variables in Chapel´s global address space. An atomic block can be used to execute a block of statements as an indivisible update to memory that executes atomically with respect to the operations of other threads. Similar to the programming model of the Tera MTA [21] computer system, Chapel´s synchronized variables are logically full or empty. Full variables become empty when read, and empty variables become full when written to. Reading from an empty location or writing to a full location causes the thread initiating the operation to block until the corresponding location becomes full or empty, respectively. Additional data access operations are provided that can be used to influence whether the initial state of a location is ignored and to set a specific state for a location after data access operation has completed.

Chapel also provides several data-parallel operations, including forall and coforall loops and parallel reduction and scan operations. The forall construct defines a loop where each iteration of the loop can be executed in parallel; the coforall variation synchronizes completion of all iterations before proceeding to the next statement in the parent thread. Chapel domains play an important role in data-parallel expressions. Because the domain of an array or other aggregated data objects is a first-class object in Chapel, sliced and zippered subsets of the domain can be readily specified to provide the set of data elements that will be operated on by data-parallel operations.

34.5.3  Fortress

Fortress [22] is a new, high-productivity programming language whose development was initiated by Sun Microsystems and has been continued by Oracle. Fortress aims to duplicate the success of Fortran by providing a modern, robust language for expressing the mathematical formulation of technical computing problems. To this end, Fortress programs are encoded in Unicode, which makes it possible to use standard mathematical characters and notations when writing a program. In addition, Fortress incorporates many modern language features, such as objects; exceptions; generic programming; type inference; and built-in aggregate data structures, including sets, arrays, maps, and lists. Object traits, similar to interfaces in Java, are used to extend the functionality supported by a particular object. In contrast to Java´s abstract interfaces, however, traits define code for their associated methods.

Both task and data parallelism are provided by Fortress. Many operations, such as whole-array operations, evaluation of function arguments, and for loops, are implicitly parallel and can be executed concurrently by the Fortress runtime system. Iterative and reductive parallelism in Fortress is controlled by a generator, or domain expression, that defines the iteration space. Asynchronous threads can also be created by using a spawn statement.

Fortress provides the programmer with a logical global view of data and allows the user to define the distribution of aggregate data structures, for example, arrays. When an aggregate data structure is used as the generator for iteration, it also defines the locality of each iteration. When synchronized access to shared data is needed, the programmer performs the access within a transactional atomic block.

34.6  SUMMARY AND CONCLUDING REMARKS

With the dramatic increase in the concurrency of large-scale systems, a broad spectrum of parallel programming models has been created. Parallel programming models are distinguished by unique approaches to managing and exchanging data and mechanisms for expressing the units of work that comprise the computation. This chapter has discussed some of the popular parallel programming models used on supercomputers today, including MPI; various PGAS models such as GA, UPC, and CAF; some of the HPCS models such as X10 and Chapel; and task-parallel programming and work-stealing models such as Scioto, Charm + + , and ADLB.

ACKNOWLEDGMENT

This work was supported by the Office of Advanced Scientific Computing Research, Office of Science, U.S. Department of Energy, under contract DE-AC02-06CH11357.

REFERENCES

[1] H. Tang and T. Yang, “Optimizing threaded MPI execution on smp clusters,” Proceedings of the 15th International Conference on Supercomputing, ICS ’01, pp. 381‒392, New York: ACM, 2001.

[2] R. Butler and E. Lusk, “Monitors, messages, and clusters: The p4 parallel programming system,” Parallel Computing, 20(4): 547‒564, 1994.

[3] W. Gropp and B. Smith, “Users manual for the chameleon parallel programming tools,” Technical Report ANL-93/23, Argonne National Laboratory, 1993.

[4] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V. Sunderam, PVM: Parallel Virtual Machine—A Users´ Guide and Tutorial for Networked Parallel Computing. Scientific and Engineering Computation Series. Cambridge, MA: MIT Press, 1994.

[5] MPI Forum, “MPI-2: Extensions to the message-passing interface,” Technical Report, University of Tennessee, Knoxville, 1996.

[6] UPC Consortium, “UPC language specifications, v1.2,” Technical Report LBNL-59208, Lawrence Berkeley National Lab, 2005.

[7] R.W. Numrich and J. Reid, “Co-Array Fortran for parallel programming,” ACM SIGPLAN Fortran Forum, 17(2): 1–31, 1998.

[8] R.W. Numrich and J. Reid, “Co-arrays in the next Fortran standard,” ACM SIGPLAN Fortran Forum, 24: 4–17, 2005.

[9] J. Nieplocha, R.J. Harrison, and R.J. Littlefield, “Global Arrays: A portable 'shared-memory´ programming model for distributed memory computers,” in Proceedings of the ACM/IEEE Conference on Supercomputing (SC ’94), pp. 340–349, 1994.

[10] J. Nieplocha, B. Palmer, V. Tipparaju, M. Krishnan, H. Trease, and E. Aprà, “Advances, applications and performance of the Global Arrays shared memory programming toolkit,” International Journal of High Performance Computing Applications, 20(2): 203–231, 2006.

[11] M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski, T.P. Straatsma, H.J.J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T.L. Windus, and W.A. de Jong, “NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations,” Computer Physics Communications, 181(9): 1477‒1489, 2010.

[12] J. Nieplocha and B. Carpenter, “ARMCI: A portable remote memory copy library for distributed array libraries and compiler run-time systems,” Lecture Notes in Computer Science, 1586: 533–546, 1999.

[13] J. Dinan, S. Krishnamoorthy, D.B. Larkins, J. Nieplocha, and P. Sadayappan, “Scioto: A framework for global-view task parallelism,” Proceedings of the 2008 37th International Conference on Parallel Processing, ICPP ’08, 2008.

[14] J. Dinan, D.B. Larkins, P. Sadayappan, S. Krishnamoorthy, and J. Nieplocha, “Scalable work stealing,” Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09, 2009.

[15] E.L. Lusk, S.C. Pieper, and R.M. Butler, “More scalability, less pain: A simple programming model and its implementation for extreme computing,” SciDAC Review, 17: 30–37, 2010.

[16] N. Carriero and D. Gelernter, “Applications experience with Linda,” in Proceedings of the ACM/SIGPLAN Conference on Parallel Programming: Experience with Applications, Languages and Systems, PPEALS ’88, pp. 173–187, New York: ACM, 1988.

[17] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar, “X10: An object-oriented approach to non-uniform cluster computing,” in Proceedings of the Conference on Object Oriented Programming Systems, Languages, and Applications (OOPSLA ’05), pp. 519–538, 2005.

[18] M. Frigo, C.E. Leiserson, and K.H. Randall, “The implementation of the Cilk-5 multi-threaded language,” in Proceedings of the Conference on Programming Language Design and Implementation (PLDI), pp. 212–223, ACM SIGPLAN, 1998.

[19] B.L. Chamberlain, D. Callahan, and H.P. Zima, “Parallel programmability and the Chapel language,” International Journal of High Performance Computing Applications, 21(3): 291–312, 2007.

[20] B.L. Chamberlain, S.-E. Choi, C. Lewis, C. Lin, L. Snyder, and W.D. Weathersby, “ZPL: A machine independent programming language for parallel computers,” IEEE Transactions on Software Engineering, 26(3): 197–211, 2000.

[21] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, “The Tera computer system,” in Proceedings of the 4th International Conference on Supercomputing, ICS ’90, pp. 1–6, New York: ACM, 1990.

[22] G.L. Steele, Jr., “Parallel programming and parallel abstractions in Fortress,” in Proceedings of the 14th International Conference on Parallel Architecture and Compilation Techniques (PACT ’05), p.157, 2005.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset