Glossary

Advisor Intel® Advisor XE, analysis tool for determining potential benefits from various approached to adding parallelism without having to write, debug, and test code in order to study the benefits.

affinity Specification of methods to associate a particular software thread to a particular hardware thread usually with the objective of getting better or more predictable performance. Affinity specifications include the concept of being maximally spread apart to reduce contention, or to pack tightly (compact) to minimize distances for communication. OpenMP supports a rich set of affinity controls at various levels from abstract to full manual control. Fortran 2008 does not specify controls, but Intel reuses the OpenMP controls for “do concurrent.” Intel Threading Building Blocks (TBB) provides an abstract loop-to-loop affinity biasing capability. Intel Cilk™ Plus relies only on fully automatic mechanisms with no user controls or overrides.

aliasing When two distinct program identifiers or expressions refer to overlapping memory locations. For example, if two pointers p and q point to the same location, then p[x] and q[x] are said to alias each other. The potential for aliasing can severely restrict a compiler’s ability to optimize a program, even when there is no actual aliasing.

Amdahl’s law Speedup is limited by the nonparallelizable serial portion of the work. A program where two thirds of the program can be run in parallel and one third of the original nonparallel program cannot be sped up by parallelism will find that speedup can only approach 3X and never exceed it assuming the same work is done. If scaling the problem size places more demands on the parallel portions of the program, then Amdahl’s law is not as bad as it may seem—see Gustafson’s law.

Amplifier See VTune.

application programming interface (API) An interface (set of function calls, operators, variables, and/or classes) used by an application developer to use a module. The implementation details of a module are ideally hidden from the application developer and the functionality is only defined through the API.

atomic operation An operation that is guaranteed to appear as if it occurred indivisibly without interference from other threads. For example, a processor might provide a memory increment operation. This operation needs to read a value from memory, increment it, and write it back to memory. An atomic increment guarantees that the final memory value is the same as would have occurred if no other operations on that memory location were allowed to happen between the read and the write.

automatic offload (AO) A generic library feature that automatically redirects some computation to use a specialty device for data parallelism, such as a coprocessor (MIC). AO is supported by the Intel® Math Kernel Library (Intel MKL). When desired, AO can also be controlled more finely with more complex parameters through additional options within Intel MKL. A key concept is that offloading will occur when it will benefit the computation. The other key concept in AO is that the computation will be performed by processor(s) if offloading is not available. Put another way, if you ignore all AO-related extensions in the program, it will do the same computation but without use of the coprocessor(s). See Chapter 11.

bandwidth The rate at which information is transferred, either from memory or over a communications channel. This term is used when the process being measured can be given a frequency-domain interpretation. When applied to computation, it can be seen as being equivalent to throughput.

barrier When a computation is broken into phases, it is often necessary to ensure that all threads complete all the work in one phase before any thread moves onto another phase. A barrier is a form of synchronization that ensures this: threads arriving at a barrier wait there until the last thread arrives, then all threads continue. A barrier can be implemented using atomic operations. For example, all threads might try to increment a shared variable, then block if the value of that variable does not equal the number of threads that need to synchronize at the barrier. The last thread to arrive can then reset the barrier to zero and release all the blocked threads.

Basic Linear Algebra Subprograms See BLAS.

Berkeley Lab Checkpoint Restore See BLCR.

bitwise copyable A characteristic of a data structure that allows a simple bit-by-bit copy (sometimes called a “shallow” copy) operation to work properly. This term is used in the C++ standard. A bitwise copyable data structure will not contain pointers and does not invoke constructors or destructors.

BLAS The BLAS (Basic Linear Algebra Subprograms) are routines that provide standard building blocks for basic vector and matrix operations. The Level 1 BLAS perform scalar, vector, and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS perform matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, LAPACK for example. A sophisticated and generic implementation of BLAS has been maintained for decades at http://netlib.org/blas. Vendor-specific implementations of BLAS are common, including the Intel Math Kernel Library (Intel MKL) that is a highly efficient version of BLAS and other standard routines for Intel architecture.

BLCR Berkeley Lab Checkpoint Restore allows one or more processes to be saved to a file and later restarted from that file. This can be used for scheduling, process migration, or failure recovery. The latter is often considered very important when running jobs that may go for multiple days or weeks so as to do a checkpoint periodically to limit how much time is wasted in the event of a system shutdown for any reason. BLCR is supported by the Intel MPSS for Intel Xeon Phi coprocessors.

block Block can be used in two senses: (1) a state in which a thread is unable to proceed while it waits for some synchronization event, or (2) a region of memory. The second meaning is also used in the sense of dividing a loop into a set of parallel tasks of a suitable granularity. To avoid confusion in this book the term tile is generally used for the second meaning, and likewise the term tiling is preferred over blocking.

C++ Composer Intel® C++ Composer XE, Intel C/C++ Compiler plus libraries. Supports both processors and coprocessors.

C-state Core idle state, a power savings capability on processors and coprocessors with a tradeoff being lower power but higher latency to do the next real work when available. Deeper sleep states offer lower power but take longer to revive to full performance.

cache A part of memory system that stores copies of data temporarily in a fast memory so that future uses for that data can be handled more quickly than if the request had to be fetched again from a more distant storage. Caches are generally automatic and are designed to enhance programs with temporal locality and/or spatial locality. Caching systems in modern computers are usually multileveled.

cache line The units in which data are retrieved and held by a cache, which in order to exploit spatial locality are generally larger than a word. The general trend is for increasing cache line sizes, which are generally large enough to hold at least two double-precision floating-point numbers, but unlikely to hold more than eight on any current design. Larger cache lines allow for more efficient bulk transfers from main memory but worsen certain issues including false sharing, which generally degrades performance.

CCL See Coprocessor Communication Link.

Cilk Plus: Intel® Cilk™ Plus (Cilk Plus) A parallel programming model for C and C++ with support for task, data and vector parallelism. Cilk Plus is an open specification from Intel, based on decades of research and publications from M.I.T. The Cilk Plus specification has been implemented by the Intel compilers for Windows, Linux, and OS X, as well as an experimental feature in a branch of the Gnu C++ compiler.

cluster A set of computers with distributed memory communicating over a high-speed interconnect. The individual computers are often called nodes.

Cluster Ready A compliance program for cluster systems and cluster software to reduce cost through higher degrees of hardware and software compliance to a defined set of APIs and configurations.

Cluster Studio Intel® Cluster Studio XE, suite of tools from Intel consisting of Intel® Parallel Studio XE plus Intel tools for MPI including Intel® MPI Library and Intel® Trace Analyzer and Collector. Supports both processors and coprocessors.

COI See Coprocessor Offload Infrastructure.

communication Any exchange of data or synchronization between software tasks or threads. Understanding that communication costs are often a limiting factor in scaling is a critical concept for parallel programming.

composability The ability to use two components in concert with each other without causing failure or unreasonable conflict (ideally no conflict). Limitations on composability, if they exist, are best when completely diagnosed at build time instead of requiring any testing. Composability problems that manifest only at runtime are the biggest problem with non-composable systems. Can refer to system features, programming models, or software components.

Composer Intel compilers plus libraries.

concurrent Logically happening simultaneously. Two tasks that are both logically active at some point in time are considered to be concurrent. Contrast with parallel.

coprocessor A separate processor, often on an add-in card (such as a PCIe card), usually with its own physical memory, which may or may not be in a separate address space from the host processor. Often also known as an accelerator (although it may only accelerate specific workloads). In the case of Intel Xeon Phi coprocessors, a coprocessor is a computing device that cannot be the only computing device in a system design. In other words, a computing device that also requires a processor in a system design.

Coprocessor Communication Link Intel® Xeon Phi™ Coprocessor Communication Link (CCL).

Coprocessor Offload Infrastructure Intel® Coprocessor Offload Infrastructure (COI), a middleware layer written by Intel with an API that supports the asynchronous delivery and management of code and data buffers between an Intel Xeon® host processor and an Intel Xeon Phi coprocessor(s). COI is primarily targeted for providing programmatic control for development tools and higher level interfaces such as compiler runtimes, system management tools, and OpenCL. Some applications may benefit from directly using the finer control COI provides but will likely require a greater development investment than using a compiler or other run-time. COI is an advanced topic, generally not of use to application developers, and is not discussed in this book.

core A separate sub-processor on a multicore processor. A core should be able to support (at least one) separate and divergent flow of control from other cores on the same processor. Note: there is some inconsistency in the use of this term. For example, some graphic processor vendors use the term as well for SIMD lanes supporting fibers. However, the separate flows of control in fibers are simulated with masking on these devices, so there is a performance penalty for divergence. We will restrict the use of the term core to the case where control flow divergence can be done without penalty.

DAPL Direct Access Programming Library is a transport-independent, platform-independent, high-performance API for accessing the remote direct memory access (RDMA) capabilities of interconnects. The Intel MPI library provides high performance support for many interconnects by hooking into their DAPL API.

data parallelism An attempt to an approach to parallelism that is more oriented around data rather than tasks. However, in reality, successful strategies in parallel algorithm development tend to focus on exploiting the parallelism in data, because data decomposition (generating tasks for different units of data) scales, but functional decomposition (generation of heterogeneous tasks for different functions) does not. See Amdahl’s law and Gustafson-Barsis’ law.

deadlock A programming error. Deadlock occurs when at least two tasks wait for each other and each will not resume until the other task proceeds. This happens easily when code requires locking multiple mutexes. For example, each task can be holding a mutex required by the other task.

deterministic A deterministic algorithm is an algorithm that behaves predictably. Given a particular input, a deterministic algorithm will always produce the same output. The definition of what is the “same” may be important due to limited precision in mathematical operations and the likelihood that optimizations including parallelization will rearrange the order of operations. These are often referred to as “rounding” differences, which result when the order of mathematical operations to compute answer differ between the original program and the final concurrent program. Concurrency is not the only factor that can lead to nondeterministic algorithms but in practice it is often the cause. Use of programming models with sequential semantics and eliminating data races with proper access controls will generally eliminate the major effects of concurrency other than the “rounding” differences.

Direct Access Programming Library See DAPL.

distributed memory Memory that is physically located in separate computers. An indirect interface, such as message passing, is required to access memory on remote computers, while local memory can be accessed directly. Distributed memory is typically supported by clusters which, for purposes of this definition, we are considering to be a collection of computers. Since the memory on attached coprocessors also cannot typically be addressed directly from the host, it can be considered, for functional purposes, to be a form of distributed memory.

ECC Error Correction Code, a method to increase reliability by correcting transient errors on a device. Used extensively on Intel Xeon processors and Intel Xeon Phi coprocessors to offer high degrees of reliability.

embarrassing parallelism An algorithm has embarrassing parallelism if it can be decomposed into a large number of independent tasks with little or no synchronization or communication required.

EMON Event monitoring: counting of events such as cache misses on a processor or coprocessor. See Chapter 13.

ETC Elapsed Time Counter. The default clock source on the coprocessor is micetc. The micetc clocksource is also compensated for power management events delivering a very stable clocksource. See Chapter 13.

false sharing Two separate tasks in two separate cores may write to separate locations in memory, but if those memory locations happened to be allocated in the same cache line, the cache coherence hardware will attempt to keep the cache lines coherent, resulting in extra interprocessor communication and reduced performance, even though the tasks are not actually sharing data.

FMA Fused Multiply and Add, a capability to request a multiply and an add operation in one instruction while maintaining precision thereby potentially doubling the computational throughput of a device.

Fortran Programming language used primarily for scientific and engineering problem solving. Originally spelled FORTRAN as an abbreviation for FORmula TRANslation.

Fortran Composer Intel® Fortran Composer XE, Intel Fortran Compiler plus libraries. Supports both processors and coprocessors.

forward scaling The concept of having a program or algorithm scalable already in threads and/or vectors so as to be ready to take advantage of growth of parallelism in future hardware with a simple recompile with a new compiler or relink to a new library. Using the right abstractions to express parallelism is normally a key to enabling forward scaling when writing a parallel program.

future-proofed A computer program written in a manner so it will survive future computer architecture changes without requiring significant changes to the program itself. Generally, the more abstract a programming method is, the more future-proof that program is. Lower level programming methods that in some way mirror computer architectural details will be less able to survive the future without change. Writing in an abstract, more future-proof fashion may involve tradeoffs in efficiency, however.

gather Gather-scatter is a type of memory access pattern that often arises when addressing vectors in sparse linear algebra operations. A gather utilizes indexed reads and a scatter utilizes indexed writes. Special vector instructions provide gather-scatter operations to assist.

GDDR5 Graphics Double Data Rate, version 5, the memory type used by Intel Xeon Phi coprocessors. Use of ECC is not directly supported by GDDR5, and activation of ECC (at coprocessor boot time) requires 12.5 percent of memory be used for error correction bits but with only a small performance overhead.

Gustafson-Barsis’ law A different view on Amdahl’s law that factors in the fact that as problem sizes grow, the serial portion of computations tend to shrink as a percentage of the total work to be done. Compare with other attempts to characterize the bounds of parallelism Amdahl’s law and span complexity.

hardware thread A hardware implementation of a task with a separate flow of control. Multiple hardware threads can be implemented using multiple cores, or can run concurrently or simultaneously on one core in order to hide latency using methods such as hyper-threading of a processor core. Intel Xeon Phi coprocessors do have four hardware threads per core, and they are not hyper-threads.

host processor The main control processor in a system, as opposed to any graphics processors or coprocessors. The host processor is responsible for booting and running the operating system.

HPC High performance computing. HPC refers to the highest performance computing available at a point in time, which today generally means at least a teraFLOP/s of computational capability. The term HPC is occasionally used as a synonym for supercomputing, although supercomputing is probably more specific to even higher performance systems (today, at least ten teraFLOP/s). While the use of HPC is spreading to more industries, it is general associated with solution of the most challenging problems in science and engineering.

hyperobjects A mechanism in Cilk Plus to support operations such as reduction that combine multiple objects.

Hyper-threading Multithreading on a single processor core with the purpose of more fully utilizing the functional units in an out-of-order core by bringing together more instructions from than one software thread. With hyper-threading, multiple hardware threads may run on one core and share resources, but some benefit is still obtained from parallelism or concurrency. Typically each hyper-thread has, at least, its own register file and program counter, so that switching between hyper-threads is relatively lightweight. Intel Xeon Phi coprocessors do have four threads per core, but they are not hyper-threads as they are utilized with an in-order core to hide latencies inherent in an in-order design.

IA A commonly used abbreviation for Intel architecture, also referred to as x86 in reference to Intel’s original 8088 and 8086 processors that implemented Intel architecture.

IMCI Intel® Initial Many Core Instructions is the official name for the new instructions available in the Intel® Xeon Phi™ coprocessor codenamed Knights Corner. The Intel® Xeon Phi™ coprocessor codenamed Knights Corner is the first device to offer 512-bit wide SIMD instructions. Intel has published a disclaimer that it does not guarantee these will be supported in future processors. One could guess that feedback will be critical to their future.

Inlining Inlining is an optimization that replaces a call to a subroutine or function with the actual code from the subroutine or function. This can be done by a compiler or done by hand in the source code. This optimization has to be weighed against the disadvantages of increasing the size of the program. Some sophisticated compilers can do partial inlining to somewhat address this. A subroutine or function may be inlined by a compiler at the call site to improve performance by two methods: (1) remove the call overhead, (2) enable optimizations by bringing the code together instead of having code separated by a call.

Inspector Intel® Inspector XE, analysis tool specializing in finding threading and memory related errors in a program. Can detect latent threading bugs (ones that are no causing program failure).

intrinsics Intrinsics appear to be functions in a language, but are supported by the compiler directly. In the case of SSE or vector intrinsics, the intrinsic function may map directly to a small number, often one, of machine instructions, which the compiler inserts without the overhead of a real function call.

ISA Instruction Set Architecture.

lambda function A lambda function, for programmers, is an anonymous function. Long a staple of languages such as LISP, it was only recently supported for C++ per the C++11 standard. A lambda function enables a fragment of code to be passed to another function without having to write a separate named function or functor. This ability is particularly handy for using TBB.

lane An element of a SIMD register file and associated functional unit, considered as a unit of hardware for performing parallel computation. SIMD instructions execute computations across multiple lanes.

Language Extensions for Offloading (LEO) A feature of the Intel compiler to specify offloading regions and data movement. Predated standardization of target directives by OpenMP. Intel compilers, supporting C, C++ and Fortran, support unofficial OpenMP extensions (see Chapter 7) to support Intel Xeon Phi coprocessors by allowing a number of additional features to offload select computations to coprocessors and assist in the data movement between processor and coprocessor memories. This capability is available to software developers through several extensions to the C, C++, and Fortran language. A key concept in LEO is that the computation will be performed by host processor(s) if offloading is not available. Put another way, if you ignore all LEO related extensions in the program it will do the same computation but without use of the coprocessor(s). LEO consists of pragma-based (directives) compiler feature that allows a program to select computations to available to offload to run on coprocessors such as the Intel Xeon Phi coprocessors and assist in the data movement between processor and coprocessor memories. Intel’s C, C11, and Fortran compilers support LEO. LEO will be rendered obsolete by OpenMP target directives. One can expect a migration path and legacy support.

latency The time it takes to complete a task; that is, the time between when the task begins and when it ends. Latency has units of time. The scale can be anywhere from nanoseconds to days. Lower latency is better in general.

latency hiding Latency hiding schedules computations on a processing element while other tasks using that core are waiting for long-latency operations to complete, such as memory or disk transfers. The latency is not actually hidden, since each task still takes the same time to complete, but more tasks can be completed in a given time since resources are shared more efficiently, so throughput is improved.

load balancing Assigning tasks to resources while handling uneven sizes of tasks.

locality Locality refers to utilizing memory locations that are closer, rather than further, apart. This will maximize reuse of cache lines, memory pages, and so on. Maintaining a high degree of locality of reference is a key to scaling.

lock A mechanism for implementing mutual exclusion. Before entering a mutual exclusion region, a thread must first try to acquire a lock on that region. If the lock has already been acquired by another thread, the current thread must block, which it may do by either suspending operation or spinning. When the lock is released, then the current thread is free to acquire it. Locks can be implemented using atomic operations, which are themselves a form of mutual exclusion on basic operations, implemented in hardware.

many-core processor A multicore processor with so many cores that in practice we do not enumerate them; there are just “lots.” The term has been generally used with processors with 32 or more cores, but there is no precise definition.

Many-core Platform Software Stack Intel® Many-core Platform Software Stack (MPSS), the stack of software supplied by Intel in binaries and open source to support the Intel Xeon Phi coprocessor. It includes drivers, Intel COI, SCIF, plus mods for Linux OS, gcc, and gdb. Available freely from http://intel.com/software/mic.

Math Kernel Library Intel® Math Kernel Library (Intel MKL), includes numerous routines to provide a high level of performance from this hand-optimized library. Intel MKL includes highly vectorized and threaded linear algebra, fast Fourier transforms (FFTs), vector math and statistics functions. Through a single C or Fortran API call, these functions automatically scale across previous, current, and future processor architectures by selecting the best code path for each. See Chapter 11.

megahertz era A historical period of time during which processors doubled clock rates at a rate similar to the doubling of transistors in a design, roughly every two years. Such rapid rise in processor clock speeds ceased at just under 4 GHz (four thousand megahertz) in 2004. Designs shifted toward adding more cores marking the shift to the multicore era.

memory hierarchy See memory subsystem.

memory subsystem The portion of a computer system responsible for moving code and data between the main system memory and the computational units. The memory subsystem may include additional connections to I/O devices including graphics cards, disk drives, and network interfaces. A modern memory subsystem will generally have many levels including some levels of caching both on and off the processor die. Coherent memory subsystems, which are used in most computers, provide for a single view of the contents of the main system memory despite temporary copies in caches and concurrency in the system.

Message Passing Interface (MPI) An industry-standard approach to distributed computing. Can be used between and within multicore processor and MIC coprocessor network nodes. Can be used for native, offload, and reverse offload approaches.

MIC Stands for “Intel Many Integrated Core Architecture.” Architecture from Intel designed for highly parallel workloads. The architecture emphasizes higher core counts on a single die, and simpler more efficient cores, than on a traditional CPU. See also Xeon Phi.

MIC Elapsed Time Counter (micetc or ETC) See ETC.

MKL See Math Kernel Library.

MPI See Message Passing Interface.

MPSS See Many-core Platform Software Stack.

multicore A processor with multiple sub-processors, each sub-processor (known as a core) supporting at least one hardware thread.

multicore era Time after which processor designs shifted away from rapidly rising clock rates and shifted toward adding more cores. This era began roughly in 2005.

mutual exclusion A mechanism for protecting a set of data values so that while they are manipulated by one parallel thread, they cannot be manipulated by another.

MYO A software shared memory capability supplied by Intel for use with Intel Xeon Phi coprocessors. Stands for Mine-Yours-Ours in recognition of the three states in which data can be shared between the host and a coprocessor. In MYO, data can have the same virtual address on both host and coprocessors so that pointers are valid on both and can be exchanged. This is not normally the case for data since the memories are not shared. The Intel compiler supports MYO through the _Cilk_Shared type as part of Intel Cilk Plus support for Intel Xeon Phi coprocessors.

node (in a cluster) A shared memory computer, often on a single board with multiple processors, that is connected with other nodes to form a cluster computer or supercomputer.

nondeterministic Exhibiting a lack of deterministic behavior, so results can vary from run to run of an algorithm. Concurrency is not the only factor that can lead to nondeterministic algorithms but in practice it is often the cause. See more in the definition for deterministic.

NTS Stands for “No Thermal Solution” referring to coprocessors shipped by Intel where the OEM is responsible for installing a thermal solution to keep the coprocessor cooled. Contrasts with actively cooled (fan) and passively cooled (heat sink) solutions that may be available on some coprocessor models.

OFA Open Fabrics Alliance develops, tests, licenses, supports and distributes OpenFabrics Enterprise Distribution (OFED) open source software to deliver high-efficiency computing, wire-speed messaging, ultra-low microsecond latencies and fast I/O. The Alliance seeks to deliver a unified, cross-platform, transport-independent software stack for RDMA and kernel bypass so that users can utilize the same OpenFabrics RDMA and kernel bypass API and run their applications agnostically over various inconnects.

OFED Open Fabrics Enterprise Distribution. See OFA.

offload Placing part of a computation on an attached device such as a GPU or coprocessor. See also LEO, OpenMP and OpenACC.

OpenACC nVidia’s OpenACC is a specification formulated by four companies (nVidia, PGI, Cray, and CAPS) to provide pragma-based offload for nVidia GPUs. OpenACC will be rendered obsolete by OpenMP target directives. One can expect a migration path and legacy support.

OpenCL Open Compute Language, initiated by Apple Corporation, OpenCL is now a standard defined by the Khronos group for graphics processors and coprocessors. However, OpenCL can also be used to specify parallel and vectorized computations on multicore host processors. Supported by the Intel® SDK for OpenCL Applications.

OpenMP an API that supports multi-platform shared memory multiprocessing programming in C, C++, and Fortran, on most processor architectures and operating systems. It is made up of a set of compiler directives, library routines, and environment variables that influence runtime behavior. OpenMP is managed by the nonprofit technology consortium OpenMP Architecture Review Board and is jointly defined by a group of major computer hardware and software vendors (http://openmp.org). See Chapter 6.

page The granularity at which virtual to physical address mapping is done. Within a page, the mapping of virtual to physical memory addresses is continuous.

parallel Physically happening simultaneously. Two tasks that are both actually doing work at some point in time are considered to be operating in parallel. When a distinction is made between concurrent and parallel, the key is whether work can ever be done simultaneously. Multiplexing of a single processor core, by multitasking operating systems, has allowed concurrency for decades even when simultaneous execution was impossible because there was only one processing core.

Parallel Studio Intel® Parallel Studio XE, suite of tools for node level programming (no MPI support included). Consists of C/C++ and Fortran compilers, libraries, debugging and analysis tools. Supports both processors and coprocessors.

parallelism Doing more than one thing at a time. Attempts to classify types of parallelism are numerous.

parallelization The act of transforming code to enable simultaneous activities. The parallelization of a program allows at least parts of it to execute in parallel.

peel loop A loop, usually compiler generated, created to go before a highly efficient (main) loop to set up conditions needed for the efficient loop. This is commonly needed with the efficient loop assumes N aligned elements per iterations, usually for vectorization, and the peel loop has to do any iterations that precede the required alignment. See also remainder loop.

PMON Performance monitoring. See Chapter 13.

PMU Performance Monitoring Unit, programmable portion of Intel Xeon Phi coprocessor for monitoring performance counters. See Chapter 13.

pragma A pragma is used to give a hint to a compiler, but not change the semantics of a program. OpenMP consists entirely of pragmas. Cilk Plus includes some pragmas in its definition. Also called a “compiler directive.”

process A application-level unit of parallel work. A process has its own thread of control and is managed by the operating system. Usually, unless special provisions are made for shared memory, a process cannot access the memory of another process.

race condition Nondeterministic behavior in a parallel program that is generally a programming error. A race condition occurs when concurrent tasks perform operations on the same memory location without proper synchronization, and one of the memory operations is a write. Code with a race may operate correctly sometimes, and fail other times.

recursion Recursion is the act of a function being reentered while an instance of the function is still active in the same thread of execution. In the simplest and most common case, a function directly calls itself, although recursion can also occur between multiple functions. Recursion is supported by storing the state for the continuations of partially completed functions in dynamically allocated memory, such as on a stack, although if higher-order functions are supported a more complex memory allocation scheme may be required. Bounding the amount of recursion can be important to prevent excessive use of memory.

relaxed sequential semantics See sequential semantics for an explanation.

remainder loop A loop, usually compiler generated, created to go after a highly efficient (main) loop to clean up any remaining iterations that did not fit within the scope of the efficient loop. This is commonly needed with the efficient loop assumes N elements per iterations, usually for vectorization, and the remainder loop has to finish less than N iterations that are left over. See also peel loop.

Reverse offload (RO) This is a concept of running a program on coprocessor and doing offload to Intel Xeon processor(s). It is a concept only, and not supported directly by the Intel development tools (although programs can be written to do this with some manual effort).

ring level Protection rings are mechanisms to protect data and programs from faults and malicious attacks. Intel Architecture has a four ring capability, which is generally used simply as user (ring 3, most restricted) and kernel (ring 0, least restricted).

SAE Suppress All Exceptions: the Intel® Xeon Phi™ coprocessor introduces the SAE attribute feature. An instruction with SAE set will not raise any kind of floating-point exception flags, independent of the inputs.

scalability Scalability is a measure of the increase in performance as a function of the availability of more hardware to use in parallel.

scalable An application is scalable if its performance increases when additional parallel hardware resources are added. See scalability.

scatter See gather.

SCI See Symmetric Communications Interface.

SCIF See Symmetric Communications Interface.

sequential consistency Sequential consistency is a relaxed memory consistency model where it is assumed that every task in a concurrent system should see memory writes (updates) in the exact order issued by the original task, but that the knowledge of the relative ordering of writes issued by multiple tasks, or among any reads, is unimportant. If such ordering is important it would require further program control to ensure. Strict consistency models, although generally consider impractical, require that all tasks observe the activities of all tasks in the order they actually occurred in real time.

sequential semantics Sequential semantics means that a (parallel) program can be executed using a single thread of control as an ordinary sequential program without changing the semantics of the program. Parallel programming with sequential semantics has many advantages over programming in a manner that precludes serial execution, and is therefore strongly encouraged. Such programs are considered easier to understand, easier to debug, more efficient on sequential machines and better at supporting nested parallelism. Sequential semantics casts parallelism as an accelerator and not as mandatory for correctness. This means that one does not need a conceptual parallel model to understand or execute a program with sequential semantics. Examples of mandatory parallelism include producer-consumer relationships with bounded buffers (hence the producer cannot necessarily be completely executed before the consumer because the producer can become blocked) and message passing (for example, MPI) programs with cyclic message passing. Due to timing, precision, and other sources of inexactness the results of a sequential execution may differ from the concurrent invocation of the same program. Sequential semantics solely means that any such variation is not due to the semantics of the program. The term “relaxed sequential semantics” is sometimes used to explicitly acknowledge the variations possible due to non-semantic differences in serial vs. concurrent executions.

serial Neither concurrent nor parallel.

serial elision The serial elision of a Cilk Plus program is generated by erasing occurrences of the cilk_spawn and cilk_sync keywords and replacing cilk_for with for. Cilk Plus is a faithful extension of C/C++ in the sense that the serial elision of any Cilk Plus program is both a serial C/C++ program and a semantically valid implementation of the Cilk Plus program. The term elision arose from earlier versions of Cilk that lacked cilk_for, and hence eliding (omitting) the two other keywords sufficed. The term “C elision” is sometimes used too, harking back to when Cilk was an extension of C but not C++.

serial illusion The apparent serial execution order of machine language instructions in a computer. In fact, hardware is naturally parallel and many low-level optimizations and high-performance implementation techniques can reorder operations.

serial semantics Same as sequential semantics.

serial traps A serial trap is a programming construct that semantically requires serial execution for proper results in general even though common cases may be over-constrained with regards to concurrency by such semantics. The term “trap” acknowledges how such constructs can easily escape attention as barriers to parallelism, in part because they are so common and were not intentionally designed to preclude parallelism. For instance, for, in the C language, has semantics that dictate the order of iterations by allowing an iteration to assume that all prior iterations have been executed. Many loops do not rely upon side effects of prior iterations, and would be natural candidates for parallel execution, but require analysis in order for a system to determine that parallel execution would not violate the program semantics. Use of cilk_for, for instance, has no such serial semantic and therefore is not a serial trap.

serialization When the tasks in a potentially parallel algorithm are executed in a specific serial order, typically due to resource constraints. The opposite of parallelization.

shared address space Even if units of parallel work do not share a physical memory, they may agree on conventions that allow a single unified set of addresses to be used. For example, one range of addresses could refer to memory on the host, while another range could refer to memory on a specific coprocessor. The use of unified addresses simplifies memory management.

shared memory When two units of parallel work can access data in the same location. Normally doing this safely requires synchronization. The units of parallel work, processes, threads, tasks, and fibers can all share data this way, if the physical memory system allows it. However, processes do not share memory by default and special calls to the operating system are required to set it up.

SIMD Single-instruction-multiple-data referring to the ability to process multiple pieces of data (such as elements of an array) with all the same operation. SIMD is a computer architecture within a widely used classification system known as Flynn’s taxonomy, first proposed in 1966.

software thread A software thread is a virtual hardware thread; in other words, a single flow of execution in software intended to map one for one to a hardware thread. An operating system typically enables many more software threads to exist than there are actual hardware threads, by mapping software threads to hardware threads as necessary.

spatial locality Nearby when measured in terms of distance (in memory address). Compare with temporal locality. Spatial locality refers to a program behavior where the use of one data element indicates that data nearby, often the next data element, will probably be used soon. Algorithms exhibiting good spatial locality in data usage can benefit from cache lines structures and prefetching hardware, both common components in modern computers.

spawn Generically, the creation of a new task. In terms of Cilk Plus, cilk_spawn creates a spawn, but the new task created is actually the continuation and not the statement that is the target of the spawn keyword.

spawning block The function, try block, or cilk_for body that contains the spawn. A sync (cilk_sync) waits only for spawns that have occurred in the same spawning block and have no effect on spawns done by other tasks, threads, nor those done prior to entering the current spawning block. A sync is always done, if there have been spawns, when exiting the enclosing spawning block.

speedup Speedup is the ratio between the latency for solving a problem with one processing unit versus the latency for solving the same problem with multiple processing units in parallel.

SPMD Single-program-multiple-data referring to the ability to process multiple pieces of data (such as elements of an array) with the same program, in contrast with a more restrictive SIMD architecture. SPMD most often refers to message passing programming on distributed memory computer architectures (See Chapter 12). SPMD is a subcategory of MIMD computer architectures within a widely used classification system known as Flynn’s taxonomy, first proposed in 1966.

Stampede The first announced computer system using Intel Xeon Phi coprocessors (outside Intel). It is located at the Texas Advanced Computing Center in Austin Texas and deployed in January 2013. Stampede will have a peak performance of more than 2 petaFLOP/sec from the base cluster of Intel Xeon processors and more than 7 petaFLOP/sec from the Intel® Xeon® Phi™ coprocessors. See also http://www.tacc.utexas.edu/stampede.

strangled scaling A programming error in which the performance of parallel code is poor due to high contention or overhead, so much so that it may underperform the nonparallel (serial) code.

Symmetric Communications Interface Intel® Symmetric Communications Interface (SCI). SCIF provides a mechanism for inter-node communications within a single platform. A node, for SCIF purposes, is defined as either an Intel Xeon Phi coprocessor or the Intel Xeon processor. In particular, SCIF abstracts the details of communicating over the PCI Express bus. The SCIF APIs are callable from both user space (uSCIF) and kernel-space (kSCIF). SCIF provides only data transfer services. Code control is provided by COI or other operating system services. SCIF exposes a distributed communication software interface very similar to sockets programming. The same programmer interface is exposed whether the implementation is running on an Intel Xeon host or a MIC card, therefore making it “symmetric” as far as functionality and code development is concerned. This enables other communications layers such as TCP/IP, OFED, and standard sockets to be more easily built upon SCIF. Implementing these standard communication interfaces on top SCIF allows a MIC card such as an Intel® Xeon Phi™ coprocessor to be assigned a standard IP address, enabling the card to be logically viewed as an independent computing node in a network or cluster. Familiar distributed usages and access models such as rsh, ssh, Network File System (NFS) mounting are all made possible through SCIF, standard communications layers, and the MIC cards’ on-board Linux operating system. Some applications may benefit from direct access to SCIF but may require a significantly higher development investment versus using other standard data transfer and communication mechanisms. SCIF is an advanced topic, not often used directly in application code, and is not covered in this book.

sync In terms of Cilk Plus, cilk_sync creates a sync point. The program flow executing the sync will not progress until all spawns have occurred in the same spawning block. A sync is not affected by spawns done by other tasks, threads, nor those done prior to entering the current spawning block. A sync is always done when exiting a spawning block that contained any spawns. This is required for program composability.

synchronization The coordination, of tasks or threads, in order to obtain the desired runtime order. Commonly used to avoid undesired race conditions.

task A lightweight unit of potential parallelism with its own control flow. Unlike threads, tasks are usually serialized on a single core and run to completion. When contrasted with “thread” the distinction is made that tasks are pieces of work without any assumptions about where they will run, while threads have a one-to-one mapping of software threads to hardware threads. Threads are a mechanism for executing tasks in parallel, while tasks are units of work that merely provide the opportunity for parallel execution; tasks are not themselves a mechanism of parallel execution.

task parallelism An attempt to classify parallelism as more oriented around tasks than data. We deliberately avoid this term, task parallelism, because its meaning varies. In particular, elsewhere “task parallelism” can refer to tasks generated by functional decomposition or to irregular tasks that still generated by data decomposition. In this book, any parallelism generated by data decomposition, regular or irregular, is considered data parallelism.

TBB See Threading Building Blocks (TBB).

temporal locality Nearby when measured in terms of time. Compare with spatial locality. Temporal locality refers to a program behavior in which data is likely to be reused relatively soon. Algorithms exhibiting good temporal locality in data usage can benefit from data caching, which is common in modern computers. It is not unusual to be able to achieve both temporal and spatial locality in data usage. Computer systems are generally more able to achieve optimal performance when both are achieved hence the interest in algorithm design to do so.

thread In general, a “software thread” is any software unit of parallel work with an independent flow of control, and a “hardware thread” is any hardware unit capable of executing a single flow of control (in particular, a hardware unit that maintains a single program counter). When “thread” is compared with “task” the distinction is made that tasks are pieces of work without any assumptions about where they will run, while threads have a one-to-one mapping of software threads to hardware threads. Threads are a mechanism for implementing tasks. A multitasking or multithreading operating system will multiplex multiple software threads onto a single hardware thread by interleaving execution via software created time slices. A multicore or many-core processor consists of multiple cores to execute at least one independent software thread per core through duplication of hardware. A multithreaded or hyper-threaded processor core will multiplex a single core to execute multiple software threads through interleaving of software threads via hardware mechanisms.

thread parallelism A mechanism for implementing parallelism in hardware using a separate flow of control for each task.

Threading Building Blocks Intel® Threading Building Blocks (TBB) is the most popular abstract solution for parallel programming in C++. TBB is an open source project created by Intel that has been ported to a very wide range of operating systems and processors from many vendors. OpenMP and TBB seldom compete for developers in reality. While more popular than OpenMP in terms of the number of developers using it, TBB is popular with C++ programmers whereas OpenMP is most used by Fortran and C programmers.

throughput Given a set of tasks to be performed, the rate at which those tasks are completed. Throughput measures the rate of computation, and it is given in units of tasks per unit time. See bandwidth and latency.

tiling Dividing a loop into a set of parallel tasks of a suitable granularity. In general, tiling consists of applying multiple steps on a smaller part of a problem instead of running each step on the whole problem one after the other. The purpose of tiling is to increase reuse of data in caches. Tiling can lead to dramatic performance increases when a whole problem does not fit in cache. We prefer the term “tiling” instead of “blocking” and “tile” instead of “block.” Tiling and tile have become the more common term in recent times.

TLB An abbreviation for Translation Lookaside Buffer. A TLB is a specialized cache that is used to hold translations of virtual to physical page addresses. The number of elements in the TLB determines how many pages of memory can be accessed simultaneously with good efficiency. Accessing a page not in the TLB will cause a TLB miss. A TLB miss typically causes a trap to the operating system so that the page table can be referenced and the TLB updated.

Trace Analyzer and Collector Intel® Trace Analyzer and Collector, a tool for analyzing MPI communication traffic in order to detect opportunities for improvement. See Chapter 13.

Translation Lookaside Buffer See TLB.

trip count The number of times a given loop will execute (“trip”); same as “iteration count.”

TSC Timestamp Counter, standard counter in modern x86 processors including the Intel Xeon Phi coprocessor. Each core has a 64-bit counter that monotonically increments the timestamp counter every clock cycle and reset to 0 whenever the processor is reset. Having multiple counters in the coprocessor increases the complexity to synchronize all of them when time measurements are required on different cores. The Read Time-Stamp Counter instruction RDTSC allows the loading of the content of the core’s timestamp counter into the EDX:EAX registers. Although this clocksource is low overhead, it is greatly affected by changes in power management therefore it is not possible to assure that the timestamp on multiple cores will be synchronized. See Chapter 13.

vector operation A low-level operation that can act on multiple data elements at once in SIMD fashion.

unroll Complete unrolling of a loop is accomplished by duplicating the body of the loop, for each iteration, into straight code so no loop is needed. For instance: for (i=0;i<3;i++) a[i]=i; can be unrolled to a[0]=0; a[1]=1; a[2]=2; partial unrolling retains the loop but expands the loop body to do multiple iterations each time through the loop. This is commonly done to enable vectorization. Unrolling is a common compiler optimization and has been common in source code in the past although it is a bad idea these days (see Section “Avoid manual loop unrolling” in Chapter 5).

vector parallelism A mechanism for implementing parallelism in hardware using the same flow of control on multiple data elements.

vector processing unit (VPU) The portion of the coprocessor dedicated to processing vector operations. See Chapter 5.

vectorization The act of transforming code to enable simultaneous computations using vector hardware. Instructions such as MMX, SSE, and AVX instructions utilize vector hardware. The vectorization of code tends to enhance performance because more data is processed per instruction than would be done otherwise. See also vectorize.

vectorize Converting a program from a scalar implementation to a vectorized implementation to utilize vector hardware such as SIMD instructions (MMX, SSE, AVX, and so on). Vectorization is a specialized form of parallelism.

virtual memory Virtual memory decouples the address used by software from the physical addresses of real memory. The translation from virtual addresses to physical addresses is done in hardware that is initialized and controlled by the operating system.

VPU Vector Processing Unit. The portion of the coprocessor dedicated to processing vector operations. See Chapter 5.

VTune Intel® VTune™ Amplifier XE, analysis tool specializing in use of EMON counters to profile activity on processors and coprocessor. See Chapter 13.

Xeon Phi Intel® Xeon Phi™ coprocessors based on Intel® Many Integrated Core (MIC) Architecture. A prototype with up to 32 cores and based on 45nm process technology, known as Knight Ferry, was made available, but not sold, by Intel in 2010 and 2011. A product, known as the Intel® Xeon Phi™ coprocessor, built on 22nm process technology with up to 61 cores, started shipping in late 2012 and was announced in November 2012 at the conference known as “SC12” (Supercomputing 2012 in Salt Lake City Utah). The SC12 announcement coincided with seven machines using the Intel® Xeon Phi™ coprocessor appearing on the “Top 500 List” and the most energy efficient computer in the world (#1 spot on “Green 500”) utilized Intel® Xeon Phi™ coprocessors. See also MIC.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset