Abstract: Impressive advances in computer design and technology have been made over the past several years. Computers have became a widely used tool in different areas of science and technology. Today, supercomputers are one of the main tools employed for scientific research in various fields, including oil and gas recovery, continuum mechanics, financial analysis, materials manufacturing, and other areas. That is the reason computational technologies, parallel programming, and efficient code development tools are so important for specialists in applied mathematics and engineering.
The first electronic digital computers capable of being programmed to solve different computing problems appeared in the 1940s. In the 1970s, when Intel developed the first microprocessor, computers became available for the general public.
For decades, the computational performance of a single processor was increased according to Moore’s law. New micro-architecture techniques, increased clock speed, and parallelism on the instruction level made it possible for old programs to run faster on new computers without the need for reprogramming. There has been almost no increase of instruction rate and clock speed since the mid-2000s. Now, major manufacturers emphasize multicore central processing units (CPUs) as the answer to scaling system performance, though to begin with this approach was used mainly in large supercomputers. Nowadays, multicore CPUs are applied in home computers, laptops, and even in smartphones. The downside of this approach is that software has to be programmed in a special manner to take the full advantages of multicore architecture into account.
Parallel programming means that computations are performed on several processors simultaneously. This can be done on multicore processors, multiprocessor computers with shared memory, computer clusters with distributed memory or hybrid architecture.
Computer architecture can be classified according to various criteria. The most popular taxonomy of computer architecture was defined by Flynn in 1966. The classifications introduced by Flynn are based on the number of concurrent instructions and data streams available in the architecture under consideration.
SISD (single instruction stream / single data stream)-A sequential computer which has a single instruction stream, executing a single operation with a single data stream. Instructions are processed sequentially, i.e. one operation at a time (the von Neumann model).
SIMD (single instruction stream / multiple data stream) — A computer that has a single instruction stream for processing multiple data flows, which may be naturally parallelized. Machines of this type usually have many identical interconnected processors under the supervision of a single control unit. Examples include array processors, Graphics Processing Units (GPUs), and SSE (Streaming SIMD Extensions) instruction sets of modern x86 processors.
MISD (multiple instruction stream / single data stream) - Multiple instructions for processing a single data flow. The same data stream flows through an array of processors executing different instruction streams. This uncommon architecture is practically considered to be almost empty.
MIMD (multiple instruction stream / multiple data stream) - Multiple instructions for processing multiple data flows. Multiple processor units execute various instructions on different data simultaneously. An MIMD computer has many interconnected processing elements, and each of them processes its own data with its own instructions. All multiprocessor systems fall under this classification.
Flynn’s taxonomy is the most widely used classification for initial characterization of computer architecture. However, this classification has evident drawbacks. In particular, the MIMD class is overcrowded. Most multiprocessor systems and multiple computer systems can be placed in this category, including any modern personal computer with x86-based multicore processors.
Another method of classification of parallel computers is based on address-space organization. This classification reflects types of communication between processors.
Shared-memory multiprocessors (SMP) - Shared-memory multiprocessor systems have more than one scalar processor which share the same addressing space (main memory). This category includes traditional multicore and multiprocessor personal computers. Each processor in an SMP-system may have its own cache memory, but all processors have to be connected to a common memory bus and memory bank (Figure 1.1). One of the main advantages of this architecture is the (comparative) simplicity of the programming model. Disadvantages include bad scalability (due to bus contention) and price. Therefore, SMP-based supercomputers are more expensive than MPP systems with the same number of processors.
Massivelyparallelprocessors (MPP) - Massively parallel processor systems are composed of multiple subsystems (usually standalone computers) with their own memory and copy of operating system (Figure 1.2). Subsystems are connected by a high-speed network (an interconnection). In particular, this category includes computing clusters, i.e. sets of computers interconnected using standard networking interfaces (Ethernet, InfiniBand etc).
MPP systems can easily have several thousand nodes. The main advantages of MPP systems are scalability, flexibility, and relatively low price. MPP systems are usually programmed using message-passing libraries. Nodes exchange data through interconnection networks; speed, latency, and flexibility of the interconnection become very important. Existing interfaces are slower than the speed of data processing in nodes.
Nonuniform memory access (NUMA) - NUMA architecture is something between SMP and MPP. NUMA systems consist of multiple computational nodes which each have their own local memory. Each node can access the entire system memory. However, the speed of access to local memory is much faster than the speed of access to the remote memory.
It should be mentioned that this classification is not absolutely mutually exclusive. For example, clusters of symmetric multiprocessors are relatively common among the TOP 500 list.
Floating-point operations per second (FLOPS) is a measure of computer performance. The LINPACK software for performing numerical linear algebra operations is one of the most popular methods of measuring the performance of parallel computers. The TOP500 project ranks and details the 500 most powerful supercomputer systems in the world. The project started in 1993 and publishes an updated list of supercomputers twice a year. Since then, 21 years have gone by, and during the 11 years of the rating, the peak power of supercomputers has increased by three orders of magnitude (Table 1.1).
Name | Year | Performance |
---|---|---|
ENIAC | 1946 | 300 flops |
IBM 709 | 1957 | 5 Kflops |
Cray-1 | 1974 | 160 MFlops |
Cray Y-M | 1988 | 2.3 Gflops |
Intel ASCI Red | 1997 | 1 Tflops |
IBM Blue Gene/L | 2006 | 478.2 Tflops |
IBM Roadrunner | 2008 | 1.042 Pflops |
Cray XT5 Jaguar | 2009 | 1.759 Pflops |
Tianhe-1A | 2010 | 2.507 Pflops |
Fujitsu K computer | 2011 | 8.162 Pflops |
IBM Sequoia | 2012 | 20 Pflops |
Cray XK7 Titan | 2012 | 27 Pflops |
Tianhe-2 | June 2014 | 54.9 Pflops |
It is convenient to have high-performance computational power in a desktop computer either for computational tasks or to speed up standard applications. Computer processor manufacturers have presented dual-core, quad-core and even 8- and 16-core x86-compatible processors since 2005. Using a standard 4-processor motherboard, it is possible now to have up to 64 cores in a single personal computer.
In addition, the idea of creating a personal supercomputer is supported by graphic processing unit (GPU) manufacturers, who adapted the technology and software for general purpose calculations on GPUs. For instance, NVIDIA provides CUDA technology, whereas AMD presents ATI Stream technology. GPUs demonstrate up to 1 TFlops on a single GPU processor, i.e. more than traditional x86 central processing units.
Nowadays, the majority of modern personal computers have two or more computational cores. As a result, parallel computing is used extensively around the world in a wide variety of applications. As stated above, these computers belong to SMP systems with shared memory. Different cores on these computers can run distinct command flows. A single program can have more than one command flow (thread), all of which operate in shared memory. A program can significantly increase its performance on a multicore system if it is designed to employ multiple threads efficiently.
The advantage of multi-threaded software for SMP systems is that data structures are shared among threads, and thus there is no need to copy data between execution contexts (threads, processes or processes over several computers), as implemented in the Message-Passing Interface (MPI) library. Also, system (shared) memory is usually essentially faster (by many orders of magnitude in some scenarios) than the speed of interconnection generally used in MPP systems (e.g. Infiniband).
Writing complex parallel programs for modern computers requires the design of codes for the multiprocessor system architecture. While this is relatively easy to implement for symmetrical multiprocessing, uniprocessor and SMP systems require different programming methods in order to achieve maximum performance. Programmers need to know how modern operating systems support processes and threads as well as to understand the performance limits of threaded software and to predict results.
Before studying multi-threaded programming, it is necessary to understand what processes and threads in modern operating systems are. First, we illustrate how threads and processes work.
In a multitasking operating system, multiple programs, also called processes, can be executed simultaneously without interfering with each other (Figure 1.3). Memory protection is applied at the hardware level to prevent a process from accessing the memory of another process. A virtual memory system exists to provide a framework for an operating system to manage memory on the behalf of various processes. Each process is presented with its own virtual address space, while hardware and the operating system prevent a process from accessing memory outside its own virtual address space.
When a process is run in the protected mode, it has its own independent, continuous, and accessible address space, and the virtual memory system is responsible for managing communications between the virtual address space of the process and the real physical memory of the computer (Figure 1.4). A virtual memory system also allows programmers to develop software in a simple memory model without needing to synchronize the global address space between different processes.
Just as an operating system can simultaneously execute several processes, each process can simultaneously execute several threads. Usually each process has at least one thread.
Each thread belongs to one process and threads cannot exist outside a process. Each thread represents a separate command flow executed inside a process (with its own program counter, system registers and stack). Both processes and threads can be seen as independent sequences of execution. The main difference between them is that while processes run in different contexts and virtual memory spaces, all threads of the same process share some resources, particularly memory address space.
Threads within a single process share
Each thread, however, has its own
A processor core switches rapidly from one thread to another in order to maintain a large number of different running processes and threads in the system. When the operating system decides to switch a currently running thread, it saves context information of the thread/process (registers, the program counter, etc.) so that the execution can be resumed at the same point later, and loads a new thread/process context into the processor. This also enables multiple threads/processes to share a single processor core.
There are different forms, technologies, and parallel programming models available for the development of parallel software. Each technique has both advantages and disadvantages, and in each case it should be decided whether the development of a parallel software version justifies the additional effort and resources.
The main advantage of multi-thread programming is obviously the effective use of SMP-architecture resources, including personal computers with multicore CPUs. A multi-threaded program can execute several commands simultaneously and its performance is significantly higher as a result. The actual performance increase depends on the computer architecture and operating system, as well as on how the programs is implemented, but it is still limited by Amdahl’s law.
Disadvantages include, first of all, possible loss of performance due to thread management overheads. Secondly, it is also more difficult to write and debug multi-threaded programs. In addition to common programming mistakes (memory leaks, allocation failures, etc.), programmers face new problems on handling parallel codes (race conditions, deadlocks, synchronization, etc.). To makes matters worse, the program will continue to work in many cases when such a mistake has been made because these mistakes only manifest themselves under very specific thread schedules.
One of the popular ways of implementing multi-threading is to use OpenMP shared memory programming technology. A brief overview of this programming environment will be given later in the book.
Another way is to use special operating system interfaces for developing multi-threaded programs. When using these low-level interfaces, threads must be explicitly created, synchronized, and destroyed. Another difference is that while high-level solutions are usually more task-specific, low-level interfaces are more flexible and give sophisticated control over thread management.