Mikhail Yu. Antonov

1 Architecture of parallel computing systems

Abstract: Impressive advances in computer design and technology have been made over the past several years. Computers have became a widely used tool in different areas of science and technology. Today, supercomputers are one of the main tools employed for scientific research in various fields, including oil and gas recovery, continuum mechanics, financial analysis, materials manufacturing, and other areas. That is the reason computational technologies, parallel programming, and efficient code development tools are so important for specialists in applied mathematics and engineering.

1.1 History of computers

The first electronic digital computers capable of being programmed to solve different computing problems appeared in the 1940s. In the 1970s, when Intel developed the first microprocessor, computers became available for the general public.

For decades, the computational performance of a single processor was increased according to Moore’s law. New micro-architecture techniques, increased clock speed, and parallelism on the instruction level made it possible for old programs to run faster on new computers without the need for reprogramming. There has been almost no increase of instruction rate and clock speed since the mid-2000s. Now, major manufacturers emphasize multicore central processing units (CPUs) as the answer to scaling system performance, though to begin with this approach was used mainly in large supercomputers. Nowadays, multicore CPUs are applied in home computers, laptops, and even in smartphones. The downside of this approach is that software has to be programmed in a special manner to take the full advantages of multicore architecture into account.

1.2 Architecture of parallel computers

Parallel programming means that computations are performed on several processors simultaneously. This can be done on multicore processors, multiprocessor computers with shared memory, computer clusters with distributed memory or hybrid architecture.

1.2.1 Flynn’s taxonomy of parallel architecture

Computer architecture can be classified according to various criteria. The most popular taxonomy of computer architecture was defined by Flynn in 1966. The classifications introduced by Flynn are based on the number of concurrent instructions and data streams available in the architecture under consideration.

SISD (single instruction stream / single data stream)-A sequential computer which has a single instruction stream, executing a single operation with a single data stream. Instructions are processed sequentially, i.e. one operation at a time (the von Neumann model).

SIMD (single instruction stream / multiple data stream) — A computer that has a single instruction stream for processing multiple data flows, which may be naturally parallelized. Machines of this type usually have many identical interconnected processors under the supervision of a single control unit. Examples include array processors, Graphics Processing Units (GPUs), and SSE (Streaming SIMD Extensions) instruction sets of modern x86 processors.

MISD (multiple instruction stream / single data stream) - Multiple instructions for processing a single data flow. The same data stream flows through an array of processors executing different instruction streams. This uncommon architecture is practically considered to be almost empty.

MIMD (multiple instruction stream / multiple data stream) - Multiple instructions for processing multiple data flows. Multiple processor units execute various instructions on different data simultaneously. An MIMD computer has many interconnected processing elements, and each of them processes its own data with its own instructions. All multiprocessor systems fall under this classification.

Flynn’s taxonomy is the most widely used classification for initial characterization of computer architecture. However, this classification has evident drawbacks. In particular, the MIMD class is overcrowded. Most multiprocessor systems and multiple computer systems can be placed in this category, including any modern personal computer with x86-based multicore processors.

1.2.2 Address-space organization

Another method of classification of parallel computers is based on address-space organization. This classification reflects types of communication between processors.

Shared-memory multiprocessors (SMP) - Shared-memory multiprocessor systems have more than one scalar processor which share the same addressing space (main memory). This category includes traditional multicore and multiprocessor personal computers. Each processor in an SMP-system may have its own cache memory, but all processors have to be connected to a common memory bus and memory bank (Figure 1.1). One of the main advantages of this architecture is the (comparative) simplicity of the programming model. Disadvantages include bad scalability (due to bus contention) and price. Therefore, SMP-based supercomputers are more expensive than MPP systems with the same number of processors.

e9783110359947_i0002.jpg

Fig. 1.1. Tightly coupled shared-memory system (SMP).

Massivelyparallelprocessors (MPP) - Massively parallel processor systems are composed of multiple subsystems (usually standalone computers) with their own memory and copy of operating system (Figure 1.2). Subsystems are connected by a high-speed network (an interconnection). In particular, this category includes computing clusters, i.e. sets of computers interconnected using standard networking interfaces (Ethernet, InfiniBand etc).

e9783110359947_i0003.jpg

Fig. 1.2. MPP architecture.

MPP systems can easily have several thousand nodes. The main advantages of MPP systems are scalability, flexibility, and relatively low price. MPP systems are usually programmed using message-passing libraries. Nodes exchange data through interconnection networks; speed, latency, and flexibility of the interconnection become very important. Existing interfaces are slower than the speed of data processing in nodes.

Nonuniform memory access (NUMA) - NUMA architecture is something between SMP and MPP. NUMA systems consist of multiple computational nodes which each have their own local memory. Each node can access the entire system memory. However, the speed of access to local memory is much faster than the speed of access to the remote memory.

It should be mentioned that this classification is not absolutely mutually exclusive. For example, clusters of symmetric multiprocessors are relatively common among the TOP 500 list.

1.3 Modern supercomputers

Floating-point operations per second (FLOPS) is a measure of computer performance. The LINPACK software for performing numerical linear algebra operations is one of the most popular methods of measuring the performance of parallel computers. The TOP500 project ranks and details the 500 most powerful supercomputer systems in the world. The project started in 1993 and publishes an updated list of supercomputers twice a year. Since then, 21 years have gone by, and during the 11 years of the rating, the peak power of supercomputers has increased by three orders of magnitude (Table 1.1).

Table 1.1. Supercomputer performance.

Name Year Performance
ENIAC 1946 300 flops
IBM 709 1957 5 Kflops
Cray-1 1974 160 MFlops
Cray Y-M 1988 2.3 Gflops
Intel ASCI Red 1997 1 Tflops
IBM Blue Gene/L 2006 478.2 Tflops
IBM Roadrunner 2008 1.042 Pflops
Cray XT5 Jaguar 2009 1.759 Pflops
Tianhe-1A 2010 2.507 Pflops
Fujitsu K computer 2011 8.162 Pflops
IBM Sequoia 2012 20 Pflops
Cray XK7 Titan 2012 27 Pflops
Tianhe-2 June 2014 54.9 Pflops

It is convenient to have high-performance computational power in a desktop computer either for computational tasks or to speed up standard applications. Computer processor manufacturers have presented dual-core, quad-core and even 8- and 16-core x86-compatible processors since 2005. Using a standard 4-processor motherboard, it is possible now to have up to 64 cores in a single personal computer.

In addition, the idea of creating a personal supercomputer is supported by graphic processing unit (GPU) manufacturers, who adapted the technology and software for general purpose calculations on GPUs. For instance, NVIDIA provides CUDA technology, whereas AMD presents ATI Stream technology. GPUs demonstrate up to 1 TFlops on a single GPU processor, i.e. more than traditional x86 central processing units.

1.4 Multicore computers

Nowadays, the majority of modern personal computers have two or more computational cores. As a result, parallel computing is used extensively around the world in a wide variety of applications. As stated above, these computers belong to SMP systems with shared memory. Different cores on these computers can run distinct command flows. A single program can have more than one command flow (thread), all of which operate in shared memory. A program can significantly increase its performance on a multicore system if it is designed to employ multiple threads efficiently.

The advantage of multi-threaded software for SMP systems is that data structures are shared among threads, and thus there is no need to copy data between execution contexts (threads, processes or processes over several computers), as implemented in the Message-Passing Interface (MPI) library. Also, system (shared) memory is usually essentially faster (by many orders of magnitude in some scenarios) than the speed of interconnection generally used in MPP systems (e.g. Infiniband).

Writing complex parallel programs for modern computers requires the design of codes for the multiprocessor system architecture. While this is relatively easy to implement for symmetrical multiprocessing, uniprocessor and SMP systems require different programming methods in order to achieve maximum performance. Programmers need to know how modern operating systems support processes and threads as well as to understand the performance limits of threaded software and to predict results.

1.5 Operating system processes and threads

Before studying multi-threaded programming, it is necessary to understand what processes and threads in modern operating systems are. First, we illustrate how threads and processes work.

1.5.1 Processes

In a multitasking operating system, multiple programs, also called processes, can be executed simultaneously without interfering with each other (Figure 1.3). Memory protection is applied at the hardware level to prevent a process from accessing the memory of another process. A virtual memory system exists to provide a framework for an operating system to manage memory on the behalf of various processes. Each process is presented with its own virtual address space, while hardware and the operating system prevent a process from accessing memory outside its own virtual address space.

e9783110359947_i0005.jpg

Fig. 1.3. Multitasking operating system.

When a process is run in the protected mode, it has its own independent, continuous, and accessible address space, and the virtual memory system is responsible for managing communications between the virtual address space of the process and the real physical memory of the computer (Figure 1.4). A virtual memory system also allows programmers to develop software in a simple memory model without needing to synchronize the global address space between different processes.

1.5.2 Threads

Just as an operating system can simultaneously execute several processes, each process can simultaneously execute several threads. Usually each process has at least one thread.

Each thread belongs to one process and threads cannot exist outside a process. Each thread represents a separate command flow executed inside a process (with its own program counter, system registers and stack). Both processes and threads can be seen as independent sequences of execution. The main difference between them is that while processes run in different contexts and virtual memory spaces, all threads of the same process share some resources, particularly memory address space.

Threads within a single process share

  • – global variables;
  • – descriptors;
  • – timers;
    e9783110359947_i0006.jpg

    Fig.1.4. Multitasking and virtual memory.

  • – semaphores; and more.

Each thread, however, has its own

  • – program counter;
  • – registers;
  • – stack;
  • – state.

A processor core switches rapidly from one thread to another in order to maintain a large number of different running processes and threads in the system. When the operating system decides to switch a currently running thread, it saves context information of the thread/process (registers, the program counter, etc.) so that the execution can be resumed at the same point later, and loads a new thread/process context into the processor. This also enables multiple threads/processes to share a single processor core.

1.6 Programming multi-threaded applications

There are different forms, technologies, and parallel programming models available for the development of parallel software. Each technique has both advantages and disadvantages, and in each case it should be decided whether the development of a parallel software version justifies the additional effort and resources.

1.6.1 Multi-threading: pros and cons

The main advantage of multi-thread programming is obviously the effective use of SMP-architecture resources, including personal computers with multicore CPUs. A multi-threaded program can execute several commands simultaneously and its performance is significantly higher as a result. The actual performance increase depends on the computer architecture and operating system, as well as on how the programs is implemented, but it is still limited by Amdahl’s law.

Disadvantages include, first of all, possible loss of performance due to thread management overheads. Secondly, it is also more difficult to write and debug multi-threaded programs. In addition to common programming mistakes (memory leaks, allocation failures, etc.), programmers face new problems on handling parallel codes (race conditions, deadlocks, synchronization, etc.). To makes matters worse, the program will continue to work in many cases when such a mistake has been made because these mistakes only manifest themselves under very specific thread schedules.

1.6.2 Program models

One of the popular ways of implementing multi-threading is to use OpenMP shared memory programming technology. A brief overview of this programming environment will be given later in the book.

Another way is to use special operating system interfaces for developing multi-threaded programs. When using these low-level interfaces, threads must be explicitly created, synchronized, and destroyed. Another difference is that while high-level solutions are usually more task-specific, low-level interfaces are more flexible and give sophisticated control over thread management.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset