Foreword

Large-scale computing—also known as supercomputing—is inherently about performance. We build supercomputers in order to solve the largest possible problems in a time that allows the answers to be relevant. However, application scientists spend the bulk of their time adding functionality to their simulations and are necessarily experts in the domains covered by those simulations. They are not experts in computer science in general and code optimization in particular. Thus, a book such as this one is essential—a comprehensive but succinct guide to achieving performance across the range of architectural space covered by large-scale systems using two widely available standard programming models (OpenMP and MPI) that complement each other.

Today’s large-scale systems consist of many nodes federated by a high-speed interconnect. Thus, multiprocess parallelism, as facilitated by MPI, is essential to use them well. However, individual nodes have become complex parallel systems in their own right. Each node typically consists of multiple processors, each of which has multiple cores. While applications have long treated these cores as virtual nodes, the decreasing memory capacity per core is best handled with multithreading, which is facilitated most by OpenMP. Those cores now almost universally offer some sort of parallel (Single Instruction, Multiple Data, or SIMD) floating-point unit that provides yet another level of parallelism that the application scientist must exploit in order to use the system as effectively as possible. Since performance is the ultimate purpose of large-scale systems, multi-level parallelism is essential to them. This book will help application scientists tackle that complex computer science problem.

In general, performance optimization is most easily accomplished with the right tools for the task. Intel Parallel Studio XE Cluster Edition is a collection of tools that support efficient application development and performance optimization. While many other compilers are available for Intel architectures, including one from PGI, as well as the open source GNU Compiler Collection, the Intel compilers that are included in the Parallel Studio tool suite generate particularly efficient code for them.

To optimize interprocess communication, the application scientist needs to understand which message operations are most efficient. Many tools, including Intel Trace Analyzer and Collector, use the MPI Profiling Interface to measure MPI performance and to help the application scientist identify bottlenecks between nodes. Several others are available, including Scalasca, TAU, Paraver, and Vampir, by which the Intel Trace Analyzer was inspired. The application scientist’s toolbox should include several of them.

Similarly, the application scientist needs to understand how well the capabilities of the node are utilized within each MPI process in order to achieve the best overall performance. Again, a wide range of tools is available for this purpose. Many build on hardware performance monitors to measure low-level details of on-node performance. VTune Amplifier XE provides these and other detailed measurements of single-node performance and helps the application scientist identify bottlenecks between and within threads. Several other tools, again including TAU and Paraver, provide similar capabilities. A particularly useful tool in addition to those already mentioned is HPCToolkit from Rice University, which offers many useful synthesized measurements that indicate how well the node’s capabilities are being used and where performance is being lost.

This book is organized in the way the successful application scientist approaches the problem of performance optimization. It starts with a brief overview of the performance optimization process. It then provides immediate assistance in addressing the most pressing optimization problems at the MPI and OpenMP levels. The following chapters take the reader on a detailed tour of performance optimization on large-scale systems, starting with an overview of the best approach for today’s architectures. Next, it surveys the top-down optimization approach, which starts with identifying and addressing the most performance-limiting aspects of the application and repeats the process until sufficient performance is achieved. Then, the book discusses how to handle high-level bottlenecks, including file I/O, that are common in large-scale applications. The concluding chapters provide similar coverage of MPI, OpenMP, and SIMD bottlenecks. At the end, the authors provide general guidelines for application design that are derived from the top-down approach.

Overall, this text will prove a useful addition to the toolbox of any application scientist who understands that the goal of significant scientific achievements can be reached only with highly optimized code.

—Dr. Bronis R. de Supinski, CTO, Livermore Computing, LLNL

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset