Contents
Chapter 1: No Time to Read This Book?
Analyze Optimization and Vectorization Reports
Use Interprocedural Optimization
Chapter 2: Overview of Platform Architectures
Performance Metrics and Targets
Latency, Throughput, Energy, and Power
Peak Performance as the Ultimate Limit
Scalability and Maximum Parallel Speedup
Bottlenecks and a Bit of Queuing Theory
Performance Features of Computer Architectures
Increasing Single-Threaded Performance: Where You Can and Cannot Help
Process More Data with SIMD Parallelism
Distributed and Shared Memory Systems
HPC Hardware Architecture Overview
A Multicore Workstation or a Server Compute Node
Coprocessor for Highly Parallel Applications
Group of Similar Nodes Form an HPC Cluster
Other Important Components of HPC Systems
Chapter 3: Top-Down Software Optimization
The Three Levels and Their Impact on Performance
Workload, Application, and Baseline
Iterating the Optimization Process
Chapter 4: Addressing System Bottlenecks
Classifying System-Level Bottlenecks
Identifying Issues Related to System Condition
Characterizing Problems Caused by System Configuration
Understanding System-Level Performance Limits
Checking General Compute Subsystem Performance
Testing Memory Subsystem Performance
Testing I/O Subsystem Performance
Characterizing Application System-Level Issues
Selecting Performance Characterization Tools
Monitoring the I/O Utilization
Chapter 5: Addressing Application Bottlenecks: Distributed Memory
Algorithm for Optimizing MPI Performance
Comprehending the Underlying MPI Performance
Recalling Some Benchmarking Basics
Gauging Default Intranode Communication Performance
Gauging Default Internode Communication Performance
Discovering Default Process Layout and Pinning Details
Gauging Physical Core Performance
Doing Initial Performance Analysis
Getting an Overview of Scalability and Performance
Choosing Representative Workload(s)
Balancing Process and Thread Parallelism
Analyzing the Details of the Application Behavior
Choosing the Optimization Objective
Classifying the MPI Performance Issues
Addressing MPI Performance Issues
Mapping Application onto the Platform
Optimizing Application for Intel MPI
Using Advanced Analysis Techniques
Automatically Checking MPI Program Correctness
Instrumenting Application Code
Correlating MPI and Hardware Events
Chapter 6: Addressing Application Bottlenecks: Shared Memory
Using VTune Amplifier XE for Hotspots Profiling
Hotspots for the HPCG Benchmark
Compiler-Assisted Loop/Function Profiling
Sequential Code and Detecting Load Imbalances
Thread Synchronization and Locking
Dealing with Memory Locality and NUMA Effects
Controlling OpenMP Thread Placement
Thread Placement in Hybrid Applications
Chapter 7: Addressing Application Bottlenecks: Microarchitecture
Overview of a Modern Processor Pipeline
Out-of-order vs. In-order Execution
Speculative Execution: Branch Prediction
Putting It All Together: A Final Look at the Sandy Bridge Pipeline
A Top-down Method for Categorizing the Pipeline Performance
Intel Composer XE Usage for Microarchitecture Optimizations
Basic Compiler Usage and Optimization
Using Optimization and Vectorization Reports to Read the Compiler’s Mind
When Optimization Leads to Wrong Results
Analyzing Pipeline Performance with Intel VTune Amplifier XE
Using a Standard Library Method
Chapter 8: Application Design Considerations
Abstraction and Generalization of the Platform Architecture
Levels of Abstraction and Complexities
Raw Hardware vs. Virtualized Hardware in the Cloud
Questions about Application Design
Designing for Performance and Scaling
Designing for Flexibility and Performance Portability
Understanding Bounds and Projecting Bottlenecks