Contents

About the Authors

About the Technical Reviewers

Acknowledgments

Foreword

Introduction

image Chapter 1: No Time to Read This Book?

Using Intel MPI Library

Using Intel Composer XE

Tuning Intel MPI Library

Gather Built-in Statistics

Optimize Process Placement

Optimize Thread Placement

Tuning Intel Composer XE

Analyze Optimization and Vectorization Reports

Use Interprocedural Optimization

Summary

References

image Chapter 2: Overview of Platform Architectures

Performance Metrics and Targets

Latency, Throughput, Energy, and Power

Peak Performance as the Ultimate Limit

Scalability and Maximum Parallel Speedup

Bottlenecks and a Bit of Queuing Theory

Roofline Model

Performance Features of Computer Architectures

Increasing Single-Threaded Performance: Where You Can and Cannot Help

Process More Data with SIMD Parallelism

Distributed and Shared Memory Systems

HPC Hardware Architecture Overview

A Multicore Workstation or a Server Compute Node

Coprocessor for Highly Parallel Applications

Group of Similar Nodes Form an HPC Cluster

Other Important Components of HPC Systems

Summary

References

image Chapter 3: Top-Down Software Optimization

The Three Levels and Their Impact on Performance

System Level

Application Level

Microarchitecture Level

Closed-Loop Methodology

Workload, Application, and Baseline

Iterating the Optimization Process

Summary

References

image Chapter 4: Addressing System Bottlenecks

Classifying System-Level Bottlenecks

Identifying Issues Related to System Condition

Characterizing Problems Caused by System Configuration

Understanding System-Level Performance Limits

Checking General Compute Subsystem Performance

Testing Memory Subsystem Performance

Testing I/O Subsystem Performance

Characterizing Application System-Level Issues

Selecting Performance Characterization Tools

Monitoring the I/O Utilization

Analyzing Memory Bandwidth

Summary

References

image Chapter 5: Addressing Application Bottlenecks: Distributed Memory

Algorithm for Optimizing MPI Performance

Comprehending the Underlying MPI Performance

Recalling Some Benchmarking Basics

Gauging Default Intranode Communication Performance

Gauging Default Internode Communication Performance

Discovering Default Process Layout and Pinning Details

Gauging Physical Core Performance

Doing Initial Performance Analysis

Is It Worth the Trouble?

Getting an Overview of Scalability and Performance

Learning Application Behavior

Choosing Representative Workload(s)

Balancing Process and Thread Parallelism

Doing a Scalability Review

Analyzing the Details of the Application Behavior

Choosing the Optimization Objective

Detecting Load Imbalance

Dealing with Load Imbalance

Classifying Load Imbalance

Addressing Load Imbalance

Optimizing MPI Performance

Classifying the MPI Performance Issues

Addressing MPI Performance Issues

Mapping Application onto the Platform

Tuning the Intel MPI Library

Optimizing Application for Intel MPI

Using Advanced Analysis Techniques

Automatically Checking MPI Program Correctness

Comparing Application Traces

Instrumenting Application Code

Correlating MPI and Hardware Events

Summary

References

image Chapter 6: Addressing Application Bottlenecks: Shared Memory

Profiling Your Application

Using VTune Amplifier XE for Hotspots Profiling

Hotspots for the HPCG Benchmark

Compiler-Assisted Loop/Function Profiling

Sequential Code and Detecting Load Imbalances

Thread Synchronization and Locking

Dealing with Memory Locality and NUMA Effects

Thread and Process Pinning

Controlling OpenMP Thread Placement

Thread Placement in Hybrid Applications

Summary

References

image Chapter 7: Addressing Application Bottlenecks: Microarchitecture

Overview of a Modern Processor Pipeline

Pipelined Execution

Out-of-order vs. In-order Execution

Superscalar Pipelines

SIMD Execution

Speculative Execution: Branch Prediction

Memory Subsystem

Putting It All Together: A Final Look at the Sandy Bridge Pipeline

A Top-down Method for Categorizing the Pipeline Performance

Intel Composer XE Usage for Microarchitecture Optimizations

Basic Compiler Usage and Optimization

Using Optimization and Vectorization Reports to Read the Compiler’s Mind

Optimizing for Vectorization

Dealing with Disambiguation

Dealing with Branches

When Optimization Leads to Wrong Results

Analyzing Pipeline Performance with Intel VTune Amplifier XE

Using a Standard Library Method

Summary

References

image Chapter 8: Application Design Considerations

Abstraction and Generalization of the Platform Architecture

Types of Abstractions

Levels of Abstraction and Complexities

Raw Hardware vs. Virtualized Hardware in the Cloud

Questions about Application Design

Designing for Performance and Scaling

Designing for Flexibility and Performance Portability

Understanding Bounds and Projecting Bottlenecks

Data Storage or Transfer vs. Recalculation

Total Productivity Assessment

Summary

References

Index

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset