Home Page Icon
Home Page
Table of Contents for
Cover
Close
Cover
by Christopher Dahnken, Michael Klemm, Andrey Semin, Alexander Supalov
Optimizing HPC Applications with Intel® Cluster Tools
Cover
Title
Copyright
About ApressOpen
Dedication
Contents at a Glance
Contents
About the Authors
About the Technical Reviewers
Acknowledgments
Foreword
Introduction
Chapter 1: No Time to Read This Book?
Using Intel MPI Library
Using Intel Composer XE
Tuning Intel MPI Library
Gather Built-in Statistics
Optimize Process Placement
Optimize Thread Placement
Tuning Intel Composer XE
Analyze Optimization and Vectorization Reports
Use Interprocedural Optimization
Summary
References
Chapter 2: Overview of Platform Architectures
Performance Metrics and Targets
Latency, Throughput, Energy, and Power
Peak Performance as the Ultimate Limit
Scalability and Maximum Parallel Speedup
Bottlenecks and a Bit of Queuing Theory
Roofline Model
Performance Features of Computer Architectures
Increasing Single-Threaded Performance: Where You Can and Cannot Help
Process More Data with SIMD Parallelism
Distributed and Shared Memory Systems
HPC Hardware Architecture Overview
A Multicore Workstation or a Server Compute Node
Coprocessor for Highly Parallel Applications
Group of Similar Nodes Form an HPC Cluster
Other Important Components of HPC Systems
Summary
References
Chapter 3: Top-Down Software Optimization
The Three Levels and Their Impact on Performance
System Level
Application Level
Microarchitecture Level
Closed-Loop Methodology
Workload, Application, and Baseline
Iterating the Optimization Process
Summary
References
Chapter 4: Addressing System Bottlenecks
Classifying System-Level Bottlenecks
Identifying Issues Related to System Condition
Characterizing Problems Caused by System Configuration
Understanding System-Level Performance Limits
Checking General Compute Subsystem Performance
Testing Memory Subsystem Performance
Testing I/O Subsystem Performance
Characterizing Application System-Level Issues
Selecting Performance Characterization Tools
Monitoring the I/O Utilization
Analyzing Memory Bandwidth
Summary
References
Chapter 5: Addressing Application Bottlenecks: Distributed Memory
Algorithm for Optimizing MPI Performance
Comprehending the Underlying MPI Performance
Recalling Some Benchmarking Basics
Gauging Default Intranode Communication Performance
Gauging Default Internode Communication Performance
Discovering Default Process Layout and Pinning Details
Gauging Physical Core Performance
Doing Initial Performance Analysis
Is It Worth the Trouble?
Getting an Overview of Scalability and Performance
Learning Application Behavior
Choosing Representative Workload(s)
Balancing Process and Thread Parallelism
Doing a Scalability Review
Analyzing the Details of the Application Behavior
Choosing the Optimization Objective
Detecting Load Imbalance
Dealing with Load Imbalance
Classifying Load Imbalance
Addressing Load Imbalance
Optimizing MPI Performance
Classifying the MPI Performance Issues
Addressing MPI Performance Issues
Mapping Application onto the Platform
Tuning the Intel MPI Library
Optimizing Application for Intel MPI
Using Advanced Analysis Techniques
Automatically Checking MPI Program Correctness
Comparing Application Traces
Instrumenting Application Code
Correlating MPI and Hardware Events
Summary
References
Chapter 6: Addressing Application Bottlenecks: Shared Memory
Profiling Your Application
Using VTune Amplifier XE for Hotspots Profiling
Hotspots for the HPCG Benchmark
Compiler-Assisted Loop/Function Profiling
Sequential Code and Detecting Load Imbalances
Thread Synchronization and Locking
Dealing with Memory Locality and NUMA Effects
Thread and Process Pinning
Controlling OpenMP Thread Placement
Thread Placement in Hybrid Applications
Summary
References
Chapter 7: Addressing Application Bottlenecks: Microarchitecture
Overview of a Modern Processor Pipeline
Pipelined Execution
Out-of-order vs. In-order Execution
Superscalar Pipelines
SIMD Execution
Speculative Execution: Branch Prediction
Memory Subsystem
Putting It All Together: A Final Look at the Sandy Bridge Pipeline
A Top-down Method for Categorizing the Pipeline Performance
Intel Composer XE Usage for Microarchitecture Optimizations
Basic Compiler Usage and Optimization
Using Optimization and Vectorization Reports to Read the Compiler’s Mind
Optimizing for Vectorization
Dealing with Disambiguation
Dealing with Branches
When Optimization Leads to Wrong Results
Analyzing Pipeline Performance with Intel VTune Amplifier XE
Summary
References
Chapter 8: Application Design Considerations
Abstraction and Generalization of the Platform Architecture
Types of Abstractions
Levels of Abstraction and Complexities
Raw Hardware vs. Virtualized Hardware in the Cloud
Questions about Application Design
Designing for Performance and Scaling
Designing for Flexibility and Performance Portability
Understanding Bounds and Projecting Bottlenecks
Data Storage or Transfer vs. Recalculation
Total Productivity Assessment
Summary
References
Index
Search in book...
Toggle Font Controls
Playlists
Add To
Create new playlist
Name your new playlist
Playlist description (optional)
Cancel
Create playlist
Sign In
Email address
Password
Forgot Password?
Create account
Login
or
Continue with Facebook
Continue with Google
Sign Up
Full Name
Email address
Confirm Email Address
Password
Login
Create account
or
Continue with Facebook
Continue with Google
Next
Next Chapter
Title
Add Highlight
No Comment
..................Content has been hidden....................
You can't read the all page of ebook, please click
here
login for view all page.
Day Mode
Cloud Mode
Night Mode
Reset