Introduction

Let’s optimize some programs. We have been doing this for years, and we still love doing it. One day we thought, Why not share this fun with the world? And just a year later, here we are.

Oh, you just need your program to run faster NOW? We understand. Go to Chapter 1 and get quick tuning advice. You can return later to see how the magic works.

Are you a student? Perfect. This book may help you pass that “Software Optimization 101” exam. Talking seriously about programming is a cool party trick, too. Try it.

Are you a professional? Good. You have hit the one-stop-shopping point for Intel’s proven top-down optimization methodology and Intel Cluster Studio that includes Message Passing Interface* (MPI), OpenMP, math libraries, compilers, and more.

Or are you just curious? Read on. You will learn how high-performance computing makes your life safer, your car faster, and your day brighter.

And, by the way: You will find all you need to carry on, including free trial software, code snippets, checklists, expert advice, fellow readers, and more at www.apress.com/source-code.

HPC: The Ever-Moving Frontier

High-performance computing, or simply HPC, is mostly concerned with floating-point operations per second, or FLOPS. The more FLOPS you get, the better. For convenience, FLOPS on large HPC systems are typically counted by the quadrillions (tera, or 10 to the power of 12) and by the quintillions (peta, or 10 to the power of 15)—hence, TeraFLOPS and PetaFLOPS. Performance of stand-alone computers is currently hovering at around 1 to 2 TeraFLOPS, which is three orders of magnitude below PetaFLOPS. In other words, you need around a thousand modern computers to get to the PetaFLOPS level for the whole system. This will not stay this way forever, for HPC is an ever-moving frontier: ExaFLOPS are three orders of magnitude above PetaFLOPS, and whole countries are setting their sights on reaching this level of performance now.

We have come a long way since the days when computing started in earnest. Back then [sigh!], just before WWII, computing speed was indicated by the two hours necessary to crack the daily key settings of the Enigma encryption machine. It is indicative that already then the computations were being done in parallel: each of the several “bombs”1 united six reconstructed Enigma machines and reportedly relieved a hundred human operators from boring and repetitive tasks.

Computing has progressed a lot since those heady days. There is hardly a better illustration of this than the famous TOP500 list.2 Twice a year, the teams running the most powerful non-classified computers on earth report their performance. This data is then collated and published in time for two major annual trade shows: the International Supercomputing Conference (ISC), typically held in Europe in June; and the SuperComputing (SC), traditionally held in the United States in November.

Figure 1 shows how certain aspects of this list have changed over time.

9781430264965_FM-01.jpg

Figure 1. Observed and projected performance of the Top500 systems (Source: top500.org; used with permission)

There are several observations we can make looking at this graph:3

  1. Performance available in every represented category is growing exponentially (hence, linear graphs in this logarithmic representation).
  2. Only part of this growth comes from the incessant improvement of processor technology, as represented, for example, by Moore’s Law.4 The other part is coming from putting many machines together to form still larger machines.
  3. An extrapolation made on the data obtained so far predicts that an ExaFLOPS machine is likely to appear by 2018. Very soon (around 2016) there may be PetaFLOPS machines at personal disposal.

So, it’s time to learn how to optimize programs for these systems.

Why Optimize?

Optimization is probably the most profitable time investment an engineer can make, as far as programming is concerned. Indeed, a day spent optimizing a program that takes an hour to complete may decrease the program turn-around time by half. This means that after 48 runs, you will recover the time invested in optimization, and then move into the black.

Optimization is also a measure of software maturity. Donald Knuth famously said, “Premature optimization is the root of all evil,”5 and he was right in some sense. We will deal with how far this goes when we get closer to the end of this book. In any case, no one should start optimizing what has not been proven to work correctly in the first place. And a correct program is still a very rare and very satisfying piece of art.

Yes, this is not a typo: art. Despite zillions of thick volumes that have been written and the conferences held on a daily basis, programming is still more art than science. Likewise, for the process of program optimization. It is somewhat akin to architecture: it must include flight of fantasy, forensic attention to detail, deep knowledge of underlying materials, and wide expertise in the prior art. Only this combination—and something else, something intangible and exciting, something we call “talent”—makes a good programmer in general and a good optimizer in particular.

Finally, optimization is fun. Some 25 years later, one of us still cherishes the memories of a day when he made a certain graphical program run 300 times faster than it used to. A screen update that had been taking half a minute in the morning became almost instantaneous by midnight. It felt almost like love.

The Top-down Optimization Method

Of course, the optimization process we mention is of the most common type—namely, performance optimization. We will be dealing with this kind of optimization almost exclusively in this book. There are other optimization targets, going beyond performance and sometimes hurting it a lot, like code size, data size, and energy.

The good news are, once you know what you want to achieve, the methodology is roughly the same. We will look into those details in Chapter 3. Briefly, you proceed in the top-down fashion from the higher levels of the problem under analysis (platform, distributed memory, shared memory, microarchitecture), iterate in a closed-loop manner until you exhaust optimization opportunities at each of these levels. Keep in mind that a problem fixed at one level may expose a problem somewhere else, so you may need to revisit those higher levels once more.

This approach crystallized quite a while ago. Its previous reincarnation was formulated by Intel application engineers working in Intel’s application solution centers in the 1990’s.6 Our book builds on that solid foundation, certainly taking some things a tad further to account for the time passed.

Now, what happens when top-down optimization meets the closed-loop approach? Well, this is a happy marriage. Every single level of the top-down method can be handled by the closed-loop approach. Moreover, the top-down method itself can be enclosed in another, bigger closed loop where every iteration addresses the biggest remaining problem at any level where it has been detected. This way, you keep your priorities straight and helps you stay focused.

Intel Parallel Studio XE Cluster Edition

Let there be no mistake: the bulk of HPC is still made up by C and Fortran, MPI, OpenMP, Linux OS, and Intel Xeon processors. This is what we will focus on, with occasional excursions into several adjacent areas.

There are many good parallel programming packages around, some of them available for free, some sold commercially. However, to the best of our absolutely unbiased professional knowledge, for completeness none of them comes in anywhere close to Intel Parallel Studio XE Cluster Edition.7

Indeed, just look at what it has to offer—and for a very modest price that does not depend on the size of the machines you are going to use, or indeed on their number.

  • Intel Parallel Studio XE Cluster Edition8 compilers and libraries, including:
    • Intel Fortran Compiler9
    • Intel C++ Compiler10
    • Intel Cilk Plus11
    • Intel Math Kernel Library (MKL)12
    • Intel Integrated Performance Primitives (IPP)13
    • Intel Threading Building Blocks (TBB)14
  • Intel MPI Benchmarks (IMB)15
  • Intel MPI Library16
  • Intel Trace Analyzer and Collector17
  • Intel VTune Amplifier XE18
  • Intel Inspector XE19
  • Intel Advisor XE20

All these riches and beauty work on the Linux and Microsoft Windows OS, sometimes more; support all modern Intel platforms, including, of course, Intel Xeon processors and Intel Xeon Phi coprocessors; and come at a cumulative discount akin to the miracles of the Arabian 1001 Nights. Best of all, Intel runtime libraries come traditionally free of charge.

Certainly, there are good tools beyond Intel Parallel Studio XE Cluster Edition, both offered by Intel and available in the world at large. Whenever possible and sensible, we employ those tools in this book, highlighting their relative advantages and drawbacks compared to those described above. Some of these tools come as open source, some come with the operating system involved; some can be evaluated for free, while others may have to be purchased. While considering the alternative tools, we focus mostly on the open-source, free alternatives that are easy to get and simple to use.

The Chapters of this Book

This is what awaits you, chapter by chapter:

  1. No Time to Read This Book? helps you out on the burning optimization assignment by providing several proven recipes out of an Intel application engineer’s magic toolbox.
  2. Overview of Platform Architectures introduces common terminology, outlines performance features in modern processors and platforms, and shows you how to estimate peak performance for a particular target platform.
  3. Top-down Software Optimization introduces the generic top-down software optimization process flow and the closed-loop approach that will help you keep the challenge of multilevel optimization under secure control.
  4. Addressing System Bottlenecks demonstrates how you can utilize Intel Cluster Studio XE and other tools to discover and remove system bottlenecks as limiting factors to the maximum achievable application performance.
  5. Addressing Application Bottlenecks: Distributed Memory shows how you can identify and remove distributed memory bottlenecks using Intel MPI Library, Intel Trace Analyzer and Collector, and other tools.
  6. Addressing Application Bottlenecks: Shared Memory explains how you can identify and remove threading bottlenecks using Intel VTune Amplifier XE and other tools.
  7. Addressing Application Bottlenecks: Microarchitecture demonstrates how you can identify and remove microarchitecture bottlenecks using Intel VTune Amplifier XE and Intel Composer XE, as well as other tools.
  8. Application Design Considerations deals with the key tradeoffs guiding the design and optimization of applications. You will learn how to make your next program be fast from the start.

Most chapters are sufficiently self-contained to permit individual reading in any order. However, if you are interested in one particular optimization aspect, you may decide to go through those chapters that naturally cover that topic. Here is a recommended reading guide for several selected topics:

Use your judgment and common sense to find your way around. Good luck!

References

1.    “Bomba_(cryptography),” [Online]. Available: http://en.wikipedia.org/wiki/Bomba_(cryptography).

2.    Top500.Org, “TOP500 Supercomputer Sites,” [Online]. Available: http://www.top500.org/.

3.    Top500.Org, “Performance Development TOP500 Supercomputer Sites,” [Online]. Available: http://www.top500.org/statistics/perfdevel/.

4.    G. E. Moore, “Cramming More Components onto Integrated Circuits,” Electronics, p. 114–117, 19 April 1965.

5.    “Knuth,” [Online]. Available: http://en.wikiquote.org/wiki/Donald_Knuth.

6.    Intel Corporation, “ASC Performance Methodology - Top-Down/Closed Loop Approach,” 1999. [Online]. Available: http://smartdata.usbid.com/datasheets/usbid/2001/2001-q1/asc_methodology.pdf.

7.    Intel Corporation, “Intel Cluster Studio XE,” [Online]. Available: http://software.intel.com/en-us/intel-cluster-studio-xe.

8.    Intel Corporation, “Intel Composer XE,” [Online]. Available: http://software.intel.com/en-us/intel-composer-xe/.

9.    Intel Corporation, “Intel Fortran Compiler,” [Online]. Available: http://software.intel.com/en-us/fortran-compilers.

10.    Intel Corporation, “Intel C++ Compiler,” [Online]. Available: http://software.intel.com/en-us/c-compilers.

11.    Intel Corporation, “Intel Cilk Plus,” [Online]. Available: http://software.intel.com/en-us/intel-cilk-plus.

12.    Intel Corporation, “Intel Math Kernel Library,” [Online]. Available: http://software.intel.com/en-us/intel-mkl.

13.    Intel Corporation, “Intel Performance Primitives,” [Online]. Available: http://software.intel.com/en-us/intel-ipp.

14.    Intel Corporation, “Intel Threading Building Blocks,” [Online]. Available: http://software.intel.com/en-us/intel-tbb.

15.    Intel Corporation, “Intel MPI Benchmarks,” [Online]. Available: http://software.intel.com/en-us/articles/intel-mpi-benchmarks/.

16.    Intel Corporation, “Intel MPI Library,” [Online]. Available: http://software.intel.com/en-us/intel-mpi-library/.

17.    Intel Corporation, “Intel Trace Analyzer and Collector,” [Online]. Available: http://software.intel.com/en-us/intel-trace-analyzer/.

18.    Intel Corporation, “Intel VTune Amplifier XE,” [Online]. Available: http://software.intel.com/en-us/intel-vtune-amplifier-xe.

19.    Intel Corporation, “Intel Inspector XE,” [Online]. Available: http://software.intel.com/en-us/intel-inspector-xe/.

20.    Intel Corporation, “Intel Advisor XE,” [Online]. Available: http://software.intel.com/en-us/intel-advisor-xe/.

______________________________

*Here and elsewhere, certain product names may be the property of their respective third parties.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset