Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12. Parallelism and Performance

In this chapter, we will cover the following recipes:

Just-in-time compiling with Numba
Speeding up numerical expressions with Numexpr
Running multiple threads with the threading module
Launching multiple tasks with the concurrent.futures module
Accessing resources asynchronously with the asyncio module
Distributed processing with execnet
Profiling memory usage
Calculating the mean, variance, skewness, and kurtosis on the fly
Caching with a least recently used cache
Caching HTTP requests
Streaming counting with the Count-min sketch
Harnessing the power of the GPU with OpenCL

Introduction

The ENIAC, built between 1943 and 1946, filled a large room with eighteen thousand tubes and had a 20-bit memory. We have come a long way since then. The growth has been exponential as also predicted by Moore's law. Whether we are dealing with a self-fulfilling prophecy or a fundamental phenomenon is, of course, hard to say. Purportedly, the growth is starting to decelerate.

Given our current knowledge of technology, thermodynamics, and quantum mechanics, we can set hard limits for Moore's law. However, our assumptions may be wrong; for instance, scientists and engineers may come up with fundamentally better techniques to build chips. (One such development is quantum computing, which is currently far from widespread.) The biggest hurdle is heat dissipation, which is commonly measured in units of kT, with k the Boltzmann constant (about 10-23 J/K) and T in Kelvin (freezing point is 273.15 K). The heat dissipation per bit for a chip is at least kT (10-20 J at 350 K). Semi-conductors in the 1990s consumed at least a hundred thousand kT. A computational system undergoes changes in energy levels during operation. The smallest tolerable difference in energy is roughly 100 kT. Even if we somehow manage to avoid this limit, we will soon be operating close to atomic levels, which for quantum mechanical reasons is not practical (information about particles is fundamentally limited), unless we are talking about a quantum computer. Currently, the consensus is that we will reach the limit within decades. Another consideration is the complex wiring of chips. Complex wiring lowers the life expectancy of chips considerably.

This chapter is about software performance; however, there are other more important software aspects, such as maintainability, robustness, and usability. Betting on Moore's law is risky and not practical, since we have other possibilities to improve performance. The first option is to do the work in parallel as much as possible using multiple machines, cores on a single machine, GPUs, or other specialized hardware such as FPGAs. For instance, I am testing the code on an eight-core machine. As a student, I was lucky enough to get involved in a project with the goal of creating a grid. The grid was supposed to bring together university computers into a single computational environment. In a later phase, there were plans to connect other computers too, a bit like the SETI project. (As you know, many office computers are idle during weekends and at night, so why not make them work too?)

Currently, of course, there are various commercial cloud systems, such as those provided by Amazon and Google. I will not discuss those because I feel that these are more specialized topics, although I did cover some Python-specific cloud systems in Python Data Analysis.

The second method to improve performance is to apply caching, thereby avoiding unnecessary function calls. I covered the joblib library, which has a caching feature, in Chapter 9, Ensemble Learning and Dimensionality Reduction. Python 3 has brought us new features for parallelism and caching.

The third method is getting close to the metal. As you know, Python is a high-level programming language with a virtual machine and interpreter. Python has an extra layer, which a language unlike what C has. When I was a student, we were taught that C is a high-level language, with assembler and machine code as the lower levels. As far as I know, these days, practically nobody codes in assembler. Via Cython (covered in Python Data Analysis) and similar software, we can compile our code to obtain performance on a par with C and C++. Compiling is a hassle and is problematic because it reduces portability due to platform dependence. A common solution is to automate compiling with shell scripts and make files. Numba and other similar projects make life even easier with just-in-time compiling, although with some limitations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 12. Parallelism and Performance

Create new playlist

Sign In

Sign Up

Chapter 12. Parallelism and Performance

Introduction

Table of Contents for
12. Parallelism and Performance