Getting ready

For best performance, a PyCUDA program must, therefore, make the most of every type of memory. In particular, it must make the most of shared memory, minimizing access to memory on a global level.

To do this, the problem domain is typically subdivided so that a single block of threads is able to execute its processing in a closed subset of data. In this way, the threads operating on the single block will all work together on the same shared memory area, optimizing access.

The basic steps for each thread are as follows:

  1. Load data from global memory to shared memory.
  2. Synchronize all threads of the block so that everyone can read safety positions and shared memory filled by other threads.
  3. Process the data of the shared memory. Making a new synchronization is necessary to ensure that the shared memory has been updated with the results.
  4. Write the results in global memory.

To clarify this approach, in the following section, we will present an example based on the calculation of the product of two matrices.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset