9. GPGPU with F#

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. GPGPU with F#

It’s easy and natural to write a math formula in F#, and F# is widely used in scientific computation applications, such as those used by trading companies and investment banks. Applications can benefit greatly from being able to directly leverage the hardware, but providing this ability can be a challenge for many .NET languages. The common way for .NET to directly access hardware is by using .NET interop. F# has another weapon called quotations. A quotation shows the program structure and can be used to translate F# code. In this chapter, I will demonstrate how to use quotations to translate F# code to GPU code.

The graphics processing unit (GPU) chip on a graphics card can do something other than render an image for you. Most developers get excited when their machine has four or eight processors, but that is a drop in the bucket when compared to what the GPU offers. GPUs generally have tens and even hundreds of processors, which often sit in an idle state. A general-purpose GPU (GPGPU) takes advantage of these processors and extends them to more general-purpose applications. GPGPUs are not focused on rendering images; they are designed to use a large number of processors to perform parallel actions or computations.

In this chapter, I will describe how to use .NET interoperations and quotations to directly access hardware. F# does not have built-in support for GPU, so the F# library, which translates the quotations to code and can be executed on a GPU is needed. In addition to the F# translation library, some small samples are listed to show how to leverage the GPU to perform parallel computations.

Introducing GPU and GPGPU

One of the major F# application areas is financial services. A fundamental problem that financial businesses try to solve is how to perform mathematical computations in real time. From previous chapters, you know that F# is a perfect candidate for implementing mathematical functions, and GPGPUs include a mechanism to provide real-time computations. Adding a working knowledge of GPGPUs to your skill set can help you solve these kinds of real-world problems.

Note

This chapter is about how to use F# to leverage GPGPU to perform computations on the GPU. The code can be downloaded from the F# sample pack (http://fsharp3sample.codeplex.com). The code is located in the OtherSamples folder.

According to Wikipedia, the definition for GPU is the following.

A graphics processing unit (GPU), also occasionally called visual processing unit (VPU), is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the building of images in a frame buffer intended for output to a display. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel. In a personal computer, a GPU can be present on a video card, or it can be on the motherboard or—in certain CPUs—on the CPU die. More than 90% of new desktop and notebook computers have integrated GPUs, which are usually far less powerful than those on a dedicated video card.

For many people, the GPU is most relevant to computer games. However, the GPU has a number of other uses. Do you know that Amazon’s GPU cluster (http://aws.amazon.com/ec2/instance-types/) can be used to enable high performance computations in the cloud? Maybe the application on your mobile phone or iPad is powered by the GPU running on a cloud cluster. Do you know that investment firms and brokerages are implementing GPU programming in their computation platforms to allow applications to quickly determine when to buy or sell stock? Any software developer working on these applications will have a lucrative career. Now might be a good time to take a second look at the small chip that has been ignored for a long time.

Wikipedia defines GPGPU as follows:

General-purpose computing on graphics processing units (GPGPU, GPGP or less often GP²U) is the means of using a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.

There are several GPU solutions on the market. OpenCL is the currently dominant open, general-purpose GPU computing language. The dominant proprietary framework is NVIDIA’s CUDA. In this chapter, I use CUDA as the target framework.

CUDA

According to NVIDIA, the definition of CUDA is as follows:

CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).

CUDA provides a GPGPU platform including drivers, a development SDK, and an execution environment. In this chapter, CUDA is used for all samples. To run the samples from this chapter, you should install an NVIDIA graphic card and download the CUDA SDK from http://developer.nvidia.com/cuda/cuda-downloads. There are three installation packages:

CUDA Toolkit
Graphics drivers
Development SDK

After you successfully install these three packages, the CUDA development environment is ready to use. The installation package creates several environment variables, as shown in Figure 9-1. And from the command line, you can execute the NVCC command, as shown in Figure 9-2.

Figure 9-1. Environment variables created by the CUDA installation package

Figure 9-2. Executing NVCC from the command line

The first task is to retrieve the graphics card property using F# interop. You need to define the struct that holds the graphics card information. Example 9-1 defines the CUDA structure in F#. cudaError is an enum structure that defines all CUDA error codes. SizeT is a struct wrapper for the IntPtr type. The CUDADeviceProp structure defines the graphics card properties. For this sample, there is a Name property that wraps the device property’s name value.

Example 9-1. CUDA data structure definition

module CUDADataStructure

open System
open System.Runtime.InteropServices

// CUDA error enumeration
type cudaError =
    | cudaErrorAddressOfConstant = 22
    | cudaErrorApiFailureBase = 10000
    | cudaErrorCudartUnloading = 29
    | cudaErrorInitializationError = 3
    | cudaErrorInsufficientDriver = 35
    | cudaErrorInvalidChannelDescriptor = 20
    | cudaErrorInvalidConfiguration = 9
    | cudaErrorInvalidDevice = 10
    | cudaErrorInvalidDeviceFunction = 8
    | cudaErrorInvalidDevicePointer = 17
    | cudaErrorInvalidFilterSetting = 26
    | cudaErrorInvalidHostPointer = 16
    | cudaErrorInvalidMemcpyDirection = 21
    | cudaErrorInvalidNormSetting = 27
    | cudaErrorInvalidPitchValue = 12
    | cudaErrorInvalidResourceHandle = 33
    | cudaErrorInvalidSymbol = 13
    | cudaErrorInvalidTexture = 18
    | cudaErrorInvalidTextureBinding = 19
    | cudaErrorInvalidValue = 11
    | cudaErrorLaunchFailure = 4
    | cudaErrorLaunchOutOfResources = 7
    | cudaErrorLaunchTimeout = 6
    | cudaErrorMapBufferObjectFailed = 14
    | cudaErrorMemoryAllocation = 2
    | cudaErrorMemoryValueTooLarge = 32
    | cudaErrorMissingConfiguration = 1
    | cudaErrorMixedDeviceExecution = 28
    | cudaErrorNoDevice = 37
    | cudaErrorNotReady = 34
    | cudaErrorNotYetImplemented = 31
    | cudaErrorPriorLaunchFailure = 5
    | cudaErrorSetOnActiveProcess = 36
    | cudaErrorStartupFailure = 127
    | cudaErrorSynchronizationError = 25
    | cudaErrorTextureFetchFailed = 23
    | cudaErrorTextureNotBound = 24
    | cudaErrorUnknown = 30
    | cudaErrorUnmapBufferObjectFailed = 15
    | cudaErrorIncompatibleDriverContext = 49
    | cudaSuccess = 0
[<Struct>]
type SizeT =
    val value : IntPtr
    new (n:int) = { value = IntPtr(n) }
    new (n:int64) = { value = IntPtr(n) }

[<Struct>]
type CUDADeviceProp =
    [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 256)>]
    val nameChar : char array
    val totalGlobalMem : SizeT
    val sharedMemPerBlock : SizeT
    val regsPerBlock : int
    val warpSize : int
    val memPitch : SizeT
    val maxThreadsPerBlock : int
    [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 3)>]
    val maxThreadsDim : int array
    [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 3)>]
    val maxGridSize : int array
    val clockRate : int
    val totalConstMem : SizeT
    val major : int
    val minor : int
    val textureAlignment: SizeT
    val deviceOverlap : int
    val multiProcessorCount : int
    val kernelExecTimeoutEnabled : int
    val integrated : int
    val canMapHostMemory : int
    val computeMode : int
    val maxTexture1D : int
    [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 2)>]
    val maxTexture2D : int array
    [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 3)>]
    val maxTexture3D : int array
    [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 2)>]
    val maxTexture1DLayered : int array
    [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 3)>]
    val maxTexture2DLayered : int array
    val surfaceAlignment : SizeT
    val concurrentKernels : int
    val ECCEnabled : int
    val pciBusID : int
    val pciDeviceID : int
    val pciDomainID : int
    val tccDriver : int
    val asyncEngineCount : int
    val unifiedAddressing : int
    val memoryClockRate : int
    val memoryBusWidth : int
    val l2CacheSize : int
    val maxThreadsPerMultiProcessor : int
    member this.Name = String(this.nameChar).Trim('00')

Note

In F#, the NULL character is represented as '00', which is different from C#’s ‘’.

The interop code from the F# side is simple. The CUDA binary has 32-bit and 64-bit versions. Because of this, you define two modules, named CUDA32 and CUDA64. The CudaRT files are located under the CUDA installation folder. If the default installation path is used, the path should look like C:ProgramDataNVIDIA CorporationNVIDIA GPU Computing SDK 4.2Ccommonin. The interop code is shown in Example 9-2.

Example 9-2. F# CUDA interop code

open System
open System.Runtime.InteropServices
open CUDADataStructure

module CUDA32 =
    [<Literal>]
    let dllName = "cudart32_42_9"

    [<DllImport(dllName)>]
    extern cudaError cudaGetDeviceProperties(CUDADeviceProp& prop, int device)

    [<DllImport(dllName)>]
    extern cudaError cudaDeviceGetLimit(SizeT& pSize, cudaLimit limit)

module CUDA64 =
    [<Literal>]
    let dllName = "cudart64_42_9"

    [<DllImport(dllName)>]
    extern cudaError cudaGetDeviceProperties(CUDADeviceProp& prop, int device)

    [<DllImport(dllName)>]
    extern cudaError cudaDeviceGetLimit(SizeT& pSize, cudaLimit limit)

[<EntryPoint>]
let main argv =
    let mutable prop = CUDADeviceProp()

    // get the first graphic card by passing in 0
    let returnCode = CUDA32.cudaGetDeviceProperties (&prop, 0)

    printfn "%A - %A" returnCode prop.Name

    ignore <| System.Console.ReadKey()
    0 // return an integer exit code

Note

The constant value cudart64_42_9 could be different depending on the CUDA SDK installed on your computer. The file name shown in Example 9-2 is based on CUDA 4.2. For the CUDA 4.0 SDK, the DLL file name is something like cudart32_40_17.

The F# extern method requires an ampersand (&) to pass a variable reference.

The execution result is shown in Figure 9-3.

Figure 9-3. Getting the device property execution result

The graphics card has some limitations, such as stack size. You need to know these limitations because your future coding needs to take these into consideration. The limitations are defined in our F# code as shown in Example 9-3.

Example 9-3. cudaLimit enumeration

type cudaLimit =
    | cudaLimitStackSize = 0
    | cudaLimitPrintfFifoSize = 1
    | cudaLimitMallocHeapSize = 2

The interop code needed to get the device limitation information is shown in Example 9-4, and the execution result is displayed in Figure 9-4. For most development activities, it is uncommon to have to query your hardware’s limitations. This is because the operating system handles these limitations for you. However, this is not the case when programming against a GPU. For any code executed on the GPU, it is the developer’s responsibility to understand and respect these limitations.

Example 9-4. Querying the limitations for the graphics device

let limitCategories = [
        cudaLimit.cudaLimitStackSize
        cudaLimit.cudaLimitPrintfFifoSize
        cudaLimit.cudaLimitMallocHeapSize ]

limitCategories |>
Seq.iter (fun category ->
                let mutable limit = SizeT()
                let returnCode = CUDA32.cudaDeviceGetLimit(&limit, category)
                printfn "%A: %A - %A" returnCode category limit.value)

Figure 9-4. Getting the device limitations execution result

For device management, there are other commonly used functions (which are shown in Example 9-5):

Reset device function. This function cleans up all resources on the current device associated with the current process. If there are multiple threads on the current process, it is the developer’s responsibility to make sure the other thread or threads do not access this device.
Get device count function. This function returns the number of devices.
Set device flag function. This is useful for allowing you to set flags to configure how the GPU and CPU work together.

When the GPU is processing the data, there is a CPU thread waiting for the result to come back. The following flags configure how the CPU thread waits for the return value:

cudaDeviceScheduleAuto is used as a default value. If the active CUDA context number is larger than the logical processor number, the thread can yield to other operating system threads. Otherwise, the CUDA will spin on the CPU processor and will not yield to other processors.
cudaDeviceScheduleSpin instructs the CUDA thread on the CPU to never yield to other threads until it gets the result from the device. This can increase the CPU latency, but it can decrease the latency when waiting for results from the device.
cudaDeviceScheduleSpin is the opposite of cudaDeviceScheduleSpin. It instructs the CUDA thread to yield to other operating system threads on the CPU when waiting for the result from the device.
cudaDeviceScheduleBlockingSync instructs a CUDA thread on the CPU to block other CPU threads when waiting for the result from the device.

Example 9-5. Device flag definition

Code definition

type cudaDeviceFlag =
    | cudaDeviceScheduleAuto = 0
    | cudaDeviceScheduleSpin = 1
    | cudaDeviceScheduleYield = 2
    | cudaDeviceScheduleBlockingSync = 4

[<DllImport(dllName)>]
extern cudaError cudaDeviceReset()

[<DllImport(dllName)>]
extern cudaError cudaGetDeviceCount(int& count)

[<DllImport(dllName)>]
extern cudaError cudaSetDeviceFlags(cudaDeviceFlag count)

Invoking the code

let mutable deviceCount = 0
let returnCode = CUDA32.cudaGetDeviceCount(&deviceCount)
printfn "%A: %A" returnCode deviceCount

let returnCode = CUDA32.cudaSetDeviceFlags(cudaDeviceFlag.cudaDeviceScheduleAuto)
printfn "set flag return code %A" returnCode

let returnCode = CUDA32.cudaDeviceReset()
printfn "reset device %A" returnCode

In addition to the hardware information, the driver information is also important, because hardware functionalities are exposed by the device driver. Different hardware exposes different APIs and has different memory capacity. The managed language developer does not have to consider the memory because the operating system and .NET Framework take care of memory management. However, the developer does need to keep the hardware configuration in mind when programming for the GPU. Therefore, it is important to know the current driver version. Example 9-6 shows how to get the driver version.

Example 9-6. Getting the driver version

type CUResult =
    | Success = 0
    | ErrorInvalidValue = 1
    | ErrorOutOfMemory = 2
    | ErrorNotInitialized = 3
    | ErrorDeinitialized = 4
    | ErrorNoDevice = 100
    | ErrorInvalidDevice = 101
    | ECCUncorrectable = 214
    | ErrorAlreadyAcquired = 210
    | ErrorAlreadyMapped = 208
    | ErrorArrayIsMapped = 207
    | ErrorContextAlreadyCurrent = 202
    | ErrorFileNotFound = 301
    | ErrorInvalidImage = 200
    | ErrorInvalidContext = 201
    | ErrorInvalidHandle = 400
    | ErrorInvalidSource = 300
    | ErrorLaunchFailed = 700
    | ErrorLaunchIncompatibleTexturing = 703
    | ErrorLaunchOutOfResources = 701
    | ErrorLaunchTimeout = 702
    | ErrorMapFailed = 205
    | ErrorNoBinaryForGPU = 209
    | ErrorNotFound = 500
    | ErrorNotMapped = 211
    | ErrorNotReady = 600
    | ErrorUnmapFailed = 206
    | NotMappedAsArray = 212
    | NotMappedAsPointer = 213
    | PointerIs64Bit = 800
    | SizeIs64Bit = 801
    | ErrorUnknown = 999

module InteropLibrary =
    [<DllImport("nvcuda")>]
    extern CUResult cuDriverGetVersion(int& driverVersion)

type CUDADriver() =
    member this.Version =
        let mutable version = 0
        (InteropLibrary.cuDriverGetVersion(&version), version)

Note

There is a cudaError structure defined in Example 9-1. The cuError structure seems to serve the same purpose. However, cuError is defined as part of the CUDA driver API, while cudaError is part of the CUDA runtime API. The runtime API is based on the driver API. Programming runtime APIs is simpler, but driver APIs provide better control over the device. There is no performance difference between these two types of APIs. It is recommended that you refrain from combining these two kinds of APIs in one application. The function in the driver API is prefixed with cu, while the runtime API begins with cuda.

To make the GPU compute the data, the data should first be loaded into memory. The computations happen within the GPU’s memory. You refer to the GPU as device and the CPU as host. The CPU memory is called the host memory, and the GPU memory is called the device memory. The memory-management function is defined in Example 9-7. The enum type CUDAMemcpyKind defines four types of memory copy operations.

Example 9-7. CUDA device memory-management function

namespace CUDARuntime

open System
open System.Text
open System.Collections.Generic
open System.Runtime.InteropServices
open CUDADataStructure
type CUDAMemcpyKind =
    | cudaMemcpyHostToHost = 0
    | cudaMemcpyHostToDevice = 1
    | cudaMemcpyDeviceToHost = 2
    | cudaMemcpyDeviceToDevice = 3

module CUDARuntime64 =
    [<Literal>]
    let dllName = "cudart64_40_17"

    [<DllImport(dllName)>]
    extern cudaError cudaMemcpy(IntPtr dst, IntPtr src, SizeT count, CUDAMemcpyKind kind)

    [<DllImport(dllName)>]
    extern cudaError cudaMalloc(IntPtr& p, SizeT size)

    [<DllImport(dllName)>]
    extern cudaError cudaMemset(IntPtr& p, int value, int count)

module CUDARuntime32 =
    [<Literal>]
    let dllName = "cudart32_40_17"

    [<DllImport(dllName)>]
    extern cudaError cudaMemcpy(IntPtr dst, IntPtr src, SizeT count, CUDAMemcpyKind kind)

    [<DllImport(dllName)>]
    extern cudaError cudaMalloc(IntPtr& p, SizeT size)

    [<DllImport(dllName)>]
    extern cudaError cudaMemset(IntPtr& p, int value, int count)

Example 9-8 shows how to copy data between the host and device memory, and its result is shown in Figure 9-5. Because memory copying is a bottleneck for GPGPU computations, it is not a good practice to copy data between host memory and device memory. The best practice is to keep the data in device memory as long as possible.

Example 9-8. Transferring data between device and host memory

open System.Runtime.InteropServices

let test5() =
    let getIntPtr arr =
        let nativeint = Marshal.UnsafeAddrOfPinnedArrayElement(arr, 0)
        let intptr = new System.IntPtr(nativeint.ToPointer())
        intptr

    let mutable ptr = IntPtr()
    let arr = [|1.f; 2.f; 3.f; 4.f; 5.f; 1.f; 2.f; 3.f; 4.f; 5.f;|]
    let arr2 = [|11.f; 12.f; 13.f; 14.f; 15.f; 11.f; 12.f; 13.f; 14.f; 15.f;|]
    let intptr = getIntPtr arr
    let intptr2 = getIntPtr arr2
    let size = arr.Length * sizeof<float>
    let error = CUDARuntime32.cudaMalloc(&ptr, SizeT(size))
    let error = CUDARuntime32.cudaMemcpy(ptr,
                                        intptr,
                                        SizeT(10 * 4),
                                        CUDAMemcpyKind.cudaMemcpyHostToDevice)
    let error = CUDA32.cudaMemcpy(intptr2,
                                        ptr,
                                        SizeT(size),
                                        CUDAMemcpyKind.cudaMemcpyDeviceToHost)
    printfn "%A - %A" arr arr2

Figure 9-5. Execution result from the code in Example 9-8

This section demonstrated how to configure the device and copy data to or from the device. There are several CUDA libraries provided by NVIDIA. The CUDA runtime and CUDA driver API provide the basic ability to program against the GPU. In addition, CUDA provides a rich set of libraries to boost this development. In the next section, I am going to demonstrate how to use F# to invoke these libraries and perform basic computations on the GPU.

CUDA Toolkit

Some libraries, collectively known as the NVIDIA CUDA Toolkit, are provided with CUDA. According to the NVIDIA website (http://developer.nvidia.com/cuda/cuda-toolkit), the NVIDIA CUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. The CUDA Toolkit includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing the performance of your applications. You’ll also find programming guides, user manuals, an API reference, and other documentation to help you get started with accelerating your application with the GPU.

In this section, two libraries—cuRAND and cuBLAS—are used to demonstrate how to use PInvoke to invoke CUDA functions from F#.

cuRAND Library

The first library presented is the NVIDIA CUDA Random Number Generation library (cuRAND). According to the NVIDIA website (http://developer.nvidia.com/cuda/curand), the cuRAND delivers high-performance, GPU-accelerated random number generation (RNG). The cuRAND library delivers high-quality random numbers using hundreds of processor cores available in the NVIDIA GPU. The random number generator is the basic building block for simulations such as the Monte Carlo simulation. The performance from the cuRAND library can improve the performance of the simulation.

There are several enumeration structures defined for the CUDA library. The RanGenerator structure, shown in Example 9-9, is used to reference the random generator.

Example 9-9. cuRAND enumeration types

type curandStatus =
    | CURAND_SUCCESS = 0
    | CURAND_VERSION_MISMATCH = 100
    | CURAND_NOT_INITIALIZED = 101
    | CURAND_ALLOCATION_FAILED = 102
    | CURAND_TYPE_ERROR =103
    | CURAND_OUT_OF_RANGE = 104
    | CURAND_LENGTH_NOT_MULTIPLE = 105
    | CURAND_LAUNCH_FAILURE = 201
    | CURAND_PREEXISTING_FAILURE = 202
    | CURAND_INITIALIZATION_FAILED = 203
    | CURAND_ARCH_MISMATCH = 204
    | CURAND_INTERNAL_ERROR = 999

type CUDARandomRngType =
    | CURAND_TEST = 0
    | CURAND_PSEUDO_DEFAULT = 100
    | CURAND_PSEUDO_XORWOW = 101
    | CURAND_QUASI_DEFAULT = 200
    | CURAND_QUASI_SOBOL32 = 201
    | CURAND_QUASI_SCRAMBLED_SOBOL32 = 202
    | CURAND_QUASI_SOBOL64 = 203
    | CURAND_QUASI_SCRAMBLED_SOBOL64 = 204

type CUDARandomOrdering =
    | CURAND_PSEUDO_BEST = 100
    | CURAND_PSEUDO_DEFAULT = 101
    | CURAND_PSEUDO_SEEDED = 102
    | CURAND_QUASI_DEFAULT = 201

type CUDADirectionVectorSet =
    | CURAND_VECTORS_32_JOEKUO6 = 101
    | CURAND_DIRECTION_VECTORS_32_JOEKUO6 = 102
    | CURAND_VECTORS_64_JOEKUO6 = 103
    | CURAND_DIRECTION_VECTORS_64_JOEKUO6 = 104

[<Struct>]
type RandGenerator =
    val handle : uint32

Because the library has both an x86 and x64 flavor, the F# cuRAND library needs to provide two versions as well, as shown in Example 9-10. The CUDAPointer struct wraps the pointer to the data stored in the GPU memory.

Example 9-10. cuRAND library code

open System
open System.Text
open System.Collections.Generic
open System.Runtime.InteropServices
open CUDADataStructure
open CUDARuntime

[<Struct>]
type CUDAPointer =
    val Pointer : IntPtr
    new(ptr) = { Pointer = ptr }
    new(cudaPointer:CUDAPointer) = { Pointer = cudaPointer.Pointer }
    member this.PointerSize with get() = IntPtr.Size

[<Struct>]
type RandDirectionVectors32 =
    [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 32)>]
    val direction_vectors :  uint32[]

[<Struct>]
type RandDirectionVectors64 =
    [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 64)>]
    val direction_vectors :  uint64[]

// CUDA random generator x86 version
module CUDARandomDriver32 =
    [<Literal>]
    let dllName =  "curand32_40_17"
    [<DllImport(dllName)>]
    extern curandStatus curandCreateGenerator(RandGenerator& generator,
                                                      CUDARandomRngType rng_type)
    [<DllImport(dllName)>]
    extern curandStatus curandCreateGeneratorHost(RandGenerator& generator,
                                                           CUDARandomRngType rng_type)
    [<DllImport(dllName)>]
    extern curandStatus curandDestroyGenerator(RandGenerator generator)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerate(RandGenerator generator,
                                              IntPtr outputPtr,
                                              SizeT num)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateLogNormal(RandGenerator generator,
                                                         IntPtr outputPtr,
                                                         SizeT n,
                                                         float mean,
                                                         float stddev)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateLogNormalDouble(RandGenerator generator,
                                                                IntPtr outputPtr,
                                                                SizeT n, double mean,
double stddev)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateLongLong(RandGenerator generator,
                                                        IntPtr outputPtr,
                                                        SizeT num)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateNormal(RandGenerator generator,
                                                     IntPtr outputPtr,
                                                     SizeT n, float mean, float stddev)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateNormalDouble(RandGenerator generator,
                                                             IntPtr outputPtr,
                                                             SizeT n, double mean, double stddev)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateSeeds(RandGenerator generator)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateUniform(RandGenerator generator,
                                                       IntPtr outputPtr, SizeT num)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateUniformDouble(RandGenerator generator,
                                                              IntPtr outputPtr, SizeT num)
    [<DllImport(dllName)>]
    extern curandStatus curandGetDirectionVectors32(RandDirectionVectors32& vectors,
                                                              CUDADirectionVectorSet set)
    [<DllImport(dllName)>]
    extern curandStatus curandGetDirectionVectors64(RandDirectionVectors64& vectors,
                                                              CUDADirectionVectorSet set)
    [<DllImport(dllName)>]
    extern curandStatus curandGetScrambleConstants32(IntPtr& constants)
    [<DllImport(dllName)>]
    extern curandStatus curandGetScrambleConstants64(IntPtr& constants)
    [<DllImport(dllName)>]
    extern curandStatus curandGetVersion(int& version)
    [<DllImport(dllName)>]
    extern curandStatus curandSetGeneratorOffset(RandGenerator generator, uint64 offset)
    [<DllImport(dllName)>]
    extern curandStatus curandSetGeneratorOrdering(RandGenerator generator,
                                                             CUDARandomOrdering order)
    [<DllImport(dllName)>]
    extern curandStatus curandSetPseudoRandomGeneratorSeed(RandGenerator generator,
                                                                      uint64 seed)
    [<DllImport(dllName)>]
    extern curandStatus curandSetQuasiRandomGeneratorDimensions(RandGenerator generator,
                                                                            uint32 num_
dimensions)
    [<DllImport(dllName)>]
    extern curandStatus curandSetStream(RandGenerator generator, CUDAStream stream)

    let CreateGenerator(rng_type) =
        let mutable generator = Unchecked.defaultof<RandGenerator>
        let r = curandCreateGenerator(&generator, rng_type)
        (r, generator)
    let DestroyGenerator(generator) =
        curandDestroyGenerator(generator)

    let SetPseudoRandomGeneratorSeed(generator, seed) =
        curandSetPseudoRandomGeneratorSeed(generator, seed)

    let SetGeneratorOffset(generator, offset) =
        curandSetGeneratorOffset(generator, offset)

    let SetGeneratorOrdering(generator, order) =
        curandSetGeneratorOrdering(generator, order)

    let SetQuasiRandomGeneratorDimensions(generator, dimensions) =
        curandSetQuasiRandomGeneratorDimensions(generator, dimensions)

    let CopyToHost(out:'T array, cudaPtr:CUDAPointer) =
        let devPtr = cudaPtr.Pointer
        let outputPtr = GCHandle.Alloc(out, GCHandleType.Pinned).AddrOfPinnedObject()
        let unitSize = Marshal.SizeOf(typeof<float32>)
        let n = out.Length
        let size = SizeT(n * unitSize)
        let r = CUDARuntime32.cudaMemcpy(outputPtr, devPtr, size,
                                                CUDAMemcpyKind.cudaMemcpyDeviceToHost)
        r

    let GenerateUniform(generator, n:int) =
        let unitSize = Marshal.SizeOf(typeof<float32>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime32.cudaMalloc(&devicePtr, size)
        let r = curandGenerateUniform(generator, devicePtr, size)
        (r, CUDAPointer(devicePtr))

    let GenerateUniformDouble(generator, n:int) =
        let unitSize = Marshal.SizeOf(typeof<float>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime32.cudaMalloc(&devicePtr, size)
        let r = curandGenerateUniform(generator, devicePtr, size)
        (r, CUDAPointer(devicePtr))

    let GenerateNormal(generator, n:int, mean, stddev) =
        let unitSize = Marshal.SizeOf(typeof<float32>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime32.cudaMalloc(&devicePtr, size)
        let r = curandGenerateNormal(generator, devicePtr, size, mean, stddev)
        (r, CUDAPointer(devicePtr))

    let GenerateNormalDouble(generator, n:int, mean, stddev) =
        let unitSize = Marshal.SizeOf(typeof<float32>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime32.cudaMalloc(&devicePtr, size)
        let r = curandGenerateNormalDouble(generator, devicePtr, size, mean, stddev)
        (r, CUDAPointer(devicePtr))

    let GenerateLogNormal(generator, n:int, mean, stddev) =
        let unitSize = Marshal.SizeOf(typeof<float32>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime32.cudaMalloc(&devicePtr, size)
        let r = curandGenerateLogNormal(generator, devicePtr, size, mean, stddev)
        (r, CUDAPointer(devicePtr))

    let GenerateLogNormalDouble(generator, n:int, mean, stddev) =
        let unitSize = Marshal.SizeOf(typeof<float>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime32.cudaMalloc(&devicePtr, size)
        let r = curandGenerateLogNormalDouble(generator, devicePtr, size, mean, stddev)
        (r, CUDAPointer(devicePtr))

// CUDA random generator x64 version
module CUDARandomDriver64 =
    [<Literal>]
    let dllName =  "curand64_40_17"
    [<DllImport(dllName)>]
    extern curandStatus curandCreateGenerator(RandGenerator& generator,
                                                      CUDARandomRngType rng_type)
    [<DllImport(dllName)>]
    extern curandStatus curandCreateGeneratorHost(RandGenerator& generator,
                                                           CUDARandomRngType rng_type)
    [<DllImport(dllName)>]
    extern curandStatus curandDestroyGenerator(RandGenerator generator)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerate(RandGenerator generator, IntPtr outputPtr, SizeT
num)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateLogNormal(RandGenerator generator,
                                                         IntPtr outputPtr,
                                                         SizeT n, float mean, float
stddev)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateLogNormalDouble(RandGenerator generator,
                                                                IntPtr outputPtr,
                                                                SizeT n, double mean,
double stddev)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateLongLong(RandGenerator generator,
                                                        IntPtr outputPtr, SizeT num)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateNormal(RandGenerator generator,
                                                     IntPtr outputPtr,
                                                     SizeT n, float mean, float stddev)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateNormalDouble(RandGenerator generator,
                                                            IntPtr outputPtr,
                                                            SizeT n, double mean, double
stddev)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateSeeds(RandGenerator generator)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateUniform(RandGenerator generator,
                                                      IntPtr outputPtr, SizeT num)
    [<DllImport(dllName)>]
    extern curandStatus curandGenerateUniformDouble(RandGenerator generator,
                                                              IntPtr outputPtr, SizeT num)
    [<DllImport(dllName)>]
    extern curandStatus curandGetDirectionVectors32(RandDirectionVectors32& vectors,
                                                              CUDADirectionVectorSet set)
    [<DllImport(dllName)>]
    extern curandStatus curandGetDirectionVectors64(RandDirectionVectors64& vectors,
                                                              CUDADirectionVectorSet set)
    [<DllImport(dllName)>]
    extern curandStatus curandGetScrambleConstants32(IntPtr& constants)
    [<DllImport(dllName)>]
    extern curandStatus curandGetScrambleConstants64(IntPtr& constants)
    [<DllImport(dllName)>]
    extern curandStatus curandGetVersion(int& version)
    [<DllImport(dllName)>]
    extern curandStatus curandSetGeneratorOffset(RandGenerator generator, uint64 offset)
    [<DllImport(dllName)>]
    extern curandStatus curandSetGeneratorOrdering(RandGenerator generator,
                                                             CUDARandomOrdering order)
    [<DllImport(dllName)>]
    extern curandStatus curandSetPseudoRandomGeneratorSeed(RandGenerator generator,
                                                                      uint64 seed);
    [<DllImport(dllName)>]
    extern curandStatus curandSetQuasiRandomGeneratorDimensions(RandGenerator generator,
                                                                            uint32 num_
dimensions)
    [<DllImport(dllName)>]
    extern curandStatus curandSetStream(RandGenerator generator, CUDAStream stream)

    let CreateGenerator(rng_type) =
        let mutable generator = Unchecked.defaultof<RandGenerator>
        let r = curandCreateGenerator(&generator, rng_type)
        (r, generator)

    let DestroyGenerator(generator) =
        curandDestroyGenerator(generator)

    let SetPseudoRandomGeneratorSeed(generator, seed) =
        curandSetPseudoRandomGeneratorSeed(generator, seed)

    let SetGeneratorOffset(generator, offset) =
        curandSetGeneratorOffset(generator, offset)
    let SetGeneratorOrdering(generator, order) =
        curandSetGeneratorOrdering(generator, order)

    let SetQuasiRandomGeneratorDimensions(generator, dimensions) =
        curandSetQuasiRandomGeneratorDimensions(generator, dimensions)

    let GenerateUniform(generator, n:int) =
        let unitSize = Marshal.SizeOf(typeof<float32>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime64.cudaMalloc(&devicePtr, size)
        let r = curandGenerateUniform(generator, devicePtr, size)
        (r, CUDAPointer(devicePtr))

    let GenerateUniformDouble(generator, n:int) =
        let unitSize = Marshal.SizeOf(typeof<float>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime64.cudaMalloc(&devicePtr, size)
        let r = curandGenerateUniform(generator, devicePtr, size)
        (r, CUDAPointer(devicePtr))
    let GenerateNormal(generator, n:int, mean, stddev) =
        let unitSize = Marshal.SizeOf(typeof<float32>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime64.cudaMalloc(&devicePtr, size)
        let r = curandGenerateNormal(generator, devicePtr, size, mean, stddev)
        (r, CUDAPointer(devicePtr))

    let GenerateNormalDouble(generator, n:int, mean, stddev) =
        let unitSize = Marshal.SizeOf(typeof<float32>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime64.cudaMalloc(&devicePtr, size)
        let r = curandGenerateNormalDouble(generator, devicePtr, size, mean, stddev)
        (r, CUDAPointer(devicePtr))

    let GenerateLogNormal(generator, n:int, mean, stddev) =
        let unitSize = Marshal.SizeOf(typeof<float32>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime64.cudaMalloc(&devicePtr, size)
        let r = curandGenerateLogNormal(generator, devicePtr, size, mean, stddev)
        (r, CUDAPointer(devicePtr))

    let GenerateLogNormalDouble(generator, n:int, mean, stddev) =
        let unitSize = Marshal.SizeOf(typeof<float>)
        let size = SizeT(n * unitSize)
        let mutable devicePtr = Unchecked.defaultof<IntPtr>
        let r = CUDARuntime64.cudaMalloc(&devicePtr, size)
        let r = curandGenerateLogNormalDouble(generator, devicePtr, size, mean, stddev)
        (r, CUDAPointer(devicePtr))

If you prefer to use a class, a class version of CUDARandom is defined in Example 9-11.

Example 9-11. CUDARandom class definition

type CUDARandom() =

    let is64bit = IntPtr.Size = 8

    member this.CreateGenerator(rand_type) =
        if is64bit then
            CUDARandomDriver64.CreateGenerator(rand_type)
        else
            CUDARandomDriver32.CreateGenerator(rand_type)
    member this.DestroyGenerator(g) =
        if is64bit then
            CUDARandomDriver64.DestroyGenerator(g)
        else
            CUDARandomDriver32.DestroyGenerator(g)
    member this.SetPseudoRandomGeneratorSeed(g, obj) =
        if is64bit then
            CUDARandomDriver64.SetPseudoRandomGeneratorSeed(g, obj |> unbox |> uint64)
        else
            CUDARandomDriver32.SetPseudoRandomGeneratorSeed(g, obj |> unbox |> uint64)
    member this.SetGeneratorOffset(g, obj) =
        if is64bit then
            CUDARandomDriver64.SetGeneratorOffset(g, obj |> unbox |> uint64)
        else
            CUDARandomDriver32.SetGeneratorOffset(g, obj |> unbox |> uint64)
    member this.SetGeneratorOrdering(g, ordering) =
        if is64bit then
            CUDARandomDriver64.SetGeneratorOrdering(g, ordering)
        else
            CUDARandomDriver32.SetGeneratorOrdering(g, ordering)
    member this.SetQuasiRandomGeneratorDimensions(g, obj) =
        if is64bit then
            CUDARandomDriver64.SetQuasiRandomGeneratorDimensions(g, obj |> unbox |>
uint32)
        else
            CUDARandomDriver32.SetQuasiRandomGeneratorDimensions(g, obj |> unbox |>
uint32)
    member this.GenerateUniform(g, seed) =
        if is64bit then
            CUDARandomDriver64.GenerateUniform(g, seed)
        else
            CUDARandomDriver32.GenerateUniform(g, seed)
    member this.GenerateUniformDouble(g, seed) =
        if is64bit then
            CUDARandomDriver64.GenerateUniformDouble(g, seed)
        else
            CUDARandomDriver32.GenerateUniformDouble(g, seed)
    member this.GenerateNormal(g, seed, mean, variance) =
        if is64bit then
            CUDARandomDriver64.GenerateNormal(g, seed, mean, variance)
        else
            CUDARandomDriver32.GenerateNormal(g, seed, mean, variance)
    member this.GenerateNormalDouble(g, seed, mean, variance) =
        if is64bit then
            CUDARandomDriver64.GenerateNormalDouble(g, seed, mean, variance)
        else
            CUDARandomDriver32.GenerateNormalDouble(g, seed, mean, variance)
    member this.GenerateLogNormal(g, seed, mean, variance) =
        if is64bit then
            CUDARandomDriver64.GenerateLogNormal(g, seed, mean, variance)
        else
            CUDARandomDriver32.GenerateLogNormal(g, seed, mean, variance)
    member this.GenerateLogNormalDouble(g, seed, mean, variance) =
        if is64bit then
            CUDARandomDriver64.GenerateLogNormalDouble(g, seed, mean, variance)
        else
            CUDARandomDriver32.GenerateLogNormalDouble(g, seed, mean, variance)

The sample code needed to invoke the CUDARandom class is shown in Example 9-12. The sample code generates 256 random numbers.

Example 9-12. Sample code to invoke the CUDARandom class

open System.Runtime.InteropServices

let test6() =
    let n = 256
    let r = CUDARandom()
    let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT)
    if status = curandStatus.CURAND_SUCCESS then
        let status, v = r.GenerateUniform(g, n)
        if status = curandStatus.CURAND_SUCCESS then
            let array : float32 array = Array.zeroCreate n
            let nativePtr = Marshal.UnsafeAddrOfPinnedArrayElement(array, 0)
            let p = System.IntPtr(nativePtr.ToPointer())
            CUDARuntime.CUDARuntime64.cudaMemcpy(
                p,
                v.Pointer,
                SizeT(n*Marshal.SizeOf(sizeof<float32>)),
                CUDAMemcpyKind.cudaMemcpyDeviceToHost)
            r.DestroyGenerator(g)
            array
            |> Seq.iter (printfn "%A")
        else
            printfn "generation failed. status = %A" status
            r.DestroyGenerator(g)
    else
        printfn "create generator failed. status = %A" status
        r.DestroyGenerator(g)
    ()

The generated random numbers result in a uniform distribution. If the random numbers need to be generated from a customized function, you can use an accept-rejection algorithm. This method is based on the observation that one can sample uniformly from the region under the graph of its density function. The algorithm works like this:

Sample a point x from a distribution—for example, uniform distribution.
Draw a vertical line from x to cut the target function’s diagram.
Sample uniformly along this vertical line starting from x. If the point is located outside the target function’s distribution, reject it.

You can then use the filter to generate the agreed-to sample value with the formula f(x), as shown in Example 9-13.

Example 9-13. Accept-reject algorithm

open System.Runtime.InteropServices

let test6_2() =
    let n = 256
    let r = CUDARandom()
    let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT)
    if status = curandStatus.CURAND_SUCCESS then
        let status, v = r.GenerateUniform(g, n)
        if status = curandStatus.CURAND_SUCCESS then
            let array : float32 array = Array.zeroCreate n
            let nativePtr = Marshal.UnsafeAddrOfPinnedArrayElement(array, 0)
            let p = System.IntPtr(nativePtr.ToPointer())
            CUDARuntime.CUDARuntime64.cudaMemcpy(
                p,
                v.Pointer,
                SizeT(n*Marshal.SizeOf(sizeof<float32>)),
                CUDAMemcpyKind.cudaMemcpyDeviceToHost)
            r.DestroyGenerator(g)
            array
        else
            r.DestroyGenerator(g)
            failwith "generation failed. status = %A" status
    else
        r.DestroyGenerator(g)
        failwith "create generator failed. status = %A" status

let test7() =
    let xArray = test6_2()
    let yArray = test6_2()

    Array.zip xArray yArray
    |> Array.filter (fun (x, y) -> x * x <= y)
    |> Seq.iter (fun (x, y) -> printfn "(%A, %A)" x y)

cuBLAS Library

The second library is the CUDA Basic Linear Algebra Subroutines library (cuBLAS). According to the NVIDIA website (http://developer.nvidia.com/cuda/cublas), the cuBLAS library is a GPU-accelerated version of the complete standard BLAS library that delivers performance that’s 6 to 17 times faster than the latest MKL BLAS. The code that defines the data structure in F# is shown in Example 9-14.

Example 9-14. cuBLAS data structures

[<Struct>]
type CUDABLASHandle =
    val handle : uint32

[<Struct>]
type CUDAStream =
    val Value : int

[<Struct>]
type CUDAFloatComplex =
    val real : float32
    val imag : float32

type CUBLASPointerMode =
    | Host = 0
    | Device = 1

type CUBLASStatus =
    | Success = 0
    | NotInitialized = 1
    | AllocFailed = 3
    | InvalidValue = 7
    | ArchMismatch = 8
    | MappingError = 11
    | ExecutionFailed = 13
    | InternalError = 14

The F# wrapper code for the cuBLAS library is shown in Example 9-15. The 32-bit version does not list all of the functions. The only difference between the 32-bit version and the 64-bit version is the dllName variable. The 64-bit version is cublas64_42_9, and the 32-bit version is cublas32_42_9.

Example 9-15. cuBLAS library

module CUDABLASDriver64 =
    [<Literal>]
    let dllName =  "cublas64_42_9"
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasInit()
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasShutdown()
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasGetError()
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasFree(CUDAPointer devicePtr)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCreate_v2(CUDABLASHandle& handle)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSetStream_v2(CUDABLASHandle handle, CUDAStream streamId)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasGetStream_v2(CUDABLASHandle handle, CUDAStream& streamId)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasGetPointerMode_v2(CUDABLASHandle handle,
                                                         CUBLASPointerMode& mode)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSetPointerMode_v2(CUDABLASHandle handle,
                                                         CUBLASPointerMode mode)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasIcamax_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, int& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasIdamax_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, int& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasIsamax_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, int& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasIzamax_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, int& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasIcamin_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, int& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasIdamin_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, int& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasIsamin_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, int& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasIzamin_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, int& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSasum_v2(CUDABLASHandle handle,
                                              int n, IntPtr x, int incx, float32& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDasum_v2(CUDABLASHandle handle,
                                              int n, IntPtr x, int incx, float& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasScasum_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, float32& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDzasum_v2(CUDABLASHandle handle,
                                               int n, IntPtr x, int incx, float& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSaxpy_v2(CUDABLASHandle handle,
                                              int n, float32& alpha, IntPtr x,
                                              int incx, IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDaxpy_v2(CUDABLASHandle handle,
                                              int n, float& alpha, IntPtr x,
                                              int incx, IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCaxpy_v2(CUDABLASHandle handle, int n,
                                              CUDAFloatComplex& alpha, IntPtr x,
                                              int incx, IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZaxpy_v2(CUDABLASHandle handle, int n,
                                              CUDAFloatComplex& alpha, IntPtr x,
                                              int incx, IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasScopy_v2(CUDABLASHandle handle,
                                              int n, IntPtr x, int incx, IntPtr y, int
incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDcopy_v2(CUDABLASHandle handle,
                                              int n, IntPtr x, int incx, IntPtr y, int
incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCcopy_v2(CUDABLASHandle handle,
                                              int n, IntPtr x, int incx, IntPtr y, int
incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZcopy_v2(CUDABLASHandle handle,
                                              int n, IntPtr x, int incx, IntPtr y, int
incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSdot_v2(CUDABLASHandle handle,
                                             int n, IntPtr x, int incx, IntPtr y,
                                             int incy, float32& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDdot_v2(CUDABLASHandle handle,
                                             int n, IntPtr x, int incx,
                                             IntPtr y, int incy, float& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCdotu_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx,
                                              IntPtr y, int incy, CUDAFloatComplex&
result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCdotc_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx,
                                              IntPtr y, int incy, CUDAFloatComplex&
result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZdotu_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx,
                                              IntPtr y, int incy, CUDAFloatComplex&
result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZdotc_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx,
                                              IntPtr y, int incy, CUDAFloatComplex&
result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSnrm2_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx, float32&result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDnrm2_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx, float& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasScnrm2_v2(CUDABLASHandle handle, int n,
                                               IntPtr x, int incx, float32&result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDznrm2_v2(CUDABLASHandle handle, int n,
                                               IntPtr x, int incx, float& result)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSrot_v2(CUDABLASHandle handle, int n,
                                             IntPtr x, int incx,
                                             IntPtr y, int incy, IntPtr c, IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDrot_v2(CUDABLASHandle handle, int n,
                                             IntPtr x, int incx,
                                             IntPtr y, int incy, IntPtr c, IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCrot_v2(CUDABLASHandle handle, int n,
                                             IntPtr x, int incx,
                                             IntPtr y, int incy, IntPtr c, IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCsrot_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx,
                                              IntPtr y, int incy, IntPtr c, IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZrot_v2(CUDABLASHandle handle, int n,
                                             IntPtr x, int incx,
                                             IntPtr y, int incy, IntPtr c, IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZdrot_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx,
                                              IntPtr y, int incy, IntPtr c, IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSrotg_v2(CUDABLASHandle handle, IntPtr a, IntPtr b, IntPtr
c,
                                              IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDrotg_v2(CUDABLASHandle handle, IntPtr a, IntPtr b, IntPtr
c,
                                              IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCrotg_v2(CUDABLASHandle handle, IntPtr a, IntPtr b, IntPtr
c,
                                              IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZrotg_v2(CUDABLASHandle handle, IntPtr a, IntPtr b, IntPtr
c,
                                              IntPtr s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSrotm_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx,
                                              IntPtr y, int incy, IntPtr param)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDrotm_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx,
                                              IntPtr y, int incy, IntPtr param)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSrotmg_v2(CUDABLASHandle handle, IntPtr d1,
                                               IntPtr d2, IntPtr x1, IntPtr y1, IntPtr
param)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDrotmg_v2(CUDABLASHandle handle, IntPtr d1,
                                               IntPtr d2, IntPtr x1, IntPtr y1, IntPtr
param)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSscal_v2(CUDABLASHandle handle, int n,
                                              IntPtr alpha, IntPtr x, int incx)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDscal_v2(CUDABLASHandle handle, int n,
                                              IntPtr alpha, IntPtr x, int incx)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCscal_v2(CUDABLASHandle handle, int n,
                                              IntPtr alpha, IntPtr x, int incx)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCsscal_v2(CUDABLASHandle handle, int n,
                                               IntPtr alpha, IntPtr x, int incx)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZscal_v2(CUDABLASHandle handle, int n,
                                              IntPtr alpha, IntPtr x, int incx)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZdscal_v2(CUDABLASHandle handle, int n,
                                               IntPtr alpha, IntPtr x, int incx)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSswap_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx, IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDswap_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx, IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCswap_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx, IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZswap_v2(CUDABLASHandle handle, int n,
                                              IntPtr x, int incx, IntPtr y, int incy)
module CUDABLASDriver32 =
    [<Literal>]
    let dllName =  "cublas32_42_9"
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasInit()
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasShutdown()
    ...
    // other functions are same as the 64-bit version

Note

dumpbin.exe is used to check the exported functions from the CUDA BLAS DLL. The cublas64_<version>.DLL is located in the directory <CUDA installation folder>Ccommonin. The result from dumpbin.exe shows that a function can have two versions. For example, the cublasZdscal and cublasZdscal_2 functions are shown in the result list. It is recommended that you use the function with _2 suffix.

There are some overloaded functions in the cuBLAS library. Because F# does not allow you to define the function with the same name, another module is needed to declare the function with the same name, as shown in Example 9-16. Some of the 64-bit version functions are not listed. These functions are the same as those in the 32-bit module.

Example 9-16. cuBLAS library in a different module

module CUDABLASDriver32_2 =
    [<Literal>]
    let dllName =  "cublas32_42_9"

    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSrotmg_v2(CUDABLASHandle handle,
                                               float32& d1, float32& d2,
                                               float32& x1, float32& y1,
                                               IntPtr param)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDrotmg_v2(CUDABLASHandle handle,
                                               float& d1, float& d2,
                                               float& x1, float& y1,
                                               IntPtr param)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZrotg_v2(CUDABLASHandle handle,
                                                 CUDAFloatComplex& a,
                                                 CUDAFloatComplex& b,
                                                 float& c,
                                                 CUDAFloatComplex& s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSrot_v2(CUDABLASHandle handle,
                                                int n,
                                                IntPtr x, int incx,
                                                IntPtr y, int incy,
                                                float32&c,
                                                float32&s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCrotg_v2(CUDABLASHandle handle,
                                                 CUDAFloatComplex& a,
                                                 CUDAFloatComplex& b,
                                                 float32& c,
                                                 CUDAFloatComplex& s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDrotg_v2(CUDABLASHandle handle,
                                                 float& a, float& b, float& c,
                                                 float& s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSrotg_v2(CUDABLASHandle handle,
                                                 float32& a, float32& b, float32& c,
                                                 float32& s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZdrot_v2(CUDABLASHandle handle,
                                                 int n,
                                                 IntPtr x, int incx,
                                                 IntPtr y, int incy,
                                                 float& c,
                                                 float& s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSaxpy_v2(CUDABLASHandle handle,
                                                 int n,
                                                 IntPtr alpha,
                                                 IntPtr x, int incx,
                                                 IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDaxpy_v2(CUDABLASHandle handle,
                                                 int n,
                                                 IntPtr alpha,
                                                 IntPtr x, int incx,
                                                 IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCaxpy_v2(CUDABLASHandle handle,
                                                 int n,
                                                 IntPtr alpha,
                                                 IntPtr x, int incx,
                                                 IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZaxpy_v2(CUDABLASHandle handle,
                                                 int n,
                                                 IntPtr alpha,
                                                 IntPtr x, int incx,
                                                 IntPtr y, int incy)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDrot_v2(CUDABLASHandle handle,
                                                int n,
                                                IntPtr x, int incx,
                                                IntPtr y, int incy,
                                                float& c, float& s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCrot_v2(CUDABLASHandle handle,
                                                int n,
                                                IntPtr x, int incx,
                                                IntPtr y, int incy,
                                                float32&c,
                                                CUDAFloatComplex& s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasCsrot_v2(CUDABLASHandle handle,
                                                 int n,
                                                 IntPtr x, int incx,
                                                 IntPtr y, int incy,
                                                 float32& c, float32& s)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasZrot_v2(CUDABLASHandle handle,
                                                int n,
                                                IntPtr x, int incx,
                                                IntPtr y, int incy,
                                                float& c,
                                                CUDAFloatComplex& s)

module CUDABLASDriver64_2 =
    [<Literal>]
    let dllName =  "cublas64_42_9"

    [<DllImport(dllName)>]
    extern CUBLASStatus cublasSrotmg_v2(CUDABLASHandle handle,
                                                  float32& d1, float32& d2,
                                                  float32& x1, float32& y1,
                                                  IntPtr param)
    [<DllImport(dllName)>]
    extern CUBLASStatus cublasDrotmg_v2(CUDABLASHandle handle,
                                                  float& d1, float& d2,
                                                  float& x1, float& y1,
                                                  IntPtr param)
    ...
    // other 64-bit version functions

The code to invoke the preceding API is listed in Example 9-17. The variable a is a 100-element-length array, and all the elements in the array are set to 1. The variable b is also a 100-element-length array, and all elements in the array are set to 2. The dot product from a and b is 200.

Example 9-17. Invoking the CUDA BLAS library

open System.Runtime.InteropServices

let test8() =
    let n = 100
    let a = Array.create n 2.f
    let b = Array.create n 1.f
    let copyToDevice(array) =
        let nativePtr = Marshal.UnsafeAddrOfPinnedArrayElement(array, 0)
        let p = System.IntPtr(nativePtr.ToPointer())
        let mutable dst = IntPtr()
        let count = SizeT(Marshal.SizeOf(sizeof<float32>) * array.Length)
        let r = CUDARuntime.CUDARuntime64.cudaMalloc(&dst, count)

        if r = cudaError.cudaSuccess then
            let r = CUDARuntime.CUDARuntime64.cudaMemcpy(
                          dst,
                          p,
                          count,
                          CUDAMemcpyKind.cudaMemcpyHostToDevice)
            if r = cudaError.cudaSuccess then
                Some dst
            else
                None
        else
            None

    let mutable handle = CUDABLASHandle()
    let r = CUDABLASDriver64.cublasInit()
    let r = CUDABLASDriver64.cublasCreate_v2(&handle)
    let deviceA = copyToDevice(a)
    let deviceB = copyToDevice(b)
    let mutable result = 1.f

    match deviceA, deviceB with
    | Some(pA), Some(pB) ->
        let status = CUDABLASDriver64.cublasSdot_v2(
                             handle,
                             Marshal.SizeOf(sizeof<float32>) * n,
                             pA,
                             1,
                             pB,
                             1,
                             &result)
        printfn "result is %A" result
    | _ -> failwith "computation error"

F# Quotation and Transform

The F# quotation is a tree structure that presents the current F# program structure. The ReflectedDefinition attribute is the key attribute used to get an F# quotation. According to MSDN (http://msdn.microsoft.com/en-us/library/ee353643.aspx), here is how you use it:

Add this attribute to the let-binding for the definition of a top-level value to make the quotation expression that implements the value available for use at runtime. Add this attribute to a type or module to make it apply recursively to all the values in the module or all the members of the type.

Example 9-18 shows how to show the quotation expression for the function. The <@@ ... @@> is used to get the quotation from a function that is decorated with the ReflectedDefinition attribute.

Example 9-18. Getting a quotation from a function with the ReflectedDefinition attribute

[<ReflectedDefinition>]
let f() =
    let b = 99
    b*2

<@@ f @@>

Execution result

val f : unit -> int
val it : Quotations.Expr = Lambda (arg00@, Call (None, f, []))

There are several approaches to converting F# code to CUDA code. Figure 9-6 shows one of these approaches. The PTX file is an assembly-like language for the GPU. In this chapter, the path from quotation via C code to PTX file is chosen.

Figure 9-6. Converting code from F# to GPU code

After the quotation is ready, a translation function can be applied. Example 9-20 shows how to traverse the tree structure and get the CUDA C code. There are several ways to generate code from a quotation tree. Example 9-20 generates C code, which can make the quotation generation easy to understand; the listing traverses the parse tree structure emitting C code. To support the code generation, CUDAPointer2 is defined, as shown in Example 9-19.

Example 9-19. CUDAPointer2 code for code generation

type CUDAPointer2<'T>(p:CUDAPointer) =
    new(ptr:IntPtr) = CUDAPointer2(CUDAPointer(ptr))
    member this.Pointer = p
    member this.PointerSize = p.PointerSize
    member this.Is64Bit = this.PointerSize = 8
    member this.Item
        with get (i:int) : float32 = failwith "for code generation only"
        and set (i:int) (v:float32) = failwith "for code generation only"
    member this.Set(x:float32, i:int) = failwith "for code generation only"

Note

CUDAPointer is defined in Example 9-10.

Example 9-20 implements only part of the conversion and does not provide full support for an F# quotation conversion.

Example 9-20. Quotation tree traversal

open System
open System.Reflection
open Microsoft.FSharp.Quotations
open Microsoft.FSharp.Quotations.Patterns
open Microsoft.FSharp.Quotations.DerivedPatterns
open Microsoft.FSharp.Quotations.ExprShape

let accessExpr exp =

    let addSemiCol (str:string) =
            if str.EndsWith(";") ||
                String.IsNullOrEmpty(str) &&
                String.IsNullOrWhiteSpace(str) then str
            else str + ";"

    let rec iterate exp : string=

        let print x =
            let str = sprintf "%A" x
            str

        let matchExp expOption =
            match expOption with
            | Some(n) -> iterate n
            | None -> String.Empty
        let isCUDAPointerType (exp:Expr Option) =
            match exp with
            | Some(n) ->
                n.Type.IsAssignableFrom(
                    typeof<CUDAPointer2<float>>) ||
                    n.Type.IsAssignableFrom(typeof<CUDAPointer2<float32>>)
            | _ ->
                false

        match exp with
        | DerivedPatterns.Applications (e, ell) ->
            let str0 = iterate e
            let str1 =
                ell
                |> Seq.map (fun n -> n |> Seq.map (fun m -> iterate m ))
                |> Seq.map (fun n -> String.Join("
", n |> Seq.toArray))
            str0 + String.Join("
", str1 |> Seq.toArray)
        | DerivedPatterns.AndAlso (e0, e1) ->
            (iterate e0) + (iterate e1)
        | DerivedPatterns.Bool e ->
            print e
        | DerivedPatterns.Byte e ->
            print e
        | DerivedPatterns.Char e ->
            print e
        | DerivedPatterns.Double e ->
            print e
        | DerivedPatterns.Int16 e->
            print e
        | DerivedPatterns.Int32 e->
            print e
        | DerivedPatterns.Int64 e ->
            print e
        | DerivedPatterns.OrElse (e0, e1)->
            (iterate e0) + (iterate e1)
        | DerivedPatterns.SByte e ->
            print e
        | DerivedPatterns.Single e ->
            print e
        | DerivedPatterns.String e ->
            print e
        | DerivedPatterns.UInt16 e ->
            print e
        | DerivedPatterns.UInt32 e ->
            print e
        | DerivedPatterns.UInt64 e ->
            print e
        | DerivedPatterns.Unit e ->
            String.Empty //"void"
        | Patterns.AddressOf address ->
            iterate address
        | Patterns.AddressSet (exp0, exp1) ->
            (iterate exp0) + (iterate exp1)
        | Patterns.Application (exp0, exp1) ->
            (iterate exp0) + (iterate exp1)
        | Patterns.Call (expOption, mi, expList)  ->
            if isCUDAPointerType expOption && mi.Name = "Set" then
                let callObject = matchExp expOption
                let index = iterate expList.[1]
                let postfix =
                    match mi with
                    | DerivedPatterns.MethodWithReflectedDefinition n -> iterate n
                    | _ -> iterate expList.[0]
                let s = sprintf "%s[%s] = %s;" callObject index postfix
                s
            else
                let callObject = matchExp expOption
                let returnType = translateFromNETType mi.ReturnType String.Empty
                let postfix =
                    match mi with
                    | DerivedPatterns.MethodWithReflectedDefinition n -> iterate n
                    | _ -> translateFromNETOperator mi expList
                let s = sprintf "%s%s" callObject postfix
                s
        | Patterns.Coerce (exp, t) ->
            let from = iterate exp
            //sprintf "coerce(%s, %s)" from t.Name
            sprintf "%s" from
        | Patterns.DefaultValue exp ->
            print exp
        | Patterns.FieldGet (expOption, fi) ->
            (matchExp expOption) + (print fi)
        | Patterns.FieldSet (expOption, fi, e) ->
            let callObj = matchExp expOption
            let fi = print fi
            let str = iterate e
            callObj + fi + str
        | Patterns.ForIntegerRangeLoop (v, e0, e1, e2) ->
            let from = iterate e0
            let toV = iterate e1
            let s = String.Format("for (int {0} = {1}; {0}<{2}; {0}++) {{ {3} }}",
                                       v, from ,toV, iterate e2)
            s
        | Patterns.IfThenElse (con, exp0, exp1) ->
            let condition = (iterate con)
            let ifClause = addSemiCol(iterate exp0)
            let elseClause = addSemiCol(iterate exp1)
            sprintf "if (%s) { %s }
else { %s }" condition ifClause elseClause
        | Patterns.Lambda (var,body) ->
            //let a = print var
            //let b = iterate body
            match exp with
            | DerivedPatterns.Lambdas (vll, e) ->
                let s =
                    vll
                    |> List.map (fun n-> n
                                         |> List.map (fun m ->
                                                             sprintf "%s %s"
                                                             (translateFromNETType m.Type
"")
                                                             m.Name))
                    |> List.fold (fun acc l -> acc@l) []
                let parameterNames = vll |> List.map (fun n -> sprintf "%s" n.Head.Name)
                let returnType = getCallReturnType e
                let returnTypeID = translateFromNETTypeToFunReturn returnType ""
                let fid = code.FunctionID;
                code.IncreaseFunctionID()
                let functionName = sprintf "ff_%d" fid
                let statement = iterate e
                let functionCode = sprintf "__device__ %s %s(%s) { %s } "
                                           returnTypeID
                                           functionName
                                           (String.Join(", ", s))
                                           (addSemiCol(statement))
                code.Add(functionCode)
                sprintf "%s(%s)" functionName (String.Join(", ", parameterNames))
            | _ -> failwith "not supported lambda format"
        | Patterns.Let (var, exp0, exp1) ->
            let a = print var
            let b = iterate exp0
            let t = var.Type
            let s =
                if t.Name = "FSharpFunc'2" then
                    sprintf "__device__ %s; //function pointer" (translateFromNETType t a)
                else
                    String.Empty
            code.Add(s)
            let c = iterate exp1
            let assignment =
                if t.Name = "FSharpFunc'2" then
                    sprintf "%s;
%s = %s;" (translateFromNETType t a) a b
                else
                    sprintf "%s %s;
%s = %s;" (translateFromNETType t a) a a b
            sprintf "%s
%s" assignment c
        | Patterns.LetRecursive (tupList, exp) ->
            let strList = tupList |> Seq.map (fun (var, e) -> (print var) + (iterate e))
            String.Join("
", strList |> Seq.toArray) + (iterate exp)
        | Patterns.NewArray (t, expList) ->
            let str0 = print t
            let str1 = expList |> Seq.map (fun e -> iterate e)
            str0 + String.Join("
", str1)
        | Patterns.NewDelegate (t, varList, exp) ->
            (print t) + (print varList) + (iterate exp)
        | Patterns.NewObject (t, expList) ->
            let str0 = print t
            let str1 = expList |> Seq.map (fun e -> iterate e)
            str0 + String.Join("
", str1)
        | Patterns.NewRecord (t, expList) ->
            let str0 = print t
            let str1 = expList |> Seq.map (fun e -> iterate e)
            str0 + String.Join("
", str1)
        | Patterns.NewObject (t, expList) ->
            let str0 = print t
            let str1 = expList |> Seq.map (fun e -> iterate e)
            str0 + String.Join("
", str1)
        | Patterns.NewRecord (t, expList) ->
            let str0 = print t
            let str1 = expList |> Seq.map (fun e -> iterate e)
            str0 + String.Join("
", str1)
        | Patterns.NewTuple expList ->
            let ty = translateFromNETType (expList.[0].Type) String.Empty
            let l = expList |> Seq.map (fun e -> iterate e)
            let l = String.Join(", ", l)
            sprintf "newTuple<%s>(%s)" ty l
        | Patterns.NewUnionCase (t, expList) ->
            let str0 = print t
            let str1 = expList |> Seq.map (fun e -> iterate e)
            str0 + String.Join("
", str1)
        | Patterns.PropertyGet (expOption, pi, expList) ->
            let callObj = matchExp expOption
            let r = match pi with
                    | DerivedPatterns.PropertyGetterWithReflectedDefinition e ->
                        iterate e
                    | _ -> pi.Name
            let l = expList |> List.map (fun n -> iterate n)
            if l.Length > 0 then
                if r = "Item" then
                    sprintf "%s[%s]" callObj (String.Join(", ", l))
                else
                    sprintf "%s.%s[%s]" callObj r (String.Join(", ", l))
            else
                if String.IsNullOrEmpty callObj then
                    sprintf "%s" r
                else
                    sprintf "%s.%s" callObj r
        | Patterns.PropertySet (expOption, pi, expList, e) ->
            let callObj = matchExp expOption
            let r = match pi with
                    | DerivedPatterns.PropertyGetterWithReflectedDefinition e ->
                        iterate e
                    | _ -> print pi
            let l = expList |> Seq.map (fun n -> iterate n)
            if r = "Item" then
                callObj + String.Join("
", l) + (iterate e)
            else
                callObj + r + String.Join("
", l) + (iterate e)
        | Patterns.Quote e ->
            iterate e
        | Patterns.Sequential (e0, e1) ->
            let statement0 = addSemiCol(iterate e0)
            let statement1 = addSemiCol(iterate e1)
            sprintf "%s
%s" statement0 statement1
        | Patterns.TryFinally (e0, e1) ->
            (iterate e0) + (iterate e1)
        | Patterns.TryWith (e0, v0, e1, v1, e2) ->
            (iterate e0) + (print v0) + (iterate e1) + (print v1) + (iterate e2)
        | Patterns.TupleGet (e, i) ->
            (iterate e) + (print i)
        | Patterns.TypeTest (e, t) ->
            (iterate e) + (print t)
        | Patterns.UnionCaseTest (e, ui) ->
            (iterate e) + (print ui)
        | Patterns.Value (obj, t) ->
            (print obj) + (print t)
        | Patterns.Var v ->
            v.Name
        | Patterns.VarSet (v, e) ->
            let left = (print v)
            let right = (iterate e)
            sprintf "%s = %s" left right
        | Patterns.WhileLoop (e0, e1) ->
            let condition = iterate e0
            let body = iterate e1
            sprintf "while (%s) { 
 %s 
}" condition (addSemiCol(body))
        | _ -> failwith "not supported pattern"

    and translateFromNETOperator (mi:MethodInfo) (exprList:Expr list) =
        let getList() = exprList |> List.map (fun n -> iterate n)
        let ty = translateFromNETType (exprList.[0].Type) String.Empty

        let generateFunction (mi:MethodInfo) (mappedMethodName:string) (parameters:Expr
list) =
            let result = sprintf "%s(%s)" mappedMethodName (String.Join(", ", getList()))
            result

        match mi.Name with
            | "op_Addition" ->
                let l = getList()
                sprintf "(%s) + (%s)" l.[0] l.[1]
            | "op_Subtraction" ->
                let l = getList()
                sprintf "(%s) - (%s)" l.[0] l.[1]
            | "op_Multiply" ->
                let l = getList()
                sprintf "(%s) * (%s)" l.[0] l.[1]
            | "op_Division" ->
                let l = getList()
                sprintf "(%s) / (%s)" l.[0] l.[1]
            | "op_LessThan" ->
                let l = getList()
                sprintf "(%s) < (%s)" l.[0] l.[1]
            | "op_LessThanOrEqual" ->
                let l = getList()
                sprintf "(%s) <= (%s)" l.[0] l.[1]
            | "op_GreaterThan" ->
                let l = getList()
                sprintf "(%s) > (%s)" l.[0] l.[1]
            | "op_GreaterThanOrEqual" ->
                let l = getList()
                sprintf "(%s) >= (%s)" l.[0] l.[1]
            | "op_Range" -> failwith "not support range on GPU"
            | "op_Equality" ->
                let l = getList()
                sprintf "(%s) == (%s)" l.[0] l.[1]
            | "GetArray" ->
                let l = getList()
                sprintf "%s[%s]" l.[0] l.[1]
            | "CreateSequence" -> failwith "not support createSeq on GPU"
            | "FailWith" -> failwith "not support exception on GPU"
            | "ToList" -> failwith "not support toList on GPU"
            | "Map" -> failwith "not support map on GPU"
            | "Delay" ->
                let l = getList()
                String.Join(", ", l)
            | "op_PipeRight" ->
                let l = getList()
                sprintf "%s ( %s )" l.[1] l.[0]
            | "ToSingle" ->
                let l = getList()
                sprintf "(float) (%s)" l.[0]
            | _ ->
                let l = getList()
                sprintf ".%s(%s)" (mi.Name) (String.Join(", ", l))

    let s = iterate exp
    addSemiCol(s)

Note

Some functions in the preceding code are defined in the rest of the chapter.

The transform function does not list all of the possible conversions inside the match expression. We will demonstrate how to expand this function in later sections of this chapter.

Example 9-21 shows how to convert .NET types to C.

Example 9-21. Translating .NET types to C type

type Type with
    member this.HasInterface(t:Type) =
        this.GetInterface(t.FullName) <> null

let rec translateFromNETType (t:Type) (a:string) =
    if t = typeof<int> then "int"
    elif t = typeof<float32> then "float"
    elif t = typeof<float> then "double"
    elif t = typeof<bool> then "bool"
    elif t.IsArray then
        let elementTypeString = translateFromNETType (t.GetElementType()) a
        sprintf "List<%s>" elementTypeString
    elif t.HasInterface(typeof<System.Collections.IEnumerable>) then
        let elementTypeString = translateFromNETType (t.GetGenericArguments().[0]) a
        sprintf "List<%s>" elementTypeString
    elif t = typeof< Microsoft.FSharp.Core.unit > then String.Empty
    elif t = typeof< CUDAPointer2<float> > then sprintf "%s*" "double"
    elif t = typeof< CUDAPointer2<float32>> then sprintf "%s*" "float"
    elif t.Name = "FSharpFunc'2" then
        let input = translateFromNETType (t.GetGenericArguments().[0]) a
        let out = translateFromNETType (t.GetGenericArguments().[1]) a
        sprintf "%s(*%s)(%s)" input a out
    elif t = typeof<System.Void> then
        String.Empty
    else failwith "not supported type"

let translateFromNETTypeToFunReturn (t:Type) (a:string) =
    let r = translateFromNETType t a
    if String.IsNullOrEmpty(r) then "void"
    else r

let translateFromNETTypeLength (t:Type) c =
    if t.IsArray then
        sprintf ", int %A_len" c
    else
        String.Empty

let isValueType (t:Type) =
    if t.IsValueType then true
    elif t.HasInterface(typeof<IEnumerable>) then false
    else failwith "is value type failed"

The code generation needs to generate a few intermediary functions. Example 9-22 shows the code structure used to generate these functions.

Example 9-22. Code structure hosts intermediate functions

type Code() =
    inherit System.Collections.Generic.List<string>()
    let mutable functionID = 0
    let mutable variableID = 0
    member this.FunctionID
        with get () = functionID
        and set (v) = functionID <- v
    member this.VariableID
        with get () = variableID
        and set(v) = variableID <- v
    member this.IncreaseFunctionID() =
        functionID <- this.FunctionID + 1
    member this.IncreaseVariableID() =
        variableID <- this.VariableID + 1

    member this.ToCode() =
        "#include "CUDALibrary.h"
" + String.Join("
", this) + "


"

let code = Code()

Note

The CUDALibrary.h file is a placeholder file. In this sample, the file content is empty. You can implement additional functions in this file to make the translation easier and the code more readable.

Example 9-23 shows several functions used to handle the function definitions, including return type and function signatures, in F# and convert them to CUDA code.

Example 9-23. Functions for finding the return type and signature

open System
open System.Reflection
open Microsoft.FSharp.Quotations
open Microsoft.FSharp.Quotations.Patterns
open Microsoft.FSharp.Quotations.DerivedPatterns
open Microsoft.FSharp.Quotations.ExprShape

let rec getFunctionBody (exp:Expr) =
    match exp with
    | DerivedPatterns.Lambdas(c, callPattern) ->
        match callPattern with
        | Patterns.Call (e, mi, exprList) ->
            match mi with
                | DerivedPatterns.MethodWithReflectedDefinition n ->
                        callPattern
                | _ ->
                        callPattern
        | Patterns.Sequential _ -> callPattern
        | _ -> callPattern
    | Patterns.Sequential _ -> exp
    | _ -> failwith "Argument must be of the form <@ foo @>!"

let getFunctionParameterAndReturn (exp:Expr) =
    match exp with
        | DerivedPatterns.Lambdas (c, Patterns.Call(a, mi, b)) ->
            Some(b, mi.ReturnType)
        | _ -> None

let getFunctionName (exp:Expr) =
    match exp with
        | DerivedPatterns.Lambdas(c, Patterns.Call(a, mi, b)) ->
            mi.Name
        | _ ->
            failwith "Argument must be of the form <@ foo @>!"

let getFunctionTypes (exp:Expr) =
    match getFunctionParameterAndReturn(exp) with
        | Some(exprList ,t) ->
            let out = exprList |> List.map (fun n -> (n, n.Type))
            Some(out, t)
        | None -> None

let getFunctionReturnType (exp:Expr) =
    match getFunctionTypes(exp) with
    | Some(_, t) -> t
    | _ -> failwith "cannot find return type"

let rec getCallReturnType (exp:Expr) =
    match exp with
    | Patterns.Call (_, mi, _) -> mi.ReturnType
    | Patterns.Var n -> n.Type
    | Patterns.Let (var, e0, e1) -> getCallReturnType e1
    | Patterns.Sequential (e0, e1) -> getCallReturnType e1
    | Patterns.Value(v) -> snd v
    | Patterns.WhileLoop(e0, e1) -> typeof<System.Void>
    | Patterns.ForIntegerRangeLoop(var, e0, e1, e2) -> getCallReturnType e2
    | Patterns.IfThenElse(e0, e1, e2) -> typeof<System.Void>
    | _ -> failwith "not supported expr type"

let getFunctionSignature (exp:Expr) =
    let template = @"extern ""C"" __global__ void {0} ({1}) "
    let functionName = getFunctionName(exp)
    let parameters = getFunctionParameterAndReturn(exp)
    match parameters with
    | Some(exprList, _) ->
        let parameterNames =
            exprList
            |> Seq.map (fun n ->
                        match n with
                        | _ when n.Type.IsAssignableFrom(typeof<CUDAPointer2<float>>) ->
                            sprintf "double* %s" (n.ToString())
                        | _ when n.Type.IsAssignableFrom(typeof<CUDAPointer2<float32>>) ->
                            sprintf "float* %s" (n.ToString())
                        | _ -> sprintf "%s %s" (n.Type.Name) (n.ToString()) )
        String.Format(template, functionName, String.Join(", ", parameterNames))
    | None -> failwith "cannot get parameter and return type"

Some low-end GPU hardware can support only the float32 data type. Example 9-24 can be modified to make sure the data type is a convertible type. If the hardware you have does not support int or float, you can change the return to false.

Example 9-24. Functions for checking supported types

let isTypeGPUOK t =
    if t = typeof<int> ||
        t = typeof<float32> ||
        t = typeof<float> then true
    else false

let isValueGPUOK (v:obj) =
    match v with
    | :? Int | :? float32 -> true
    | :? float -> true    //if the hardware does not support this, you can make it false
    | _ -> false

Before you can proceed in the translation of the F# quotation, an attribute is needed, as shown in Example 9-25. The attribute uses reflection to identify the GPU function in the assembly. This attribute is used to identify the function that needs to be converted.

Example 9-25. GPUAttribute definition

type GPUAttribute() =
    inherit Attribute()

F# Quotation on GPGPU

The goal for this section is to translate the F# code in Example 9-26 to CUDA code. The sample function is used to add two arrays, named a and b, and set the result to array c. The pascalTriangle function is used to compute the Pascal Triangle values on a given line. The sample2 function demonstrates how the while statement is used to translate the F# code to CUDA code.

CUDA uses an array of threads to process the data. So the code output.[threadid] = input.[threadId] is a transaction that will be executed by many threads. Because of these threads, you can process a number of elements in an array simultaneously. More detailed information can be found in the NVIDIA documentation at http://www.nvidia.com/content/cudazone/download/Getting_Started_w_CUDA_Training_NVISION08.pdf.

Example 9-26. F# code for the translation to CUDA code

[<ReflectedDefinition; GPU>]
let sample (a:CUDAPointer2<float>) (b:CUDAPointer2<float>) (c:CUDAPointer2<float>)=
    let x = blockIdx.x
    c.Set(a.[x] + b.[x], x) //c.[x] = a.[x] + b.[x]
    ()

[<ReflectedDefinition; GPU>]
let pascalTriangle (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) =
    let x = blockIdx.x
    if x = 0 then
        b.Set(a.[x], x)
    else
        b.Set(a.[x] + a.[x - 1], x)
    ()

[<ReflectedDefinition; GPU>]
let sample2 (a:CUDAPointer2<float>) (b:CUDAPointer2<float>) (c:CUDAPointer2<float>)=
    let x = blockIdx.x
    for j = 0 to x do
        c.Set(a.[j] + b.[j], x)
    let mutable i = 3
    while i >= 8 do
        if i > 3 then
            c.Set(a.[i] + b.[i], x)
        else
            c.Set(a.[i] + b.[0], x)
        i <- i + 1

Note

To simplify the translation, the function must return unit.

To translate the code, you need to first define some basic data structures, which are defined in Example 9-27. The GPU kernel can launch multiple threads simultaneously. ThreadIdx is the thread identifier. BlockDim is a way to segment the data. Their relationship is shown in Figure 9-7.

Example 9-27. CUDA data structures

type dim3 (x, y, z) =
    new() = dim3(0, 0, 0)
    new(x) = dim3(x, 0, 0)
    new(x, y) = dim3(x, y, 0)
    member this.x = x
    member this.y = y
    member this.z = z

type ThreadIdx() =
    inherit dim3()

type BlockDim() =
    inherit dim3()

Figure 9-7. Block and thread relationship

The generated code has two sections: the function that starts with the __device__ keyword, and the function with the __global__ keyword. The function with the __device__ keyword cannot be invoked from the host. The __global__ function can be invoked from the host. If you need to access the functionality in the __device__ function, there must be a __global__ function. The generated code is in the following format:

#include "CUDALibrary.h"

__device__ void ff_0(float* a, float* b) {
   // some generated code
}

extern "C" __global__ void sample4 (float* a, float* b)  {
ff_0(a, b);
}

The code has three parts:

Header file CUDALibrary.h
Generated __device__ function (ff_0 in the preceding code)
Generated __global__ function, which will be invoked from the host and call into the __device__ function ff0

Example 9-28 shows the function used to generate the CUDA code. The Code class is used to hold the generated functions. The getCUDACode function is used to get the CUDA code. The translation code translates three samples from F# code to CUDA C code. These three samples show how to translate WHILE and FOR structures.

Example 9-28. Code-generation function and the generated code

let tempFileName = "temp1.cu"

type Code() =
    inherit System.Collections.Generic.List<string>()
    let mutable functionID = 0
    let mutable variableID = 0
    member this.FunctionID
        with get () = functionID
        and set (v) = functionID <-v
    member this.VariableID
        with get () = variableID
        and set(v) = variableID <- v
    member this.IncreaseFunctionID() =
        functionID <- this.FunctionID + 1
    member this.IncreaseVariableID() =
        variableID <- this.VariableID + 1

    member this.ToCode() =
        "#include "CUDALibrary.h"
" + String.Join("
", this) + "


"

let code = Code()
let getCUDACode f =
    let s = getFunctionSignature f
    let body = f |> getFunctionBody
    let functionStr = sprintf "%s {
%s
}
" s (accessExpr body)
    sprintf "%s" functionStr

let getCommonCode() = code.ToCode()

[<ReflectedDefinition; GPU>]
let pascalTriangle (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) =
    let x = blockIdx.x
    if x = 0 then
        b.Set(a.[x], x)
    else
        b.Set(a.[x] + a.[x-1], x)
    ()

[<ReflectedDefinition; GPU>]
let sample2 (a:CUDAPointer2<float>) (b:CUDAPointer2<float>) (c:CUDAPointer2<float>)=
    sample a b c
    let x = blockIdx.x
    for j = 0 to x do
        c.Set(a.[j] + b.[j], x)
    let mutable i = 3
    while i >= 8 do
        if i > 3 then
            c.Set(a.[i] + b.[i], x)
        else
            c.Set(a.[i] + b.[0], x)
        i <- i + 1

[<ReflectedDefinition; GPU>]
let sample (a:CUDAPointer2<float>) (b:CUDAPointer2<float>) (c:CUDAPointer2<float>)=
    let x = blockIdx.x
    c.Set(a.[x] + b.[x], x) //c.[x] = a.[x] + b.[x]
    ()

let WriteToFile() =
    let a1 = <@@ sample @@>
    let b = getCUDACode(a1)
    let a2 = <@@ sample2 @@>
    let b2 = getCUDACode(a2)
    let a3 = <@@ pascalTriangle @@>
    let b3 = getCUDACode(a3)
    let commonCode = getCommonCode()

    System.IO.File.WriteAllText(tempFileName, commonCode + b + b2 + b3)
    ()

Generated CUDA C code

#include "CUDALibrary.h"

__device__ void ff_0(double* a, double* b, double* c) { int x;
x = blockIdx.x;
c[x] = (a[x]) + (b[x]);
; }

__device__ void ff_1(float* a, float* b) { int x;
x = blockIdx.x;
if ((x) == (0)) { b[x] = a[x]; }
else { b[x] = (a[x]) + (a[(x) - (1)]); };
; }

__device__ void ff_3(double* a, double* b, double* c) { int x;
x = blockIdx.x;
c[x] = (a[x]) + (b[x]);
; }


__device__ void ff_2(double* a, double* b, double* c) { ff_3(a, b, c);
int x;
x = blockIdx.x;
for (int j = 0; j < x; j++) { c[x] = (a[j]) + (b[j]); };
int i;
i = 3;
while ((i) >= (8)) {
 if ((i) > (3)) { c[x] = (a[i]) + (b[i]); }
else { c[x] = (a[i]) + (b[0]); };
i = (i) + (1);
}; }


extern "C" __global__ void sample (double* a, double* b, double* c)  {
 ff_0(a, b, c);
}
extern "C" __global__ void pascalTriangle (float* a, float* b)  {
 ff_1(a, b);
}
extern "C" __global__ void sample2 (double* a, double* b, double* c)  {
 ff_2(a, b, c);
}

Note

The CUDALibrary.h header file serves as a placeholder, which can hold some customized functions. In this sample, the file is empty.

It is not convenient to use <@@...@@> on each function that needs to be converted. The GPUAttribute enables you to automatically scan the assembly and generate CUDA code. Example 9-29 shows how to scan the assembly and identify the function with the GPU attribute and generate code from these functions. The generated code is stored in the “temp1.cu” file, which is specified by the tempFile value.

Example 9-29. Generating code from functions decorated with the GPU attribute

let getCommonCode() = code.ToCode()

let getGPUFunctions() =
    let currentAssembly = Assembly.GetExecutingAssembly()
    let gpuMethods =
        currentAssembly.GetTypes()
        |> List.ofArray
        |> List.collect (fun t -> t.GetMethods() |> List.ofArray)
        |> List.filter (fun mi ->
                               mi.GetCustomAttributes(typeof<GPUAttribute>, true).Length >
0)

    gpuMethods
    |> List.map (fun mi -> (mi, (Quotations.Expr.TryGetReflectedDefinition mi)))
    |> List.map (fun (mi, Some(expr)) -> (mi, expr))
    |> List.map (fun (mi, expr) ->
                        sprintf "%s {
 %s 
}"
                            (getFunctionSignatureFromMethodInfo mi)
                            (accessExpr expr))

let tempFile = "temp1.cu"

let GenerateCodeToFile() =
    let gpuCode = getGPUFunctions()
    let commonCode = getCommonCode()
    let allCode = String.Join("
", commonCode :: gpuCode)
    System.IO.File.Delete(tempFile)
    System.IO.File.WriteAllText(tempFile, allCode)

The generated C code will not be able to execute on the GPU. You can use NVCC.exe to compile the C code to a PTX file. The PTX file can be loaded and executed on the GPU. The PTX code is not an executable binary. Instead, the PTX file is compiled for a specific target GPU binary code at run time. It is more like the assembly language on the GPU. The generated PTX file is shown in Example 9-30.

Example 9-30. Generated PTX file

    .version 1.4
    .target sm_10, map_f64_to_f32
    // compiled with C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../
open64/lib//be.exe
    // nvopencc 4.0 built on 2011-05-13

    //-----------------------------------------------------------
    // Compiling C:/Users/taliu/AppData/Local/Temp/tmpxft_00000a84_00000000-11_temp.cpp3.i
(C:/Users/taliu/AppData/Local/Temp/ccBI#.a04920)
    //-----------------------------------------------------------
    //-----------------------------------------------------------
    // Options:
    //-----------------------------------------------------------
    //  Target:ptx, ISA:sm_10, Endian:little, Pointer Size:64
    //  -O3    (Optimization level)
    //  -g0    (Debug level)
    //  -m2    (Report advisories)
    //-----------------------------------------------------------

    .file    1    "C:/Users/taliu/AppData/Local/Temp/tmpxft_00000a84_00000000-10_temp.
cudafe2.gpu"
    .file    2    "c:program files (x86)microsoft visual studio
10.0vcincludecodeanalysissourceannotations.h"
    .file    3    "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../include
crt/device_runtime.h"
    .file    4    "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../include
host_defines.h"
    .file    5    "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../include
builtin_types.h"
    .file    6    "c:program files
vidia gpu computing toolkitcudav4.0includedevice_
types.h"
    .file    7    "c:program files
vidia gpu computing toolkitcudav4.0includedriver_
types.h"
    .file    8    "c:program files
vidia gpu computing toolkitcudav4.0include
surface_types.h"
    .file    9    "c:program files
vidia gpu computing toolkitcudav4.0include
texture_types.h"
    .file    10    "c:program files
vidia gpu computing toolkitcudav4.0include
vector_types.h"
    .file    11    "c:program files
vidia gpu computing toolkitcudav4.0include
builtin_types.h"
    .file    12    "c:program files
vidia gpu computing toolkitcudav4.0includehost_
defines.h"
    .file    13    "C:Program FilesNVIDIA GPU Computing
ToolkitCUDAv4.0in/../includedevice_launch_parameters.h"
    .file    14    "c:program files
vidia gpu computing
toolkitcudav4.0includecrtstorage_class.h"
    .file    15    "C:Program Files (x86)Microsoft Visual Studio
10.0VCin/../../VC/INCLUDE	ime.h"
    .file    16    "temp.cu"
    .file    17    "c:mycodecodecenterf#fsharpgpufsharpgpuindebugCUDALibrary.h"
    .file    18    "C:Program FilesNVIDIA GPU Computing
ToolkitCUDAv4.0in/../includecommon_functions.h"
    .file    19    "c:program files
vidia gpu computing toolkitcudav4.0includemath_
functions.h"
    .file    20    "c:program files
vidia gpu computing toolkitcudav4.0includemath_
constants.h"
    .file    21    "c:program files
vidia gpu computing
toolkitcudav4.0includedevice_functions.h"
    .file    22    "c:program files
vidia gpu computing toolkitcudav4.0includesm_11_
atomic_functions.h"
    .file    23    "c:program files
vidia gpu computing toolkitcudav4.0includesm_12_
atomic_functions.h"
    .file    24    "c:program files
vidia gpu computing toolkitcudav4.0includesm_13_
double_functions.h"
    .file    25    "c:program files
vidia gpu computing toolkitcudav4.0includesm_20_
atomic_functions.h"
    .file    26    "c:program files
vidia gpu computing toolkitcudav4.0includesm_20_
intrinsics.h"
    .file    27    "c:program files
vidia gpu computing toolkitcudav4.0include
surface_functions.h"
    .file    28    "c:program files
vidia gpu computing toolkitcudav4.0include
texture_fetch_functions.h"
    .file    29    "c:program files
vidia gpu computing toolkitcudav4.0includemath_
functions_dbl_ptx1.h"


    .entry sample (
        .param .u64 __cudaparm_sample_a,
        .param .u64 __cudaparm_sample_b,
        .param .u64 __cudaparm_sample_c)
    {
    .reg .u32 %r<3>;
    .reg .u64 %rd<10>;
    .reg .f64 %fd<5>;
    .loc    16    33    0
$LDWbegin_sample:
    .loc    16    5    0
    cvt.s32.u16     %r1, %ctaid.x;
    cvt.s64.s32     %rd1, %r1;
    mul.wide.s32     %rd2, %r1, 8;
    ld.param.u64     %rd3, [__cudaparm_sample_a];
    add.u64     %rd4, %rd3, %rd2;
    ld.global.f64     %fd1, [%rd4+0];
    ld.param.u64     %rd5, [__cudaparm_sample_b];
    add.u64     %rd6, %rd5, %rd2;
    ld.global.f64     %fd2, [%rd6+0];
    add.f64     %fd3, %fd1, %fd2;
    ld.param.u64     %rd7, [__cudaparm_sample_c];
    add.u64     %rd8, %rd7, %rd2;
    st.global.f64     [%rd8+0], %fd3;
    .loc    16    35    0
    exit;
$LDWend_sample:
    } // sample

    .entry sample2 (
        .param .u64 __cudaparm_sample2_a,
        .param .u64 __cudaparm_sample2_b,
        .param .u64 __cudaparm_sample2_c)
    {
    .reg .u32 %r<7>;
    .reg .u64 %rd<12>;
    .reg .f64 %fd<8>;
    .reg .pred %p<4>;
    .loc    16    36    0
$LDWbegin_sample2:
    .loc    16    10    0
    cvt.s32.u16     %r1, %ctaid.x;
    cvt.s64.s32     %rd1, %r1;
    mul.wide.s32     %rd2, %r1, 8;
    ld.param.u64     %rd3, [__cudaparm_sample2_c];
    add.u64     %rd4, %rd3, %rd2;
    ld.param.u64     %rd5, [__cudaparm_sample2_b];
    ld.param.u64     %rd6, [__cudaparm_sample2_a];
    add.u64     %rd7, %rd2, %rd5;
    ld.global.f64     %fd1, [%rd7+0];
    add.u64     %rd8, %rd2, %rd6;
    ld.global.f64     %fd2, [%rd8+0];
    add.f64     %fd3, %fd1, %fd2;
    st.global.f64     [%rd4+0], %fd3;
    .loc    16    14    0
    mov.u32     %r2, 0;
    setp.le.s32     %p1, %r1, %r2;
    @%p1 bra     $Lt_1_3074;
    mov.s32     %r3, %r1;
    .loc    16    10    0
    ld.param.u64     %rd6, [__cudaparm_sample2_a];
    .loc    16    14    0
    mov.s64     %rd9, %rd6;
    .loc    16    10    0
    ld.param.u64     %rd5, [__cudaparm_sample2_b];
    .loc    16    14    0
    mov.s64     %rd10, %rd5;
    mov.s32     %r4, 0;
    mov.s32     %r5, %r3;
$Lt_1_3586:
 //<loop> Loop body line 14, nesting depth: 1, estimated iterations: unknown
    .loc    16    17    0
    ld.global.f64     %fd4, [%rd9+0];
    ld.global.f64     %fd5, [%rd10+0];
    add.f64     %fd6, %fd4, %fd5;
    st.global.f64     [%rd4+0], %fd6;
    add.s32     %r4, %r4, 1;
    add.u64     %rd10, %rd10, 8;
    add.u64     %rd9, %rd9, 8;
    setp.ne.s32     %p2, %r4, %r1;
    @%p2 bra     $Lt_1_3586;
$Lt_1_3074:
    .loc    16    38    0
    exit;
$LDWend_sample2:
    } // sample2

    .entry pascalTriangle (
        .param .u64 __cudaparm_pascalTriangle_a,
        .param .u64 __cudaparm_pascalTriangle_b)
    {
    .reg .u32 %r<4>;
    .reg .u64 %rd<8>;
    .reg .f32 %f<5>;
    .reg .pred %p<3>;
    .loc    16    39    0
$LDWbegin_pascalTriangle:
    .loc    16    40    0
    cvt.s32.u16     %r1, %ctaid.x;
    cvt.s64.s32     %rd1, %r1;
    mul.wide.s32     %rd2, %r1, 4;
    ld.param.u64     %rd3, [__cudaparm_pascalTriangle_a];
    add.u64     %rd4, %rd3, %rd2;
    ld.param.u64     %rd5, [__cudaparm_pascalTriangle_b];
    add.u64     %rd6, %rd5, %rd2;
    ld.global.f32     %f1, [%rd4+0];
    mov.u32     %r2, 0;
    setp.ne.s32     %p1, %r1, %r2;
    @%p1 bra     $Lt_2_1282;
    .loc    16    28    0
    st.global.f32     [%rd6+0], %f1;
    bra.uni     $Lt_2_1026;
$Lt_2_1282:
    .loc    16    29    0
    ld.global.f32     %f2, [%rd4+-4];
    add.f32     %f3, %f2, %f1;
    st.global.f32     [%rd6+0], %f3;
$Lt_2_1026:
    .loc    16    41    0
    exit;
$LDWend_pascalTriangle:
    } // pascalTriangle
    .global .u32 error;

The CUDARuntime class is responsible for managing how to load and execute functions on the GPU. GPUExecution is a wrapper class that also includes the function to generate the PTX file. GPUExecution class uses nvcc.exe with a –ptx switch to generate the PTX file. In the Init function, GPUExecution calls the CUDARuntime method to load the PTX file into the GPU, as shown in Example 9-31.

Example 9-31. GPUExecution and the CUDARuntime class

CUDA runtime and CUDA array class

open System
open System.Text
open System.Collections.Generic
open System.Runtime.InteropServices

type uint = uint32
[<Struct>]
type CUDAModule =
    val Pointer : IntPtr

[<Struct>]
type CUDADevice =
    val Pointer : int

[<Struct>]
type CUDAContext =
    val Pointer : IntPtr

[<Struct>]
type CUDAFunction =
    val Pointer : IntPtr

module CudaDataStructureExtensions =
    let is64Bit = IntPtr.Size = 8

module InteropLibrary2 =
    [<DllImport("nvcuda")>]
    extern CUResult cuParamSetv(CUDAFunction hfunc, int offset, IntPtr ptr, uint numbytes)

module InteropLibrary =
    [<DllImport("nvcuda")>]
    extern CUResult cuModuleLoad(CUDAModule& m, string fn)
    [<DllImport("nvcuda")>]
    extern CUResult cuDriverGetVersion(int& driverVersion)
    [<DllImport("nvcuda")>]
    extern CUResult cuInit(uint Flags)
    [<DllImport("nvcuda", EntryPoint = "cuCtxCreate_v2")>]
    extern CUResult cuCtxCreate(CUDAContext& pctx, uint flags, CUDADevice dev)
    [<DllImport("nvcuda")>]
    extern CUResult cuDeviceGet(CUDADevice& device, int ordinal)
    [<DllImport("nvcuda")>]
    extern CUResult cuModuleGetFunction(CUDAFunction& hfunc, CUDAModule hmod, string name)
    [<DllImport("nvcuda")>]
    extern CUResult cuFuncSetBlockShape(CUDAFunction hfunc, int x, int y, int z)
    [<DllImport("nvcuda")>]
    extern CUResult cuLaunch(CUDAFunction f)
    [<DllImport("nvcuda")>]
    extern CUResult cuLaunchGrid(CUDAFunction f, int grid_width, int grid_height)

    [<DllImport("nvcuda", EntryPoint = "cuMemAlloc_v2")>]
    extern CUResult cuMemAlloc(CUDAPointer& dptr, uint bytesize)
    [<DllImport("nvcuda", EntryPoint = "cuMemcpyDtoH_v2")>]
    extern CUResult cuMemcpyDtoH(IntPtr dstHost, CUDAPointer srcDevice, uint ByteCount)
    [<DllImport("nvcuda", EntryPoint = "cuMemcpyHtoD_v2")>]
    extern CUResult cuMemcpyHtoD(CUDAPointer dstDevice, IntPtr srcHost, uint ByteCount)
    [<DllImport("nvcuda", EntryPoint = "cuMemFree_v2")>]
    extern CUResult cuMemFree(CUDAPointer dptr)
    [<DllImport("nvcuda")>]
    extern CUResult cuParamSeti(CUDAFunction hfunc, int offset, uint value)
    [<DllImport("nvcuda")>]
    extern CUResult cuParamSetf(CUDAFunction hfunc, int offset, float32 value)
    [<DllImport("nvcuda")>]
    extern CUResult cuParamSetv(CUDAFunction hfunc, int offset, int64& value, uint
numbytes)
    [<DllImport("nvcuda")>]
    extern CUResult cuParamSetSize(CUDAFunction hfunc, uint numbytes)

    [<DllImport("nvcuda", EntryPoint = "cuMemsetD8_v2")>]
    extern CUResult cuMemsetD8(CUDAPointer dstDevice, byte uc, uint N)
    [<DllImport("nvcuda", EntryPoint = "cuMemsetD16_v2")>]
    extern CUResult cuMemsetD16(CUDAPointer dstDevice, uint16 us, uint N)

type CUDAArray<'T>(cudaPointer:CUDAPointer2<_>, size:uint, runtime:CUDARunTime) =
    let unitSize = uint32(sizeof<'T>)
    interface IDisposable with
        member this.Dispose() = runtime.Free(cudaPointer) |> ignore
    member this.Runtime with get() = runtime
    member this.SizeInByte with get() = size
    member this.Pointer with get() = cudaPointer
    member this.UnitSize with get() = unitSize
    member this.Size with get() = int( this.SizeInByte / this.UnitSize )
    member this.ToArray<'T>() =
        let out = Array.create (int(size)) Unchecked.defaultof<'T>
        this.Runtime.CopyDeviceToHost(this.Pointer, out)

and CUDARunTime(deviceID) =
    let mutable device = CUDADevice()
    let mutable deviceContext = CUDAContext()
    let mutable m = CUDAModule()

    let init() =
        let r = InteropLibrary.cuInit(deviceID)
        let r = InteropLibrary.cuDeviceGet(&device, int(deviceID))
        let r = InteropLibrary.cuCtxCreate(&deviceContext, deviceID, device)
        ()
    do init()

    let align(offset, alignment) = offset + alignment - 1 &&& ~~~(alignment - 1);
    new() = new CUDARunTime(0u)

    interface IDisposable with
        member this.Dispose() = ()

    member this.LoadModule(fn) =
        (InteropLibrary.cuModuleLoad(&m, fn), m)
    member this.Version
        with get() =
            let mutable a = 0
            (InteropLibrary.cuDriverGetVersion(&a), a)
    member this.Is64Bit with get() = CudaDataStructureExtensions.is64Bit
    member this.GetFunction(fn) =
        let mutable f = CUDAFunction()
        (InteropLibrary.cuModuleGetFunction(&f, m, fn), f)
    member this.ExecuteFunction(fn, x, y) =
        let r, f = this.GetFunction(fn)
        if r = CUResult.Success then
            InteropLibrary.cuLaunchGrid(f, x, y)
        else
            r
    member this.ExecuteFunction(fn) =
        let r, f = this.GetFunction(fn)
        if r = CUResult.Success then
            InteropLibrary.cuLaunch(f)
        else
            r
    member this.ExecuteFunction(fn, [<ParamArray>] parameters:obj list) =
        let func = this.GetFunctionPointer(fn)
        this.SetParameter(func, parameters)
        let r = InteropLibrary.cuLaunch(func)
        r
    member this.ExecuteFunction(fn, parameters:obj list, x, y) =
        let func = this.GetFunctionPointer(fn)
        let paras =
            parameters
            |> List.map (fun n -> match n with
                                  | :? CUDAPointer2<float> as p -> box(p.Pointer)
                                  | :? CUDAPointer2<float32> as p -> box(p.Pointer)
                                  | :? CUDAPointer2<_> as p -> box(p.Pointer)
                                  | _ -> n)

        this.SetParameter(func, paras)
        InteropLibrary.cuLaunchGrid(func, x, y)
    member private this.GetFunctionPointer(fn) =
        let r, p = this.GetFunction(fn)
        if r = CUResult.Success then p
        else failwith "cannot get function pointer"

    // allocate
    member this.Allocate(bytes:uint) =
        let mutable p = CUDAPointer()
        (InteropLibrary.cuMemAlloc(&p, bytes), CUDAPointer2(p))
    member this.Allocate(array) =
        let size = this.GetSize(array) |> uint32
        this.Allocate(size)
    member this.GetSize(data:'T array) =
        this.MSizeOf(typeof<'T>) * uint32(data.Length)
    member this.GetUnitSize(data:'T array) =
        this.MSizeOf(typeof<'T>)
    member private this.MSizeOf(t:Type) =
        if t = typeof<System.Char> then 2u
        else Marshal.SizeOf(t) |> uint32
    member this.Free(p:CUDAPointer2<_>) : CUResult =
        InteropLibrary.cuMemFree(p.Pointer)
    member this.CopyHostToDevice(data: 'T array) =
        let gCHandle = GCHandle.Alloc(data, GCHandleType.Pinned)
        let size = this.GetSize(data)
        let r, p = this.Allocate(size)
        let r = (InteropLibrary.cuMemcpyHtoD(p.Pointer, gCHandle.AddrOfPinnedObject(),
size), p)
        gCHandle.Free()
        r
    member this.CopyDeviceToHost(p:CUDAPointer2<_>, data) =
        let gCHandle = GCHandle.Alloc(data, GCHandleType.Pinned)
        let r = (InteropLibrary.cuMemcpyDtoH(
                      gCHandle.AddrOfPinnedObject(),
                      p.Pointer,
                      this.GetSize(data)),
                      data)
        gCHandle.Free()
        r

    //parameter setting
    member private this.SetParameter<'T>(func, offset, vector:'T) =
        let gCHandle = GCHandle.Alloc(vector, GCHandleType.Pinned)
        let numbytes = uint32(Marshal.SizeOf(vector))
        let r = InteropLibrary2.cuParamSetv(func, offset, gCHandle.AddrOfPinnedObject(),
numbytes)
        gCHandle.Free()
        r
    member private this.SetParameterSize(func, size) =
        if InteropLibrary.cuParamSetSize(func, size) = CUResult.Success then ()
        else failwith "set parameter size failed"
    member this.SetParameter(func, parameters) =
        let mutable num = 0
        for para in parameters do
            match box(para) with
            | :? uint32 as n ->
                num <- align(num, 4)
                if InteropLibrary.cuParamSeti(func, num, n) = CUResult.Success then ()
                else failwith "set uint32 failed"
                num <- num + 4
            | :? float32 as f ->
                num <- align(num, 4)
                if InteropLibrary.cuParamSetf(func, num, f) = CUResult.Success then ()
                else failwith "set float failed"
                num <- num + 4
            | :? int64 as i64 ->
                num <- align(num, 8)
                let mutable i64Ref = i64
                if InteropLibrary.cuParamSetv(func, num, &i64Ref, 8u) = CUResult.Success
then ()
                else failwith "set int64 failed"
                num <- num + 8
            | :? char as ch ->
                num <- align(num, 2)
                let bytes = Encoding.Unicode.GetBytes([|ch|])
                let v = BitConverter.ToUInt16(bytes, 0)
                if this.SetParameter(func, num, v) = CUResult.Success then ()
                else failwith "set char failed"
                num <- num + 2
            | :? CUDAPointer as devPointer ->
                num <- align(num, devPointer.PointerSize)
                if devPointer.PointerSize = 8 then
                    if this.SetParameter(func,
                                             num,
                                             uint64(int64(devPointer.Pointer)))
                                             = CUResult.Success then ()
                    else failwith "set device pointer failed"
                else
                    if InteropLibrary.cuParamSeti(func,
                                                        num,
                                                        uint32(int(devPointer.Pointer)))
                                                        = CUResult.Success then ()
                    else failwith "set device pointer failed"
                num <- num + devPointer.PointerSize
            | :? CUDAArray<float32> as devArray ->
                let devPointer:CUDAPointer2<_> = devArray.Pointer
                num <- align(num, devPointer.PointerSize)
                if devPointer.PointerSize = 8 then
                    if this.SetParameter(func,
                                             num,
                                             uint64(int64(devPointer.Pointer.Pointer)))
                                             = CUResult.Success then ()
                    else failwith "set device pointer failed"
                else
                    if InteropLibrary.cuParamSeti(func,
                                                        num,
                                                        uint32(int(devPointer.Pointer.
Pointer)))
                                                        = CUResult.Success then ()
                    else failwith "set device pointer failed"
                num <- num + devPointer.PointerSize
            | _ when para.GetType().IsValueType ->
                let n = int(this.MSizeOf(para.GetType()))
                num <- align(num, n)
                if this.SetParameter(func, num, box(para)) = CUResult.Success then ()
                else failwith "set no-char object"
                num <- num + n
            | _ -> failwith "not supported"
        this.SetParameterSize( func, uint32(num) )

Execution class

namespace FSharp.Execution

open System
open System.Diagnostics
open System.IO
open Microsoft.FSharp.Quotations
open Microsoft.FSharp.Quotations.Patterns
open Microsoft.FSharp.Quotations.DerivedPatterns
open Microsoft.FSharp.Quotations.ExprShape

type BlockID() =
    inherit dim3()

type GPUExecution () as this =
    let runtime = new CUDARunTime()
    let nvcc = "nvcc.exe"
    do this.Init() |> ignore

    interface IDisposable with
        member this.Dispose() = ()

    member this.Runtime with get() = runtime

    // compile the code to PTX file
    member private this.CompileToPTX() : string =
        let fn = @".	emp.cu"
        this.CompileToPTX(fn)

    // compile the file to PTX file and return PTX file name
    member private this.CompileToPTX(fn) : string =
        use p = new Process()
        let para = sprintf "%s -ptx" fn
        p.StartInfo <- ProcessStartInfo(nvcc, para)
        p.StartInfo.UseShellExecute <- false
        p.StartInfo.WindowStyle <- ProcessWindowStyle.Hidden
        p.Start() |> ignore
        p.WaitForExit()
        System.IO.Path.GetFileNameWithoutExtension(fn) + ".ptx"

    // compile to PTX file and load the PTX file to GPU
    member this.Init() =
        let fn = this.CompileToPTX()
        let r, m = runtime.LoadModule(fn)
        if isSuccess r then m
        else failwith "cannot load module"

    member this.Init(fn:string) =
        let fn = this.CompileToPTX(fn)
        let r,m = runtime.LoadModule(fn)
        if isSuccess r then m
        else failwith "cannot load module"

    // execute function loaded on GPU with parameter list
    member this.Execute(fn:string, list:'T array list) =
        let unitSize = (sizeof<'T>) |> uint32
        let size = list.Head.Length |> uint32
        let results =
            list
            |> List.map (fun l -> this.Runtime.CopyHostToDevice(l))
            |> List.map (fun (r,p) -> (r, new CUDAArray<'T>(p, size, this.Runtime)))
        let success = results |> Seq.forall (fun (r, _) -> isSuccess(r))
        if success then
            let pointers = results |> List.map snd
            let head = List.head list
            let result = this.Runtime.ExecuteFunction(
                               fn,
                               pointers |> List.map box,
                               head.Length,
                               1)
            let out = Array.create head.Length 0.f
            let a = this.Runtime.CopyDeviceToHost(pointers.[0].Pointer, out)
            (result, pointers)
        else
            failwith "copy host failed"

    // copy data from host (CPU) memory to device (GPU) memory
    member this.CopyHostToDevice(data: 'T array) =
        let r, out = this.Runtime.CopyHostToDevice(data)
        if r = CUResult.Success then out
        else failwith "cannot copy host to device"

    // copy data from device (GPU) memory to host (CPU) memory
    member this.CopyDeviceToHost(p:CUDAPointer2<_>, data) =
        let r, out = this.Runtime.CopyDeviceToHost(p, data)
        if r = CUResult.Success then out
        else failwith "cannot copy device to host"

    // convert a list to CUDA array
    member this.ToCUDAArray(l) =
        let r, array = this.Runtime.CopyHostToDevice(l)
        if r = CUResult.Success then array
        else failwith "cannot copy host to device"

    // execute function loaded on GPU with cuda array list
    member this.ExecuteFunction(fn:string, cudaArray:CUDAPointer list) =
        let r = this.Runtime.ExecuteFunction(
                               fn,
                               cudaArray |> List.map box,
                               cudaArray.Length,
                               1)
        r

Pascal Triangle

With everything ready, you can create a few examples that use the GPU. Example 9-32 compares the CPU and GPU versions of the Pascal Triangle computation. The execution result shows that the GPU can finish the computation more efficiently, even with the additional overhead of loading and retrieving the data from the GPU. If the data set is large, the data load time is relatively small and applying the GPU is worthwhile. If the data set is small, most of the time will be spent on loading data to and unloading data from the GPU, causing the GPU version to be slower.

Example 9-32. Pascal Triangle computation on CPU and GPU

let len = 1000

let blockIdx = new BlockDim()
let threadIdx = new ThreadIdx()

[<ReflectedDefinition; GPU>]
let pascalTriangle (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) =
    let x = blockIdx.x
    if x = 0 then
        b.Set(a.[x], x)
    else
        b.Set(a.[x] + a.[x - 1], x)
    ()

// GPU version
let test3() =
    WriteToFile()    // it is defined in Listing 9-29
    let execution = new GPUExecution()
    let m = execution.Init(tempFileName)

    let stopWatch =  System.Diagnostics.Stopwatch()
    stopWatch.Reset()
    stopWatch.Start()

    let l0 = Array.zeroCreate len
    let l1 = Array.zeroCreate len
    l0.[0] <- 1.f
    l1.[0] <- 0.f
    let r, p = execution.Runtime.CopyHostToDevice(l0)
    let r, p2 = execution.Runtime.CopyHostToDevice(l1)
    let rs =
        [1..len]
        |> Seq.map (fun i ->
                        if i % 2 = 1 then
                            let r = execution.Runtime.ExecuteFunction(
                                          "pascalTriangle",
                                          [p; p2],
                                          len,
                                          1)
                            r
                        else
                            let r = execution.Runtime.ExecuteFunction(
                                          "pascalTriangle",
                                          [p2; p],
                                          len,
                                          1)
                            r)
        |> Seq.toList

    let result1, o1 = execution.Runtime.CopyDeviceToHost(p, l0)
    let result2, o2 = execution.Runtime.CopyDeviceToHost(p2, l1)
    stopWatch.Stop()
    printfn "%A" stopWatch.Elapsed
    ()

let computePascal(p:float32 array, p2:float32 array) =
    let len = p.Length
    [0..len-1]
    |> Seq.iter (fun i ->
                    if i = 0 then p2.[i] <- 1.f
                    else p2.[i] <- p.[i-1] + p.[i])

    ()

// CPU version of Pascal Triangle
let test4() =
    let stopWatch =  System.Diagnostics.Stopwatch()
    stopWatch.Reset()
    stopWatch.Start()

    let l0 = Array.zeroCreate len
    let l1 = Array.zeroCreate len
    l0.[0] <- 1.f
    l1.[0] <- 0.f

    [1..len]
    |> Seq.map (fun i ->
                        if i % 2 = 1 then
                            let r = computePascal(l0, l1)
                            r
                        else
                            let r = computePascal(l1, l0)
                            r)
    |> Seq.toList
    |> ignore

    stopWatch.Stop()
    printfn "%A" stopWatch.Elapsed

    ()

Execution result that runs the CPU version followed by the GPU version

00:00:00.1034888
temp.cu
c:mycodecodecenterf#fsharpgpufsharpgpuindebugCUDALibrary.h(56): warning
: variable "sizeT" was declared but never referenced

temp.cu
tmpxft_00001558_00000000-3_temp.cudafe1.gpu
tmpxft_00001558_00000000-10_temp.cudafe2.gpu
temp.cu
c:mycodecodecenterf#fsharpgpufsharpgpuindebugCUDALibrary.h(56): warning
: variable "sizeT" was declared but never referenced
temp.cu
tmpxft_00000790_00000000-3_temp.cudafe1.gpu
tmpxft_00000790_00000000-10_temp.cudafe2.gpu
00:00:00.0448262

Note

Some functions in the preceding listing were defined previously.

Using Binomial Trees and the BOPM

If you’ve ever wondered about a real-world application for the Pascal Triangle code in Example 9-32, you’ll enjoy this section. The Pascal Triangle shows how to code and represent a way to use the GPU to process a binomial-tree-like structure. The Pascal Triangle is generated as shown in Figure 9-8.

Figure 9-8. Pascal Triangle processing

In the financial sector in the United States, the binomial options pricing model (BOPM) uses a binomial tree to value options that are exercisable at any time in a given time interval. The pricing model generates a binomial tree like the one in Figure 9-9.

Figure 9-9. Binomial tree from BOPM

In the preceding diagram, you can find the following relationship:

You can easily change the Pascal Triangle function to the BOPM, as shown in Example 9-33.

Example 9-33. The BOPM function

let blockIdx = new BlockDim()
let threadIdx = new ThreadIdx()

[<ReflectedDefinition; GPU>]
let bopm (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) =
    let u = 0.2f
    let d = 1.f / u

    let x = blockIdx.x
    if x = 0 then
        b.Set(a.[x] * u, x)
    else
        b.Set(a.[x - 1] * d, x)
    ()

Maximum Values in Subarrays

Processing an array is one scenario where a GPU can be of help. Because the GPU has dozens of processors, it can process the elements simultaneously. In this section, you need to find the largest element in an array. The function takes the array and the search starting point and returns the largest element from the starting point.

Example 9-34 shows a GPU version of the algorithm. For the GPU version, the code is straightforward. The starting point x will iterate to the end of the array. The maximum value is stored in the max variable and later will be assigned to another array. Some functions in Example 9-34 are defined in code shown earlier in the chapter.

Example 9-34. Finding the maximum value in an array

let blockIdx = new BlockDim()
let threadIdx = new ThreadIdx()

let input = [1.f .. 15.f] |> Array.ofList

[<ReflectedDefinition; GPU>]
let sample4 (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) : unit =
    let x = blockIdx.x
    let mutable max = 0.f
    for i = x to 15 do
        if max < a.[i] then
            max <- a.[i]
        else
            ()
    b.Set(max, x)

let WriteToFile2() =
    let a1 = <@@ sample4 @@>
    let b = getCUDACode(a1)    // defined in Listing 9-28
    let commonCode = getCommonCode()   // defined in Listing 9-28

    System.IO.File.Delete(tempFileName)  // defined in Listing 9-28
    System.IO.File.WriteAllText(tempFileName, commonCode + b);

let getMax() =
    let tempFileName = @".	emp.cu"
    WriteToFile2()
    let execution = new GPUExecution()
    let m = execution.Init(tempFileName)
    let output = Array.create input.Length 0.f
    let r, ps = execution.Execute("sample4", [input; output;])
    let results =
        ps
        |> List.map (fun p -> p.ToArray() |> snd)
    ()

Generated code in temp.cu

#include "CUDALibrary.h"


__device__ void ff_0(float* a, float* b) { int x;
x = blockIdx.x;
float max;
max = 0.0f;
for (int i = x; i<15; i++) { if ((max) < (a[i])) { max = a[i]; }
else {  } };
b[x] = max; }


extern "C" __global__ void sample4 (float* a, float* b)  {
ff_0(a, b);
}

Using the Monte Carlo Simulation to Compute the π Value on a GPU

Other than using GPUs for array processing, you also can use them in simulations. In this section, you take a look at a small application designed to calculate the π using the Monte Carlo simulation. The algorithm is used to count the random generated number hit in two areas. The two areas are a square and, within it, a circle that touches each edge of the square. Figure 9-10 shows the positions of the circle and rectangle. If you know that the radius of the circle is r, the area of the circle is circle area = πr². And the area of the square is squareArea = 4r². Imagine a large number of random hits in the rectangle area. The π value can be calculated from the number of hits in the circle area and the number of hits in the rectangle area.

Figure 9-10. Position of the circle and rectangle in the Monte Carlo simulation forπ

The cuRAND library provides functions used to generate uniform random numbers between 0 and 1. Instead of using the model shown in Figure 9-10, you can create a quarter of the circle whose area computation involves π. The diagram is shown in Figure 9-11. The formula to compute π value from area1 and area2 is listed here:

Figure 9-11. Monte Carlo computation for π using the cuRAND library

Before you can generate the code, the code translation function in Example 9-35 is used to translate the sqrt function. The code then needs to be added to the translateFromNETOperator function shown in Example 9-20.

Example 9-35. Translating the sqrt function

| "Sqrt" ->
    let l = getList()
    sprintf "sqrt(%s)" l.[0]

The GPU code is defined in the sample3 function in Example 9-36. The WriteToFile2 function is used to get the function quotation and translate it to CUDA code. The computation result is passed back from the GPU to the CPU, allowing filtering and counting to happen on the CPU.

Example 9-36. Computing π using the GPU

let blockIdx = new BlockDim()
let threadIdx = new ThreadIdx()

[<ReflectedDefinition; GPU>]
let sample3 (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) (c:CUDAPointer2<float32>
=
    let x = blockIdx.x
    c.Set(sqrt(a.[x] * a.[x] + b.[x] * b.[x]), x)
    ()

let WriteToFile2() =
    let a1 = <@@ sample3 @@>
    let b = getCUDACode(a1)  // defined in Listing 9-28
    let commonCode = getCommonCode()  // defined in Listing 9-28

    System.IO.File.WriteAllText(tempFileName, commonCode + b);

let computePI() =
    let len = 1000
    WriteToFile2()
    let execution = new GPUExecution()
    let r = execution.Init(tempFileName)

    let r = CUDARandom()
    let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT)
    if status = curandStatus.CURAND_SUCCESS then
        let status, l0 = r.GenerateUniform(g, len)
        let status, l1 = r.GenerateUniform(g, len)

        let output = Array.create len 0.f
        let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l0), output)
        let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l1), output)
        let r, l2 = execution.Runtime.CopyHostToDevice(output)

        let r = execution.Runtime.ExecuteFunction("sample3", [l0; l1; l2], len, 1)
        let result, output = execution.Runtime.CopyDeviceToHost(l2, output)
        float ( output |> Seq.filter (fun n-> n<=1.f) |> Seq.length) / float len * 4.0
    else
        failwith "execution error"

Note

This program uses the cuRAND library; therefore, some types such as curandStatus are defined in Example 9-9.

The generated code is shown in Example 9-37.

Example 9-37. Generated code after running Example 9-36

#include "CUDALibrary.h"

__device__ void ff_0(float* a, float* b, float* c) { int x;
x = blockIdx.x;
c[x] = sqrt(((a[x]) * (a[x])) + ((b[x]) * (b[x])));
; }


extern "C" __global__ void sample3 (float* a, float* b, float* c)  {
ff_0(a, b, c);
}

Generated PTX file after running the code from Example 9-28

    .version 1.4
    .target sm_10, map_f64_to_f32
    // compiled with C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2\bin/../
open64/lib//be.exe
    // nvopencc 4.1 built on 2012-04-07

    //-----------------------------------------------------------
    // Compiling C:/Users/taliu/AppData/Local/Temp/tmpxft_00000e3c_00000000-11_temp.cpp3.i
(C:/Users/taliu/AppData/Local/Temp/ccBI#.a02796)
    //-----------------------------------------------------------

    //-----------------------------------------------------------
    // Options:
    //-----------------------------------------------------------
    //  Target:ptx, ISA:sm_10, Endian:little, Pointer Size:64
    //  -O3    (Optimization level)
    //  -g0    (Debug level)
    //  -m2    (Report advisories)
    //-----------------------------------------------------------

    .file    1    "C:/Users/taliu/AppData/Local/Temp/tmpxft_00000e3c_00000000-10_temp.
cudafe2.gpu"
    .file    2    "c:program files (x86)microsoft visual studio
10.0vcincludecodeanalysissourceannotations.h"
    .file    3    "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2in/../include
crt/device_runtime.h"
    .file    4    "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2in/../include
host_defines.h"
    .file    5    "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2in/../include
builtin_types.h"
    .file    6    "c:program files
vidia gpu computing toolkitcudav4.2includedevice_
types.h"
    .file    7    "c:program files
vidia gpu computing toolkitcudav4.2includehost_
defines.h"
    .file    8    "c:program files
vidia gpu computing toolkitcudav4.2includedriver_
types.h"
    .file    9    "c:program files
vidia gpu computing toolkitcudav4.2include
surface_types.h"
    .file    10    "c:program files
vidia gpu computing toolkitcudav4.2include
texture_types.h"
    .file    11    "c:program files
vidia gpu computing toolkitcudav4.2include
vector_types.h"
    .file    12    "c:program files
vidia gpu computing toolkitcudav4.2include
builtin_types.h"
    .file    13    "C:Program FilesNVIDIA GPU Computing
ToolkitCUDAv4.2in/../includedevice_launch_parameters.h"
    .file    14    "c:program files
vidia gpu computing
toolkitcudav4.2includecrtstorage_class.h"
    .file    15    "temp.cu"
    .file    16    "c:mycodecodecenterf#fsharpgpufsharpgpuindebugCUDALibrary.h"
    .file    17    "C:Program FilesNVIDIA GPU Computing
ToolkitCUDAv4.2in/../includecommon_functions.h"
    .file    18    "c:program files
vidia gpu computing toolkitcudav4.2includemath_
functions.h"
    .file    19    "c:program files
vidia gpu computing toolkitcudav4.2includemath_
constants.h"
    .file    20    "c:program files
vidia gpu computing
toolkitcudav4.2includedevice_functions.h"
    .file    21    "c:program files
vidia gpu computing toolkitcudav4.2includesm_11_
atomic_functions.h"
    .file    22    "c:program files
vidia gpu computing toolkitcudav4.2includesm_12_
atomic_functions.h"
    .file    23    "c:program files
vidia gpu computing toolkitcudav4.2includesm_13_
double_functions.h"
    .file    24    "c:program files
vidia gpu computing toolkitcudav4.2includesm_20_
atomic_functions.h"
    .file    25    "c:program files
vidia gpu computing toolkitcudav4.2includesm_20_
intrinsics.h"
    .file    26    "c:program files
vidia gpu computing toolkitcudav4.2includesm_30_
intrinsics.h"
    .file    27    "c:program files
vidia gpu computing toolkitcudav4.2include
surface_functions.h"
    .file    28    "c:program files
vidia gpu computing toolkitcudav4.2include
texture_fetch_functions.h"
    .file    29    "c:program files
vidia gpu computing toolkitcudav4.2includemath_
functions_dbl_ptx1.h"


    .entry sample3 (
        .param .u64 __cudaparm_sample3_a,
        .param .u64 __cudaparm_sample3_b,
        .param .u64 __cudaparm_sample3_c)
    {
    .reg .u32 %r<3>;
    .reg .u64 %rd<10>;
    .reg .f32 %f<7>;
    .loc    15    9    0
$LDWbegin_sample3:
    .loc    15    5    0
    cvt.s32.u16     %r1, %ctaid.x;
    cvt.s64.s32     %rd1, %r1;
    mul.wide.s32     %rd2, %r1, 4;
    ld.param.u64     %rd3, [__cudaparm_sample3_b];
    add.u64     %rd4, %rd3, %rd2;
    ld.param.u64     %rd5, [__cudaparm_sample3_a];
    add.u64     %rd6, %rd5, %rd2;
    ld.global.f32     %f1, [%rd4+0];
    ld.global.f32     %f2, [%rd6+0];
    mul.f32     %f3, %f1, %f1;
    mad.f32     %f4, %f2, %f2, %f3;
    sqrt.approx.f32     %f5, %f4;
    ld.param.u64     %rd7, [__cudaparm_sample3_c];
    add.u64     %rd8, %rd7, %rd2;
    st.global.f32     [%rd8+0], %f5;
    .loc    15    11    0
    exit;
$LDWend_sample3:
    } // sample3
    .global .u32 error;

The filtering function can be moved from the CPU to the GPU, as shown in Example 9-38. This new version performs the comparison inside the sample3 function, which is executed on the GPU. The result is an array of 1 and 0, and the CPU side can simply add the array elements.

Example 9-38. GPU function that performs checks and related CPU functions

let blockIdx = new BlockDim()
let threadIdx = new ThreadIdx()

[<ReflectedDefinition; GPU>]
let sample3 (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) (c:CUDAPointer2<float32>)=
    let x = blockIdx.x
    if sqrt(a.[x] * a.[x] + b.[x] * b.[x]) <= 1.f then
        c.Set(1.f, x)
    else
        c.Set(0.f, x)
    ()

computePI function

let computePI() =
    let len = 1000
    WriteToFile2()    // defined in Listing 9-34
    let execution = new GPUExecution()
    let r = execution.Init(tempFileName)

    let r = CUDARandom()
    let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT)
    if status = curandStatus.CURAND_SUCCESS then //defined in Listing 9-9
        let status, l0 = r.GenerateUniform(g, len)
        let status, l1 = r.GenerateUniform(g, len)

        let output = Array.create len 0.f
        let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l0), output)
        let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l1), output)
        let r, l2 = execution.Runtime.CopyHostToDevice(output)

        let r = execution.Runtime.ExecuteFunction("sample3", [l0; l1; l2], len, 1)
        let result, output = execution.Runtime.CopyDeviceToHost(l2, output)
        float ( output |> Seq.sum) / float len * 4.0
    else
        failwith "execution error"

Now let’s examine the performance when the number of random data points increases. Example 9-39 executes the function computePI with different array lengths. From the execution result, you can tell that the execution time does not increase significantly even when the array length increases exponentially, except for the first one, which performs a few one-time initialization operations.

Example 9-39. Measuring the GPU performance

let computePI() =
    WriteToFile2()    // defined in Listing 9-34
    let execution = new GPUExecution()
    let r = execution.Init(tempFileName)

    let r = CUDARandom()
    let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT)
    let sw = System.Diagnostics.Stopwatch()

    if status = curandStatus.CURAND_SUCCESS then // curandStatus is defined in Listing 9-9
        let compute(len) =
            sw.Reset()
            sw.Start()
            let status, l0 = r.GenerateUniform(g, len)
            let status, l1 = r.GenerateUniform(g, len)

            let output = Array.create len 0.f
            let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l0), output)
            let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l1), output)
            let r, l2 = execution.Runtime.CopyHostToDevice(output)

            let r = execution.Runtime.ExecuteFunction("sample3", [l0; l1; l2], len, 1)
            let result, output = execution.Runtime.CopyDeviceToHost(l2, output)
            let pi = float ( output |> Seq.sum) / float len * 4.0
            sw.Stop()
            pi, sw.ElapsedTicks

        [50; 100; 500; 1000; 5000; 10000]
        |> Seq.map compute
    else
        failwith "execution error"

Generated CUDA code

#include "CUDALibrary.h"


__device__ void ff_0(float* a, float* b) { int x;
x = blockIdx.x;
float max;
max = 0.0f;
for (int i = x; i < 15; i++) { if ((max) < (a[i])) { max = a[i]; }
else {  } };
b[x] = max; }


extern "C" __global__ void sample4 (float* a, float* b)  {
ff_0(a, b);
}

Execution result

(3.04, 142194L)
(2.84, 1643L)
(3.152, 3595L)
(3.144, 3369L)
(3.1632, 4685L)
(3.1588, 3511L)

Note

Depending on your graphics card, you might run into errors when giving large numbers to the computePI function.

Useful Resources

If your requirements involve matrix manipulation or linear algebra, Statfactory’s FCore numerical library: (http://www.statfactory.co.uk/) is a good choice. This library provides GPGPU-based matrix, linear algebra and random number generating functions.

Other than Statfactory library, the following websites are good resources for more information:

General-Purpose Computation on Graphics Hardware (http://gpgpu.org/developer/cuda)
CUDA Zone (https://developer.nvidia.com/category/zone/cuda-zone)
OpenCL on NVIDIA (https://developer.nvidia.com/opencl)
The Khronos Group (http://www.khronos.org/opencl/)

In Closing

For a C# developer, functional programming might not be a familiar concept. Chapter 1 to Chapter 3 introduced the imperative and object-oriented (OO) features. If you are planning to use F# in your project, you do not have to dedicate three months to learning a new language from scratch. Instead, you can start to implement some components from the material presented in Chapter 1 to Chapter 3. Chapter 4 to Chapter 6 introduced some F# unique features, such as type providers. Chapter 7 to Chapter 9 introduced a few F# applications. These chapters demonstrate how to solve complex problems using features introduced in previous chapters.

Functional programming is not a silver bullet. F# is a language that provides both functional and OO features. Having knowledge in these two areas is a perfect complementary skill set for the C# developer to solve daily programming tasks more efficiently. For example, the LINQ feature in C#, which is a functional programming concept, increasingly attracts developer interest and dramatically changes the way developers write code. A number of problems are solved more naturally by applying these functional programming concepts. If you are curious and motivated to explore a new way to talk to the computer, F# is a great candidate for further exploration.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. GPGPU with F#

Create new playlist

Sign In

Sign Up

Chapter 9. GPGPU with F#

Introducing GPU and GPGPU

Note

CUDA

Note

Note

Note

CUDA Toolkit

cuRAND Library

cuBLAS Library

Note

F# Quotation and Transform

Note

Note

Note

F# Quotation on GPGPU

Note

Note

Pascal Triangle

Note

Using Binomial Trees and the BOPM

Maximum Values in Subarrays

Using the Monte Carlo Simulation to Compute the π Value on a GPU

Note

Note

Useful Resources

In Closing

Table of Contents for
9. GPGPU with F#