It’s easy and natural to write a math formula in F#, and F# is widely used in scientific computation applications, such as those used by trading companies and investment banks. Applications can benefit greatly from being able to directly leverage the hardware, but providing this ability can be a challenge for many .NET languages. The common way for .NET to directly access hardware is by using .NET interop. F# has another weapon called quotations. A quotation shows the program structure and can be used to translate F# code. In this chapter, I will demonstrate how to use quotations to translate F# code to GPU code.
The graphics processing unit (GPU) chip on a graphics card can do something other than render an image for you. Most developers get excited when their machine has four or eight processors, but that is a drop in the bucket when compared to what the GPU offers. GPUs generally have tens and even hundreds of processors, which often sit in an idle state. A general-purpose GPU (GPGPU) takes advantage of these processors and extends them to more general-purpose applications. GPGPUs are not focused on rendering images; they are designed to use a large number of processors to perform parallel actions or computations.
In this chapter, I will describe how to use .NET interoperations and quotations to directly access hardware. F# does not have built-in support for GPU, so the F# library, which translates the quotations to code and can be executed on a GPU is needed. In addition to the F# translation library, some small samples are listed to show how to leverage the GPU to perform parallel computations.
One of the major F# application areas is financial services. A fundamental problem that financial businesses try to solve is how to perform mathematical computations in real time. From previous chapters, you know that F# is a perfect candidate for implementing mathematical functions, and GPGPUs include a mechanism to provide real-time computations. Adding a working knowledge of GPGPUs to your skill set can help you solve these kinds of real-world problems.
This chapter is about how to use F# to leverage GPGPU to perform computations on the GPU. The code can be downloaded from the F# sample pack (http://fsharp3sample.codeplex.com). The code is located in the OtherSamples folder.
According to Wikipedia, the definition for GPU is the following.
A graphics processing unit (GPU), also occasionally called visual processing unit (VPU), is a specialized electronic circuit designed to rapidly manipulate and alter memory to accelerate the building of images in a frame buffer intended for output to a display. GPUs are used in embedded systems, mobile phones, personal computers, workstations, and game consoles. Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel. In a personal computer, a GPU can be present on a video card, or it can be on the motherboard or—in certain CPUs—on the CPU die. More than 90% of new desktop and notebook computers have integrated GPUs, which are usually far less powerful than those on a dedicated video card.
For many people, the GPU is most relevant to computer games. However, the GPU has a number of other uses. Do you know that Amazon’s GPU cluster (http://aws.amazon.com/ec2/instance-types/) can be used to enable high performance computations in the cloud? Maybe the application on your mobile phone or iPad is powered by the GPU running on a cloud cluster. Do you know that investment firms and brokerages are implementing GPU programming in their computation platforms to allow applications to quickly determine when to buy or sell stock? Any software developer working on these applications will have a lucrative career. Now might be a good time to take a second look at the small chip that has been ignored for a long time.
Wikipedia defines GPGPU as follows:
General-purpose computing on graphics processing units (GPGPU, GPGP or less often GP2U) is the means of using a graphics processing unit (GPU), which typically handles computation only for computer graphics, to perform computation in applications traditionally handled by the central processing unit (CPU). Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing.
There are several GPU solutions on the market. OpenCL is the currently dominant open, general-purpose GPU computing language. The dominant proprietary framework is NVIDIA’s CUDA. In this chapter, I use CUDA as the target framework.
According to NVIDIA, the definition of CUDA is as follows:
CUDA is a parallel computing platform and programming model invented by NVIDIA. It enables dramatic increases in computing performance by harnessing the power of the graphics processing unit (GPU).
CUDA provides a GPGPU platform including drivers, a development SDK, and an execution environment. In this chapter, CUDA is used for all samples. To run the samples from this chapter, you should install an NVIDIA graphic card and download the CUDA SDK from http://developer.nvidia.com/cuda/cuda-downloads. There are three installation packages:
CUDA Toolkit
Graphics drivers
Development SDK
After you successfully install these three packages, the CUDA development environment is ready to use. The installation package creates several environment variables, as shown in Figure 9-1. And from the command line, you can execute the NVCC command, as shown in Figure 9-2.
The first task is to retrieve the graphics card property using F# interop. You need to define the struct that holds the graphics card information. Example 9-1 defines the CUDA structure in F#. cudaError is an enum structure that defines all CUDA error codes. SizeT is a struct wrapper for the IntPtr type. The CUDADeviceProp structure defines the graphics card properties. For this sample, there is a Name property that wraps the device property’s name value.
module CUDADataStructure open System open System.Runtime.InteropServices // CUDA error enumeration type cudaError = | cudaErrorAddressOfConstant = 22 | cudaErrorApiFailureBase = 10000 | cudaErrorCudartUnloading = 29 | cudaErrorInitializationError = 3 | cudaErrorInsufficientDriver = 35 | cudaErrorInvalidChannelDescriptor = 20 | cudaErrorInvalidConfiguration = 9 | cudaErrorInvalidDevice = 10 | cudaErrorInvalidDeviceFunction = 8 | cudaErrorInvalidDevicePointer = 17 | cudaErrorInvalidFilterSetting = 26 | cudaErrorInvalidHostPointer = 16 | cudaErrorInvalidMemcpyDirection = 21 | cudaErrorInvalidNormSetting = 27 | cudaErrorInvalidPitchValue = 12 | cudaErrorInvalidResourceHandle = 33 | cudaErrorInvalidSymbol = 13 | cudaErrorInvalidTexture = 18 | cudaErrorInvalidTextureBinding = 19 | cudaErrorInvalidValue = 11 | cudaErrorLaunchFailure = 4 | cudaErrorLaunchOutOfResources = 7 | cudaErrorLaunchTimeout = 6 | cudaErrorMapBufferObjectFailed = 14 | cudaErrorMemoryAllocation = 2 | cudaErrorMemoryValueTooLarge = 32 | cudaErrorMissingConfiguration = 1 | cudaErrorMixedDeviceExecution = 28 | cudaErrorNoDevice = 37 | cudaErrorNotReady = 34 | cudaErrorNotYetImplemented = 31 | cudaErrorPriorLaunchFailure = 5 | cudaErrorSetOnActiveProcess = 36 | cudaErrorStartupFailure = 127 | cudaErrorSynchronizationError = 25 | cudaErrorTextureFetchFailed = 23 | cudaErrorTextureNotBound = 24 | cudaErrorUnknown = 30 | cudaErrorUnmapBufferObjectFailed = 15 | cudaErrorIncompatibleDriverContext = 49 | cudaSuccess = 0 [<Struct>] type SizeT = val value : IntPtr new (n:int) = { value = IntPtr(n) } new (n:int64) = { value = IntPtr(n) } [<Struct>] type CUDADeviceProp = [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 256)>] val nameChar : char array val totalGlobalMem : SizeT val sharedMemPerBlock : SizeT val regsPerBlock : int val warpSize : int val memPitch : SizeT val maxThreadsPerBlock : int [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 3)>] val maxThreadsDim : int array [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 3)>] val maxGridSize : int array val clockRate : int val totalConstMem : SizeT val major : int val minor : int val textureAlignment: SizeT val deviceOverlap : int val multiProcessorCount : int val kernelExecTimeoutEnabled : int val integrated : int val canMapHostMemory : int val computeMode : int val maxTexture1D : int [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 2)>] val maxTexture2D : int array [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 3)>] val maxTexture3D : int array [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 2)>] val maxTexture1DLayered : int array [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 3)>] val maxTexture2DLayered : int array val surfaceAlignment : SizeT val concurrentKernels : int val ECCEnabled : int val pciBusID : int val pciDeviceID : int val pciDomainID : int val tccDriver : int val asyncEngineCount : int val unifiedAddressing : int val memoryClockRate : int val memoryBusWidth : int val l2CacheSize : int val maxThreadsPerMultiProcessor : int member this.Name = String(this.nameChar).Trim(' 00')
The interop code from the F# side is simple. The CUDA binary has 32-bit and 64-bit versions. Because of this, you define two modules, named CUDA32 and CUDA64. The CudaRT files are located under the CUDA installation folder. If the default installation path is used, the path should look like C:ProgramDataNVIDIA CorporationNVIDIA GPU Computing SDK 4.2Ccommonin. The interop code is shown in Example 9-2.
open System open System.Runtime.InteropServices open CUDADataStructure module CUDA32 = [<Literal>] let dllName = "cudart32_42_9" [<DllImport(dllName)>] extern cudaError cudaGetDeviceProperties(CUDADeviceProp& prop, int device) [<DllImport(dllName)>] extern cudaError cudaDeviceGetLimit(SizeT& pSize, cudaLimit limit) module CUDA64 = [<Literal>] let dllName = "cudart64_42_9" [<DllImport(dllName)>] extern cudaError cudaGetDeviceProperties(CUDADeviceProp& prop, int device) [<DllImport(dllName)>] extern cudaError cudaDeviceGetLimit(SizeT& pSize, cudaLimit limit) [<EntryPoint>] let main argv = let mutable prop = CUDADeviceProp() // get the first graphic card by passing in 0 let returnCode = CUDA32.cudaGetDeviceProperties (&prop, 0) printfn "%A - %A" returnCode prop.Name ignore <| System.Console.ReadKey() 0 // return an integer exit code
The constant value cudart64_42_9 could be different depending on the CUDA SDK installed on your computer. The file name shown in Example 9-2 is based on CUDA 4.2. For the CUDA 4.0 SDK, the DLL file name is something like cudart32_40_17.
The F# extern method requires an ampersand (&) to pass a variable reference.
The execution result is shown in Figure 9-3.
The graphics card has some limitations, such as stack size. You need to know these limitations because your future coding needs to take these into consideration. The limitations are defined in our F# code as shown in Example 9-3.
type cudaLimit = | cudaLimitStackSize = 0 | cudaLimitPrintfFifoSize = 1 | cudaLimitMallocHeapSize = 2
The interop code needed to get the device limitation information is shown in Example 9-4, and the execution result is displayed in Figure 9-4. For most development activities, it is uncommon to have to query your hardware’s limitations. This is because the operating system handles these limitations for you. However, this is not the case when programming against a GPU. For any code executed on the GPU, it is the developer’s responsibility to understand and respect these limitations.
let limitCategories = [ cudaLimit.cudaLimitStackSize cudaLimit.cudaLimitPrintfFifoSize cudaLimit.cudaLimitMallocHeapSize ] limitCategories |> Seq.iter (fun category -> let mutable limit = SizeT() let returnCode = CUDA32.cudaDeviceGetLimit(&limit, category) printfn "%A: %A - %A" returnCode category limit.value)
For device management, there are other commonly used functions (which are shown in Example 9-5):
Reset device function. This function cleans up all resources on the current device associated with the current process. If there are multiple threads on the current process, it is the developer’s responsibility to make sure the other thread or threads do not access this device.
Get device count function. This function returns the number of devices.
Set device flag function. This is useful for allowing you to set flags to configure how the GPU and CPU work together.
When the GPU is processing the data, there is a CPU thread waiting for the result to come back. The following flags configure how the CPU thread waits for the return value:
cudaDeviceScheduleAuto is used as a default value. If the active CUDA context number is larger than the logical processor number, the thread can yield to other operating system threads. Otherwise, the CUDA will spin on the CPU processor and will not yield to other processors.
cudaDeviceScheduleSpin instructs the CUDA thread on the CPU to never yield to other threads until it gets the result from the device. This can increase the CPU latency, but it can decrease the latency when waiting for results from the device.
cudaDeviceScheduleSpin is the opposite of cudaDeviceScheduleSpin. It instructs the CUDA thread to yield to other operating system threads on the CPU when waiting for the result from the device.
cudaDeviceScheduleBlockingSync instructs a CUDA thread on the CPU to block other CPU threads when waiting for the result from the device.
Code definition
type cudaDeviceFlag = | cudaDeviceScheduleAuto = 0 | cudaDeviceScheduleSpin = 1 | cudaDeviceScheduleYield = 2 | cudaDeviceScheduleBlockingSync = 4 [<DllImport(dllName)>] extern cudaError cudaDeviceReset() [<DllImport(dllName)>] extern cudaError cudaGetDeviceCount(int& count) [<DllImport(dllName)>] extern cudaError cudaSetDeviceFlags(cudaDeviceFlag count)
let mutable deviceCount = 0 let returnCode = CUDA32.cudaGetDeviceCount(&deviceCount) printfn "%A: %A" returnCode deviceCount let returnCode = CUDA32.cudaSetDeviceFlags(cudaDeviceFlag.cudaDeviceScheduleAuto) printfn "set flag return code %A" returnCode let returnCode = CUDA32.cudaDeviceReset() printfn "reset device %A" returnCode
In addition to the hardware information, the driver information is also important, because hardware functionalities are exposed by the device driver. Different hardware exposes different APIs and has different memory capacity. The managed language developer does not have to consider the memory because the operating system and .NET Framework take care of memory management. However, the developer does need to keep the hardware configuration in mind when programming for the GPU. Therefore, it is important to know the current driver version. Example 9-6 shows how to get the driver version.
type CUResult = | Success = 0 | ErrorInvalidValue = 1 | ErrorOutOfMemory = 2 | ErrorNotInitialized = 3 | ErrorDeinitialized = 4 | ErrorNoDevice = 100 | ErrorInvalidDevice = 101 | ECCUncorrectable = 214 | ErrorAlreadyAcquired = 210 | ErrorAlreadyMapped = 208 | ErrorArrayIsMapped = 207 | ErrorContextAlreadyCurrent = 202 | ErrorFileNotFound = 301 | ErrorInvalidImage = 200 | ErrorInvalidContext = 201 | ErrorInvalidHandle = 400 | ErrorInvalidSource = 300 | ErrorLaunchFailed = 700 | ErrorLaunchIncompatibleTexturing = 703 | ErrorLaunchOutOfResources = 701 | ErrorLaunchTimeout = 702 | ErrorMapFailed = 205 | ErrorNoBinaryForGPU = 209 | ErrorNotFound = 500 | ErrorNotMapped = 211 | ErrorNotReady = 600 | ErrorUnmapFailed = 206 | NotMappedAsArray = 212 | NotMappedAsPointer = 213 | PointerIs64Bit = 800 | SizeIs64Bit = 801 | ErrorUnknown = 999 module InteropLibrary = [<DllImport("nvcuda")>] extern CUResult cuDriverGetVersion(int& driverVersion) type CUDADriver() = member this.Version = let mutable version = 0 (InteropLibrary.cuDriverGetVersion(&version), version)
There is a cudaError structure defined in Example 9-1. The cuError structure seems to serve the same purpose. However, cuError is defined as part of the CUDA driver API, while cudaError is part of the CUDA runtime API. The runtime API is based on the driver API. Programming runtime APIs is simpler, but driver APIs provide better control over the device. There is no performance difference between these two types of APIs. It is recommended that you refrain from combining these two kinds of APIs in one application. The function in the driver API is prefixed with cu, while the runtime API begins with cuda.
To make the GPU compute the data, the data should first be loaded into memory. The computations happen within the GPU’s memory. You refer to the GPU as device and the CPU as host. The CPU memory is called the host memory, and the GPU memory is called the device memory. The memory-management function is defined in Example 9-7. The enum type CUDAMemcpyKind defines four types of memory copy operations.
namespace CUDARuntime open System open System.Text open System.Collections.Generic open System.Runtime.InteropServices open CUDADataStructure type CUDAMemcpyKind = | cudaMemcpyHostToHost = 0 | cudaMemcpyHostToDevice = 1 | cudaMemcpyDeviceToHost = 2 | cudaMemcpyDeviceToDevice = 3 module CUDARuntime64 = [<Literal>] let dllName = "cudart64_40_17" [<DllImport(dllName)>] extern cudaError cudaMemcpy(IntPtr dst, IntPtr src, SizeT count, CUDAMemcpyKind kind) [<DllImport(dllName)>] extern cudaError cudaMalloc(IntPtr& p, SizeT size) [<DllImport(dllName)>] extern cudaError cudaMemset(IntPtr& p, int value, int count) module CUDARuntime32 = [<Literal>] let dllName = "cudart32_40_17" [<DllImport(dllName)>] extern cudaError cudaMemcpy(IntPtr dst, IntPtr src, SizeT count, CUDAMemcpyKind kind) [<DllImport(dllName)>] extern cudaError cudaMalloc(IntPtr& p, SizeT size) [<DllImport(dllName)>] extern cudaError cudaMemset(IntPtr& p, int value, int count)
Example 9-8 shows how to copy data between the host and device memory, and its result is shown in Figure 9-5. Because memory copying is a bottleneck for GPGPU computations, it is not a good practice to copy data between host memory and device memory. The best practice is to keep the data in device memory as long as possible.
open System.Runtime.InteropServices let test5() = let getIntPtr arr = let nativeint = Marshal.UnsafeAddrOfPinnedArrayElement(arr, 0) let intptr = new System.IntPtr(nativeint.ToPointer()) intptr let mutable ptr = IntPtr() let arr = [|1.f; 2.f; 3.f; 4.f; 5.f; 1.f; 2.f; 3.f; 4.f; 5.f;|] let arr2 = [|11.f; 12.f; 13.f; 14.f; 15.f; 11.f; 12.f; 13.f; 14.f; 15.f;|] let intptr = getIntPtr arr let intptr2 = getIntPtr arr2 let size = arr.Length * sizeof<float> let error = CUDARuntime32.cudaMalloc(&ptr, SizeT(size)) let error = CUDARuntime32.cudaMemcpy(ptr, intptr, SizeT(10 * 4), CUDAMemcpyKind.cudaMemcpyHostToDevice) let error = CUDA32.cudaMemcpy(intptr2, ptr, SizeT(size), CUDAMemcpyKind.cudaMemcpyDeviceToHost) printfn "%A - %A" arr arr2
This section demonstrated how to configure the device and copy data to or from the device. There are several CUDA libraries provided by NVIDIA. The CUDA runtime and CUDA driver API provide the basic ability to program against the GPU. In addition, CUDA provides a rich set of libraries to boost this development. In the next section, I am going to demonstrate how to use F# to invoke these libraries and perform basic computations on the GPU.
Some libraries, collectively known as the NVIDIA CUDA Toolkit, are provided with CUDA. According to the NVIDIA website (http://developer.nvidia.com/cuda/cuda-toolkit), the NVIDIA CUDA Toolkit provides a comprehensive development environment for C and C++ developers building GPU-accelerated applications. The CUDA Toolkit includes a compiler for NVIDIA GPUs, math libraries, and tools for debugging and optimizing the performance of your applications. You’ll also find programming guides, user manuals, an API reference, and other documentation to help you get started with accelerating your application with the GPU.
In this section, two libraries—cuRAND and cuBLAS—are used to demonstrate how to use PInvoke to invoke CUDA functions from F#.
The first library presented is the NVIDIA CUDA Random Number Generation library (cuRAND). According to the NVIDIA website (http://developer.nvidia.com/cuda/curand), the cuRAND delivers high-performance, GPU-accelerated random number generation (RNG). The cuRAND library delivers high-quality random numbers using hundreds of processor cores available in the NVIDIA GPU. The random number generator is the basic building block for simulations such as the Monte Carlo simulation. The performance from the cuRAND library can improve the performance of the simulation.
There are several enumeration structures defined for the CUDA library. The RanGenerator structure, shown in Example 9-9, is used to reference the random generator.
type curandStatus = | CURAND_SUCCESS = 0 | CURAND_VERSION_MISMATCH = 100 | CURAND_NOT_INITIALIZED = 101 | CURAND_ALLOCATION_FAILED = 102 | CURAND_TYPE_ERROR =103 | CURAND_OUT_OF_RANGE = 104 | CURAND_LENGTH_NOT_MULTIPLE = 105 | CURAND_LAUNCH_FAILURE = 201 | CURAND_PREEXISTING_FAILURE = 202 | CURAND_INITIALIZATION_FAILED = 203 | CURAND_ARCH_MISMATCH = 204 | CURAND_INTERNAL_ERROR = 999 type CUDARandomRngType = | CURAND_TEST = 0 | CURAND_PSEUDO_DEFAULT = 100 | CURAND_PSEUDO_XORWOW = 101 | CURAND_QUASI_DEFAULT = 200 | CURAND_QUASI_SOBOL32 = 201 | CURAND_QUASI_SCRAMBLED_SOBOL32 = 202 | CURAND_QUASI_SOBOL64 = 203 | CURAND_QUASI_SCRAMBLED_SOBOL64 = 204 type CUDARandomOrdering = | CURAND_PSEUDO_BEST = 100 | CURAND_PSEUDO_DEFAULT = 101 | CURAND_PSEUDO_SEEDED = 102 | CURAND_QUASI_DEFAULT = 201 type CUDADirectionVectorSet = | CURAND_VECTORS_32_JOEKUO6 = 101 | CURAND_DIRECTION_VECTORS_32_JOEKUO6 = 102 | CURAND_VECTORS_64_JOEKUO6 = 103 | CURAND_DIRECTION_VECTORS_64_JOEKUO6 = 104 [<Struct>] type RandGenerator = val handle : uint32
Because the library has both an x86 and x64 flavor, the F# cuRAND library needs to provide two versions as well, as shown in Example 9-10. The CUDAPointer struct wraps the pointer to the data stored in the GPU memory.
open System open System.Text open System.Collections.Generic open System.Runtime.InteropServices open CUDADataStructure open CUDARuntime [<Struct>] type CUDAPointer = val Pointer : IntPtr new(ptr) = { Pointer = ptr } new(cudaPointer:CUDAPointer) = { Pointer = cudaPointer.Pointer } member this.PointerSize with get() = IntPtr.Size [<Struct>] type RandDirectionVectors32 = [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 32)>] val direction_vectors : uint32[] [<Struct>] type RandDirectionVectors64 = [<MarshalAs(UnmanagedType.ByValArray, SizeConst = 64)>] val direction_vectors : uint64[] // CUDA random generator x86 version module CUDARandomDriver32 = [<Literal>] let dllName = "curand32_40_17" [<DllImport(dllName)>] extern curandStatus curandCreateGenerator(RandGenerator& generator, CUDARandomRngType rng_type) [<DllImport(dllName)>] extern curandStatus curandCreateGeneratorHost(RandGenerator& generator, CUDARandomRngType rng_type) [<DllImport(dllName)>] extern curandStatus curandDestroyGenerator(RandGenerator generator) [<DllImport(dllName)>] extern curandStatus curandGenerate(RandGenerator generator, IntPtr outputPtr, SizeT num) [<DllImport(dllName)>] extern curandStatus curandGenerateLogNormal(RandGenerator generator, IntPtr outputPtr, SizeT n, float mean, float stddev) [<DllImport(dllName)>] extern curandStatus curandGenerateLogNormalDouble(RandGenerator generator, IntPtr outputPtr, SizeT n, double mean, double stddev) [<DllImport(dllName)>] extern curandStatus curandGenerateLongLong(RandGenerator generator, IntPtr outputPtr, SizeT num) [<DllImport(dllName)>] extern curandStatus curandGenerateNormal(RandGenerator generator, IntPtr outputPtr, SizeT n, float mean, float stddev) [<DllImport(dllName)>] extern curandStatus curandGenerateNormalDouble(RandGenerator generator, IntPtr outputPtr, SizeT n, double mean, double stddev) [<DllImport(dllName)>] extern curandStatus curandGenerateSeeds(RandGenerator generator) [<DllImport(dllName)>] extern curandStatus curandGenerateUniform(RandGenerator generator, IntPtr outputPtr, SizeT num) [<DllImport(dllName)>] extern curandStatus curandGenerateUniformDouble(RandGenerator generator, IntPtr outputPtr, SizeT num) [<DllImport(dllName)>] extern curandStatus curandGetDirectionVectors32(RandDirectionVectors32& vectors, CUDADirectionVectorSet set) [<DllImport(dllName)>] extern curandStatus curandGetDirectionVectors64(RandDirectionVectors64& vectors, CUDADirectionVectorSet set) [<DllImport(dllName)>] extern curandStatus curandGetScrambleConstants32(IntPtr& constants) [<DllImport(dllName)>] extern curandStatus curandGetScrambleConstants64(IntPtr& constants) [<DllImport(dllName)>] extern curandStatus curandGetVersion(int& version) [<DllImport(dllName)>] extern curandStatus curandSetGeneratorOffset(RandGenerator generator, uint64 offset) [<DllImport(dllName)>] extern curandStatus curandSetGeneratorOrdering(RandGenerator generator, CUDARandomOrdering order) [<DllImport(dllName)>] extern curandStatus curandSetPseudoRandomGeneratorSeed(RandGenerator generator, uint64 seed) [<DllImport(dllName)>] extern curandStatus curandSetQuasiRandomGeneratorDimensions(RandGenerator generator, uint32 num_ dimensions) [<DllImport(dllName)>] extern curandStatus curandSetStream(RandGenerator generator, CUDAStream stream) let CreateGenerator(rng_type) = let mutable generator = Unchecked.defaultof<RandGenerator> let r = curandCreateGenerator(&generator, rng_type) (r, generator) let DestroyGenerator(generator) = curandDestroyGenerator(generator) let SetPseudoRandomGeneratorSeed(generator, seed) = curandSetPseudoRandomGeneratorSeed(generator, seed) let SetGeneratorOffset(generator, offset) = curandSetGeneratorOffset(generator, offset) let SetGeneratorOrdering(generator, order) = curandSetGeneratorOrdering(generator, order) let SetQuasiRandomGeneratorDimensions(generator, dimensions) = curandSetQuasiRandomGeneratorDimensions(generator, dimensions) let CopyToHost(out:'T array, cudaPtr:CUDAPointer) = let devPtr = cudaPtr.Pointer let outputPtr = GCHandle.Alloc(out, GCHandleType.Pinned).AddrOfPinnedObject() let unitSize = Marshal.SizeOf(typeof<float32>) let n = out.Length let size = SizeT(n * unitSize) let r = CUDARuntime32.cudaMemcpy(outputPtr, devPtr, size, CUDAMemcpyKind.cudaMemcpyDeviceToHost) r let GenerateUniform(generator, n:int) = let unitSize = Marshal.SizeOf(typeof<float32>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime32.cudaMalloc(&devicePtr, size) let r = curandGenerateUniform(generator, devicePtr, size) (r, CUDAPointer(devicePtr)) let GenerateUniformDouble(generator, n:int) = let unitSize = Marshal.SizeOf(typeof<float>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime32.cudaMalloc(&devicePtr, size) let r = curandGenerateUniform(generator, devicePtr, size) (r, CUDAPointer(devicePtr)) let GenerateNormal(generator, n:int, mean, stddev) = let unitSize = Marshal.SizeOf(typeof<float32>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime32.cudaMalloc(&devicePtr, size) let r = curandGenerateNormal(generator, devicePtr, size, mean, stddev) (r, CUDAPointer(devicePtr)) let GenerateNormalDouble(generator, n:int, mean, stddev) = let unitSize = Marshal.SizeOf(typeof<float32>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime32.cudaMalloc(&devicePtr, size) let r = curandGenerateNormalDouble(generator, devicePtr, size, mean, stddev) (r, CUDAPointer(devicePtr)) let GenerateLogNormal(generator, n:int, mean, stddev) = let unitSize = Marshal.SizeOf(typeof<float32>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime32.cudaMalloc(&devicePtr, size) let r = curandGenerateLogNormal(generator, devicePtr, size, mean, stddev) (r, CUDAPointer(devicePtr)) let GenerateLogNormalDouble(generator, n:int, mean, stddev) = let unitSize = Marshal.SizeOf(typeof<float>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime32.cudaMalloc(&devicePtr, size) let r = curandGenerateLogNormalDouble(generator, devicePtr, size, mean, stddev) (r, CUDAPointer(devicePtr)) // CUDA random generator x64 version module CUDARandomDriver64 = [<Literal>] let dllName = "curand64_40_17" [<DllImport(dllName)>] extern curandStatus curandCreateGenerator(RandGenerator& generator, CUDARandomRngType rng_type) [<DllImport(dllName)>] extern curandStatus curandCreateGeneratorHost(RandGenerator& generator, CUDARandomRngType rng_type) [<DllImport(dllName)>] extern curandStatus curandDestroyGenerator(RandGenerator generator) [<DllImport(dllName)>] extern curandStatus curandGenerate(RandGenerator generator, IntPtr outputPtr, SizeT num) [<DllImport(dllName)>] extern curandStatus curandGenerateLogNormal(RandGenerator generator, IntPtr outputPtr, SizeT n, float mean, float stddev) [<DllImport(dllName)>] extern curandStatus curandGenerateLogNormalDouble(RandGenerator generator, IntPtr outputPtr, SizeT n, double mean, double stddev) [<DllImport(dllName)>] extern curandStatus curandGenerateLongLong(RandGenerator generator, IntPtr outputPtr, SizeT num) [<DllImport(dllName)>] extern curandStatus curandGenerateNormal(RandGenerator generator, IntPtr outputPtr, SizeT n, float mean, float stddev) [<DllImport(dllName)>] extern curandStatus curandGenerateNormalDouble(RandGenerator generator, IntPtr outputPtr, SizeT n, double mean, double stddev) [<DllImport(dllName)>] extern curandStatus curandGenerateSeeds(RandGenerator generator) [<DllImport(dllName)>] extern curandStatus curandGenerateUniform(RandGenerator generator, IntPtr outputPtr, SizeT num) [<DllImport(dllName)>] extern curandStatus curandGenerateUniformDouble(RandGenerator generator, IntPtr outputPtr, SizeT num) [<DllImport(dllName)>] extern curandStatus curandGetDirectionVectors32(RandDirectionVectors32& vectors, CUDADirectionVectorSet set) [<DllImport(dllName)>] extern curandStatus curandGetDirectionVectors64(RandDirectionVectors64& vectors, CUDADirectionVectorSet set) [<DllImport(dllName)>] extern curandStatus curandGetScrambleConstants32(IntPtr& constants) [<DllImport(dllName)>] extern curandStatus curandGetScrambleConstants64(IntPtr& constants) [<DllImport(dllName)>] extern curandStatus curandGetVersion(int& version) [<DllImport(dllName)>] extern curandStatus curandSetGeneratorOffset(RandGenerator generator, uint64 offset) [<DllImport(dllName)>] extern curandStatus curandSetGeneratorOrdering(RandGenerator generator, CUDARandomOrdering order) [<DllImport(dllName)>] extern curandStatus curandSetPseudoRandomGeneratorSeed(RandGenerator generator, uint64 seed); [<DllImport(dllName)>] extern curandStatus curandSetQuasiRandomGeneratorDimensions(RandGenerator generator, uint32 num_ dimensions) [<DllImport(dllName)>] extern curandStatus curandSetStream(RandGenerator generator, CUDAStream stream) let CreateGenerator(rng_type) = let mutable generator = Unchecked.defaultof<RandGenerator> let r = curandCreateGenerator(&generator, rng_type) (r, generator) let DestroyGenerator(generator) = curandDestroyGenerator(generator) let SetPseudoRandomGeneratorSeed(generator, seed) = curandSetPseudoRandomGeneratorSeed(generator, seed) let SetGeneratorOffset(generator, offset) = curandSetGeneratorOffset(generator, offset) let SetGeneratorOrdering(generator, order) = curandSetGeneratorOrdering(generator, order) let SetQuasiRandomGeneratorDimensions(generator, dimensions) = curandSetQuasiRandomGeneratorDimensions(generator, dimensions) let GenerateUniform(generator, n:int) = let unitSize = Marshal.SizeOf(typeof<float32>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime64.cudaMalloc(&devicePtr, size) let r = curandGenerateUniform(generator, devicePtr, size) (r, CUDAPointer(devicePtr)) let GenerateUniformDouble(generator, n:int) = let unitSize = Marshal.SizeOf(typeof<float>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime64.cudaMalloc(&devicePtr, size) let r = curandGenerateUniform(generator, devicePtr, size) (r, CUDAPointer(devicePtr)) let GenerateNormal(generator, n:int, mean, stddev) = let unitSize = Marshal.SizeOf(typeof<float32>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime64.cudaMalloc(&devicePtr, size) let r = curandGenerateNormal(generator, devicePtr, size, mean, stddev) (r, CUDAPointer(devicePtr)) let GenerateNormalDouble(generator, n:int, mean, stddev) = let unitSize = Marshal.SizeOf(typeof<float32>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime64.cudaMalloc(&devicePtr, size) let r = curandGenerateNormalDouble(generator, devicePtr, size, mean, stddev) (r, CUDAPointer(devicePtr)) let GenerateLogNormal(generator, n:int, mean, stddev) = let unitSize = Marshal.SizeOf(typeof<float32>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime64.cudaMalloc(&devicePtr, size) let r = curandGenerateLogNormal(generator, devicePtr, size, mean, stddev) (r, CUDAPointer(devicePtr)) let GenerateLogNormalDouble(generator, n:int, mean, stddev) = let unitSize = Marshal.SizeOf(typeof<float>) let size = SizeT(n * unitSize) let mutable devicePtr = Unchecked.defaultof<IntPtr> let r = CUDARuntime64.cudaMalloc(&devicePtr, size) let r = curandGenerateLogNormalDouble(generator, devicePtr, size, mean, stddev) (r, CUDAPointer(devicePtr))
If you prefer to use a class, a class version of CUDARandom is defined in Example 9-11.
type CUDARandom() = let is64bit = IntPtr.Size = 8 member this.CreateGenerator(rand_type) = if is64bit then CUDARandomDriver64.CreateGenerator(rand_type) else CUDARandomDriver32.CreateGenerator(rand_type) member this.DestroyGenerator(g) = if is64bit then CUDARandomDriver64.DestroyGenerator(g) else CUDARandomDriver32.DestroyGenerator(g) member this.SetPseudoRandomGeneratorSeed(g, obj) = if is64bit then CUDARandomDriver64.SetPseudoRandomGeneratorSeed(g, obj |> unbox |> uint64) else CUDARandomDriver32.SetPseudoRandomGeneratorSeed(g, obj |> unbox |> uint64) member this.SetGeneratorOffset(g, obj) = if is64bit then CUDARandomDriver64.SetGeneratorOffset(g, obj |> unbox |> uint64) else CUDARandomDriver32.SetGeneratorOffset(g, obj |> unbox |> uint64) member this.SetGeneratorOrdering(g, ordering) = if is64bit then CUDARandomDriver64.SetGeneratorOrdering(g, ordering) else CUDARandomDriver32.SetGeneratorOrdering(g, ordering) member this.SetQuasiRandomGeneratorDimensions(g, obj) = if is64bit then CUDARandomDriver64.SetQuasiRandomGeneratorDimensions(g, obj |> unbox |> uint32) else CUDARandomDriver32.SetQuasiRandomGeneratorDimensions(g, obj |> unbox |> uint32) member this.GenerateUniform(g, seed) = if is64bit then CUDARandomDriver64.GenerateUniform(g, seed) else CUDARandomDriver32.GenerateUniform(g, seed) member this.GenerateUniformDouble(g, seed) = if is64bit then CUDARandomDriver64.GenerateUniformDouble(g, seed) else CUDARandomDriver32.GenerateUniformDouble(g, seed) member this.GenerateNormal(g, seed, mean, variance) = if is64bit then CUDARandomDriver64.GenerateNormal(g, seed, mean, variance) else CUDARandomDriver32.GenerateNormal(g, seed, mean, variance) member this.GenerateNormalDouble(g, seed, mean, variance) = if is64bit then CUDARandomDriver64.GenerateNormalDouble(g, seed, mean, variance) else CUDARandomDriver32.GenerateNormalDouble(g, seed, mean, variance) member this.GenerateLogNormal(g, seed, mean, variance) = if is64bit then CUDARandomDriver64.GenerateLogNormal(g, seed, mean, variance) else CUDARandomDriver32.GenerateLogNormal(g, seed, mean, variance) member this.GenerateLogNormalDouble(g, seed, mean, variance) = if is64bit then CUDARandomDriver64.GenerateLogNormalDouble(g, seed, mean, variance) else CUDARandomDriver32.GenerateLogNormalDouble(g, seed, mean, variance)
The sample code needed to invoke the CUDARandom class is shown in Example 9-12. The sample code generates 256 random numbers.
open System.Runtime.InteropServices let test6() = let n = 256 let r = CUDARandom() let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT) if status = curandStatus.CURAND_SUCCESS then let status, v = r.GenerateUniform(g, n) if status = curandStatus.CURAND_SUCCESS then let array : float32 array = Array.zeroCreate n let nativePtr = Marshal.UnsafeAddrOfPinnedArrayElement(array, 0) let p = System.IntPtr(nativePtr.ToPointer()) CUDARuntime.CUDARuntime64.cudaMemcpy( p, v.Pointer, SizeT(n*Marshal.SizeOf(sizeof<float32>)), CUDAMemcpyKind.cudaMemcpyDeviceToHost) r.DestroyGenerator(g) array |> Seq.iter (printfn "%A") else printfn "generation failed. status = %A" status r.DestroyGenerator(g) else printfn "create generator failed. status = %A" status r.DestroyGenerator(g) ()
The generated random numbers result in a uniform distribution. If the random numbers need to be generated from a customized function, you can use an accept-rejection algorithm. This method is based on the observation that one can sample uniformly from the region under the graph of its density function. The algorithm works like this:
Sample a point x from a distribution—for example, uniform distribution.
Draw a vertical line from x to cut the target function’s diagram.
Sample uniformly along this vertical line starting from x. If the point is located outside the target function’s distribution, reject it.
You can then use the filter to generate the agreed-to sample value with the formula f(x), as shown in Example 9-13.
open System.Runtime.InteropServices let test6_2() = let n = 256 let r = CUDARandom() let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT) if status = curandStatus.CURAND_SUCCESS then let status, v = r.GenerateUniform(g, n) if status = curandStatus.CURAND_SUCCESS then let array : float32 array = Array.zeroCreate n let nativePtr = Marshal.UnsafeAddrOfPinnedArrayElement(array, 0) let p = System.IntPtr(nativePtr.ToPointer()) CUDARuntime.CUDARuntime64.cudaMemcpy( p, v.Pointer, SizeT(n*Marshal.SizeOf(sizeof<float32>)), CUDAMemcpyKind.cudaMemcpyDeviceToHost) r.DestroyGenerator(g) array else r.DestroyGenerator(g) failwith "generation failed. status = %A" status else r.DestroyGenerator(g) failwith "create generator failed. status = %A" status let test7() = let xArray = test6_2() let yArray = test6_2() Array.zip xArray yArray |> Array.filter (fun (x, y) -> x * x <= y) |> Seq.iter (fun (x, y) -> printfn "(%A, %A)" x y)
The second library is the CUDA Basic Linear Algebra Subroutines library (cuBLAS). According to the NVIDIA website (http://developer.nvidia.com/cuda/cublas), the cuBLAS library is a GPU-accelerated version of the complete standard BLAS library that delivers performance that’s 6 to 17 times faster than the latest MKL BLAS. The code that defines the data structure in F# is shown in Example 9-14.
[<Struct>] type CUDABLASHandle = val handle : uint32 [<Struct>] type CUDAStream = val Value : int [<Struct>] type CUDAFloatComplex = val real : float32 val imag : float32 type CUBLASPointerMode = | Host = 0 | Device = 1 type CUBLASStatus = | Success = 0 | NotInitialized = 1 | AllocFailed = 3 | InvalidValue = 7 | ArchMismatch = 8 | MappingError = 11 | ExecutionFailed = 13 | InternalError = 14
The F# wrapper code for the cuBLAS library is shown in Example 9-15. The 32-bit version does not list all of the functions. The only difference between the 32-bit version and the 64-bit version is the dllName variable. The 64-bit version is cublas64_42_9, and the 32-bit version is cublas32_42_9.
module CUDABLASDriver64 = [<Literal>] let dllName = "cublas64_42_9" [<DllImport(dllName)>] extern CUBLASStatus cublasInit() [<DllImport(dllName)>] extern CUBLASStatus cublasShutdown() [<DllImport(dllName)>] extern CUBLASStatus cublasGetError() [<DllImport(dllName)>] extern CUBLASStatus cublasFree(CUDAPointer devicePtr) [<DllImport(dllName)>] extern CUBLASStatus cublasCreate_v2(CUDABLASHandle& handle) [<DllImport(dllName)>] extern CUBLASStatus cublasSetStream_v2(CUDABLASHandle handle, CUDAStream streamId) [<DllImport(dllName)>] extern CUBLASStatus cublasGetStream_v2(CUDABLASHandle handle, CUDAStream& streamId) [<DllImport(dllName)>] extern CUBLASStatus cublasGetPointerMode_v2(CUDABLASHandle handle, CUBLASPointerMode& mode) [<DllImport(dllName)>] extern CUBLASStatus cublasSetPointerMode_v2(CUDABLASHandle handle, CUBLASPointerMode mode) [<DllImport(dllName)>] extern CUBLASStatus cublasIcamax_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, int& result) [<DllImport(dllName)>] extern CUBLASStatus cublasIdamax_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, int& result) [<DllImport(dllName)>] extern CUBLASStatus cublasIsamax_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, int& result) [<DllImport(dllName)>] extern CUBLASStatus cublasIzamax_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, int& result) [<DllImport(dllName)>] extern CUBLASStatus cublasIcamin_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, int& result) [<DllImport(dllName)>] extern CUBLASStatus cublasIdamin_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, int& result) [<DllImport(dllName)>] extern CUBLASStatus cublasIsamin_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, int& result) [<DllImport(dllName)>] extern CUBLASStatus cublasIzamin_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, int& result) [<DllImport(dllName)>] extern CUBLASStatus cublasSasum_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, float32& result) [<DllImport(dllName)>] extern CUBLASStatus cublasDasum_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, float& result) [<DllImport(dllName)>] extern CUBLASStatus cublasScasum_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, float32& result) [<DllImport(dllName)>] extern CUBLASStatus cublasDzasum_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, float& result) [<DllImport(dllName)>] extern CUBLASStatus cublasSaxpy_v2(CUDABLASHandle handle, int n, float32& alpha, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasDaxpy_v2(CUDABLASHandle handle, int n, float& alpha, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasCaxpy_v2(CUDABLASHandle handle, int n, CUDAFloatComplex& alpha, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasZaxpy_v2(CUDABLASHandle handle, int n, CUDAFloatComplex& alpha, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasScopy_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasDcopy_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasCcopy_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasZcopy_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasSdot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, float32& result) [<DllImport(dllName)>] extern CUBLASStatus cublasDdot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, float& result) [<DllImport(dllName)>] extern CUBLASStatus cublasCdotu_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, CUDAFloatComplex& result) [<DllImport(dllName)>] extern CUBLASStatus cublasCdotc_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, CUDAFloatComplex& result) [<DllImport(dllName)>] extern CUBLASStatus cublasZdotu_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, CUDAFloatComplex& result) [<DllImport(dllName)>] extern CUBLASStatus cublasZdotc_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, CUDAFloatComplex& result) [<DllImport(dllName)>] extern CUBLASStatus cublasSnrm2_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, float32&result) [<DllImport(dllName)>] extern CUBLASStatus cublasDnrm2_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, float& result) [<DllImport(dllName)>] extern CUBLASStatus cublasScnrm2_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, float32&result) [<DllImport(dllName)>] extern CUBLASStatus cublasDznrm2_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, float& result) [<DllImport(dllName)>] extern CUBLASStatus cublasSrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasDrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasCrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasCsrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasZrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasZdrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasSrotg_v2(CUDABLASHandle handle, IntPtr a, IntPtr b, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasDrotg_v2(CUDABLASHandle handle, IntPtr a, IntPtr b, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasCrotg_v2(CUDABLASHandle handle, IntPtr a, IntPtr b, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasZrotg_v2(CUDABLASHandle handle, IntPtr a, IntPtr b, IntPtr c, IntPtr s) [<DllImport(dllName)>] extern CUBLASStatus cublasSrotm_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, IntPtr param) [<DllImport(dllName)>] extern CUBLASStatus cublasDrotm_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, IntPtr param) [<DllImport(dllName)>] extern CUBLASStatus cublasSrotmg_v2(CUDABLASHandle handle, IntPtr d1, IntPtr d2, IntPtr x1, IntPtr y1, IntPtr param) [<DllImport(dllName)>] extern CUBLASStatus cublasDrotmg_v2(CUDABLASHandle handle, IntPtr d1, IntPtr d2, IntPtr x1, IntPtr y1, IntPtr param) [<DllImport(dllName)>] extern CUBLASStatus cublasSscal_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx) [<DllImport(dllName)>] extern CUBLASStatus cublasDscal_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx) [<DllImport(dllName)>] extern CUBLASStatus cublasCscal_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx) [<DllImport(dllName)>] extern CUBLASStatus cublasCsscal_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx) [<DllImport(dllName)>] extern CUBLASStatus cublasZscal_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx) [<DllImport(dllName)>] extern CUBLASStatus cublasZdscal_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx) [<DllImport(dllName)>] extern CUBLASStatus cublasSswap_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasDswap_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasCswap_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasZswap_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy) module CUDABLASDriver32 = [<Literal>] let dllName = "cublas32_42_9" [<DllImport(dllName)>] extern CUBLASStatus cublasInit() [<DllImport(dllName)>] extern CUBLASStatus cublasShutdown() ... // other functions are same as the 64-bit version
dumpbin.exe is used to check the exported functions from the CUDA BLAS DLL. The cublas64_<version>.DLL is located in the directory <CUDA installation folder>Ccommonin. The result from dumpbin.exe shows that a function can have two versions. For example, the cublasZdscal and cublasZdscal_2 functions are shown in the result list. It is recommended that you use the function with _2 suffix.
There are some overloaded functions in the cuBLAS library. Because F# does not allow you to define the function with the same name, another module is needed to declare the function with the same name, as shown in Example 9-16. Some of the 64-bit version functions are not listed. These functions are the same as those in the 32-bit module.
module CUDABLASDriver32_2 = [<Literal>] let dllName = "cublas32_42_9" [<DllImport(dllName)>] extern CUBLASStatus cublasSrotmg_v2(CUDABLASHandle handle, float32& d1, float32& d2, float32& x1, float32& y1, IntPtr param) [<DllImport(dllName)>] extern CUBLASStatus cublasDrotmg_v2(CUDABLASHandle handle, float& d1, float& d2, float& x1, float& y1, IntPtr param) [<DllImport(dllName)>] extern CUBLASStatus cublasZrotg_v2(CUDABLASHandle handle, CUDAFloatComplex& a, CUDAFloatComplex& b, float& c, CUDAFloatComplex& s) [<DllImport(dllName)>] extern CUBLASStatus cublasSrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, float32&c, float32&s) [<DllImport(dllName)>] extern CUBLASStatus cublasCrotg_v2(CUDABLASHandle handle, CUDAFloatComplex& a, CUDAFloatComplex& b, float32& c, CUDAFloatComplex& s) [<DllImport(dllName)>] extern CUBLASStatus cublasDrotg_v2(CUDABLASHandle handle, float& a, float& b, float& c, float& s) [<DllImport(dllName)>] extern CUBLASStatus cublasSrotg_v2(CUDABLASHandle handle, float32& a, float32& b, float32& c, float32& s) [<DllImport(dllName)>] extern CUBLASStatus cublasZdrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, float& c, float& s) [<DllImport(dllName)>] extern CUBLASStatus cublasSaxpy_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasDaxpy_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasCaxpy_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasZaxpy_v2(CUDABLASHandle handle, int n, IntPtr alpha, IntPtr x, int incx, IntPtr y, int incy) [<DllImport(dllName)>] extern CUBLASStatus cublasDrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, float& c, float& s) [<DllImport(dllName)>] extern CUBLASStatus cublasCrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, float32&c, CUDAFloatComplex& s) [<DllImport(dllName)>] extern CUBLASStatus cublasCsrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, float32& c, float32& s) [<DllImport(dllName)>] extern CUBLASStatus cublasZrot_v2(CUDABLASHandle handle, int n, IntPtr x, int incx, IntPtr y, int incy, float& c, CUDAFloatComplex& s) module CUDABLASDriver64_2 = [<Literal>] let dllName = "cublas64_42_9" [<DllImport(dllName)>] extern CUBLASStatus cublasSrotmg_v2(CUDABLASHandle handle, float32& d1, float32& d2, float32& x1, float32& y1, IntPtr param) [<DllImport(dllName)>] extern CUBLASStatus cublasDrotmg_v2(CUDABLASHandle handle, float& d1, float& d2, float& x1, float& y1, IntPtr param) ... // other 64-bit version functions
The code to invoke the preceding API is listed in Example 9-17. The variable a is a 100-element-length array, and all the elements in the array are set to 1. The variable b is also a 100-element-length array, and all elements in the array are set to 2. The dot product from a and b is 200.
open System.Runtime.InteropServices let test8() = let n = 100 let a = Array.create n 2.f let b = Array.create n 1.f let copyToDevice(array) = let nativePtr = Marshal.UnsafeAddrOfPinnedArrayElement(array, 0) let p = System.IntPtr(nativePtr.ToPointer()) let mutable dst = IntPtr() let count = SizeT(Marshal.SizeOf(sizeof<float32>) * array.Length) let r = CUDARuntime.CUDARuntime64.cudaMalloc(&dst, count) if r = cudaError.cudaSuccess then let r = CUDARuntime.CUDARuntime64.cudaMemcpy( dst, p, count, CUDAMemcpyKind.cudaMemcpyHostToDevice) if r = cudaError.cudaSuccess then Some dst else None else None let mutable handle = CUDABLASHandle() let r = CUDABLASDriver64.cublasInit() let r = CUDABLASDriver64.cublasCreate_v2(&handle) let deviceA = copyToDevice(a) let deviceB = copyToDevice(b) let mutable result = 1.f match deviceA, deviceB with | Some(pA), Some(pB) -> let status = CUDABLASDriver64.cublasSdot_v2( handle, Marshal.SizeOf(sizeof<float32>) * n, pA, 1, pB, 1, &result) printfn "result is %A" result | _ -> failwith "computation error"
The F# quotation is a tree structure that presents the current F# program structure. The ReflectedDefinition attribute is the key attribute used to get an F# quotation. According to MSDN (http://msdn.microsoft.com/en-us/library/ee353643.aspx), here is how you use it:
Add this attribute to the let-binding for the definition of a top-level value to make the quotation expression that implements the value available for use at runtime. Add this attribute to a type or module to make it apply recursively to all the values in the module or all the members of the type.
Example 9-18 shows how to show the quotation expression for the function. The <@@ ... @@>
is used to get the quotation from a function that is decorated with the ReflectedDefinition attribute.
There are several approaches to converting F# code to CUDA code. Figure 9-6 shows one of these approaches. The PTX file is an assembly-like language for the GPU. In this chapter, the path from quotation via C code to PTX file is chosen.
After the quotation is ready, a translation function can be applied. Example 9-20 shows how to traverse the tree structure and get the CUDA C code. There are several ways to generate code from a quotation tree. Example 9-20 generates C code, which can make the quotation generation easy to understand; the listing traverses the parse tree structure emitting C code. To support the code generation, CUDAPointer2 is defined, as shown in Example 9-19.
type CUDAPointer2<'T>(p:CUDAPointer) = new(ptr:IntPtr) = CUDAPointer2(CUDAPointer(ptr)) member this.Pointer = p member this.PointerSize = p.PointerSize member this.Is64Bit = this.PointerSize = 8 member this.Item with get (i:int) : float32 = failwith "for code generation only" and set (i:int) (v:float32) = failwith "for code generation only" member this.Set(x:float32, i:int) = failwith "for code generation only"
CUDAPointer is defined in Example 9-10.
Example 9-20 implements only part of the conversion and does not provide full support for an F# quotation conversion.
open System open System.Reflection open Microsoft.FSharp.Quotations open Microsoft.FSharp.Quotations.Patterns open Microsoft.FSharp.Quotations.DerivedPatterns open Microsoft.FSharp.Quotations.ExprShape let accessExpr exp = let addSemiCol (str:string) = if str.EndsWith(";") || String.IsNullOrEmpty(str) && String.IsNullOrWhiteSpace(str) then str else str + ";" let rec iterate exp : string= let print x = let str = sprintf "%A" x str let matchExp expOption = match expOption with | Some(n) -> iterate n | None -> String.Empty let isCUDAPointerType (exp:Expr Option) = match exp with | Some(n) -> n.Type.IsAssignableFrom( typeof<CUDAPointer2<float>>) || n.Type.IsAssignableFrom(typeof<CUDAPointer2<float32>>) | _ -> false match exp with | DerivedPatterns.Applications (e, ell) -> let str0 = iterate e let str1 = ell |> Seq.map (fun n -> n |> Seq.map (fun m -> iterate m )) |> Seq.map (fun n -> String.Join(" ", n |> Seq.toArray)) str0 + String.Join(" ", str1 |> Seq.toArray) | DerivedPatterns.AndAlso (e0, e1) -> (iterate e0) + (iterate e1) | DerivedPatterns.Bool e -> print e | DerivedPatterns.Byte e -> print e | DerivedPatterns.Char e -> print e | DerivedPatterns.Double e -> print e | DerivedPatterns.Int16 e-> print e | DerivedPatterns.Int32 e-> print e | DerivedPatterns.Int64 e -> print e | DerivedPatterns.OrElse (e0, e1)-> (iterate e0) + (iterate e1) | DerivedPatterns.SByte e -> print e | DerivedPatterns.Single e -> print e | DerivedPatterns.String e -> print e | DerivedPatterns.UInt16 e -> print e | DerivedPatterns.UInt32 e -> print e | DerivedPatterns.UInt64 e -> print e | DerivedPatterns.Unit e -> String.Empty //"void" | Patterns.AddressOf address -> iterate address | Patterns.AddressSet (exp0, exp1) -> (iterate exp0) + (iterate exp1) | Patterns.Application (exp0, exp1) -> (iterate exp0) + (iterate exp1) | Patterns.Call (expOption, mi, expList) -> if isCUDAPointerType expOption && mi.Name = "Set" then let callObject = matchExp expOption let index = iterate expList.[1] let postfix = match mi with | DerivedPatterns.MethodWithReflectedDefinition n -> iterate n | _ -> iterate expList.[0] let s = sprintf "%s[%s] = %s;" callObject index postfix s else let callObject = matchExp expOption let returnType = translateFromNETType mi.ReturnType String.Empty let postfix = match mi with | DerivedPatterns.MethodWithReflectedDefinition n -> iterate n | _ -> translateFromNETOperator mi expList let s = sprintf "%s%s" callObject postfix s | Patterns.Coerce (exp, t) -> let from = iterate exp //sprintf "coerce(%s, %s)" from t.Name sprintf "%s" from | Patterns.DefaultValue exp -> print exp | Patterns.FieldGet (expOption, fi) -> (matchExp expOption) + (print fi) | Patterns.FieldSet (expOption, fi, e) -> let callObj = matchExp expOption let fi = print fi let str = iterate e callObj + fi + str | Patterns.ForIntegerRangeLoop (v, e0, e1, e2) -> let from = iterate e0 let toV = iterate e1 let s = String.Format("for (int {0} = {1}; {0}<{2}; {0}++) {{ {3} }}", v, from ,toV, iterate e2) s | Patterns.IfThenElse (con, exp0, exp1) -> let condition = (iterate con) let ifClause = addSemiCol(iterate exp0) let elseClause = addSemiCol(iterate exp1) sprintf "if (%s) { %s } else { %s }" condition ifClause elseClause | Patterns.Lambda (var,body) -> //let a = print var //let b = iterate body match exp with | DerivedPatterns.Lambdas (vll, e) -> let s = vll |> List.map (fun n-> n |> List.map (fun m -> sprintf "%s %s" (translateFromNETType m.Type "") m.Name)) |> List.fold (fun acc l -> acc@l) [] let parameterNames = vll |> List.map (fun n -> sprintf "%s" n.Head.Name) let returnType = getCallReturnType e let returnTypeID = translateFromNETTypeToFunReturn returnType "" let fid = code.FunctionID; code.IncreaseFunctionID() let functionName = sprintf "ff_%d" fid let statement = iterate e let functionCode = sprintf "__device__ %s %s(%s) { %s } " returnTypeID functionName (String.Join(", ", s)) (addSemiCol(statement)) code.Add(functionCode) sprintf "%s(%s)" functionName (String.Join(", ", parameterNames)) | _ -> failwith "not supported lambda format" | Patterns.Let (var, exp0, exp1) -> let a = print var let b = iterate exp0 let t = var.Type let s = if t.Name = "FSharpFunc'2" then sprintf "__device__ %s; //function pointer" (translateFromNETType t a) else String.Empty code.Add(s) let c = iterate exp1 let assignment = if t.Name = "FSharpFunc'2" then sprintf "%s; %s = %s;" (translateFromNETType t a) a b else sprintf "%s %s; %s = %s;" (translateFromNETType t a) a a b sprintf "%s %s" assignment c | Patterns.LetRecursive (tupList, exp) -> let strList = tupList |> Seq.map (fun (var, e) -> (print var) + (iterate e)) String.Join(" ", strList |> Seq.toArray) + (iterate exp) | Patterns.NewArray (t, expList) -> let str0 = print t let str1 = expList |> Seq.map (fun e -> iterate e) str0 + String.Join(" ", str1) | Patterns.NewDelegate (t, varList, exp) -> (print t) + (print varList) + (iterate exp) | Patterns.NewObject (t, expList) -> let str0 = print t let str1 = expList |> Seq.map (fun e -> iterate e) str0 + String.Join(" ", str1) | Patterns.NewRecord (t, expList) -> let str0 = print t let str1 = expList |> Seq.map (fun e -> iterate e) str0 + String.Join(" ", str1) | Patterns.NewObject (t, expList) -> let str0 = print t let str1 = expList |> Seq.map (fun e -> iterate e) str0 + String.Join(" ", str1) | Patterns.NewRecord (t, expList) -> let str0 = print t let str1 = expList |> Seq.map (fun e -> iterate e) str0 + String.Join(" ", str1) | Patterns.NewTuple expList -> let ty = translateFromNETType (expList.[0].Type) String.Empty let l = expList |> Seq.map (fun e -> iterate e) let l = String.Join(", ", l) sprintf "newTuple<%s>(%s)" ty l | Patterns.NewUnionCase (t, expList) -> let str0 = print t let str1 = expList |> Seq.map (fun e -> iterate e) str0 + String.Join(" ", str1) | Patterns.PropertyGet (expOption, pi, expList) -> let callObj = matchExp expOption let r = match pi with | DerivedPatterns.PropertyGetterWithReflectedDefinition e -> iterate e | _ -> pi.Name let l = expList |> List.map (fun n -> iterate n) if l.Length > 0 then if r = "Item" then sprintf "%s[%s]" callObj (String.Join(", ", l)) else sprintf "%s.%s[%s]" callObj r (String.Join(", ", l)) else if String.IsNullOrEmpty callObj then sprintf "%s" r else sprintf "%s.%s" callObj r | Patterns.PropertySet (expOption, pi, expList, e) -> let callObj = matchExp expOption let r = match pi with | DerivedPatterns.PropertyGetterWithReflectedDefinition e -> iterate e | _ -> print pi let l = expList |> Seq.map (fun n -> iterate n) if r = "Item" then callObj + String.Join(" ", l) + (iterate e) else callObj + r + String.Join(" ", l) + (iterate e) | Patterns.Quote e -> iterate e | Patterns.Sequential (e0, e1) -> let statement0 = addSemiCol(iterate e0) let statement1 = addSemiCol(iterate e1) sprintf "%s %s" statement0 statement1 | Patterns.TryFinally (e0, e1) -> (iterate e0) + (iterate e1) | Patterns.TryWith (e0, v0, e1, v1, e2) -> (iterate e0) + (print v0) + (iterate e1) + (print v1) + (iterate e2) | Patterns.TupleGet (e, i) -> (iterate e) + (print i) | Patterns.TypeTest (e, t) -> (iterate e) + (print t) | Patterns.UnionCaseTest (e, ui) -> (iterate e) + (print ui) | Patterns.Value (obj, t) -> (print obj) + (print t) | Patterns.Var v -> v.Name | Patterns.VarSet (v, e) -> let left = (print v) let right = (iterate e) sprintf "%s = %s" left right | Patterns.WhileLoop (e0, e1) -> let condition = iterate e0 let body = iterate e1 sprintf "while (%s) { %s }" condition (addSemiCol(body)) | _ -> failwith "not supported pattern" and translateFromNETOperator (mi:MethodInfo) (exprList:Expr list) = let getList() = exprList |> List.map (fun n -> iterate n) let ty = translateFromNETType (exprList.[0].Type) String.Empty let generateFunction (mi:MethodInfo) (mappedMethodName:string) (parameters:Expr list) = let result = sprintf "%s(%s)" mappedMethodName (String.Join(", ", getList())) result match mi.Name with | "op_Addition" -> let l = getList() sprintf "(%s) + (%s)" l.[0] l.[1] | "op_Subtraction" -> let l = getList() sprintf "(%s) - (%s)" l.[0] l.[1] | "op_Multiply" -> let l = getList() sprintf "(%s) * (%s)" l.[0] l.[1] | "op_Division" -> let l = getList() sprintf "(%s) / (%s)" l.[0] l.[1] | "op_LessThan" -> let l = getList() sprintf "(%s) < (%s)" l.[0] l.[1] | "op_LessThanOrEqual" -> let l = getList() sprintf "(%s) <= (%s)" l.[0] l.[1] | "op_GreaterThan" -> let l = getList() sprintf "(%s) > (%s)" l.[0] l.[1] | "op_GreaterThanOrEqual" -> let l = getList() sprintf "(%s) >= (%s)" l.[0] l.[1] | "op_Range" -> failwith "not support range on GPU" | "op_Equality" -> let l = getList() sprintf "(%s) == (%s)" l.[0] l.[1] | "GetArray" -> let l = getList() sprintf "%s[%s]" l.[0] l.[1] | "CreateSequence" -> failwith "not support createSeq on GPU" | "FailWith" -> failwith "not support exception on GPU" | "ToList" -> failwith "not support toList on GPU" | "Map" -> failwith "not support map on GPU" | "Delay" -> let l = getList() String.Join(", ", l) | "op_PipeRight" -> let l = getList() sprintf "%s ( %s )" l.[1] l.[0] | "ToSingle" -> let l = getList() sprintf "(float) (%s)" l.[0] | _ -> let l = getList() sprintf ".%s(%s)" (mi.Name) (String.Join(", ", l)) let s = iterate exp addSemiCol(s)
Some functions in the preceding code are defined in the rest of the chapter.
The transform function does not list all of the possible conversions inside the match expression. We will demonstrate how to expand this function in later sections of this chapter.
Example 9-21 shows how to convert .NET types to C.
type Type with member this.HasInterface(t:Type) = this.GetInterface(t.FullName) <> null let rec translateFromNETType (t:Type) (a:string) = if t = typeof<int> then "int" elif t = typeof<float32> then "float" elif t = typeof<float> then "double" elif t = typeof<bool> then "bool" elif t.IsArray then let elementTypeString = translateFromNETType (t.GetElementType()) a sprintf "List<%s>" elementTypeString elif t.HasInterface(typeof<System.Collections.IEnumerable>) then let elementTypeString = translateFromNETType (t.GetGenericArguments().[0]) a sprintf "List<%s>" elementTypeString elif t = typeof< Microsoft.FSharp.Core.unit > then String.Empty elif t = typeof< CUDAPointer2<float> > then sprintf "%s*" "double" elif t = typeof< CUDAPointer2<float32>> then sprintf "%s*" "float" elif t.Name = "FSharpFunc'2" then let input = translateFromNETType (t.GetGenericArguments().[0]) a let out = translateFromNETType (t.GetGenericArguments().[1]) a sprintf "%s(*%s)(%s)" input a out elif t = typeof<System.Void> then String.Empty else failwith "not supported type" let translateFromNETTypeToFunReturn (t:Type) (a:string) = let r = translateFromNETType t a if String.IsNullOrEmpty(r) then "void" else r let translateFromNETTypeLength (t:Type) c = if t.IsArray then sprintf ", int %A_len" c else String.Empty let isValueType (t:Type) = if t.IsValueType then true elif t.HasInterface(typeof<IEnumerable>) then false else failwith "is value type failed"
The code generation needs to generate a few intermediary functions. Example 9-22 shows the code structure used to generate these functions.
type Code() = inherit System.Collections.Generic.List<string>() let mutable functionID = 0 let mutable variableID = 0 member this.FunctionID with get () = functionID and set (v) = functionID <- v member this.VariableID with get () = variableID and set(v) = variableID <- v member this.IncreaseFunctionID() = functionID <- this.FunctionID + 1 member this.IncreaseVariableID() = variableID <- this.VariableID + 1 member this.ToCode() = "#include "CUDALibrary.h" " + String.Join(" ", this) + " " let code = Code()
The CUDALibrary.h file is a placeholder file. In this sample, the file content is empty. You can implement additional functions in this file to make the translation easier and the code more readable.
Example 9-23 shows several functions used to handle the function definitions, including return type and function signatures, in F# and convert them to CUDA code.
open System open System.Reflection open Microsoft.FSharp.Quotations open Microsoft.FSharp.Quotations.Patterns open Microsoft.FSharp.Quotations.DerivedPatterns open Microsoft.FSharp.Quotations.ExprShape let rec getFunctionBody (exp:Expr) = match exp with | DerivedPatterns.Lambdas(c, callPattern) -> match callPattern with | Patterns.Call (e, mi, exprList) -> match mi with | DerivedPatterns.MethodWithReflectedDefinition n -> callPattern | _ -> callPattern | Patterns.Sequential _ -> callPattern | _ -> callPattern | Patterns.Sequential _ -> exp | _ -> failwith "Argument must be of the form <@ foo @>!" let getFunctionParameterAndReturn (exp:Expr) = match exp with | DerivedPatterns.Lambdas (c, Patterns.Call(a, mi, b)) -> Some(b, mi.ReturnType) | _ -> None let getFunctionName (exp:Expr) = match exp with | DerivedPatterns.Lambdas(c, Patterns.Call(a, mi, b)) -> mi.Name | _ -> failwith "Argument must be of the form <@ foo @>!" let getFunctionTypes (exp:Expr) = match getFunctionParameterAndReturn(exp) with | Some(exprList ,t) -> let out = exprList |> List.map (fun n -> (n, n.Type)) Some(out, t) | None -> None let getFunctionReturnType (exp:Expr) = match getFunctionTypes(exp) with | Some(_, t) -> t | _ -> failwith "cannot find return type" let rec getCallReturnType (exp:Expr) = match exp with | Patterns.Call (_, mi, _) -> mi.ReturnType | Patterns.Var n -> n.Type | Patterns.Let (var, e0, e1) -> getCallReturnType e1 | Patterns.Sequential (e0, e1) -> getCallReturnType e1 | Patterns.Value(v) -> snd v | Patterns.WhileLoop(e0, e1) -> typeof<System.Void> | Patterns.ForIntegerRangeLoop(var, e0, e1, e2) -> getCallReturnType e2 | Patterns.IfThenElse(e0, e1, e2) -> typeof<System.Void> | _ -> failwith "not supported expr type" let getFunctionSignature (exp:Expr) = let template = @"extern ""C"" __global__ void {0} ({1}) " let functionName = getFunctionName(exp) let parameters = getFunctionParameterAndReturn(exp) match parameters with | Some(exprList, _) -> let parameterNames = exprList |> Seq.map (fun n -> match n with | _ when n.Type.IsAssignableFrom(typeof<CUDAPointer2<float>>) -> sprintf "double* %s" (n.ToString()) | _ when n.Type.IsAssignableFrom(typeof<CUDAPointer2<float32>>) -> sprintf "float* %s" (n.ToString()) | _ -> sprintf "%s %s" (n.Type.Name) (n.ToString()) ) String.Format(template, functionName, String.Join(", ", parameterNames)) | None -> failwith "cannot get parameter and return type"
Some low-end GPU hardware can support only the float32 data type. Example 9-24 can be modified to make sure the data type is a convertible type. If the hardware you have does not support int or float, you can change the return to false.
let isTypeGPUOK t = if t = typeof<int> || t = typeof<float32> || t = typeof<float> then true else false let isValueGPUOK (v:obj) = match v with | :? Int | :? float32 -> true | :? float -> true //if the hardware does not support this, you can make it false | _ -> false
Before you can proceed in the translation of the F# quotation, an attribute is needed, as shown in Example 9-25. The attribute uses reflection to identify the GPU function in the assembly. This attribute is used to identify the function that needs to be converted.
The goal for this section is to translate the F# code in Example 9-26 to CUDA code. The sample function is used to add two arrays, named a and b, and set the result to array c. The pascalTriangle function is used to compute the Pascal Triangle values on a given line. The sample2 function demonstrates how the while statement is used to translate the F# code to CUDA code.
CUDA uses an array of threads to process the data. So the code output.[threadid] = input.[threadId]
is a transaction that will be executed by many threads. Because of these threads, you can process a number of elements in an array simultaneously. More detailed information can be found in the NVIDIA documentation at http://www.nvidia.com/content/cudazone/download/Getting_Started_w_CUDA_Training_NVISION08.pdf.
[<ReflectedDefinition; GPU>] let sample (a:CUDAPointer2<float>) (b:CUDAPointer2<float>) (c:CUDAPointer2<float>)= let x = blockIdx.x c.Set(a.[x] + b.[x], x) //c.[x] = a.[x] + b.[x] () [<ReflectedDefinition; GPU>] let pascalTriangle (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) = let x = blockIdx.x if x = 0 then b.Set(a.[x], x) else b.Set(a.[x] + a.[x - 1], x) () [<ReflectedDefinition; GPU>] let sample2 (a:CUDAPointer2<float>) (b:CUDAPointer2<float>) (c:CUDAPointer2<float>)= let x = blockIdx.x for j = 0 to x do c.Set(a.[j] + b.[j], x) let mutable i = 3 while i >= 8 do if i > 3 then c.Set(a.[i] + b.[i], x) else c.Set(a.[i] + b.[0], x) i <- i + 1
To translate the code, you need to first define some basic data structures, which are defined in Example 9-27. The GPU kernel can launch multiple threads simultaneously. ThreadIdx is the thread identifier. BlockDim is a way to segment the data. Their relationship is shown in Figure 9-7.
type dim3 (x, y, z) = new() = dim3(0, 0, 0) new(x) = dim3(x, 0, 0) new(x, y) = dim3(x, y, 0) member this.x = x member this.y = y member this.z = z type ThreadIdx() = inherit dim3() type BlockDim() = inherit dim3()
The generated code has two sections: the function that starts with the __device__ keyword, and the function with the __global__ keyword. The function with the __device__ keyword cannot be invoked from the host. The __global__ function can be invoked from the host. If you need to access the functionality in the __device__ function, there must be a __global__ function. The generated code is in the following format:
#include "CUDALibrary.h" __device__ void ff_0(float* a, float* b) { // some generated code } extern "C" __global__ void sample4 (float* a, float* b) { ff_0(a, b); }
The code has three parts:
Header file CUDALibrary.h
Generated __device__ function (ff_0 in the preceding code)
Generated __global__ function, which will be invoked from the host and call into the __device__ function ff0
Example 9-28 shows the function used to generate the CUDA code. The Code class is used to hold the generated functions. The getCUDACode function is used to get the CUDA code. The translation code translates three samples from F# code to CUDA C code. These three samples show how to translate WHILE and FOR structures.
let tempFileName = "temp1.cu" type Code() = inherit System.Collections.Generic.List<string>() let mutable functionID = 0 let mutable variableID = 0 member this.FunctionID with get () = functionID and set (v) = functionID <-v member this.VariableID with get () = variableID and set(v) = variableID <- v member this.IncreaseFunctionID() = functionID <- this.FunctionID + 1 member this.IncreaseVariableID() = variableID <- this.VariableID + 1 member this.ToCode() = "#include "CUDALibrary.h" " + String.Join(" ", this) + " " let code = Code() let getCUDACode f = let s = getFunctionSignature f let body = f |> getFunctionBody let functionStr = sprintf "%s { %s } " s (accessExpr body) sprintf "%s" functionStr let getCommonCode() = code.ToCode() [<ReflectedDefinition; GPU>] let pascalTriangle (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) = let x = blockIdx.x if x = 0 then b.Set(a.[x], x) else b.Set(a.[x] + a.[x-1], x) () [<ReflectedDefinition; GPU>] let sample2 (a:CUDAPointer2<float>) (b:CUDAPointer2<float>) (c:CUDAPointer2<float>)= sample a b c let x = blockIdx.x for j = 0 to x do c.Set(a.[j] + b.[j], x) let mutable i = 3 while i >= 8 do if i > 3 then c.Set(a.[i] + b.[i], x) else c.Set(a.[i] + b.[0], x) i <- i + 1 [<ReflectedDefinition; GPU>] let sample (a:CUDAPointer2<float>) (b:CUDAPointer2<float>) (c:CUDAPointer2<float>)= let x = blockIdx.x c.Set(a.[x] + b.[x], x) //c.[x] = a.[x] + b.[x] () let WriteToFile() = let a1 = <@@ sample @@> let b = getCUDACode(a1) let a2 = <@@ sample2 @@> let b2 = getCUDACode(a2) let a3 = <@@ pascalTriangle @@> let b3 = getCUDACode(a3) let commonCode = getCommonCode() System.IO.File.WriteAllText(tempFileName, commonCode + b + b2 + b3) ()
#include "CUDALibrary.h" __device__ void ff_0(double* a, double* b, double* c) { int x; x = blockIdx.x; c[x] = (a[x]) + (b[x]); ; } __device__ void ff_1(float* a, float* b) { int x; x = blockIdx.x; if ((x) == (0)) { b[x] = a[x]; } else { b[x] = (a[x]) + (a[(x) - (1)]); }; ; } __device__ void ff_3(double* a, double* b, double* c) { int x; x = blockIdx.x; c[x] = (a[x]) + (b[x]); ; } __device__ void ff_2(double* a, double* b, double* c) { ff_3(a, b, c); int x; x = blockIdx.x; for (int j = 0; j < x; j++) { c[x] = (a[j]) + (b[j]); }; int i; i = 3; while ((i) >= (8)) { if ((i) > (3)) { c[x] = (a[i]) + (b[i]); } else { c[x] = (a[i]) + (b[0]); }; i = (i) + (1); }; } extern "C" __global__ void sample (double* a, double* b, double* c) { ff_0(a, b, c); } extern "C" __global__ void pascalTriangle (float* a, float* b) { ff_1(a, b); } extern "C" __global__ void sample2 (double* a, double* b, double* c) { ff_2(a, b, c); }
The CUDALibrary.h header file serves as a placeholder, which can hold some customized functions. In this sample, the file is empty.
It is not convenient to use <@@...@@>
on each function that needs to be converted. The GPUAttribute enables you to automatically scan the assembly and generate CUDA code. Example 9-29 shows how to scan the assembly and identify the function with the GPU attribute and generate code from these functions. The generated code is stored in the “temp1.cu” file, which is specified by the tempFile value.
let getCommonCode() = code.ToCode() let getGPUFunctions() = let currentAssembly = Assembly.GetExecutingAssembly() let gpuMethods = currentAssembly.GetTypes() |> List.ofArray |> List.collect (fun t -> t.GetMethods() |> List.ofArray) |> List.filter (fun mi -> mi.GetCustomAttributes(typeof<GPUAttribute>, true).Length > 0) gpuMethods |> List.map (fun mi -> (mi, (Quotations.Expr.TryGetReflectedDefinition mi))) |> List.map (fun (mi, Some(expr)) -> (mi, expr)) |> List.map (fun (mi, expr) -> sprintf "%s { %s }" (getFunctionSignatureFromMethodInfo mi) (accessExpr expr)) let tempFile = "temp1.cu" let GenerateCodeToFile() = let gpuCode = getGPUFunctions() let commonCode = getCommonCode() let allCode = String.Join(" ", commonCode :: gpuCode) System.IO.File.Delete(tempFile) System.IO.File.WriteAllText(tempFile, allCode)
The generated C code will not be able to execute on the GPU. You can use NVCC.exe to compile the C code to a PTX file. The PTX file can be loaded and executed on the GPU. The PTX code is not an executable binary. Instead, the PTX file is compiled for a specific target GPU binary code at run time. It is more like the assembly language on the GPU. The generated PTX file is shown in Example 9-30.
.version 1.4 .target sm_10, map_f64_to_f32 // compiled with C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../ open64/lib//be.exe // nvopencc 4.0 built on 2011-05-13 //----------------------------------------------------------- // Compiling C:/Users/taliu/AppData/Local/Temp/tmpxft_00000a84_00000000-11_temp.cpp3.i (C:/Users/taliu/AppData/Local/Temp/ccBI#.a04920) //----------------------------------------------------------- //----------------------------------------------------------- // Options: //----------------------------------------------------------- // Target:ptx, ISA:sm_10, Endian:little, Pointer Size:64 // -O3 (Optimization level) // -g0 (Debug level) // -m2 (Report advisories) //----------------------------------------------------------- .file 1 "C:/Users/taliu/AppData/Local/Temp/tmpxft_00000a84_00000000-10_temp. cudafe2.gpu" .file 2 "c:program files (x86)microsoft visual studio 10.0vcincludecodeanalysissourceannotations.h" .file 3 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../include crt/device_runtime.h" .file 4 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../include host_defines.h" .file 5 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../include builtin_types.h" .file 6 "c:program files vidia gpu computing toolkitcudav4.0includedevice_ types.h" .file 7 "c:program files vidia gpu computing toolkitcudav4.0includedriver_ types.h" .file 8 "c:program files vidia gpu computing toolkitcudav4.0include surface_types.h" .file 9 "c:program files vidia gpu computing toolkitcudav4.0include texture_types.h" .file 10 "c:program files vidia gpu computing toolkitcudav4.0include vector_types.h" .file 11 "c:program files vidia gpu computing toolkitcudav4.0include builtin_types.h" .file 12 "c:program files vidia gpu computing toolkitcudav4.0includehost_ defines.h" .file 13 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../includedevice_launch_parameters.h" .file 14 "c:program files vidia gpu computing toolkitcudav4.0includecrtstorage_class.h" .file 15 "C:Program Files (x86)Microsoft Visual Studio 10.0VCin/../../VC/INCLUDE ime.h" .file 16 "temp.cu" .file 17 "c:mycodecodecenterf#fsharpgpufsharpgpuindebugCUDALibrary.h" .file 18 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.0in/../includecommon_functions.h" .file 19 "c:program files vidia gpu computing toolkitcudav4.0includemath_ functions.h" .file 20 "c:program files vidia gpu computing toolkitcudav4.0includemath_ constants.h" .file 21 "c:program files vidia gpu computing toolkitcudav4.0includedevice_functions.h" .file 22 "c:program files vidia gpu computing toolkitcudav4.0includesm_11_ atomic_functions.h" .file 23 "c:program files vidia gpu computing toolkitcudav4.0includesm_12_ atomic_functions.h" .file 24 "c:program files vidia gpu computing toolkitcudav4.0includesm_13_ double_functions.h" .file 25 "c:program files vidia gpu computing toolkitcudav4.0includesm_20_ atomic_functions.h" .file 26 "c:program files vidia gpu computing toolkitcudav4.0includesm_20_ intrinsics.h" .file 27 "c:program files vidia gpu computing toolkitcudav4.0include surface_functions.h" .file 28 "c:program files vidia gpu computing toolkitcudav4.0include texture_fetch_functions.h" .file 29 "c:program files vidia gpu computing toolkitcudav4.0includemath_ functions_dbl_ptx1.h" .entry sample ( .param .u64 __cudaparm_sample_a, .param .u64 __cudaparm_sample_b, .param .u64 __cudaparm_sample_c) { .reg .u32 %r<3>; .reg .u64 %rd<10>; .reg .f64 %fd<5>; .loc 16 33 0 $LDWbegin_sample: .loc 16 5 0 cvt.s32.u16 %r1, %ctaid.x; cvt.s64.s32 %rd1, %r1; mul.wide.s32 %rd2, %r1, 8; ld.param.u64 %rd3, [__cudaparm_sample_a]; add.u64 %rd4, %rd3, %rd2; ld.global.f64 %fd1, [%rd4+0]; ld.param.u64 %rd5, [__cudaparm_sample_b]; add.u64 %rd6, %rd5, %rd2; ld.global.f64 %fd2, [%rd6+0]; add.f64 %fd3, %fd1, %fd2; ld.param.u64 %rd7, [__cudaparm_sample_c]; add.u64 %rd8, %rd7, %rd2; st.global.f64 [%rd8+0], %fd3; .loc 16 35 0 exit; $LDWend_sample: } // sample .entry sample2 ( .param .u64 __cudaparm_sample2_a, .param .u64 __cudaparm_sample2_b, .param .u64 __cudaparm_sample2_c) { .reg .u32 %r<7>; .reg .u64 %rd<12>; .reg .f64 %fd<8>; .reg .pred %p<4>; .loc 16 36 0 $LDWbegin_sample2: .loc 16 10 0 cvt.s32.u16 %r1, %ctaid.x; cvt.s64.s32 %rd1, %r1; mul.wide.s32 %rd2, %r1, 8; ld.param.u64 %rd3, [__cudaparm_sample2_c]; add.u64 %rd4, %rd3, %rd2; ld.param.u64 %rd5, [__cudaparm_sample2_b]; ld.param.u64 %rd6, [__cudaparm_sample2_a]; add.u64 %rd7, %rd2, %rd5; ld.global.f64 %fd1, [%rd7+0]; add.u64 %rd8, %rd2, %rd6; ld.global.f64 %fd2, [%rd8+0]; add.f64 %fd3, %fd1, %fd2; st.global.f64 [%rd4+0], %fd3; .loc 16 14 0 mov.u32 %r2, 0; setp.le.s32 %p1, %r1, %r2; @%p1 bra $Lt_1_3074; mov.s32 %r3, %r1; .loc 16 10 0 ld.param.u64 %rd6, [__cudaparm_sample2_a]; .loc 16 14 0 mov.s64 %rd9, %rd6; .loc 16 10 0 ld.param.u64 %rd5, [__cudaparm_sample2_b]; .loc 16 14 0 mov.s64 %rd10, %rd5; mov.s32 %r4, 0; mov.s32 %r5, %r3; $Lt_1_3586: //<loop> Loop body line 14, nesting depth: 1, estimated iterations: unknown .loc 16 17 0 ld.global.f64 %fd4, [%rd9+0]; ld.global.f64 %fd5, [%rd10+0]; add.f64 %fd6, %fd4, %fd5; st.global.f64 [%rd4+0], %fd6; add.s32 %r4, %r4, 1; add.u64 %rd10, %rd10, 8; add.u64 %rd9, %rd9, 8; setp.ne.s32 %p2, %r4, %r1; @%p2 bra $Lt_1_3586; $Lt_1_3074: .loc 16 38 0 exit; $LDWend_sample2: } // sample2 .entry pascalTriangle ( .param .u64 __cudaparm_pascalTriangle_a, .param .u64 __cudaparm_pascalTriangle_b) { .reg .u32 %r<4>; .reg .u64 %rd<8>; .reg .f32 %f<5>; .reg .pred %p<3>; .loc 16 39 0 $LDWbegin_pascalTriangle: .loc 16 40 0 cvt.s32.u16 %r1, %ctaid.x; cvt.s64.s32 %rd1, %r1; mul.wide.s32 %rd2, %r1, 4; ld.param.u64 %rd3, [__cudaparm_pascalTriangle_a]; add.u64 %rd4, %rd3, %rd2; ld.param.u64 %rd5, [__cudaparm_pascalTriangle_b]; add.u64 %rd6, %rd5, %rd2; ld.global.f32 %f1, [%rd4+0]; mov.u32 %r2, 0; setp.ne.s32 %p1, %r1, %r2; @%p1 bra $Lt_2_1282; .loc 16 28 0 st.global.f32 [%rd6+0], %f1; bra.uni $Lt_2_1026; $Lt_2_1282: .loc 16 29 0 ld.global.f32 %f2, [%rd4+-4]; add.f32 %f3, %f2, %f1; st.global.f32 [%rd6+0], %f3; $Lt_2_1026: .loc 16 41 0 exit; $LDWend_pascalTriangle: } // pascalTriangle .global .u32 error;
The CUDARuntime class is responsible for managing how to load and execute functions on the GPU. GPUExecution is a wrapper class that also includes the function to generate the PTX file. GPUExecution class uses nvcc.exe with a –ptx switch to generate the PTX file. In the Init function, GPUExecution calls the CUDARuntime method to load the PTX file into the GPU, as shown in Example 9-31.
CUDA runtime and CUDA array class
open System open System.Text open System.Collections.Generic open System.Runtime.InteropServices type uint = uint32 [<Struct>] type CUDAModule = val Pointer : IntPtr [<Struct>] type CUDADevice = val Pointer : int [<Struct>] type CUDAContext = val Pointer : IntPtr [<Struct>] type CUDAFunction = val Pointer : IntPtr module CudaDataStructureExtensions = let is64Bit = IntPtr.Size = 8 module InteropLibrary2 = [<DllImport("nvcuda")>] extern CUResult cuParamSetv(CUDAFunction hfunc, int offset, IntPtr ptr, uint numbytes) module InteropLibrary = [<DllImport("nvcuda")>] extern CUResult cuModuleLoad(CUDAModule& m, string fn) [<DllImport("nvcuda")>] extern CUResult cuDriverGetVersion(int& driverVersion) [<DllImport("nvcuda")>] extern CUResult cuInit(uint Flags) [<DllImport("nvcuda", EntryPoint = "cuCtxCreate_v2")>] extern CUResult cuCtxCreate(CUDAContext& pctx, uint flags, CUDADevice dev) [<DllImport("nvcuda")>] extern CUResult cuDeviceGet(CUDADevice& device, int ordinal) [<DllImport("nvcuda")>] extern CUResult cuModuleGetFunction(CUDAFunction& hfunc, CUDAModule hmod, string name) [<DllImport("nvcuda")>] extern CUResult cuFuncSetBlockShape(CUDAFunction hfunc, int x, int y, int z) [<DllImport("nvcuda")>] extern CUResult cuLaunch(CUDAFunction f) [<DllImport("nvcuda")>] extern CUResult cuLaunchGrid(CUDAFunction f, int grid_width, int grid_height) [<DllImport("nvcuda", EntryPoint = "cuMemAlloc_v2")>] extern CUResult cuMemAlloc(CUDAPointer& dptr, uint bytesize) [<DllImport("nvcuda", EntryPoint = "cuMemcpyDtoH_v2")>] extern CUResult cuMemcpyDtoH(IntPtr dstHost, CUDAPointer srcDevice, uint ByteCount) [<DllImport("nvcuda", EntryPoint = "cuMemcpyHtoD_v2")>] extern CUResult cuMemcpyHtoD(CUDAPointer dstDevice, IntPtr srcHost, uint ByteCount) [<DllImport("nvcuda", EntryPoint = "cuMemFree_v2")>] extern CUResult cuMemFree(CUDAPointer dptr) [<DllImport("nvcuda")>] extern CUResult cuParamSeti(CUDAFunction hfunc, int offset, uint value) [<DllImport("nvcuda")>] extern CUResult cuParamSetf(CUDAFunction hfunc, int offset, float32 value) [<DllImport("nvcuda")>] extern CUResult cuParamSetv(CUDAFunction hfunc, int offset, int64& value, uint numbytes) [<DllImport("nvcuda")>] extern CUResult cuParamSetSize(CUDAFunction hfunc, uint numbytes) [<DllImport("nvcuda", EntryPoint = "cuMemsetD8_v2")>] extern CUResult cuMemsetD8(CUDAPointer dstDevice, byte uc, uint N) [<DllImport("nvcuda", EntryPoint = "cuMemsetD16_v2")>] extern CUResult cuMemsetD16(CUDAPointer dstDevice, uint16 us, uint N) type CUDAArray<'T>(cudaPointer:CUDAPointer2<_>, size:uint, runtime:CUDARunTime) = let unitSize = uint32(sizeof<'T>) interface IDisposable with member this.Dispose() = runtime.Free(cudaPointer) |> ignore member this.Runtime with get() = runtime member this.SizeInByte with get() = size member this.Pointer with get() = cudaPointer member this.UnitSize with get() = unitSize member this.Size with get() = int( this.SizeInByte / this.UnitSize ) member this.ToArray<'T>() = let out = Array.create (int(size)) Unchecked.defaultof<'T> this.Runtime.CopyDeviceToHost(this.Pointer, out) and CUDARunTime(deviceID) = let mutable device = CUDADevice() let mutable deviceContext = CUDAContext() let mutable m = CUDAModule() let init() = let r = InteropLibrary.cuInit(deviceID) let r = InteropLibrary.cuDeviceGet(&device, int(deviceID)) let r = InteropLibrary.cuCtxCreate(&deviceContext, deviceID, device) () do init() let align(offset, alignment) = offset + alignment - 1 &&& ~~~(alignment - 1); new() = new CUDARunTime(0u) interface IDisposable with member this.Dispose() = () member this.LoadModule(fn) = (InteropLibrary.cuModuleLoad(&m, fn), m) member this.Version with get() = let mutable a = 0 (InteropLibrary.cuDriverGetVersion(&a), a) member this.Is64Bit with get() = CudaDataStructureExtensions.is64Bit member this.GetFunction(fn) = let mutable f = CUDAFunction() (InteropLibrary.cuModuleGetFunction(&f, m, fn), f) member this.ExecuteFunction(fn, x, y) = let r, f = this.GetFunction(fn) if r = CUResult.Success then InteropLibrary.cuLaunchGrid(f, x, y) else r member this.ExecuteFunction(fn) = let r, f = this.GetFunction(fn) if r = CUResult.Success then InteropLibrary.cuLaunch(f) else r member this.ExecuteFunction(fn, [<ParamArray>] parameters:obj list) = let func = this.GetFunctionPointer(fn) this.SetParameter(func, parameters) let r = InteropLibrary.cuLaunch(func) r member this.ExecuteFunction(fn, parameters:obj list, x, y) = let func = this.GetFunctionPointer(fn) let paras = parameters |> List.map (fun n -> match n with | :? CUDAPointer2<float> as p -> box(p.Pointer) | :? CUDAPointer2<float32> as p -> box(p.Pointer) | :? CUDAPointer2<_> as p -> box(p.Pointer) | _ -> n) this.SetParameter(func, paras) InteropLibrary.cuLaunchGrid(func, x, y) member private this.GetFunctionPointer(fn) = let r, p = this.GetFunction(fn) if r = CUResult.Success then p else failwith "cannot get function pointer" // allocate member this.Allocate(bytes:uint) = let mutable p = CUDAPointer() (InteropLibrary.cuMemAlloc(&p, bytes), CUDAPointer2(p)) member this.Allocate(array) = let size = this.GetSize(array) |> uint32 this.Allocate(size) member this.GetSize(data:'T array) = this.MSizeOf(typeof<'T>) * uint32(data.Length) member this.GetUnitSize(data:'T array) = this.MSizeOf(typeof<'T>) member private this.MSizeOf(t:Type) = if t = typeof<System.Char> then 2u else Marshal.SizeOf(t) |> uint32 member this.Free(p:CUDAPointer2<_>) : CUResult = InteropLibrary.cuMemFree(p.Pointer) member this.CopyHostToDevice(data: 'T array) = let gCHandle = GCHandle.Alloc(data, GCHandleType.Pinned) let size = this.GetSize(data) let r, p = this.Allocate(size) let r = (InteropLibrary.cuMemcpyHtoD(p.Pointer, gCHandle.AddrOfPinnedObject(), size), p) gCHandle.Free() r member this.CopyDeviceToHost(p:CUDAPointer2<_>, data) = let gCHandle = GCHandle.Alloc(data, GCHandleType.Pinned) let r = (InteropLibrary.cuMemcpyDtoH( gCHandle.AddrOfPinnedObject(), p.Pointer, this.GetSize(data)), data) gCHandle.Free() r //parameter setting member private this.SetParameter<'T>(func, offset, vector:'T) = let gCHandle = GCHandle.Alloc(vector, GCHandleType.Pinned) let numbytes = uint32(Marshal.SizeOf(vector)) let r = InteropLibrary2.cuParamSetv(func, offset, gCHandle.AddrOfPinnedObject(), numbytes) gCHandle.Free() r member private this.SetParameterSize(func, size) = if InteropLibrary.cuParamSetSize(func, size) = CUResult.Success then () else failwith "set parameter size failed" member this.SetParameter(func, parameters) = let mutable num = 0 for para in parameters do match box(para) with | :? uint32 as n -> num <- align(num, 4) if InteropLibrary.cuParamSeti(func, num, n) = CUResult.Success then () else failwith "set uint32 failed" num <- num + 4 | :? float32 as f -> num <- align(num, 4) if InteropLibrary.cuParamSetf(func, num, f) = CUResult.Success then () else failwith "set float failed" num <- num + 4 | :? int64 as i64 -> num <- align(num, 8) let mutable i64Ref = i64 if InteropLibrary.cuParamSetv(func, num, &i64Ref, 8u) = CUResult.Success then () else failwith "set int64 failed" num <- num + 8 | :? char as ch -> num <- align(num, 2) let bytes = Encoding.Unicode.GetBytes([|ch|]) let v = BitConverter.ToUInt16(bytes, 0) if this.SetParameter(func, num, v) = CUResult.Success then () else failwith "set char failed" num <- num + 2 | :? CUDAPointer as devPointer -> num <- align(num, devPointer.PointerSize) if devPointer.PointerSize = 8 then if this.SetParameter(func, num, uint64(int64(devPointer.Pointer))) = CUResult.Success then () else failwith "set device pointer failed" else if InteropLibrary.cuParamSeti(func, num, uint32(int(devPointer.Pointer))) = CUResult.Success then () else failwith "set device pointer failed" num <- num + devPointer.PointerSize | :? CUDAArray<float32> as devArray -> let devPointer:CUDAPointer2<_> = devArray.Pointer num <- align(num, devPointer.PointerSize) if devPointer.PointerSize = 8 then if this.SetParameter(func, num, uint64(int64(devPointer.Pointer.Pointer))) = CUResult.Success then () else failwith "set device pointer failed" else if InteropLibrary.cuParamSeti(func, num, uint32(int(devPointer.Pointer. Pointer))) = CUResult.Success then () else failwith "set device pointer failed" num <- num + devPointer.PointerSize | _ when para.GetType().IsValueType -> let n = int(this.MSizeOf(para.GetType())) num <- align(num, n) if this.SetParameter(func, num, box(para)) = CUResult.Success then () else failwith "set no-char object" num <- num + n | _ -> failwith "not supported" this.SetParameterSize( func, uint32(num) )
Execution class
namespace FSharp.Execution open System open System.Diagnostics open System.IO open Microsoft.FSharp.Quotations open Microsoft.FSharp.Quotations.Patterns open Microsoft.FSharp.Quotations.DerivedPatterns open Microsoft.FSharp.Quotations.ExprShape type BlockID() = inherit dim3() type GPUExecution () as this = let runtime = new CUDARunTime() let nvcc = "nvcc.exe" do this.Init() |> ignore interface IDisposable with member this.Dispose() = () member this.Runtime with get() = runtime // compile the code to PTX file member private this.CompileToPTX() : string = let fn = @". emp.cu" this.CompileToPTX(fn) // compile the file to PTX file and return PTX file name member private this.CompileToPTX(fn) : string = use p = new Process() let para = sprintf "%s -ptx" fn p.StartInfo <- ProcessStartInfo(nvcc, para) p.StartInfo.UseShellExecute <- false p.StartInfo.WindowStyle <- ProcessWindowStyle.Hidden p.Start() |> ignore p.WaitForExit() System.IO.Path.GetFileNameWithoutExtension(fn) + ".ptx" // compile to PTX file and load the PTX file to GPU member this.Init() = let fn = this.CompileToPTX() let r, m = runtime.LoadModule(fn) if isSuccess r then m else failwith "cannot load module" member this.Init(fn:string) = let fn = this.CompileToPTX(fn) let r,m = runtime.LoadModule(fn) if isSuccess r then m else failwith "cannot load module" // execute function loaded on GPU with parameter list member this.Execute(fn:string, list:'T array list) = let unitSize = (sizeof<'T>) |> uint32 let size = list.Head.Length |> uint32 let results = list |> List.map (fun l -> this.Runtime.CopyHostToDevice(l)) |> List.map (fun (r,p) -> (r, new CUDAArray<'T>(p, size, this.Runtime))) let success = results |> Seq.forall (fun (r, _) -> isSuccess(r)) if success then let pointers = results |> List.map snd let head = List.head list let result = this.Runtime.ExecuteFunction( fn, pointers |> List.map box, head.Length, 1) let out = Array.create head.Length 0.f let a = this.Runtime.CopyDeviceToHost(pointers.[0].Pointer, out) (result, pointers) else failwith "copy host failed" // copy data from host (CPU) memory to device (GPU) memory member this.CopyHostToDevice(data: 'T array) = let r, out = this.Runtime.CopyHostToDevice(data) if r = CUResult.Success then out else failwith "cannot copy host to device" // copy data from device (GPU) memory to host (CPU) memory member this.CopyDeviceToHost(p:CUDAPointer2<_>, data) = let r, out = this.Runtime.CopyDeviceToHost(p, data) if r = CUResult.Success then out else failwith "cannot copy device to host" // convert a list to CUDA array member this.ToCUDAArray(l) = let r, array = this.Runtime.CopyHostToDevice(l) if r = CUResult.Success then array else failwith "cannot copy host to device" // execute function loaded on GPU with cuda array list member this.ExecuteFunction(fn:string, cudaArray:CUDAPointer list) = let r = this.Runtime.ExecuteFunction( fn, cudaArray |> List.map box, cudaArray.Length, 1) r
With everything ready, you can create a few examples that use the GPU. Example 9-32 compares the CPU and GPU versions of the Pascal Triangle computation. The execution result shows that the GPU can finish the computation more efficiently, even with the additional overhead of loading and retrieving the data from the GPU. If the data set is large, the data load time is relatively small and applying the GPU is worthwhile. If the data set is small, most of the time will be spent on loading data to and unloading data from the GPU, causing the GPU version to be slower.
let len = 1000 let blockIdx = new BlockDim() let threadIdx = new ThreadIdx() [<ReflectedDefinition; GPU>] let pascalTriangle (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) = let x = blockIdx.x if x = 0 then b.Set(a.[x], x) else b.Set(a.[x] + a.[x - 1], x) () // GPU version let test3() = WriteToFile() // it is defined in Listing 9-29 let execution = new GPUExecution() let m = execution.Init(tempFileName) let stopWatch = System.Diagnostics.Stopwatch() stopWatch.Reset() stopWatch.Start() let l0 = Array.zeroCreate len let l1 = Array.zeroCreate len l0.[0] <- 1.f l1.[0] <- 0.f let r, p = execution.Runtime.CopyHostToDevice(l0) let r, p2 = execution.Runtime.CopyHostToDevice(l1) let rs = [1..len] |> Seq.map (fun i -> if i % 2 = 1 then let r = execution.Runtime.ExecuteFunction( "pascalTriangle", [p; p2], len, 1) r else let r = execution.Runtime.ExecuteFunction( "pascalTriangle", [p2; p], len, 1) r) |> Seq.toList let result1, o1 = execution.Runtime.CopyDeviceToHost(p, l0) let result2, o2 = execution.Runtime.CopyDeviceToHost(p2, l1) stopWatch.Stop() printfn "%A" stopWatch.Elapsed () let computePascal(p:float32 array, p2:float32 array) = let len = p.Length [0..len-1] |> Seq.iter (fun i -> if i = 0 then p2.[i] <- 1.f else p2.[i] <- p.[i-1] + p.[i]) () // CPU version of Pascal Triangle let test4() = let stopWatch = System.Diagnostics.Stopwatch() stopWatch.Reset() stopWatch.Start() let l0 = Array.zeroCreate len let l1 = Array.zeroCreate len l0.[0] <- 1.f l1.[0] <- 0.f [1..len] |> Seq.map (fun i -> if i % 2 = 1 then let r = computePascal(l0, l1) r else let r = computePascal(l1, l0) r) |> Seq.toList |> ignore stopWatch.Stop() printfn "%A" stopWatch.Elapsed ()
Execution result that runs the CPU version followed by the GPU version
00:00:00.1034888 temp.cu c:mycodecodecenterf#fsharpgpufsharpgpuindebugCUDALibrary.h(56): warning : variable "sizeT" was declared but never referenced temp.cu tmpxft_00001558_00000000-3_temp.cudafe1.gpu tmpxft_00001558_00000000-10_temp.cudafe2.gpu temp.cu c:mycodecodecenterf#fsharpgpufsharpgpuindebugCUDALibrary.h(56): warning : variable "sizeT" was declared but never referenced temp.cu tmpxft_00000790_00000000-3_temp.cudafe1.gpu tmpxft_00000790_00000000-10_temp.cudafe2.gpu 00:00:00.0448262
If you’ve ever wondered about a real-world application for the Pascal Triangle code in Example 9-32, you’ll enjoy this section. The Pascal Triangle shows how to code and represent a way to use the GPU to process a binomial-tree-like structure. The Pascal Triangle is generated as shown in Figure 9-8.
In the financial sector in the United States, the binomial options pricing model (BOPM) uses a binomial tree to value options that are exercisable at any time in a given time interval. The pricing model generates a binomial tree like the one in Figure 9-9.
In the preceding diagram, you can find the following relationship:
You can easily change the Pascal Triangle function to the BOPM, as shown in Example 9-33.
Processing an array is one scenario where a GPU can be of help. Because the GPU has dozens of processors, it can process the elements simultaneously. In this section, you need to find the largest element in an array. The function takes the array and the search starting point and returns the largest element from the starting point.
Example 9-34 shows a GPU version of the algorithm. For the GPU version, the code is straightforward. The starting point x will iterate to the end of the array. The maximum value is stored in the max variable and later will be assigned to another array. Some functions in Example 9-34 are defined in code shown earlier in the chapter.
let blockIdx = new BlockDim() let threadIdx = new ThreadIdx() let input = [1.f .. 15.f] |> Array.ofList [<ReflectedDefinition; GPU>] let sample4 (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) : unit = let x = blockIdx.x let mutable max = 0.f for i = x to 15 do if max < a.[i] then max <- a.[i] else () b.Set(max, x) let WriteToFile2() = let a1 = <@@ sample4 @@> let b = getCUDACode(a1) // defined in Listing 9-28 let commonCode = getCommonCode() // defined in Listing 9-28 System.IO.File.Delete(tempFileName) // defined in Listing 9-28 System.IO.File.WriteAllText(tempFileName, commonCode + b); let getMax() = let tempFileName = @". emp.cu" WriteToFile2() let execution = new GPUExecution() let m = execution.Init(tempFileName) let output = Array.create input.Length 0.f let r, ps = execution.Execute("sample4", [input; output;]) let results = ps |> List.map (fun p -> p.ToArray() |> snd) ()
#include "CUDALibrary.h" __device__ void ff_0(float* a, float* b) { int x; x = blockIdx.x; float max; max = 0.0f; for (int i = x; i<15; i++) { if ((max) < (a[i])) { max = a[i]; } else { } }; b[x] = max; } extern "C" __global__ void sample4 (float* a, float* b) { ff_0(a, b); }
Other than using GPUs for array processing, you also can use them in simulations. In this section, you take a look at a small application designed to calculate the π using the Monte Carlo simulation. The algorithm is used to count the random generated number hit in two areas. The two areas are a square and, within it, a circle that touches each edge of the square. Figure 9-10 shows the positions of the circle and rectangle. If you know that the radius of the circle is r, the area of the circle is circle area = πr2
. And the area of the square is squareArea = 4r2.
Imagine a large number of random hits in the rectangle area. The π value can be calculated from the number of hits in the circle area and the number of hits in the rectangle area.
The cuRAND library provides functions used to generate uniform random numbers between 0 and 1. Instead of using the model shown in Figure 9-10, you can create a quarter of the circle whose area computation involves π. The diagram is shown in Figure 9-11. The formula to compute π value from area1 and area2 is listed here:
Before you can generate the code, the code translation function in Example 9-35 is used to translate the sqrt function. The code then needs to be added to the translateFromNETOperator function shown in Example 9-20.
The GPU code is defined in the sample3 function in Example 9-36. The WriteToFile2 function is used to get the function quotation and translate it to CUDA code. The computation result is passed back from the GPU to the CPU, allowing filtering and counting to happen on the CPU.
let blockIdx = new BlockDim() let threadIdx = new ThreadIdx() [<ReflectedDefinition; GPU>] let sample3 (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) (c:CUDAPointer2<float32> = let x = blockIdx.x c.Set(sqrt(a.[x] * a.[x] + b.[x] * b.[x]), x) () let WriteToFile2() = let a1 = <@@ sample3 @@> let b = getCUDACode(a1) // defined in Listing 9-28 let commonCode = getCommonCode() // defined in Listing 9-28 System.IO.File.WriteAllText(tempFileName, commonCode + b); let computePI() = let len = 1000 WriteToFile2() let execution = new GPUExecution() let r = execution.Init(tempFileName) let r = CUDARandom() let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT) if status = curandStatus.CURAND_SUCCESS then let status, l0 = r.GenerateUniform(g, len) let status, l1 = r.GenerateUniform(g, len) let output = Array.create len 0.f let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l0), output) let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l1), output) let r, l2 = execution.Runtime.CopyHostToDevice(output) let r = execution.Runtime.ExecuteFunction("sample3", [l0; l1; l2], len, 1) let result, output = execution.Runtime.CopyDeviceToHost(l2, output) float ( output |> Seq.filter (fun n-> n<=1.f) |> Seq.length) / float len * 4.0 else failwith "execution error"
This program uses the cuRAND library; therefore, some types such as curandStatus are defined in Example 9-9.
The generated code is shown in Example 9-37.
#include "CUDALibrary.h" __device__ void ff_0(float* a, float* b, float* c) { int x; x = blockIdx.x; c[x] = sqrt(((a[x]) * (a[x])) + ((b[x]) * (b[x]))); ; } extern "C" __global__ void sample3 (float* a, float* b, float* c) { ff_0(a, b, c); }
Generated PTX file after running the code from Example 9-28
.version 1.4 .target sm_10, map_f64_to_f32 // compiled with C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2\bin/../ open64/lib//be.exe // nvopencc 4.1 built on 2012-04-07 //----------------------------------------------------------- // Compiling C:/Users/taliu/AppData/Local/Temp/tmpxft_00000e3c_00000000-11_temp.cpp3.i (C:/Users/taliu/AppData/Local/Temp/ccBI#.a02796) //----------------------------------------------------------- //----------------------------------------------------------- // Options: //----------------------------------------------------------- // Target:ptx, ISA:sm_10, Endian:little, Pointer Size:64 // -O3 (Optimization level) // -g0 (Debug level) // -m2 (Report advisories) //----------------------------------------------------------- .file 1 "C:/Users/taliu/AppData/Local/Temp/tmpxft_00000e3c_00000000-10_temp. cudafe2.gpu" .file 2 "c:program files (x86)microsoft visual studio 10.0vcincludecodeanalysissourceannotations.h" .file 3 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2in/../include crt/device_runtime.h" .file 4 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2in/../include host_defines.h" .file 5 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2in/../include builtin_types.h" .file 6 "c:program files vidia gpu computing toolkitcudav4.2includedevice_ types.h" .file 7 "c:program files vidia gpu computing toolkitcudav4.2includehost_ defines.h" .file 8 "c:program files vidia gpu computing toolkitcudav4.2includedriver_ types.h" .file 9 "c:program files vidia gpu computing toolkitcudav4.2include surface_types.h" .file 10 "c:program files vidia gpu computing toolkitcudav4.2include texture_types.h" .file 11 "c:program files vidia gpu computing toolkitcudav4.2include vector_types.h" .file 12 "c:program files vidia gpu computing toolkitcudav4.2include builtin_types.h" .file 13 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2in/../includedevice_launch_parameters.h" .file 14 "c:program files vidia gpu computing toolkitcudav4.2includecrtstorage_class.h" .file 15 "temp.cu" .file 16 "c:mycodecodecenterf#fsharpgpufsharpgpuindebugCUDALibrary.h" .file 17 "C:Program FilesNVIDIA GPU Computing ToolkitCUDAv4.2in/../includecommon_functions.h" .file 18 "c:program files vidia gpu computing toolkitcudav4.2includemath_ functions.h" .file 19 "c:program files vidia gpu computing toolkitcudav4.2includemath_ constants.h" .file 20 "c:program files vidia gpu computing toolkitcudav4.2includedevice_functions.h" .file 21 "c:program files vidia gpu computing toolkitcudav4.2includesm_11_ atomic_functions.h" .file 22 "c:program files vidia gpu computing toolkitcudav4.2includesm_12_ atomic_functions.h" .file 23 "c:program files vidia gpu computing toolkitcudav4.2includesm_13_ double_functions.h" .file 24 "c:program files vidia gpu computing toolkitcudav4.2includesm_20_ atomic_functions.h" .file 25 "c:program files vidia gpu computing toolkitcudav4.2includesm_20_ intrinsics.h" .file 26 "c:program files vidia gpu computing toolkitcudav4.2includesm_30_ intrinsics.h" .file 27 "c:program files vidia gpu computing toolkitcudav4.2include surface_functions.h" .file 28 "c:program files vidia gpu computing toolkitcudav4.2include texture_fetch_functions.h" .file 29 "c:program files vidia gpu computing toolkitcudav4.2includemath_ functions_dbl_ptx1.h" .entry sample3 ( .param .u64 __cudaparm_sample3_a, .param .u64 __cudaparm_sample3_b, .param .u64 __cudaparm_sample3_c) { .reg .u32 %r<3>; .reg .u64 %rd<10>; .reg .f32 %f<7>; .loc 15 9 0 $LDWbegin_sample3: .loc 15 5 0 cvt.s32.u16 %r1, %ctaid.x; cvt.s64.s32 %rd1, %r1; mul.wide.s32 %rd2, %r1, 4; ld.param.u64 %rd3, [__cudaparm_sample3_b]; add.u64 %rd4, %rd3, %rd2; ld.param.u64 %rd5, [__cudaparm_sample3_a]; add.u64 %rd6, %rd5, %rd2; ld.global.f32 %f1, [%rd4+0]; ld.global.f32 %f2, [%rd6+0]; mul.f32 %f3, %f1, %f1; mad.f32 %f4, %f2, %f2, %f3; sqrt.approx.f32 %f5, %f4; ld.param.u64 %rd7, [__cudaparm_sample3_c]; add.u64 %rd8, %rd7, %rd2; st.global.f32 [%rd8+0], %f5; .loc 15 11 0 exit; $LDWend_sample3: } // sample3 .global .u32 error;
The filtering function can be moved from the CPU to the GPU, as shown in Example 9-38. This new version performs the comparison inside the sample3 function, which is executed on the GPU. The result is an array of 1 and 0, and the CPU side can simply add the array elements.
let blockIdx = new BlockDim() let threadIdx = new ThreadIdx() [<ReflectedDefinition; GPU>] let sample3 (a:CUDAPointer2<float32>) (b:CUDAPointer2<float32>) (c:CUDAPointer2<float32>)= let x = blockIdx.x if sqrt(a.[x] * a.[x] + b.[x] * b.[x]) <= 1.f then c.Set(1.f, x) else c.Set(0.f, x) ()
computePI function
let computePI() =
let len = 1000
WriteToFile2() // defined in Listing 9-34
let execution = new GPUExecution()
let r = execution.Init(tempFileName)
let r = CUDARandom()
let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT)
if status = curandStatus.CURAND_SUCCESS then //defined in Listing 9-9
let status, l0 = r.GenerateUniform(g, len)
let status, l1 = r.GenerateUniform(g, len)
let output = Array.create len 0.f
let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l0), output)
let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l1), output)
let r, l2 = execution.Runtime.CopyHostToDevice(output)
let r = execution.Runtime.ExecuteFunction("sample3", [l0; l1; l2], len, 1)
let result, output = execution.Runtime.CopyDeviceToHost(l2, output)
float ( output |> Seq.sum) / float len * 4.0
else
failwith "execution error"
Now let’s examine the performance when the number of random data points increases. Example 9-39 executes the function computePI with different array lengths. From the execution result, you can tell that the execution time does not increase significantly even when the array length increases exponentially, except for the first one, which performs a few one-time initialization operations.
let computePI() = WriteToFile2() // defined in Listing 9-34 let execution = new GPUExecution() let r = execution.Init(tempFileName) let r = CUDARandom() let status, g = r.CreateGenerator(CUDARandomRngType.CURAND_PSEUDO_DEFAULT) let sw = System.Diagnostics.Stopwatch() if status = curandStatus.CURAND_SUCCESS then // curandStatus is defined in Listing 9-9 let compute(len) = sw.Reset() sw.Start() let status, l0 = r.GenerateUniform(g, len) let status, l1 = r.GenerateUniform(g, len) let output = Array.create len 0.f let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l0), output) let _ = execution.Runtime.CopyDeviceToHost(CUDAPointer2<_>(l1), output) let r, l2 = execution.Runtime.CopyHostToDevice(output) let r = execution.Runtime.ExecuteFunction("sample3", [l0; l1; l2], len, 1) let result, output = execution.Runtime.CopyDeviceToHost(l2, output) let pi = float ( output |> Seq.sum) / float len * 4.0 sw.Stop() pi, sw.ElapsedTicks [50; 100; 500; 1000; 5000; 10000] |> Seq.map compute else failwith "execution error"
#include "CUDALibrary.h" __device__ void ff_0(float* a, float* b) { int x; x = blockIdx.x; float max; max = 0.0f; for (int i = x; i < 15; i++) { if ((max) < (a[i])) { max = a[i]; } else { } }; b[x] = max; } extern "C" __global__ void sample4 (float* a, float* b) { ff_0(a, b); }
Execution result
(3.04, 142194L) (2.84, 1643L) (3.152, 3595L) (3.144, 3369L) (3.1632, 4685L) (3.1588, 3511L)
If your requirements involve matrix manipulation or linear algebra, Statfactory’s FCore numerical library: (http://www.statfactory.co.uk/) is a good choice. This library provides GPGPU-based matrix, linear algebra and random number generating functions.
Other than Statfactory library, the following websites are good resources for more information:
General-Purpose Computation on Graphics Hardware (http://gpgpu.org/developer/cuda)
CUDA Zone (https://developer.nvidia.com/category/zone/cuda-zone)
OpenCL on NVIDIA (https://developer.nvidia.com/opencl)
The Khronos Group (http://www.khronos.org/opencl/)
For a C# developer, functional programming might not be a familiar concept. Chapter 1 to Chapter 3 introduced the imperative and object-oriented (OO) features. If you are planning to use F# in your project, you do not have to dedicate three months to learning a new language from scratch. Instead, you can start to implement some components from the material presented in Chapter 1 to Chapter 3. Chapter 4 to Chapter 6 introduced some F# unique features, such as type providers. Chapter 7 to Chapter 9 introduced a few F# applications. These chapters demonstrate how to solve complex problems using features introduced in previous chapters.
Functional programming is not a silver bullet. F# is a language that provides both functional and OO features. Having knowledge in these two areas is a perfect complementary skill set for the C# developer to solve daily programming tasks more efficiently. For example, the LINQ feature in C#, which is a functional programming concept, increasingly attracts developer interest and dramatically changes the way developers write code. A number of problems are solved more naturally by applying these functional programming concepts. If you are curious and motivated to explore a new way to talk to the computer, F# is a great candidate for further exploration.