Programming Notes for Using NVIDIA® CUDA™ Toolkit
This reference material is intended for users who want to use the computational resources of their NVIDIA GPU board for numeric processing when using the IMSL C Numerical Library. Users who do not have the NVIDIA GPU board can ignore this section.
Rationale and General Algorithm
NVIDIA® CUDA™ technology leverages the massively parallel processing power of NVIDIA GPUs. The NVIDIA CUDA Toolkit provides functions which can be used as building blocks for an application taking advantage of this technology. IMSL C Numerical Library has incorporated the use of some of these functions to improve the overall performance of the library.
No direct use or knowledge of the NVIDIA CUDA Toolkit is required to take advantage of these functions. The program or application is simply rebuilt using environment variables which link with the NVIDIA CUDA Toolkit libraries.
The strategy for using the NVIDIA GPU is given by the following algorithm:
If an NVIDIA-enabled version of an IMSL function is called and the maximum of vector or matrix dimensions are
greater than or equal to a threshold value,
then Copy the required vector and matrix data from the CPU to the GPU
Compute the result on the GPU
Copy the result from the GPU to the CPU
Else, use the IMSL equivalent version of the function that does not use the GPU.
Normally a code that calls an IMSL/NVIDIA code does not have to be aware of the copy steps or the threshold size. These are hidden from the user code. Users have the option of changing the threshold size. This is important because using the GPU may be slower than using a CPU version of the code until array sizes become sufficiently large. Thereafter the GPU version is typically faster and increasingly much faster as the problem size increases. The default threshold value is 32 but it may not be optimal. This default allows the functions to perform correctly without initial attention to this value.
The user can change the threshold value for all or specific IMSL/NVIDIA functions by using the IMSL function
imsl_cuda_set. The threshold values can be obtained using the IMSL function
imsl_cuda_get.
The floating point results obtained using the CPU vs. the GPU will likely differ in units of the low order bits in each component. These differences come from non-equivalent strategies of floating point arithmetic and rounding modes that are implemented in the NVIDIA board. This can be an important detail when comparing results for purposes of benchmarking or code regression. Generally either result should be acceptable for numerical work.