Programming Notes for Using NVIDIA® CUDA™ Toolkit

Chapter 12: Utilities > Programming Notes for Using NVIDIA® CUDA™ Toolkit

Programming Notes for Using NVIDIA^® CUDA™ Toolkit

This reference material is intended for users who want to use the computational resources of their NVIDIA GPU board for numeric processing when using the IMSL C Numerical Library. Users who do not have the NVIDIA GPU board can ignore this section.

Rationale and General Algorithm

NVIDIA^® CUDA™ technology leverages the massively parallel processing power of NVIDIA GPUs. The NVIDIA CUDA Toolkit provides functions which can be used as building blocks for an application taking advantage of this technology. IMSL C Numerical Library has incorporated the use of some of these functions to improve the overall performance of the library.

No direct use or knowledge of the NVIDIA CUDA Toolkit is required to take advantage of these functions. The program or application is simply rebuilt using environment variables which link with the NVIDIA CUDA Toolkit libraries.

The strategy for using the NVIDIA GPU is given by the following algorithm:

• If an NVIDIA-enabled version of an IMSL function is called and the maximum of vector or matrix dimensions are greater than or equal to a threshold value, then

• Copy the required vector and matrix data from the CPU to the GPU

• Compute the result on the GPU

• Copy the result from the GPU to the CPU

• Else, use the IMSL equivalent version of the function that does not use the GPU.

Normally a code that calls an IMSL/NVIDIA code does not have to be aware of the copy steps or the threshold size. These are hidden from the user code. Users have the option of changing the threshold size. This is important because using the GPU may be slower than using a CPU version of the code until array sizes become sufficiently large. Thereafter the GPU version is typically faster and increasingly much faster as the problem size increases. The default threshold value is 32 but it may not be optimal. This default allows the functions to perform correctly without initial attention to this value.

The user can change the threshold value for all or specific IMSL/NVIDIA functions by using the IMSL function imsl_cuda_set. The threshold values can be obtained using the IMSL function imsl_cuda_get.

The floating point results obtained using the CPU vs. the GPU will likely differ in units of the low order bits in each component. These differences come from non-equivalent strategies of floating point arithmetic and rounding modes that are implemented in the NVIDIA board. This can be an important detail when comparing results for purposes of benchmarking or code regression. Generally either result should be acceptable for numerical work.

Implementation

Basic Linear Algebra Subprograms

IMSL C Numerical Library incorporates the use of many Basic Linear Algebra Subprograms (BLAS) throughout the product. These functions are named using IMSL conventions and used internally. They are not accessible directly by the user.

NVIDIA Corp. implemented certain Level 1, 2 and 3 BLAS in the NVIDIA CUDA Toolkit. The NVIDIA external names and argument protocols are different from those used by the IMSL C Numerical Library. Wrappers have been written to allow for the IMSL C Numerical Library to access selected routines in the NVIDIA CUDA Toolkit.

In Table 12.1, we document an enumeration that includes those BLAS for which a CUDA Toolkit implementation is provided in the IMSL C Numerical Library. The naming convention used is the name of the BLAS function prefaced by ‘IMSL_CUDA_’.

Transforms

NVIDIA CUDA Toolkit implementations of complex two-dimensional FFT (Fast Fourier Transform) functions can be accessed when using functions imsl_c_fft_2d_complex and imsl_z_fft_2d_complex. The enumerations defined to enable the user to manipulate the parameters used by these function are documented in Table 12.1.

Utility Functions

There are three utility functions provided in the IMSL C Math Library that can be used to help manage the use of NVIDIA CUDA Toolkit. These utilities appear in Table 12.2 and are described in more detail in their corresponding function descriptions.

Note: Some NVIDIA hardware does not provide double precision arithmetic. Since the double precision functions are included in the NVIDIA CUDA Toolkit library, those functions will appear to execute correctly even though they do not return correct results. When the IMSL software detects that the correct results are not returned, a warning error message will be printed and the IMSL equivalent of the function which does not use the GPU will be used. The user can eliminate this error by using function imsl_cuda_set to set the threshold value to zero.

Table 12.1. Enumerations of NVIDIA Toolkit-Enabled Functions

IMSL_CUDA_SGEMV	IMSL_CUDA_DGER	IMSL_CUDA_STRSM
IMSL_CUDA_SGER	IMSL_CUDA_DSYR	IMSL_CUDA_DTRSM
IMSL_CUDA_SSYR	IMSL_CUDA_DGEMM	IMSL_CUDA_C_FFT_2D_COMPLEX
IMSL_CUDA_SGEMM	IMSL_CUDA_SGBMV	IMSL_CUDA_Z_FFT_2D_COMPLEX
IMSL_CUDA_DGEMV	IMSL_CUDA_DGBMV

Table 12.2. NVIDIA CUDA Toolkit Utilities

imsl_cuda_get

imsl_cuda_set

imsl_cuda_free

Required NVIDIA Copyright Notice:

Portions of the NVIDIA SGEMM and DGEMM library routines were written by Vasily Volkov and are subject to the Modified Berkeley Software Distribution License as follows:

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. (See CUDA Toolkit 4.0, CUBLAS Library, April, 2011, for these remaining conditions.)

Contact Support