CUME Tutorial

Remember that CUME means CUda Made Easy and that you can find it on sourceforge.

1. How to use CUME functions

The cume_base.h file introduces a set of macro instructions to simplify the use of the CUDA API for memory allocation.

a) memory allocation

The cume_new_var and cume_new_array macro instructions help allocate memory on the device:


    // we allocate one integer in the device memory
    int *gpu_integer;
    
    // cume_new_var(pointer, type) 
    cume_new_var(gpu_integer, int);

    // allocate an array of 100 integers in device memory
    int *gpu_array;

    // cume_new_array(pointer, type, nbr_items)
    cume_new_array(gpu_array, int, 100);

Another interesting function is cume_new_array_zero which has the same behavior has cume_new_array but initializes the memory with zero bytes.


    int *gpu_array;
    cume_new_array_zero(gpu_array, int, 100);

b) memory deallocation

Use the cume_free macro instruction:


    cume_free(gpu_array);

b) memory transfer

We use:

the cume_push function to transfer data from host to device memory
and the cume_pull function to transfer data from device to host memory


    int *cpu_array = new int [100];
    
    cume_new_array(gpu_array, int, 100);

    // cume_push(destination in device memory, source in host memory, type, nbr_items)
    cume_push(gpu_array, cpu_array, int, 100);

    ... call kernel

    // cume_push(destination in host memory, source in device memory, type, nbr_items)
    cume_pull(cpu_array, gpu_array, int, 100);

2. How to use the CUME Array class to handle arrays

a) definition of an array

The CUME Array defines a generic array that maintains two pointers:

one that stores data in host memory
the other that stores data in device memory

To define an array, use the following code:

 
    Array<int> a(1000);
    Array<double> b(500);

this will define a as an array of 1000 integers and b as an array of 500 doubles.

b) modification of array data

You can modify data using the overloaded operator[] or use the STL algorithms as the Array class defines the begin and end iterators.

 
    a[0] = 0;
    fill(a.begin(), a.end(), 1);

c) transfer of data between host and device memory

Use the push and pull methods of the array to respectively transfer data from host to device, and device to host memory.

d) redefinition of operator &

The operator& has been overloaded and returns the address of data in the device memory.

3. How to use the Kernel class to call a kernel

This is the most interesting class of CUME that is used to setup grid and block dimensions and call the kernel.

First you must define the size of the grid and block:

 
    Kernel k(REQUIRED_THREADS)
    k.configure(GRID_TYPE, BLOCK_TYPE, parameters)

where:

REQUIRED_THREADS is the number of threads you need
GRID_TYPE is one of the constants: GRID_1, GRID_X, GRID_XY, GRID_XYZ, GRID_GUESS
BLOCK_TYPE is one of the constants: BLOCK_1, BLOCK_X, BLOCK_XY, BLOCK_XYZ
parameters are the size of the grid and blocks following the GRID_TYPE and BLOCK_TYPE definition

The different constants GRID_1, GRID_X, .... have the following meaning:

GRID_1 : a grid with 1 block
GRID_X : a 1D grid with several blocks on x axis (dimGrid.x >= 1)
GRID_XY : a 2D grid with several blocks on x and y axis
GRID_XYZ : a 3D grid with several blocks on x, y and z axis

The last constant GRID_GUESS can be combined with one of GRID_X and GRID_XY to let the Kernel class determine the correct dimension in function of the number of required threads and the number of blocks.

The different constants BLOCK_1, BLOCK_X, .... have the following meaning:

BLOCK_1 : a block with 1 thread
BLOCK_X : a 1D block with several threads on x axis (dimGrid.x >= 1)
BLOCK_XY : a 2D block with several threads on x and y axis
BLOCK_XYZ : a 3D block with several threads on x, y and z axis

For example if we need to work with 1024 threads with a grid of 2 x 16 blocks and each block has 32 threads then we will write


    Kernel k(1024);
    k.configure(GRID_XY, BLOCK_X, 2, 16, 32);

If you want the Kernel class to determine the size of the grid for you if you need 1027 threads and you know that you want a grid of 1D blocks and each blocks has 32 threads, then use the following code:

 
   Kernel k(1027);
    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);

The grid will then be defined of type GRID_X of size 33.

4. How to get the global thread index inside the kernel

Once you have defined the size of grid and block you can call the kernel using one of the two macros instructions defined in cume_kernel.h

kernel_call_no_resource: call the kernel with No Resource
kernel_call: call kernel With Resource (preferred)

The difference of No Resource and With Resource is that a data structure Kernel::Resource will be passed as the first argument of the kernel and the global thread index formula can be automatically retrieved from the Resource using the get_global_tid() function.

Let's compare the two methods:

a) call with no Resource

In this example you will need to use the formula of the gtid that corresponds to the organization of threads in terms of grid and block. If we use a 1D grid composed of 1D blocks then we will use the cume_gtid_x_x() macro instruction:

 
    Kernel k(SIZE);
    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
    kernel_call_no_resource(kernel_sum, k, &a, &b, &c, a.get_size());


    __global__ void kernel_sum(int *a, int *b, int *c, int size) {
        // **************************************************************
        // get global thread index with cume macro instruction
        // **************************************************************
        int gtid = cume_gtid_x_x();     
        
        if (gtid < size) {
                c[gtid] = a[gtid] + b[gtid];
        }
    }

If later you want to change and have a 1D grid with 2D blocks then you will need to modify the line


    int gtid = cume_gtid_x_x();


    int gtid = cume_gtid_x_xy();

b) call with Resource

In this case, by using the res->get_global_tid() function you will automatically get the right formula.


    Kernel k(SIZE);
    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
    kernel_call(kernel_sum, k, &a, &b, &c, a.get_size());

    
    __global__ void kernel_sum(Kernel::Resource *res, int *a, int *b, int *c, int size) {
        // **************************************************************
        // automatically get global thread index in function of kernel
        // type: no need to wonder which formula to use
        // **************************************************************
        int gtid = res->get_global_tid();
        
        if (gtid < size) {
                c[gtid] = a[gtid] + b[gtid];
        }
    }

If later you want to change and have a 1D grid with 2D blocks then you won't need to modify your code inside the kernel.

Page de Jean-Michel Richer