CUME Tutorial

Remember that CUME means CUda Made Easy and that you can find it on sourceforge.

1. How to use CUME functions

The cume_base.h file introduces a set of macro instructions to simplify the use of the CUDA API for memory allocation.

a) memory allocation

The cume_new_var and cume_new_array macro instructions help allocate memory on the device:


    // we allocate one integer in the device memory
    int *gpu_integer;
    
    // cume_new_var(pointer, type) 
    cume_new_var(gpu_integer, int);

    // allocate an array of 100 integers in device memory
    int *gpu_array;

    // cume_new_array(pointer, type, nbr_items)
    cume_new_array(gpu_array, int, 100);
    

Another interesting function is cume_new_array_zero which has the same behavior has cume_new_array but initializes the memory with zero bytes.


    int *gpu_array;
    cume_new_array_zero(gpu_array, int, 100);
    

b) memory deallocation

Use the cume_free macro instruction:


    cume_free(gpu_array);

b) memory transfer

We use:

  1. the cume_push function to transfer data from host to device memory
  2. and the cume_pull function to transfer data from device to host memory

    int *cpu_array = new int [100];
    
    cume_new_array(gpu_array, int, 100);

    // cume_push(destination in device memory, source in host memory, type, nbr_items)
    cume_push(gpu_array, cpu_array, int, 100);

    ... call kernel

    // cume_push(destination in host memory, source in device memory, type, nbr_items)
    cume_pull(cpu_array, gpu_array, int, 100);    

2. How to use the CUME Array class to handle arrays

a) definition of an array

The CUME Array defines a generic array that maintains two pointers:

To define an array, use the following code:

 
    Array<int> a(1000);
    Array<double> b(500);

this will define a as an array of 1000 integers and b as an array of 500 doubles.

b) modification of array data

You can modify data using the overloaded operator[] or use the STL algorithms as the Array class defines the begin and end iterators.

 
    a[0] = 0;
    fill(a.begin(), a.end(), 1);
    

c) transfer of data between host and device memory

Use the push and pull methods of the array to respectively transfer data from host to device, and device to host memory.

d) redefinition of operator &

The operator& has been overloaded and returns the address of data in the device memory.

3. How to use the Kernel class to call a kernel

This is the most interesting class of CUME that is used to setup grid and block dimensions and call the kernel.

First you must define the size of the grid and block:

 
    Kernel k(REQUIRED_THREADS)
    k.configure(GRID_TYPE, BLOCK_TYPE, parameters)

where:

The different constants GRID_1, GRID_X, .... have the following meaning:

The last constant GRID_GUESS can be combined with one of GRID_X and GRID_XY to let the Kernel class determine the correct dimension in function of the number of required threads and the number of blocks.

The different constants BLOCK_1, BLOCK_X, .... have the following meaning:

For example if we need to work with 1024 threads with a grid of 2 x 16 blocks and each block has 32 threads then we will write


    Kernel k(1024);
    k.configure(GRID_XY, BLOCK_X, 2, 16, 32);
    

If you want the Kernel class to determine the size of the grid for you if you need 1027 threads and you know that you want a grid of 1D blocks and each blocks has 32 threads, then use the following code:

 
   Kernel k(1027);
    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
    

The grid will then be defined of type GRID_X of size 33.

4. How to get the global thread index inside the kernel

Once you have defined the size of grid and block you can call the kernel using one of the two macros instructions defined in cume_kernel.h

The difference of No Resource and With Resource is that a data structure Kernel::Resource will be passed as the first argument of the kernel and the global thread index formula can be automatically retrieved from the Resource using the get_global_tid() function.

Let's compare the two methods:

a) call with no Resource

In this example you will need to use the formula of the gtid that corresponds to the organization of threads in terms of grid and block. If we use a 1D grid composed of 1D blocks then we will use the cume_gtid_x_x() macro instruction:

 
    Kernel k(SIZE);
    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
    kernel_call_no_resource(kernel_sum, k, &a, &b, &c, a.get_size());


    __global__ void kernel_sum(int *a, int *b, int *c, int size) {
        // **************************************************************
        // get global thread index with cume macro instruction
        // **************************************************************
        int gtid = cume_gtid_x_x();     
        
        if (gtid < size) {
                c[gtid] = a[gtid] + b[gtid];
        }
    }
    

If later you want to change and have a 1D grid with 2D blocks then you will need to modify the line


    int gtid = cume_gtid_x_x();

by


    int gtid = cume_gtid_x_xy();

b) call with Resource

In this case, by using the res->get_global_tid() function you will automatically get the right formula.


    Kernel k(SIZE);
    k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
    kernel_call(kernel_sum, k, &a, &b, &c, a.get_size());

    
    __global__ void kernel_sum(Kernel::Resource *res, int *a, int *b, int *c, int size) {
        // **************************************************************
        // automatically get global thread index in function of kernel
        // type: no need to wonder which formula to use
        // **************************************************************
        int gtid = res->get_global_tid();
        
        if (gtid < size) {
                c[gtid] = a[gtid] + b[gtid];
        }
    }


If later you want to change and have a 1D grid with 2D blocks then you won't need to modify your code inside the kernel.