Remember that CUME means CUda Made Easy and that you can find it on sourceforge.
The cume_base.h file introduces a set of macro instructions to simplify the use of the CUDA API for memory allocation.
The cume_new_var and cume_new_array macro instructions help allocate memory on the device:
// we allocate one integer in the device memory
int *gpu_integer;
// cume_new_var(pointer, type)
cume_new_var(gpu_integer, int);
// allocate an array of 100 integers in device memory
int *gpu_array;
// cume_new_array(pointer, type, nbr_items)
cume_new_array(gpu_array, int, 100);
Another interesting function is cume_new_array_zero which has the same behavior has cume_new_array but initializes the memory with zero bytes.
int *gpu_array;
cume_new_array_zero(gpu_array, int, 100);
Use the cume_free macro instruction:
cume_free(gpu_array);
We use:
int *cpu_array = new int [100];
cume_new_array(gpu_array, int, 100);
// cume_push(destination in device memory, source in host memory, type, nbr_items)
cume_push(gpu_array, cpu_array, int, 100);
... call kernel
// cume_push(destination in host memory, source in device memory, type, nbr_items)
cume_pull(cpu_array, gpu_array, int, 100);
The CUME Array defines a generic array that maintains two pointers:
To define an array, use the following code:
Array<int> a(1000);
Array<double> b(500);
this will define a as an array of 1000 integers and b as an array of 500 doubles.
You can modify data using the overloaded operator[] or use the STL algorithms as the Array class defines the begin and end iterators.
a[0] = 0;
fill(a.begin(), a.end(), 1);
Use the push and pull methods of the array to respectively transfer data from host to device, and device to host memory.
The operator& has been overloaded and returns the address of data in the device memory.
This is the most interesting class of CUME that is used to setup grid and block dimensions and call the kernel.
First you must define the size of the grid and block:
Kernel k(REQUIRED_THREADS)
k.configure(GRID_TYPE, BLOCK_TYPE, parameters)
where:
The different constants GRID_1, GRID_X, .... have the following meaning:
The last constant GRID_GUESS can be combined with one of GRID_X and GRID_XY to let the Kernel class determine the correct dimension in function of the number of required threads and the number of blocks.
The different constants BLOCK_1, BLOCK_X, .... have the following meaning:
For example if we need to work with 1024 threads with a grid of 2 x 16 blocks and each block has 32 threads then we will write
Kernel k(1024);
k.configure(GRID_XY, BLOCK_X, 2, 16, 32);
If you want the Kernel class to determine the size of the grid for you if you need 1027 threads and you know that you want a grid of 1D blocks and each blocks has 32 threads, then use the following code:
Kernel k(1027);
k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
The grid will then be defined of type GRID_X of size 33.
Once you have defined the size of grid and block you can call the kernel using one of the two macros instructions defined in cume_kernel.h
The difference of No Resource and With Resource is that a data structure Kernel::Resource will be passed as the first argument of the kernel and the global thread index formula can be automatically retrieved from the Resource using the get_global_tid() function.
Let's compare the two methods:
In this example you will need to use the formula of the gtid that corresponds to the organization of threads in terms of grid and block. If we use a 1D grid composed of 1D blocks then we will use the cume_gtid_x_x() macro instruction:
Kernel k(SIZE);
k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
kernel_call_no_resource(kernel_sum, k, &a, &b, &c, a.get_size());
__global__ void kernel_sum(int *a, int *b, int *c, int size) {
// **************************************************************
// get global thread index with cume macro instruction
// **************************************************************
int gtid = cume_gtid_x_x();
if (gtid < size) {
c[gtid] = a[gtid] + b[gtid];
}
}
If later you want to change and have a 1D grid with 2D blocks then you will need to modify the line
int gtid = cume_gtid_x_x();
by
int gtid = cume_gtid_x_xy();
In this case, by using the res->get_global_tid() function you will automatically get the right formula.
Kernel k(SIZE);
k.configure(GRID_GUESS | GRID_X, BLOCK_X, 32);
kernel_call(kernel_sum, k, &a, &b, &c, a.get_size());
__global__ void kernel_sum(Kernel::Resource *res, int *a, int *b, int *c, int size) {
// **************************************************************
// automatically get global thread index in function of kernel
// type: no need to wonder which formula to use
// **************************************************************
int gtid = res->get_global_tid();
if (gtid < size) {
c[gtid] = a[gtid] + b[gtid];
}
}
If later you want to change and have a 1D grid with 2D blocks then you won't need to modify your code inside the kernel.