Skip to content

Omp target code example

Kewei Yan edited this page Apr 4, 2022 · 23 revisions

Omp Target Implementation

Sample code

CPU version

// cnn_cpu.c
...
network->input = data->X;
network->truth = data->y;
...
count = 0;
while (count < max_batches) {                                              // host
    for (int i = 0; i < img.col/batch; i++) {
        forward(network->input, network->truth, network->weights, ...);    // device
        backward(network->output, network->truth, network->weights, ...);  // device
        update(network->weights, ...);                                     // device
    }
    count++;                                                               // host
}
...

GPU offloading version(OMP TARGET):

// cnn_target.c
...
network->input = data->X;
network->truth = data->y;
...
count = 0;
#pragma omp target data map(tofrom: network)   // How to copy data in terms of a structure? To split data and other parameters?
{
while (count < max_batches) {
{
    for (int i = 0; i < img.col/batch; i++) {
#pragma omp target                             // How to handle Sync?
{
        forward(network->input, network->truth, network->weights, ...);
        backward(network->outputs, network->truth, network->weights, ...);
        update(network->weights, ...);
} // target region ends
    } // for loop ends

    count++;
} // while loop ends
} // target data region ends

Within forward/backward/update modules

while(count< max_batches) {
    for (int l = 0; l < network->n; l++) {
#pragma omp parallel for
        for (int i = 0; i < network->batch*network->outputs) {
            for (int j = 0; j < network->channels) {
                // conv...
            }
        }

#pragma omp parallel for
        for (;;) {
            for (;;) {
                for (;;) {
                    // conv...
                }
            }
        }

...

All the data, including images and model parameters, are in the structure network. Among them, network->input and network->truth need to be copied to device, but no need to be copied back; network->weights and network->output need to be copied to device and finally be copied back.

On entrance of target data region, the memory is allocated on device and no more actions are taken - no initialization and data copy: omp target data works with omp target, when target region is entered and the memory is already allocated on device, initialization/data copy happens. On exit of target data region, array created by target data alone will be deallocated. The array on host will be updated if the map-type is from or tofrom. After that the array on device will be deallocated (omp target, if is not in target data region, is also working like this).

The array needs to be copied to device only is handled by omp target: omp target copies data from host to device. Any array needs to be copied back to host is handled by omp target data, omp target and omp target update. In this case, the map-type of omp target data is tofrom, mapping the memory between host and device. Also, for the target regions are still in target data region, then, even the map-type of omp target is tofrom, data will not be copied back to host when on the exits of target region.

Compile & run

Compile

Clang and llvm version need to be matched. For instance, on cci-carina, the lib for llvm-14 is shown below.

clang target.c -L/opt/llvm/llvm-14.x-install/lib/ -lm -fopenmp -fopenmp-targets=nvptx64 -o target

Run

./target

Performance

if no NOWAIT with omp target, the time should be really long...most of the time spent on Sync (according to nvprof).

TODOs

  1. try to pipeline data copy of next batch and computation of current batch
  2. optimization: get batch size, adapted schedule from input data