Optimize transformation of firstprivate clause in OpenMP GPU offloading #157

ouankou · 2023-02-15T20:46:53Z

In both ROSE and REX, the private/shared/firstprivate clauses in a target region are first converted into map clause. The lowering module only handles map clauses and ignores all the other data clauses.

Given the following code as an example:

for (int j = 0; j < 100; j++)
#pragma omp target teams distribute parallel for map(to: x[0:200]) map(from: y[0:200]) firstprivate(a, n)
  for (int i = 0; i < 200; i++)
    y[i] += a * x[i];

REX will convert it to:

for (int j = 0; j < 100; j++)
#pragma omp target teams distribute parallel for map(to: x[0:200], a, n) map(from: y[0:200]) firstprivate(a, n)
  for (int i = 0; i < 200; i++)
    y[i] += a * x[i];

The firstprivate clause will be used for kernel generation (e.g. private variable initilization) but not for data transferring. However, LLVM transforms the original code without such conversion.

For example code, in LLVM, a and n is not mapped but are directly passed by value. REX creates a mapping between the host and the device. As a result, LLVM performed 200 times of data transfers (100 for x, 100 for y), but REX performed 400 times (100 for x, y, a, and n). It won't cause incorrect computing results but may cause significant performance differences.

In NeoRodinia nn benchmark, in each iteration of a while loop, it launches an omp target region. Because of the issue descriable above, the REX version has 12000 times of data transfers, and the LLVM version only has 4000 times. The data transfer time is 24ms vs. 6.7ms on Carina. When we manually change the mapping type in the REX version from map to to firstprivate, the REX version also shows 4000 times of data transfers, which takes 6.7ms.

Therefore, we need to make significant changes to the way of handling data transfers in REX to address this issue.

The text was updated successfully, but these errors were encountered:

ouankou · 2023-02-20T14:35:50Z

The transformation for firstprivate clause has some differences compared to map clause.

If variable a is in map(to) clause, while using LLVM runtime API, its base pointer and pointer should be &a and &a.

  ...
  void *__args_base[] = {&a, x};
  void *__args[] = {&a, x + 0};
  int64_t __arg_sizes[] = {((int64_t )(sizeof(int ))), ((int64_t )(sizeof(int ) * 100))};
  int64_t __arg_types[] = {33, 33};
  __tgt_target_teams(__device_id,__host_ptr,__arg_num,__args_base,__args,__arg_sizes,__arg_types,_num_blocks_,_threads_per_block_);

If it's in firstprivate clause, the code should be:

  ...
  void *__args_base[] = {(void *)a, x};
  void *__args[] = {(void *)a, x + 0};
  int64_t __arg_sizes[] = {((int64_t )(sizeof(int ))), ((int64_t )(sizeof(int ) * 100))};
  int64_t __arg_types[] = {288, 33};
  __tgt_target_teams(__device_id,__host_ptr,__arg_num,__args_base,__args,__arg_sizes,__arg_types,_num_blocks_,_threads_per_block_);

Instead of passing the address of a, we pass its values directly as void *. The mapping type is also different. Original 33 (= 1 + 32) means copying from host to device (1) and kernel argument (32). The new mapping type value 288 (= 256 + 32) means passing by value (256) and ernel argument (32).

The code works fine now. However, the compiler will warn that we are casting an integer to void * with a larger size. We can cast the variable as int64_t and then void * to eliminate the warning.

void *__args_base[] = {(void *)((int64_t)a), x + 0};
void *__args[] = {(void *)((int64_t)a), x + 0};

For the kernel, while using map(to), a was passed by the pointer as follows.

__global__ void outlined__kernel__(int *a, int *_dev_x) { ... }

With the change, it was passed by value.

__global__ void outlined__kernel__(int a, int *_dev_x) { ... }

This change was tested with gaussian in NeoRodinia on Carina. Before the change, REX spent 10.5 ms on HtoD data transfer and LLVM spent 2.8 ms. After the change, REX only spent 1.7 ms.
The computing results of the REX and LLVM versions are verified to be the same.

ouankou added the bug Something isn't working label Feb 15, 2023

ouankou self-assigned this Feb 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize transformation of firstprivate clause in OpenMP GPU offloading #157

Optimize transformation of firstprivate clause in OpenMP GPU offloading #157

ouankou commented Feb 15, 2023 •

edited

Loading

ouankou commented Feb 20, 2023 •

edited

Loading

Optimize transformation of firstprivate clause in OpenMP GPU offloading #157

Optimize transformation of firstprivate clause in OpenMP GPU offloading #157

Comments

ouankou commented Feb 15, 2023 • edited Loading

ouankou commented Feb 20, 2023 • edited Loading

ouankou commented Feb 15, 2023 •

edited

Loading

ouankou commented Feb 20, 2023 •

edited

Loading