Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize transformation of firstprivate clause in OpenMP GPU offloading #157

Open
ouankou opened this issue Feb 15, 2023 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@ouankou
Copy link
Contributor

ouankou commented Feb 15, 2023

In both ROSE and REX, the private/shared/firstprivate clauses in a target region are first converted into map clause. The lowering module only handles map clauses and ignores all the other data clauses.

Given the following code as an example:

for (int j = 0; j < 100; j++)
#pragma omp target teams distribute parallel for map(to: x[0:200]) map(from: y[0:200]) firstprivate(a, n)
  for (int i = 0; i < 200; i++)
    y[i] += a * x[i];

REX will convert it to:

for (int j = 0; j < 100; j++)
#pragma omp target teams distribute parallel for map(to: x[0:200], a, n) map(from: y[0:200]) firstprivate(a, n)
  for (int i = 0; i < 200; i++)
    y[i] += a * x[i];

The firstprivate clause will be used for kernel generation (e.g. private variable initilization) but not for data transferring. However, LLVM transforms the original code without such conversion.

For example code, in LLVM, a and n is not mapped but are directly passed by value. REX creates a mapping between the host and the device. As a result, LLVM performed 200 times of data transfers (100 for x, 100 for y), but REX performed 400 times (100 for x, y, a, and n). It won't cause incorrect computing results but may cause significant performance differences.

In NeoRodinia nn benchmark, in each iteration of a while loop, it launches an omp target region. Because of the issue descriable above, the REX version has 12000 times of data transfers, and the LLVM version only has 4000 times. The data transfer time is 24ms vs. 6.7ms on Carina. When we manually change the mapping type in the REX version from map to to firstprivate, the REX version also shows 4000 times of data transfers, which takes 6.7ms.

Therefore, we need to make significant changes to the way of handling data transfers in REX to address this issue.

@ouankou ouankou added the bug Something isn't working label Feb 15, 2023
@ouankou ouankou self-assigned this Feb 15, 2023
@ouankou
Copy link
Contributor Author

ouankou commented Feb 20, 2023

The transformation for firstprivate clause has some differences compared to map clause.

If variable a is in map(to) clause, while using LLVM runtime API, its base pointer and pointer should be &a and &a.

  ...
  void *__args_base[] = {&a, x};
  void *__args[] = {&a, x + 0};
  int64_t __arg_sizes[] = {((int64_t )(sizeof(int ))), ((int64_t )(sizeof(int ) * 100))};
  int64_t __arg_types[] = {33, 33};
  __tgt_target_teams(__device_id,__host_ptr,__arg_num,__args_base,__args,__arg_sizes,__arg_types,_num_blocks_,_threads_per_block_);

If it's in firstprivate clause, the code should be:

  ...
  void *__args_base[] = {(void *)a, x};
  void *__args[] = {(void *)a, x + 0};
  int64_t __arg_sizes[] = {((int64_t )(sizeof(int ))), ((int64_t )(sizeof(int ) * 100))};
  int64_t __arg_types[] = {288, 33};
  __tgt_target_teams(__device_id,__host_ptr,__arg_num,__args_base,__args,__arg_sizes,__arg_types,_num_blocks_,_threads_per_block_);

Instead of passing the address of a, we pass its values directly as void *. The mapping type is also different. Original 33 (= 1 + 32) means copying from host to device (1) and kernel argument (32). The new mapping type value 288 (= 256 + 32) means passing by value (256) and ernel argument (32).

The code works fine now. However, the compiler will warn that we are casting an integer to void * with a larger size. We can cast the variable as int64_t and then void * to eliminate the warning.

void *__args_base[] = {(void *)((int64_t)a), x + 0};
void *__args[] = {(void *)((int64_t)a), x + 0};

For the kernel, while using map(to), a was passed by the pointer as follows.

__global__ void outlined__kernel__(int *a, int *_dev_x) { ... }

With the change, it was passed by value.

__global__ void outlined__kernel__(int a, int *_dev_x) { ... }

This change was tested with gaussian in NeoRodinia on Carina. Before the change, REX spent 10.5 ms on HtoD data transfer and LLVM spent 2.8 ms. After the change, REX only spent 1.7 ms.
The computing results of the REX and LLVM versions are verified to be the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant