Skip to content

OpenCL sin Performance

Yichao Yu edited this page Jan 11, 2021 · 5 revisions

Similar to the CPU test, we will first measure the performance while involving as little memory access as possible. Due to the complexity of the OpenCL driver, we will also measure some overhead related to scheduling a job from the CPU, waiting for the previous dependent job to finish and to run a dummy kernel. These numbers should give us an idea about how much job we need to schedule to avoid being bottlenecked by these overhead. The roundtrip time for scheduling a single dummy kernel will also give us an upper bound on the overhead latency we should expect though we still need to measure that in a more realistic setting later.

In order to force the CPU to do computation without accessing memory, we use asm volatile to create a dummy use of the result on the CPU. AFAICT, we do not have anything as direct as this in OpenCL so we need to find another way. In this test, we do this by storing the result to memory behind a branch that will never be taken at the runtime. We also make the condition of the store depend on the calculated value such that the compiler will not be able to move the computation into the same branch. More specifically, we have something similar to

float res = amp * sin(...);
if (res > threshold) {
   // store `res` to memory
}

As long as we pass in an amp that is significantly smaller than threshold the branch will never be taken and the compiler will generally not optimize this case out.

As mentioned in the accuracy test we will test both sin and native_sin. For each tests, including the dummy one mentioned above, we will vary the dimention we run each kernel on and the number of repitition we schedule this in the command queue. We do this for a command queue that is either in order or out of order. For the computation test (i.e. not dummy), we also vary the number of times we evaluate the sin/native_sin function inside the kernel to minimize the effect of the kernel overhead on the measurement.

The full code for the test can be found in opencl-dry-compute.cpp and the results can be found under data/cl-dry-compute.

As mentioned before, we have three different platforms to test.

  1. Intel OpenCL CPU runtime

    1. i7-6700K

      Performance

    2. i9-10885H

      Performance

    Different lines correspond to different global work sizes. The plots in the left column are using the in order queue whereas the right one uses out of order queue.

    There does not seem to be a significant difference between the in order and out of order queue.

    From the measurement of the dummy kernel, the time plotted is the average time per processed elements (i.e. cumulative global size) It seems that each job enqueued takes at least about 200 us and there may be a compariable amount of overhead for the initial command. (Note that the blue solid line has a global size of 2 so the total time it takes for that line per run is twice the plotted value.) On the other hand, the minimum time we've measured per execution of each kernel is between 20 and 30 ps which seems to be mostly caused by the minimum time per run. There does not seem to be any measurement overhead related to running a kernel on an element.

    Given the huge overhead, the measurement of the time for fewer than 0.5 to 1 M evaluations of sin/native_sin is dominated by the overhead. With more evaluations per run, the time per evaluation is about 200 ps on i7-6700K and about 100 ps on i9-10885H, most likely from the difference in core count. The native_sin is slightly faster than the sin version but not by too much. This is about 4 times slower than our CPU implementation and is also much less accurate so there's really no reason for us to use this...

  2. Intel Compute OpenCL runtime

    1. UHD 530

      Performance

    2. UHD 640

      Performance

    The Intel GPU appears to behave much better compared to their CPU, which isn't something that I'll expect from intel...

    From the dummy run, the overhead per run seems to be much lower at about 12 us on UHD 530 and 15 us on UHD 640 for the out of order queue 60 us and 50 us of initial overhead respectively. The in order queue seems to have a similar initial overhead but slightly higher overhead per run. Based on the timing for the run with 8 M elements, there also seems to be a fixed overhead of 100 ps per element which needs to be overcome by increasing the computation inside each kernel.

    Due to the per element overhead, the timing is not flat as a function of evaluations per kernel even for large global sizes. This overhead becomes insignificant for about 16 evaluations of the function per kernel for sin but remains significant for native_sin until about 32 or 64 evaluations per kernel due to the higher performance of native_sin. On both devices, the time is about 100 ps for sin and about 30 ps for native_sin. This is comparable or slightly worse than what we achieved on the respective CPU. The native_sin would be very nice to use based on the performance but the precision is unfortunatly a little too low compared to what we need.

  3. AMD ROCm OpenCL driver

    AMD Radeon RX 5500 XT

    Performance

    Unlike the Intel GPU driver, the out of order queue does not seem to make too much a difference here. The initial overhead for scheduling the kernel seems to be slightly higher at about 100 us, which makes some sense due to the PCIe bus, but the time per run is much lower at about 2.5 us. There also seem to be a fixed overhead per element of about 20 ps.

    For the performance of sin and native_sin, the native_sin appears to be much faster. The time is about 13 ps for sin and about 2 ps for native_sin. Since the native_sin already give us enough accuracy, we should be able to use it to generate hundreds of traps on this GPU.

Given the performance and accuracy of the Intel CPU OpenCL implementation, we will ignore it from this point on and run our tests only on the GPUs. The next thing we will test is the memory bandwidth of the GPU.