-
Notifications
You must be signed in to change notification settings - Fork 0
OpenCL sin Performance
Similar to the CPU test, we will first measure the performance while involving as little memory access as possible. Due to the complexity of the OpenCL driver, we will also measure some overhead related to scheduling a job from the CPU, waiting for the previous dependent job to finish and to run a dummy kernel. These numbers should give us an idea about how much job we need to schedule to avoid being bottlenecked by these overhead. The roundtrip time for scheduling a single dummy kernel will also give us an upper bound on the overhead latency we should expect though we still need to measure that in a more realistic setting later.
In order to force the CPU to do computation without accessing memory,
we use asm volatile
to create a dummy use of the result on the CPU.
AFAICT, we do not have anything as direct as this in OpenCL so we need to find another way.
In this test, we do this by storing the result to memory behind a branch that will never
be taken at the runtime. We also make the condition of the store depend on the calculated value
such that the compiler will not be able to move the computation into the same branch.
More specifically, we have something similar to
float res = amp * sin(...);
if (res > threshold) {
// store `res` to memory
}
As long as we pass in an amp
that is significantly smaller than threshold
the branch will never be taken and the compiler will generally not optimize this case out.
As mentioned in the accuracy test
we will test both sin
and native_sin
.
For each tests, including the dummy one mentioned above,
we will vary the dimention we run each kernel on and the number of repitition
we schedule this in the command queue.
We do this for a command queue that is either in order or out of order.
For the computation test (i.e. not dummy), we also vary the number of times we evaluate
the sin
/native_sin
function inside the kernel to minimize the effect of
the kernel overhead on the measurement.
The full code for the test can be found in opencl-dry-compute.cpp
and the results can be found under data/cl-dry-compute
.
As mentioned before, we have three different platforms to test.
-
Intel OpenCL CPU runtime
-
i7-6700K
-
i9-10885H
Different lines correspond to different global work sizes. The plots in the left column are using the in order queue whereas the right one uses out of order queue.
There does not seem to be a significant difference between the in order and out of order queue.
From the measurement of the dummy kernel, the time plotted is the average time per processed elements (i.e. cumulative global size) It seems that each job enqueued takes at least about
200 us
and there may be a compariable amount of overhead for the initial command. (Note that the blue solid line has a global size of2
so the total time it takes for that line per run is twice the plotted value.) On the other hand, the minimum time we've measured per execution of each kernel is between20
and30 ps
which seems to be mostly caused by the minimum time per run. There does not seem to be any measurement overhead related to running a kernel on an element.Given the huge overhead, the measurement of the time for fewer than
0.5
to1 M
evaluations ofsin
/native_sin
is dominated by the overhead. With more evaluations per run, the time per evaluation is about200 ps
oni7-6700K
and about100 ps
oni9-10885H
, most likely from the difference in core count. Thenative_sin
is slightly faster than thesin
version but not by too much. This is about4
times slower than our CPU implementation and is also much less accurate so there's really no reason for us to use this... -
-
Intel Compute OpenCL runtime
-
UHD 530
-
UHD 640
The Intel GPU appears to behave much better compared to their CPU, which isn't something that I'll expect from intel...
From the dummy run, the overhead per run seems to be much lower at about
12 us
on UHD 530 and15 us
on UHD 640 for the out of order queue60 us
and50 us
of initial overhead respectively. The in order queue seems to have a similar initial overhead but slightly higher overhead per run. Based on the timing for the run with8 M
elements, there also seems to be a fixed overhead of100 ps
per element which needs to be overcome by increasing the computation inside each kernel.Due to the per element overhead, the timing is not flat as a function of evaluations per kernel even for large global sizes. This overhead becomes insignificant for about
16
evaluations of the function per kernel forsin
but remains significant fornative_sin
until about32
or64
evaluations per kernel due to the higher performance ofnative_sin
. On both devices, the time is about100 ps
forsin
and about30 ps
fornative_sin
. This is comparable or slightly worse than what we achieved on the respective CPU. Thenative_sin
would be very nice to use based on the performance but the precision is unfortunatly a little too low compared to what we need. -
-
AMD ROCm OpenCL driver
AMD Radeon RX 5500 XT
Unlike the Intel GPU driver, the out of order queue does not seem to make too much a difference here. The initial overhead for scheduling the kernel seems to be slightly higher at about
100 us
, which makes some sense due to the PCIe bus, but the time per run is much lower at about2.5 us
. There also seem to be a fixed overhead per element of about20 ps
.For the performance of
sin
andnative_sin
, thenative_sin
appears to be much faster. The time is about13 ps
forsin
and about2 ps
fornative_sin
. Since thenative_sin
already give us enough accuracy, we should be able to use it to generate hundreds of traps on this GPU.
Given the performance and accuracy of the Intel CPU OpenCL implementation, we will ignore it from this point on and run our tests only on the GPUs. The next thing we will test is the memory bandwidth of the GPU.