GPU custom kernels #107

scarrazza · 2020-06-05T08:45:48Z

This PR closes #26. Here a list of required points before merging:

Then:

Test performance
Make sure tests are also passing on GPU
Make sure the design is clean

codecov · 2020-06-05T10:14:23Z

Codecov Report

Merging #107 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #107   +/-   ##
=======================================
  Coverage   97.56%   97.56%           
=======================================
  Files          31       31           
  Lines        3617     3617           
=======================================
  Hits         3529     3529           
  Misses         88       88

Flag	Coverage Δ
#unittests	`97.56% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7963d98...0ff72d2. Read the comment docs.

scarrazza · 2020-06-06T11:57:19Z

@stavros11 this PR is ready for test and review. Could you please try to run your gates/circuit benchmarks comparing this approach vs the tf.einsum on GPU? (you can use the TITAN V)

stavros11 · 2020-06-06T20:31:47Z

Thanks for implementing this. Performance seems very good. Here are some benchmarks:
I used single precision and 100 layers because one-layer times were very small and mostly noise. For tf.einsum all one/two-qubit gates are essentially equivalent as it just does 2x2 / 4x4 matrix multiplication.

One-qubit gates:

	27	28	29	30
tf.einsum	49.82841492	104.1803401
H	6.127989292	12.95185041	27.32096434	57.70882773
X	6.152362108	12.98752093	27.38330579	57.75994301
Y	7.868867397	16.57449579	34.87748337	73.28885698
Z	3.366233587	7.083585501	14.89782739	31.2736063
RX	7.230115652	13.98320127	28.49803305	62.8688333

Two-qubit gates:

	27	28	29	30
tf.einsum two-qubit	40.34099483	80.16095805
CNOT	5.406490803	11.08659792	22.48678398	45.97074556
CZPow	3.482161522	6.683050871	13.45482922	26.82113028
SWAP	3.299152851	6.96425724	14.66655588	30.83993149
fSim	6.187955618	11.16268635	22.07861543	45.51510334

QFT (single-precision):

	Custom (i9 - 36 threads)	Custom (TITAN V)	tf.einsum (TITAN V)
26		0.3277387619	5.71264863
27	13.76589632	0.6096761227	12.13366461
28	25.55270457	1.191361427	26.04767585
29	61.55770898	2.376187086
30	146.8903966	4.718497753

Variational circuit from #106:

	27	28	29	30
Custom (All gates)	0.2576124668	0.5082602501	0.9988548756	2.066425085
tf.matmul (All gates)	1.634342909	3.077134132
Custom (Gate union)	0.3028848171	0.3951649666	0.532115221	0.8434355259
tf.matmul (Gate union)	0.6267392635	1.120274067

Regarding the code, one thing I noticed is that we have to redefine nocontrolwork, singlecontrolwork, etc. for each gate. From the benchmarks I am not sure how useful it is to have many different kernels in the GPU (eg. X and Y performance is similar to ApplyGate used by H and RX). For two-qubit gates there is some improvement (eg. CZPow is better than fSim). It might be better to keep the current structure just to be similar to CPU though.

This PR seems good in terms of performance, although it might make sense to compare with another GPU library (eg. QCGPU or similar / more reliable if it exists). I also think that we should find a way to run tests on GPU as well though, to verify that all gates work properly there. Tests currently pass on dom where I think the GPU is used.

stavros11 · 2020-06-06T20:36:16Z

src/qibo/tensorflow/custom_operators/cc/ops/custom_ops.cc

+      .Attr("T: {complex64, complex128}") \
+      .Input("state: T")                  \
+      .Input("gate: T")                   \
+      .Input("tensor_controls: int32")    \


Is it required to pass controls both as a tensor and an attribute? I thought the attribute is sufficient.

For multi control gates we need the control array allocated inside the GPU memory, so this was my 'lazy' attempt after failing to use the allocate_temp method to create a temporary tensor on the device. I will play with that a little bit more, and if it doesn't work then I will use the cudaMalloc function.

Note that in the current code, the multi control for two qubit gates compute the qubits array at every call (which is pretty bad) but I think we can avoid that with one of the previous approaches.

Note that in the current code, the multi control for two qubit gates compute the qubits array at every call (which is pretty bad) but I think we can avoid that with one of the previous approaches.

In principle we can avoid recalculating qubits if we calculate this during gate creation (as we were doing with einsum strings etc.). I am not sure if that's a better approach as we would probably need to write a seperate kernel that does this calculation. It would also require to pass qubits from C++ to Python and back so it might end up being slower. I am not sure if there is a better way to do this that avoids Python.

I think it is fine to perform this computation in c++, I don't think we need caching this info, but instead I just want to avoid this kernel line:
https://github.com/Quantum-TII/qibo/blob/3a68e233ae022c812eae83a322f33af6cb4ad8cc/src/qibo/tensorflow/custom_operators/cc/kernels/apply_gate_kernels.cu.cc#L642
because it recompute the same information at each thread call (looks ugly and can add some overhead).

scarrazza · 2020-06-06T20:55:36Z

Thanks for the extensive performance checks. I am really happy about these numbers.

Concerning the QFT did you check if the output for circuits for large qubits matches the einsum/cpu implementation?

I will have another go with simplifications, trying to remove redundant code, and adding some documentation.

stavros11 · 2020-06-07T09:38:11Z

Concerning the QFT did you check if the output for circuits for large qubits matches the einsum/cpu implementation?

I just checked benchmark outputs between Qibo tf.einsum, custom GPU and CPU and Cirq and they all agree.

scarrazza · 2020-06-07T12:07:59Z

@joseignaciolatorre now we are ultra fast: 4 seconds for a 30 qubits QFT!

scarrazza · 2020-06-07T12:08:20Z

@stavros11 perfect, thanks.

scarrazza · 2020-06-07T12:16:45Z

@stavros11 on dom GPU some tests are failing (not only for this PR).
In particular:

test_probabilistic_measurement
test_unbalanced_probabilistic_measurement
test_vqe_compile_error

Do you understand why?

stavros11 · 2020-06-07T13:00:32Z

@stavros11 on dom GPU some tests are failing (not only for this PR).
In particular:

test_probabilistic_measurement

test_unbalanced_probabilistic_measurement

test_vqe_compile_error

Do you understand why?

Yes, I also noticed that but didn't mention it because it is not important for the kernels.

The two measurement tests involve sampling that is non-deterministic. I use tf.random.seed to make sure the results are always the same but it seems that tf.random works differently on CPU and GPU so that's why tests fail. If we see the error:

E           {0: 273} != {0: 271}
E           {1: 233} != {1: 239}
E           {3: 252} != {3: 248}

we see that it is just statistical difference (due to sampling) so nothing to worry about (the distribution must be the same). We should just hard-code the GPU numbers when we make tests run on GPU.

The VQE test fails because it assumes that from qibo import gates imports the custom gates. Currently from qibo import gates imports our old tensorflow gates when running on GPU. Since this PR implements the GPU kernels we should change the default gates to be the custom ones both for CPU and GPU.

scarrazza · 2020-06-07T17:24:22Z

Thanks for checking, I have pushed the fix for tests, it is working now (but I have the suspicious the cirq tests are failing on GPU). Let's merge #111, port the changes here and then perform another round of tests.

stavros11 · 2020-06-07T18:19:30Z

Thanks for checking, I have pushed the fix for tests, it is working now (but I have the suspicious the cirq tests are failing on GPU).

I think the failing Cirq tests are related to the bug I fixed in #108. In the two-qubit gate kernel we define t1 = max(target1, target2); t2 = min(target1, target2) which switches the order of targets if target1 < target2. This may lead to wrong application of some gates (eg. general 4x4 unitary matrix). In the failing tests targets are generated randomly but I noticed that if I do

if targets[0] > targets[1]: 
    targets[0], targets[1] = targets[1], targets[0]

then tests pass.

In #108 I modified the two-qubit kernel to fix this issue. I think this fix is not ported here. When we port it then Cirq tests should be fine.

scarrazza · 2020-06-07T18:59:06Z

Indeed, now it works, thanks.

scarrazza · 2020-06-08T11:26:22Z

@stavros11 I think we should merge this PR as it is asap, so you can rebase the multi-gpu and check.

Technically, this PR contains GPU kernels in sync with the master kernels for CPU. All tests are passing. The only thing I still have to check is how to allocate Tensors directly on GPU for the qubits array calculation, but I can try this later in another PR and it should change the performance drastically.

Could you please review?

stavros11

I agree we can merge this for now so that we can start testing the multi-gpu. Actually I already merged this in another multigpu branch and was updating my old code to use this. I think that the main bottleneck now will be transfering state pieces from CPU to GPUs and was planning to try a custom kernel that does this indexing on CPU. I will update on #76 once I have some results.

Regarding qubits, another simple approach is to do the calculation in Python and pass this array instead of controls to the kernels. Essentially in vector notation qubits = nqubits - controls - 1 but also includes the targets and is sorted, so something like:

qubits = list(nqubits - np.array(controls) - 1)
qubits.extend(nqubits - np.array(targets) - 1)
qubits = sorted(qubits)

We can pass this list to the kernel instead of the controls list we are passing now. This should be fine for the GPU as we can pass it as a tensor (like state and gate).

I think this approach would simplify the C++ code and we could also do the calculation once during gate creation, not every time the kernel is called. I would not expect much difference in terms of performance compared to the current approach.

I can try implementing this idea in a seperate PR if you like.

scarrazza · 2020-06-08T14:17:54Z

Ok, lets merge this now so you can continue with the multi-gpu as priority while I have look at the qubits replacement.

GPU custom kernels

scarrazza added 3 commits June 4, 2020 15:17

make sure the custom operator compiles for GPU

ec6df36

reorganizing structures in order to allocate kernels

662435e

applygate ncontrols

c0dd6b6

scarrazza mentioned this pull request Jun 5, 2020

[Proof of Concept] custom apply_gate on GPU #88

Closed

adding x, y, z, zpow operators for no control

ef54147

converting controls from input to attribute

54161a3

scarrazza added this to the First QIBO public release milestone Jun 5, 2020

scarrazza added the optimization label Jun 5, 2020

scarrazza added 8 commits June 5, 2020 22:55

implementing ncontrols > 0

0b97951

adding virtual auxiliary functions for cuda

f4a806e

simplifying apply_gate cuda kernel

f0a706d

conclusing one qubits gates

88bfb09

updating functor signature

565c6bc

applying changes

4ed0bf8

implementing two qubit gates kernels

18f29f0

apply fix to two qubits gates kernel bug

bd4b86b

scarrazza requested a review from stavros11 June 6, 2020 11:56

scarrazza marked this pull request as ready for review June 6, 2020 11:56

applying google formatting style

3a68e23

scarrazza changed the title ~~[WIP] GPU custom kernels~~ GPU custom kernels Jun 6, 2020

stavros11 reviewed Jun 6, 2020

View reviewed changes

scarrazza and others added 4 commits June 7, 2020 19:03

Merge branch 'master' into gpukernels

ba5a093

fixing merge bug

0d3224b

use custom gates for CPU and GPU

0d35872

fixing test measurements with GPU reference numbers

f7743ac

stavros11 mentioned this pull request Jun 7, 2020

Fix multi-threading for one-qubit gates #111

Merged

fixing test measurements round 2

8d516cf

fixing two gate bug

ea3ae08

scarrazza and others added 5 commits June 8, 2020 10:17

Merge branch 'master' into gpukernels

f14392e

sync CPU operators with master

0a497e8

removing unused variable

27a6a4e

porting updated CPU gates to GPU

7ec58b0

fixing two gate base

0ff72d2

stavros11 approved these changes Jun 8, 2020

View reviewed changes

scarrazza merged commit 5300f33 into master Jun 8, 2020

scarrazza mentioned this pull request Jun 10, 2020

Precomputing qubits for GPU/CPU kernels #115

Merged

scarrazza deleted the gpukernels branch June 18, 2020 07:53

scarrazza added a commit that referenced this pull request Nov 1, 2022

Merge pull request #107 from Quantum-TII/gpukernels

088870e

GPU custom kernels

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU custom kernels #107

GPU custom kernels #107

scarrazza commented Jun 5, 2020 •

edited

Loading

codecov bot commented Jun 5, 2020 •

edited

Loading

scarrazza commented Jun 6, 2020 •

edited

Loading

stavros11 commented Jun 6, 2020

stavros11 Jun 6, 2020

scarrazza Jun 6, 2020

stavros11 Jun 7, 2020

scarrazza Jun 7, 2020

scarrazza commented Jun 6, 2020 •

edited

Loading

stavros11 commented Jun 7, 2020

scarrazza commented Jun 7, 2020

scarrazza commented Jun 7, 2020

scarrazza commented Jun 7, 2020

stavros11 commented Jun 7, 2020 •

edited

Loading

scarrazza commented Jun 7, 2020

stavros11 commented Jun 7, 2020 •

edited

Loading

scarrazza commented Jun 7, 2020

scarrazza commented Jun 8, 2020

stavros11 left a comment •

edited

Loading

scarrazza commented Jun 8, 2020

GPU custom kernels #107

GPU custom kernels #107

Conversation

scarrazza commented Jun 5, 2020 • edited Loading

codecov bot commented Jun 5, 2020 • edited Loading

Codecov Report

scarrazza commented Jun 6, 2020 • edited Loading

stavros11 commented Jun 6, 2020

stavros11 Jun 6, 2020

Choose a reason for hiding this comment

scarrazza Jun 6, 2020

Choose a reason for hiding this comment

stavros11 Jun 7, 2020

Choose a reason for hiding this comment

scarrazza Jun 7, 2020

Choose a reason for hiding this comment

scarrazza commented Jun 6, 2020 • edited Loading

stavros11 commented Jun 7, 2020

scarrazza commented Jun 7, 2020

scarrazza commented Jun 7, 2020

scarrazza commented Jun 7, 2020

stavros11 commented Jun 7, 2020 • edited Loading

scarrazza commented Jun 7, 2020

stavros11 commented Jun 7, 2020 • edited Loading

scarrazza commented Jun 7, 2020

scarrazza commented Jun 8, 2020

stavros11 left a comment • edited Loading

Choose a reason for hiding this comment

scarrazza commented Jun 8, 2020

scarrazza commented Jun 5, 2020 •

edited

Loading

codecov bot commented Jun 5, 2020 •

edited

Loading

scarrazza commented Jun 6, 2020 •

edited

Loading

scarrazza commented Jun 6, 2020 •

edited

Loading

stavros11 commented Jun 7, 2020 •

edited

Loading

stavros11 commented Jun 7, 2020 •

edited

Loading

stavros11 left a comment •

edited

Loading