Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU custom kernels #107

Merged
merged 25 commits into from
Jun 8, 2020
Merged

GPU custom kernels #107

merged 25 commits into from
Jun 8, 2020

Conversation

scarrazza
Copy link
Member

@scarrazza scarrazza commented Jun 5, 2020

This PR closes #26. Here a list of required points before merging:

  • add GPU kernel layout
  • BaseOneQubitGateFunctor
  • ApplyGateFunctor
  • ApplyXFunctor
  • ApplyYFunctor
  • ApplyZFunctor
  • ApplyZPowFunctor
  • BaseTwoQubitGateFunctor
  • ApplyTwoQubitGateFunctor
  • ApplyFsimFunctor
  • ApplySwapFunctor

Then:

  • Test performance
  • Make sure tests are also passing on GPU
  • Make sure the design is clean

@codecov
Copy link

codecov bot commented Jun 5, 2020

Codecov Report

Merging #107 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #107   +/-   ##
=======================================
  Coverage   97.56%   97.56%           
=======================================
  Files          31       31           
  Lines        3617     3617           
=======================================
  Hits         3529     3529           
  Misses         88       88           
Flag Coverage Δ
#unittests 97.56% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7963d98...0ff72d2. Read the comment docs.

@scarrazza scarrazza requested a review from stavros11 June 6, 2020 11:56
@scarrazza scarrazza marked this pull request as ready for review June 6, 2020 11:56
@scarrazza
Copy link
Member Author

scarrazza commented Jun 6, 2020

@stavros11 this PR is ready for test and review. Could you please try to run your gates/circuit benchmarks comparing this approach vs the tf.einsum on GPU? (you can use the TITAN V)

@scarrazza scarrazza changed the title [WIP] GPU custom kernels GPU custom kernels Jun 6, 2020
@stavros11
Copy link
Member

Thanks for implementing this. Performance seems very good. Here are some benchmarks:
I used single precision and 100 layers because one-layer times were very small and mostly noise. For tf.einsum all one/two-qubit gates are essentially equivalent as it just does 2x2 / 4x4 matrix multiplication.

One-qubit gates:

27 28 29 30
tf.einsum 49.82841492 104.1803401    
H 6.127989292 12.95185041 27.32096434 57.70882773
X 6.152362108 12.98752093 27.38330579 57.75994301
Y 7.868867397 16.57449579 34.87748337 73.28885698
Z 3.366233587 7.083585501 14.89782739 31.2736063
RX 7.230115652 13.98320127 28.49803305 62.8688333

Two-qubit gates:

27 28 29 30
tf.einsum two-qubit 40.34099483 80.16095805    
CNOT 5.406490803 11.08659792 22.48678398 45.97074556
CZPow 3.482161522 6.683050871 13.45482922 26.82113028
SWAP 3.299152851 6.96425724 14.66655588 30.83993149
fSim 6.187955618 11.16268635 22.07861543 45.51510334

QFT (single-precision):
dom_qft_custom_c64_gpudom_qft_custom_c64

Custom (i9 - 36 threads) Custom (TITAN V) tf.einsum (TITAN V)
26 0.3277387619 5.71264863
27 13.76589632 0.6096761227 12.13366461
28 25.55270457 1.191361427 26.04767585
29 61.55770898 2.376187086
30 146.8903966 4.718497753

Variational circuit from #106:

27 28 29 30
Custom (All gates) 0.2576124668 0.5082602501 0.9988548756 2.066425085
tf.matmul (All gates) 1.634342909 3.077134132
Custom (Gate union) 0.3028848171 0.3951649666 0.532115221 0.8434355259
tf.matmul (Gate union) 0.6267392635 1.120274067

Regarding the code, one thing I noticed is that we have to redefine nocontrolwork, singlecontrolwork, etc. for each gate. From the benchmarks I am not sure how useful it is to have many different kernels in the GPU (eg. X and Y performance is similar to ApplyGate used by H and RX). For two-qubit gates there is some improvement (eg. CZPow is better than fSim). It might be better to keep the current structure just to be similar to CPU though.

This PR seems good in terms of performance, although it might make sense to compare with another GPU library (eg. QCGPU or similar / more reliable if it exists). I also think that we should find a way to run tests on GPU as well though, to verify that all gates work properly there. Tests currently pass on dom where I think the GPU is used.

.Attr("T: {complex64, complex128}") \
.Input("state: T") \
.Input("gate: T") \
.Input("tensor_controls: int32") \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it required to pass controls both as a tensor and an attribute? I thought the attribute is sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multi control gates we need the control array allocated inside the GPU memory, so this was my 'lazy' attempt after failing to use the allocate_temp method to create a temporary tensor on the device. I will play with that a little bit more, and if it doesn't work then I will use the cudaMalloc function.

Note that in the current code, the multi control for two qubit gates compute the qubits array at every call (which is pretty bad) but I think we can avoid that with one of the previous approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that in the current code, the multi control for two qubit gates compute the qubits array at every call (which is pretty bad) but I think we can avoid that with one of the previous approaches.

In principle we can avoid recalculating qubits if we calculate this during gate creation (as we were doing with einsum strings etc.). I am not sure if that's a better approach as we would probably need to write a seperate kernel that does this calculation. It would also require to pass qubits from C++ to Python and back so it might end up being slower. I am not sure if there is a better way to do this that avoids Python.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is fine to perform this computation in c++, I don't think we need caching this info, but instead I just want to avoid this kernel line:
https://github.com/Quantum-TII/qibo/blob/3a68e233ae022c812eae83a322f33af6cb4ad8cc/src/qibo/tensorflow/custom_operators/cc/kernels/apply_gate_kernels.cu.cc#L642
because it recompute the same information at each thread call (looks ugly and can add some overhead).

@scarrazza
Copy link
Member Author

scarrazza commented Jun 6, 2020

Thanks for the extensive performance checks. I am really happy about these numbers.

Concerning the QFT did you check if the output for circuits for large qubits matches the einsum/cpu implementation?

I will have another go with simplifications, trying to remove redundant code, and adding some documentation.

@stavros11
Copy link
Member

Concerning the QFT did you check if the output for circuits for large qubits matches the einsum/cpu implementation?

I just checked benchmark outputs between Qibo tf.einsum, custom GPU and CPU and Cirq and they all agree.

@scarrazza
Copy link
Member Author

@joseignaciolatorre now we are ultra fast: 4 seconds for a 30 qubits QFT!

@scarrazza
Copy link
Member Author

@stavros11 perfect, thanks.

@scarrazza
Copy link
Member Author

@stavros11 on dom GPU some tests are failing (not only for this PR).
In particular:

  • test_probabilistic_measurement
  • test_unbalanced_probabilistic_measurement
  • test_vqe_compile_error

Do you understand why?

@stavros11
Copy link
Member

stavros11 commented Jun 7, 2020

@stavros11 on dom GPU some tests are failing (not only for this PR).
In particular:

  • test_probabilistic_measurement
  • test_unbalanced_probabilistic_measurement
  • test_vqe_compile_error

Do you understand why?

Yes, I also noticed that but didn't mention it because it is not important for the kernels.

The two measurement tests involve sampling that is non-deterministic. I use tf.random.seed to make sure the results are always the same but it seems that tf.random works differently on CPU and GPU so that's why tests fail. If we see the error:

E           {0: 273} != {0: 271}
E           {1: 233} != {1: 239}
E           {3: 252} != {3: 248}

we see that it is just statistical difference (due to sampling) so nothing to worry about (the distribution must be the same). We should just hard-code the GPU numbers when we make tests run on GPU.

The VQE test fails because it assumes that from qibo import gates imports the custom gates. Currently from qibo import gates imports our old tensorflow gates when running on GPU. Since this PR implements the GPU kernels we should change the default gates to be the custom ones both for CPU and GPU.

@scarrazza
Copy link
Member Author

Thanks for checking, I have pushed the fix for tests, it is working now (but I have the suspicious the cirq tests are failing on GPU). Let's merge #111, port the changes here and then perform another round of tests.

@stavros11
Copy link
Member

stavros11 commented Jun 7, 2020

Thanks for checking, I have pushed the fix for tests, it is working now (but I have the suspicious the cirq tests are failing on GPU).

I think the failing Cirq tests are related to the bug I fixed in #108. In the two-qubit gate kernel we define t1 = max(target1, target2); t2 = min(target1, target2) which switches the order of targets if target1 < target2. This may lead to wrong application of some gates (eg. general 4x4 unitary matrix). In the failing tests targets are generated randomly but I noticed that if I do

if targets[0] > targets[1]: 
    targets[0], targets[1] = targets[1], targets[0]

then tests pass.

In #108 I modified the two-qubit kernel to fix this issue. I think this fix is not ported here. When we port it then Cirq tests should be fine.

@scarrazza
Copy link
Member Author

Indeed, now it works, thanks.

@scarrazza
Copy link
Member Author

@stavros11 I think we should merge this PR as it is asap, so you can rebase the multi-gpu and check.

Technically, this PR contains GPU kernels in sync with the master kernels for CPU. All tests are passing. The only thing I still have to check is how to allocate Tensors directly on GPU for the qubits array calculation, but I can try this later in another PR and it should change the performance drastically.

Could you please review?

Copy link
Member

@stavros11 stavros11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we can merge this for now so that we can start testing the multi-gpu. Actually I already merged this in another multigpu branch and was updating my old code to use this. I think that the main bottleneck now will be transfering state pieces from CPU to GPUs and was planning to try a custom kernel that does this indexing on CPU. I will update on #76 once I have some results.

Regarding qubits, another simple approach is to do the calculation in Python and pass this array instead of controls to the kernels. Essentially in vector notation qubits = nqubits - controls - 1 but also includes the targets and is sorted, so something like:

qubits = list(nqubits - np.array(controls) - 1)
qubits.extend(nqubits - np.array(targets) - 1)
qubits = sorted(qubits)

We can pass this list to the kernel instead of the controls list we are passing now. This should be fine for the GPU as we can pass it as a tensor (like state and gate).

I think this approach would simplify the C++ code and we could also do the calculation once during gate creation, not every time the kernel is called. I would not expect much difference in terms of performance compared to the current approach.

I can try implementing this idea in a seperate PR if you like.

@scarrazza
Copy link
Member Author

Ok, lets merge this now so you can continue with the multi-gpu as priority while I have look at the qubits replacement.

@scarrazza scarrazza merged commit 5300f33 into master Jun 8, 2020
@scarrazza scarrazza deleted the gpukernels branch June 18, 2020 07:53
scarrazza added a commit that referenced this pull request Nov 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Custom operator for GPU
2 participants