-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU custom kernels #107
GPU custom kernels #107
Conversation
Codecov Report
@@ Coverage Diff @@
## master #107 +/- ##
=======================================
Coverage 97.56% 97.56%
=======================================
Files 31 31
Lines 3617 3617
=======================================
Hits 3529 3529
Misses 88 88
Continue to review full report at Codecov.
|
@stavros11 this PR is ready for test and review. Could you please try to run your gates/circuit benchmarks comparing this approach vs the |
Thanks for implementing this. Performance seems very good. Here are some benchmarks: One-qubit gates:
Two-qubit gates:
Variational circuit from #106:
Regarding the code, one thing I noticed is that we have to redefine This PR seems good in terms of performance, although it might make sense to compare with another GPU library (eg. QCGPU or similar / more reliable if it exists). I also think that we should find a way to run tests on GPU as well though, to verify that all gates work properly there. Tests currently pass on dom where I think the GPU is used. |
.Attr("T: {complex64, complex128}") \ | ||
.Input("state: T") \ | ||
.Input("gate: T") \ | ||
.Input("tensor_controls: int32") \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it required to pass controls both as a tensor and an attribute? I thought the attribute is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For multi control gates we need the control
array allocated inside the GPU memory, so this was my 'lazy' attempt after failing to use the allocate_temp
method to create a temporary tensor on the device. I will play with that a little bit more, and if it doesn't work then I will use the cudaMalloc function.
Note that in the current code, the multi control for two qubit gates compute the qubits array at every call (which is pretty bad) but I think we can avoid that with one of the previous approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in the current code, the multi control for two qubit gates compute the qubits array at every call (which is pretty bad) but I think we can avoid that with one of the previous approaches.
In principle we can avoid recalculating qubits
if we calculate this during gate creation (as we were doing with einsum strings etc.). I am not sure if that's a better approach as we would probably need to write a seperate kernel that does this calculation. It would also require to pass qubits
from C++ to Python and back so it might end up being slower. I am not sure if there is a better way to do this that avoids Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is fine to perform this computation in c++, I don't think we need caching this info, but instead I just want to avoid this kernel line:
https://github.com/Quantum-TII/qibo/blob/3a68e233ae022c812eae83a322f33af6cb4ad8cc/src/qibo/tensorflow/custom_operators/cc/kernels/apply_gate_kernels.cu.cc#L642
because it recompute the same information at each thread call (looks ugly and can add some overhead).
Thanks for the extensive performance checks. I am really happy about these numbers. Concerning the QFT did you check if the output for circuits for large qubits matches the einsum/cpu implementation? I will have another go with simplifications, trying to remove redundant code, and adding some documentation. |
I just checked benchmark outputs between Qibo |
@joseignaciolatorre now we are ultra fast: 4 seconds for a 30 qubits QFT! |
@stavros11 perfect, thanks. |
@stavros11 on dom GPU some tests are failing (not only for this PR).
Do you understand why? |
Yes, I also noticed that but didn't mention it because it is not important for the kernels. The two measurement tests involve sampling that is non-deterministic. I use
we see that it is just statistical difference (due to sampling) so nothing to worry about (the distribution must be the same). We should just hard-code the GPU numbers when we make tests run on GPU. The VQE test fails because it assumes that |
Thanks for checking, I have pushed the fix for tests, it is working now (but I have the suspicious the cirq tests are failing on GPU). Let's merge #111, port the changes here and then perform another round of tests. |
I think the failing Cirq tests are related to the bug I fixed in #108. In the two-qubit gate kernel we define
then tests pass. In #108 I modified the two-qubit kernel to fix this issue. I think this fix is not ported here. When we port it then Cirq tests should be fine. |
Indeed, now it works, thanks. |
@stavros11 I think we should merge this PR as it is asap, so you can rebase the multi-gpu and check. Technically, this PR contains GPU kernels in sync with the master kernels for CPU. All tests are passing. The only thing I still have to check is how to allocate Tensors directly on GPU for the Could you please review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree we can merge this for now so that we can start testing the multi-gpu. Actually I already merged this in another multigpu branch and was updating my old code to use this. I think that the main bottleneck now will be transfering state pieces from CPU to GPUs and was planning to try a custom kernel that does this indexing on CPU. I will update on #76 once I have some results.
Regarding qubits
, another simple approach is to do the calculation in Python and pass this array instead of controls
to the kernels. Essentially in vector notation qubits = nqubits - controls - 1
but also includes the targets and is sorted, so something like:
qubits = list(nqubits - np.array(controls) - 1)
qubits.extend(nqubits - np.array(targets) - 1)
qubits = sorted(qubits)
We can pass this list to the kernel instead of the controls
list we are passing now. This should be fine for the GPU as we can pass it as a tensor (like state
and gate
).
I think this approach would simplify the C++ code and we could also do the calculation once during gate creation, not every time the kernel is called. I would not expect much difference in terms of performance compared to the current approach.
I can try implementing this idea in a seperate PR if you like.
Ok, lets merge this now so you can continue with the multi-gpu as priority while I have look at the qubits replacement. |
This PR closes #26. Here a list of required points before merging:
Then: