-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Proof of Concept] custom apply_gate on GPU #88
Conversation
Codecov Report
@@ Coverage Diff @@
## master #88 +/- ##
=======================================
Coverage 97.33% 97.33%
=======================================
Files 30 30
Lines 2965 2965
=======================================
Hits 2886 2886
Misses 79 79
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for implementing this.
I tested this but apply_gate
does not seem to work on GPU. I don't get any errors but the state is not modified. Have you checked if it is working? Because it is likely that I am doing something wrong.
const int64 k = std::pow(2, nqubits - target - 1); | ||
int threads = nstates / (2 * k); | ||
int blocks = (nstates / (2 * k) + threads -1) / threads; | ||
std::cout << threads << " " << blocks << std::endl; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this cout
is to be deleted for the final version.
Yes, the tests are working on GPU for me. As a proof of concept I believe the distribution of blocks and threads can be improved. That being said, I believe we should not merge this PR. |
I think in my case it has to do with some compilation issues. When I try to run the tests on dom I get the
I mentioned some time ago. Following this I tried removing the When I run
|
On dom it is working fine to me, using the default CUDA_PATH and TF2.1 installation. On AWS it also works but somehow the installation directories are strange, I had to create a symlink in |
I was using tf2.2 before. I retried with tf2.1 and it works both for the tests and my script. I am not sure what is the issue with tf2.2, though. It also seems to be super-fast and better in terms of memory. My benchmark script with einsum takes 0.895sec for 26 qubits and fails for 27. With this it could run up to 28 qubits in 0.000343sec (is this correct?!). Why do you think this should not be merged? Even if it is not fully optimized it seems to be much faster than we currently have so it is definitely an improvement (excluding the gradients and possible compilation issues). |
Ok, I think we should probably implement this approach in #87 (if you decide to merge it). I still have to study the what's is the best block/thread distribution. |
Closing this in favour of #107 . |
@stavros11 following the developments in #86, this is a proof of concept PR where the kernel is distributed using cuda. The structure should probably be more robust in terms of thread/blocks distribution, however as it is we should be able to gain orders of magnitude in comparison to
tf.einsum
.