Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proof of Concept] custom apply_gate on GPU #88

Closed
wants to merge 2 commits into from

Conversation

scarrazza
Copy link
Member

@scarrazza scarrazza commented May 23, 2020

@stavros11 following the developments in #86, this is a proof of concept PR where the kernel is distributed using cuda. The structure should probably be more robust in terms of thread/blocks distribution, however as it is we should be able to gain orders of magnitude in comparison to tf.einsum.

@scarrazza scarrazza requested a review from stavros11 May 23, 2020 20:30
@codecov
Copy link

codecov bot commented May 23, 2020

Codecov Report

Merging #88 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master      #88   +/-   ##
=======================================
  Coverage   97.33%   97.33%           
=======================================
  Files          30       30           
  Lines        2965     2965           
=======================================
  Hits         2886     2886           
  Misses         79       79           
Flag Coverage Δ
#unittests 97.33% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11b90f5...55bc18d. Read the comment docs.

Copy link
Member

@stavros11 stavros11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for implementing this.

I tested this but apply_gate does not seem to work on GPU. I don't get any errors but the state is not modified. Have you checked if it is working? Because it is likely that I am doing something wrong.

const int64 k = std::pow(2, nqubits - target - 1);
int threads = nstates / (2 * k);
int blocks = (nstates / (2 * k) + threads -1) / threads;
std::cout << threads << " " << blocks << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this cout is to be deleted for the final version.

@scarrazza
Copy link
Member Author

Yes, the tests are working on GPU for me. As a proof of concept I believe the distribution of blocks and threads can be improved. That being said, I believe we should not merge this PR.

@stavros11
Copy link
Member

Yes, the tests are working on GPU for me.

I think in my case it has to do with some compilation issues. When I try to run the tests on dom I get the

tensorflow.python.framework.errors_impl.NotFoundError: /home/stavrose/qibo/src/qibo/tensorflow/custom_operators/python/ops/_qibo_tf_custom_operators.so: undefined symbol: _ZN10tensorflow8OpKernel11TraceStringB5cxx11EPNS_15OpKernelContextEb

I mentioned some time ago. Following this I tried removing the -D_GLIBCXX_USE_CXX11_ABI=0 from the compilation but I the problem remains.

When I run python setup.py build on AWS I get a different error:

/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status
Makefile:38: recipe for target 'cpugpuoperator' failed

@scarrazza
Copy link
Member Author

On dom it is working fine to me, using the default CUDA_PATH and TF2.1 installation.

On AWS it also works but somehow the installation directories are strange, I had to create a symlink in /usr/local/cuda-10.1/lib/libcudart.so from /usr/local/cuda-10.1/lib64/libcudart.so. Let's play a little bit more and before the release if needed we can replace this simple makefile with a cmake or meson implementation which automatically discovers the path.

@stavros11
Copy link
Member

On dom it is working fine to me, using the default CUDA_PATH and TF2.1 installation.

I was using tf2.2 before. I retried with tf2.1 and it works both for the tests and my script. I am not sure what is the issue with tf2.2, though.

It also seems to be super-fast and better in terms of memory. My benchmark script with einsum takes 0.895sec for 26 qubits and fails for 27. With this it could run up to 28 qubits in 0.000343sec (is this correct?!). Why do you think this should not be merged? Even if it is not fully optimized it seems to be much faster than we currently have so it is definitely an improvement (excluding the gradients and possible compilation issues).

@scarrazza
Copy link
Member Author

Ok, I think we should probably implement this approach in #87 (if you decide to merge it). I still have to study the what's is the best block/thread distribution.

@scarrazza
Copy link
Member Author

Closing this in favour of #107 .

@scarrazza scarrazza closed this Jun 5, 2020
@scarrazza scarrazza added this to the First QIBO public release milestone Jun 6, 2020
@scarrazza scarrazza deleted the pocgpuapplygate branch June 18, 2020 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants