[Proof of Concept] custom apply_gate on GPU #88

scarrazza · 2020-05-23T20:30:08Z

@stavros11 following the developments in #86, this is a proof of concept PR where the kernel is distributed using cuda. The structure should probably be more robust in terms of thread/blocks distribution, however as it is we should be able to gain orders of magnitude in comparison to tf.einsum.

codecov · 2020-05-23T20:31:19Z

Codecov Report

Merging #88 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master      #88   +/-   ##
=======================================
  Coverage   97.33%   97.33%           
=======================================
  Files          30       30           
  Lines        2965     2965           
=======================================
  Hits         2886     2886           
  Misses         79       79

Flag	Coverage Δ
#unittests	`97.33% <ø> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11b90f5...55bc18d. Read the comment docs.

stavros11

Thanks for implementing this.

I tested this but apply_gate does not seem to work on GPU. I don't get any errors but the state is not modified. Have you checked if it is working? Because it is likely that I am doing something wrong.

stavros11 · 2020-05-24T11:25:41Z

src/qibo/tensorflow/custom_operators/cc/kernels/apply_gate_kernels.cu.cc

+    const int64 k = std::pow(2, nqubits - target - 1);
+    int threads = nstates / (2 * k);
+    int blocks = (nstates / (2 * k) + threads -1) / threads;
+    std::cout << threads << " " << blocks << std::endl;


I think this cout is to be deleted for the final version.

scarrazza · 2020-05-24T11:33:36Z

Yes, the tests are working on GPU for me. As a proof of concept I believe the distribution of blocks and threads can be improved. That being said, I believe we should not merge this PR.

stavros11 · 2020-05-24T12:10:50Z

Yes, the tests are working on GPU for me.

I think in my case it has to do with some compilation issues. When I try to run the tests on dom I get the

tensorflow.python.framework.errors_impl.NotFoundError: /home/stavrose/qibo/src/qibo/tensorflow/custom_operators/python/ops/_qibo_tf_custom_operators.so: undefined symbol: _ZN10tensorflow8OpKernel11TraceStringB5cxx11EPNS_15OpKernelContextEb

I mentioned some time ago. Following this I tried removing the -D_GLIBCXX_USE_CXX11_ABI=0 from the compilation but I the problem remains.

When I run python setup.py build on AWS I get a different error:

/usr/bin/ld: cannot find -lcudart
collect2: error: ld returned 1 exit status
Makefile:38: recipe for target 'cpugpuoperator' failed

scarrazza · 2020-05-24T15:46:56Z

On dom it is working fine to me, using the default CUDA_PATH and TF2.1 installation.

On AWS it also works but somehow the installation directories are strange, I had to create a symlink in /usr/local/cuda-10.1/lib/libcudart.so from /usr/local/cuda-10.1/lib64/libcudart.so. Let's play a little bit more and before the release if needed we can replace this simple makefile with a cmake or meson implementation which automatically discovers the path.

stavros11 · 2020-05-24T18:49:23Z

On dom it is working fine to me, using the default CUDA_PATH and TF2.1 installation.

I was using tf2.2 before. I retried with tf2.1 and it works both for the tests and my script. I am not sure what is the issue with tf2.2, though.

It also seems to be super-fast and better in terms of memory. My benchmark script with einsum takes 0.895sec for 26 qubits and fails for 27. With this it could run up to 28 qubits in 0.000343sec (is this correct?!). Why do you think this should not be merged? Even if it is not fully optimized it seems to be much faster than we currently have so it is definitely an improvement (excluding the gradients and possible compilation issues).

scarrazza · 2020-05-24T18:56:58Z

Ok, I think we should probably implement this approach in #87 (if you decide to merge it). I still have to study the what's is the best block/thread distribution.

scarrazza · 2020-06-05T08:46:10Z

Closing this in favour of #107 .

scarrazza added 2 commits May 23, 2020 22:27

unlocking gpu

9a3224e

adding gpu parallel evaluation

55bc18d

scarrazza requested a review from stavros11 May 23, 2020 20:30

stavros11 reviewed May 24, 2020

View reviewed changes

scarrazza closed this Jun 5, 2020

scarrazza added this to the First QIBO public release milestone Jun 6, 2020

scarrazza deleted the pocgpuapplygate branch June 18, 2020 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proof of Concept] custom apply_gate on GPU #88

[Proof of Concept] custom apply_gate on GPU #88

scarrazza commented May 23, 2020 •

edited

Loading

codecov bot commented May 23, 2020 •

edited

Loading

stavros11 left a comment

stavros11 May 24, 2020

scarrazza commented May 24, 2020

stavros11 commented May 24, 2020

scarrazza commented May 24, 2020

stavros11 commented May 24, 2020

scarrazza commented May 24, 2020

scarrazza commented Jun 5, 2020

[Proof of Concept] custom apply_gate on GPU #88

[Proof of Concept] custom apply_gate on GPU #88

Conversation

scarrazza commented May 23, 2020 • edited Loading

codecov bot commented May 23, 2020 • edited Loading

Codecov Report

stavros11 left a comment

Choose a reason for hiding this comment

stavros11 May 24, 2020

Choose a reason for hiding this comment

scarrazza commented May 24, 2020

stavros11 commented May 24, 2020

scarrazza commented May 24, 2020

stavros11 commented May 24, 2020

scarrazza commented May 24, 2020

scarrazza commented Jun 5, 2020

scarrazza commented May 23, 2020 •

edited

Loading

codecov bot commented May 23, 2020 •

edited

Loading