Multi-controlled gates in `apply_gate` #87

stavros11 · 2020-05-22T16:29:01Z

This extends op.apply_gate to support multi-control gates. I am not sure if this is the best implementation possible but performance and memory requirements are the same with our current apply_gate (controlled gates are actually slightly faster because they require less updates).

With this kernel we can implement all Qibo gates except SWAP and measurements. For SWAP we can either decompose it to three CNOTs or write another special kernel.

I will implement Circuit in a different branch using this custom operator instead of matmul/einsum and redo the gate benchmarks to see how it compares to Cirq.

scarrazza · 2020-05-22T17:01:06Z

@stavros11 the implementation looks good, please rebase this PR with the #86 (if you agree with the changes), so we can check that everything is working properly.

codecov · 2020-05-23T08:29:25Z

Codecov Report

Merging #87 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #87      +/-   ##
==========================================
+ Coverage   97.33%   97.37%   +0.03%     
==========================================
  Files          30       30              
  Lines        2965     3005      +40     
==========================================
+ Hits         2886     2926      +40     
  Misses         79       79

Flag	Coverage Δ
#unittests	`97.37% <100.00%> (+0.03%)`	⬆️

Impacted Files	Coverage Δ
...m_operators/python/ops/qibo_tf_custom_operators.py	`100.00% <100.00%> (ø)`
src/qibo/tests/test_custom_operators.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11b90f5...67a76f1. Read the comment docs.

stavros11 · 2020-05-23T12:26:09Z

I integrated this to qibo Circuit in a different branch and run some of the gate benchmarks from #47 to compare with Cirq. It performs much better than our current implementation but it still does not exploit gate sparsity and Cirq is faster in some gates. Table has ratio Qibo custom op time / Cirq0.8 time.

nqubits	27	28	29
H (1 thread)	0.670	0.709	0.768
H (8 threads)	0.422	0.428	0.389
H (36 threads)	0.289	0.280	0.307
Z (1 thread)	2.694	2.787	3.069
Z (8 threads)	1.730	1.715	1.657
Z (36 threads)	1.208	1.222	1.136
CNOT (1 thread)	2.726	2.782	2.856
CNOT (8 threads)	1.137	1.132	1.099
CNOT (36 threads)	0.741	0.694	0.695

As in our previous implementation, execution time is similar for all gates. The difference is just because Cirq is much faster for some gates (such as CNOT and Z) than it is for H.

scarrazza · 2020-05-23T12:46:04Z

Good, do you have a complete QFT example?

scarrazza · 2020-05-23T16:04:39Z

src/qibo/tensorflow/custom_operators/cc/kernels/apply_gate_kernels.cc

+    const int64 tk = std::pow(2, nqubits - target - 1);
+
+    int64 cktot = 0;
+    std::set<int64> cks;


Do you really need a set? If not, then we could use a vector with know size vector<int64> cks(ncontrols) and then simply assign element by element, avoiding resizing with insert.

scarrazza · 2020-05-23T16:06:15Z

src/qibo/tensorflow/custom_operators/cc/kernels/apply_gate_kernels.cc

+      for (auto g = t; g < w; g += 2 * tk) {
+        for (auto i = g; i < g + tk; i++) {
+          bool apply = true;
+          for (std::set<int64>::iterator q = cks.begin(); q != cks.end(); q++) {


What about a c++11 range-based for loop, e.g. for (const auto &i: cks)?

scarrazza

Looks good to me. We can address the slower gates in another PR.

stavros11 · 2020-05-23T17:43:07Z

Good, do you have a complete QFT example?

Exact numbers for many qubits (everything single precision):

nqubits	27	28	29	30	31
Cirq0.8	53.754	112.990	235.410	495.829	1044.066
Custom (all threads)	57.498	118.274	243.665	505.251	1047.843
MatmulEinsum (all threads)	239.516	495.313	1036.507	2161.165	4559.753

It should be noted that Cirq is single-threaded, while our single thread performance will be ~2-3 times worse. However QFT also involves a lot of CZPow for which Cirq is particularly fast.

I do not have results for more qubits due to time (no memory). 32 qubits take ~48GB, so it might be possible to run 33 on dom.

Looks good to me. We can address the slower gates in another PR.

Thanks for the comments. I switched set to vector as it has simpler syntax and it is also slightly faster. I am trying to think of a clean way to fix slower gates without having to write a seperate kernel for each gate. Sure we can do this in a different PR.

scarrazza · 2020-05-23T18:12:21Z

Great, we are almost there. Concerning the other operators we could have a functor per gate and thus pass the gate name together with the state.

stavros11 · 2020-05-23T18:38:26Z

For the gates we currently have in Qibo I think the situation is the following:

H, RX, RY, RZ should work very well (better than Cirq) with this kernel.
Z / CZPow require update half / quarter of the state. This means that we could get rid of one of the two state updates in the loop (eg. line 45 in the kernel).
X, SWAP are just rearrangements and in principle do not require any calculation. I think (but not really sure) that the most efficient way to implement this in the C++ level is by swapping the pointers (state[i1], state[i2] = state[i2], state[i1]) without affecting any actual values in memory.
Y is the same as X but also requires multiplying with +-i after rearranging the pointers.

The easiest and probably most efficient solution is to write three/four different kernels that implement these. The only disadvantage of this is that it would result to many C files. Alternatively we could try to fit all in the same kernel, however I am not sure how easy this would be.

Also, there is an indexing issue with 32 qubits:

Actual backend: NoneType
2020-05-23 20:09:57.186323: F ./tensorflow/core/platform/blocking_counter.h:30] Check failed: initial_count >= 0 (-2147483648 vs. 0)
Aborted (core dumped)

scarrazza · 2020-05-23T18:49:06Z

Thanks for the summary, having 3-4 kernels is not a problem at all.

stavros11 · 2020-05-23T18:56:09Z

Thanks for the summary, having 3-4 kernels is not a problem at all.

Great. I will try to implement something around this idea and check how it compares to Cirq. If it works as expected I will open a PR or push it directly here if we leave this open.

scarrazza · 2020-05-23T19:18:59Z

Perfect, concerning the crash, I have just tried a simple apply_gate on dom with single precision and 32 qubits and I don't see any problem. Do you have a specific example that I can try to reproduce?

stavros11 · 2020-05-23T20:20:09Z

Perfect, concerning the crash, I have just tried a simple apply_gate on dom with single precision and 32 qubits and I don't see any problem. Do you have a specific example that I can try to reproduce?

Thanks for testing. I also tried a simpler example and it is possible to run up to 32 qubits. I guess the problem is associated with the way I integrated the custom op in qibo circuits but this was a quick implementation just for the benchmarks. I will revisit this anyway when we have the final kernels.

scarrazza

Ok, if you prefer to merge this now and then open other PRs, it is fine by me too.

stavros11 added 6 commits May 22, 2020 12:51

Custom operator for controlled gates (not working)

74189e5

Merge shardcpu

61a2bca

Controlled apply_gate compiling

2066d29

Refactor apply_gate tests

423b88d

Add controlled gate tests

bddc966

Add docstring in apply gate

a149b21

stavros11 added 3 commits May 23, 2020 11:22

Tests failing

ba640ae

Fix tests

906fc1f

Remove import

8552604

scarrazza reviewed May 23, 2020

View reviewed changes

Switch set to vector

67a76f1

scarrazza approved these changes May 23, 2020

View reviewed changes

scarrazza mentioned this pull request May 24, 2020

[Proof of Concept] custom apply_gate on GPU #88

Closed

stavros11 mentioned this pull request May 24, 2020

Gate specific CPU kernels #90

Merged

5 tasks

scarrazza merged commit 67a76f1 into master May 27, 2020

stavros11 deleted the controlled-gates branch May 27, 2020 20:01

stavros11 mentioned this pull request Jun 4, 2020

Variational layer with gate union #106

Merged

scarrazza added this to the First QIBO public release milestone Jun 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-controlled gates in `apply_gate` #87

Multi-controlled gates in `apply_gate` #87

stavros11 commented May 22, 2020

scarrazza commented May 22, 2020

codecov bot commented May 23, 2020 •

edited

Loading

stavros11 commented May 23, 2020

scarrazza commented May 23, 2020

scarrazza May 23, 2020

scarrazza May 23, 2020

scarrazza left a comment

stavros11 commented May 23, 2020

scarrazza commented May 23, 2020

stavros11 commented May 23, 2020 •

edited

Loading

scarrazza commented May 23, 2020

stavros11 commented May 23, 2020

scarrazza commented May 23, 2020

stavros11 commented May 23, 2020

scarrazza left a comment

Multi-controlled gates in apply_gate #87

Multi-controlled gates in apply_gate #87

Conversation

stavros11 commented May 22, 2020

scarrazza commented May 22, 2020

codecov bot commented May 23, 2020 • edited Loading

Codecov Report

stavros11 commented May 23, 2020

scarrazza commented May 23, 2020

scarrazza May 23, 2020

Choose a reason for hiding this comment

scarrazza May 23, 2020

Choose a reason for hiding this comment

scarrazza left a comment

Choose a reason for hiding this comment

stavros11 commented May 23, 2020

scarrazza commented May 23, 2020

stavros11 commented May 23, 2020 • edited Loading

scarrazza commented May 23, 2020

stavros11 commented May 23, 2020

scarrazza commented May 23, 2020

stavros11 commented May 23, 2020

scarrazza left a comment

Choose a reason for hiding this comment

Multi-controlled gates in `apply_gate` #87

Multi-controlled gates in `apply_gate` #87

codecov bot commented May 23, 2020 •

edited

Loading

stavros11 commented May 23, 2020 •

edited

Loading