Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-controlled gates in apply_gate #87

Merged
merged 10 commits into from
May 27, 2020
Merged

Multi-controlled gates in apply_gate #87

merged 10 commits into from
May 27, 2020

Conversation

stavros11
Copy link
Member

This extends op.apply_gate to support multi-control gates. I am not sure if this is the best implementation possible but performance and memory requirements are the same with our current apply_gate (controlled gates are actually slightly faster because they require less updates).

With this kernel we can implement all Qibo gates except SWAP and measurements. For SWAP we can either decompose it to three CNOTs or write another special kernel.

I will implement Circuit in a different branch using this custom operator instead of matmul/einsum and redo the gate benchmarks to see how it compares to Cirq.

@scarrazza
Copy link
Member

@stavros11 the implementation looks good, please rebase this PR with the #86 (if you agree with the changes), so we can check that everything is working properly.

@codecov
Copy link

codecov bot commented May 23, 2020

Codecov Report

Merging #87 into master will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #87      +/-   ##
==========================================
+ Coverage   97.33%   97.37%   +0.03%     
==========================================
  Files          30       30              
  Lines        2965     3005      +40     
==========================================
+ Hits         2886     2926      +40     
  Misses         79       79              
Flag Coverage Δ
#unittests 97.37% <100.00%> (+0.03%) ⬆️
Impacted Files Coverage Δ
...m_operators/python/ops/qibo_tf_custom_operators.py 100.00% <100.00%> (ø)
src/qibo/tests/test_custom_operators.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 11b90f5...67a76f1. Read the comment docs.

@stavros11
Copy link
Member Author

I integrated this to qibo Circuit in a different branch and run some of the gate benchmarks from #47 to compare with Cirq. It performs much better than our current implementation but it still does not exploit gate sparsity and Cirq is faster in some gates. Table has ratio Qibo custom op time / Cirq0.8 time.

nqubits 27 28 29
H (1 thread) 0.670 0.709 0.768
H (8 threads) 0.422 0.428 0.389
H (36 threads) 0.289 0.280 0.307
Z (1 thread) 2.694 2.787 3.069
Z (8 threads) 1.730 1.715 1.657
Z (36 threads) 1.208 1.222 1.136
CNOT (1 thread) 2.726 2.782 2.856
CNOT (8 threads) 1.137 1.132 1.099
CNOT (36 threads) 0.741 0.694 0.695

As in our previous implementation, execution time is similar for all gates. The difference is just because Cirq is much faster for some gates (such as CNOT and Z) than it is for H.

@scarrazza
Copy link
Member

Good, do you have a complete QFT example?

const int64 tk = std::pow(2, nqubits - target - 1);

int64 cktot = 0;
std::set<int64> cks;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need a set? If not, then we could use a vector with know size vector<int64> cks(ncontrols) and then simply assign element by element, avoiding resizing with insert.

for (auto g = t; g < w; g += 2 * tk) {
for (auto i = g; i < g + tk; i++) {
bool apply = true;
for (std::set<int64>::iterator q = cks.begin(); q != cks.end(); q++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about a c++11 range-based for loop, e.g. for (const auto &i: cks)?

Copy link
Member

@scarrazza scarrazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. We can address the slower gates in another PR.

@stavros11
Copy link
Member Author

Good, do you have a complete QFT example?

image

Exact numbers for many qubits (everything single precision):

nqubits 27 28 29 30 31
Cirq0.8 53.754 112.990 235.410 495.829 1044.066
Custom (all threads) 57.498 118.274 243.665 505.251 1047.843
MatmulEinsum (all threads) 239.516 495.313 1036.507 2161.165 4559.753

It should be noted that Cirq is single-threaded, while our single thread performance will be ~2-3 times worse. However QFT also involves a lot of CZPow for which Cirq is particularly fast.

I do not have results for more qubits due to time (no memory). 32 qubits take ~48GB, so it might be possible to run 33 on dom.

Looks good to me. We can address the slower gates in another PR.

Thanks for the comments. I switched set to vector as it has simpler syntax and it is also slightly faster. I am trying to think of a clean way to fix slower gates without having to write a seperate kernel for each gate. Sure we can do this in a different PR.

@scarrazza
Copy link
Member

Great, we are almost there. Concerning the other operators we could have a functor per gate and thus pass the gate name together with the state.

@stavros11
Copy link
Member Author

stavros11 commented May 23, 2020

For the gates we currently have in Qibo I think the situation is the following:

  • H, RX, RY, RZ should work very well (better than Cirq) with this kernel.
  • Z / CZPow require update half / quarter of the state. This means that we could get rid of one of the two state updates in the loop (eg. line 45 in the kernel).
  • X, SWAP are just rearrangements and in principle do not require any calculation. I think (but not really sure) that the most efficient way to implement this in the C++ level is by swapping the pointers (state[i1], state[i2] = state[i2], state[i1]) without affecting any actual values in memory.
  • Y is the same as X but also requires multiplying with +-i after rearranging the pointers.

The easiest and probably most efficient solution is to write three/four different kernels that implement these. The only disadvantage of this is that it would result to many C files. Alternatively we could try to fit all in the same kernel, however I am not sure how easy this would be.

Also, there is an indexing issue with 32 qubits:

Actual backend: NoneType
2020-05-23 20:09:57.186323: F ./tensorflow/core/platform/blocking_counter.h:30] Check failed: initial_count >= 0 (-2147483648 vs. 0)
Aborted (core dumped)

@scarrazza
Copy link
Member

Thanks for the summary, having 3-4 kernels is not a problem at all.

@stavros11
Copy link
Member Author

Thanks for the summary, having 3-4 kernels is not a problem at all.

Great. I will try to implement something around this idea and check how it compares to Cirq. If it works as expected I will open a PR or push it directly here if we leave this open.

@scarrazza
Copy link
Member

Perfect, concerning the crash, I have just tried a simple apply_gate on dom with single precision and 32 qubits and I don't see any problem. Do you have a specific example that I can try to reproduce?

@stavros11
Copy link
Member Author

Perfect, concerning the crash, I have just tried a simple apply_gate on dom with single precision and 32 qubits and I don't see any problem. Do you have a specific example that I can try to reproduce?

Thanks for testing. I also tried a simpler example and it is possible to run up to 32 qubits. I guess the problem is associated with the way I integrated the custom op in qibo circuits but this was a quick implementation just for the benchmarks. I will revisit this anyway when we have the final kernels.

Copy link
Member

@scarrazza scarrazza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if you prefer to merge this now and then open other PRs, it is fine by me too.

@stavros11 stavros11 mentioned this pull request May 24, 2020
5 tasks
@scarrazza scarrazza merged commit 67a76f1 into master May 27, 2020
@stavros11 stavros11 deleted the controlled-gates branch May 27, 2020 20:01
@scarrazza scarrazza added this to the First QIBO public release milestone Jun 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants