Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for the qibo.quantum_info.basis.pauli_basis and vectorization function #1459

Merged
merged 26 commits into from
Oct 8, 2024

Conversation

BrunoLiegiBastonLiegi
Copy link
Contributor

@BrunoLiegiBastonLiegi BrunoLiegiBastonLiegi commented Sep 20, 2024

This improves the construction of the pauli basis by moving everything to tensor notation and removing loops. Namely the basis_full is now constructed via contraction through einsum. This should also scale well with GPU backends, weirdly for standard numpy CPU there is no speedup.
These are the results:

nqubits    old (numpy)    new (numpy)    new (cupy)
4            ~ 1s            ~ 1s          ~ 2.7s
7           ~ 9.5s          ~ 9.1s         ~ 3.5s
8            killed          memory         memory

The GPU always takes ~1s to set up. With 8 qubits, old gets killed together with the shell session, whereas new, both CPU and GPU, raise an out of memory error. To run 8 qubits you apparently need ~64GB of memory.

To perform the einsum I use all the 48 ascii characters available, which means that we are limited to 48/3=16 qubits, unless other characters can be used in einsum. In any case, the memory requirements for 16 qubits are probably going to be very taxing I am using integers as the indices for the einsum now, thus this is not limited anymore by the number of ascii charachters. It may be possible to obtain a speedup with numba as well, but I still have to investigate.

EDIT 1: unfortunately einsum is not supported by numba, however if you jit the old implementation:

            basis_single = list(product(basis_single, repeat=nqubits))

            @njit(fastmath=True, parallel=True, cache=True)
            def _build_basis(basis_single):
                rows = [basis_single[0][0] for i in range(len(basis_single))]
                for i in prange(len(basis_single)):
                    row = basis_single[i][0]
                    for j in range(len(basis_single[i][1:])):
                        row = np.kron(row, basis_single[i][j])
                    rows[i] = row
                return rows

            basis_full = backend.cast(_build_basis(basis_single))

you are able to get a nice speedup ~ 5s for 7 qubits, however this happens from the second call, since the first time you are still dealing with the compilation unfortunately this implementation was yielding wrong results, thus I had to rollback to the einsum aprroach. Further investigation is needed to understand if it's possible to parallelize and improve this with numba.

EDIT 2:
At some point I realized that a possible bottleneck, or better inefficiency, was due to the qibo.quantum_info.superoperator_transformation.vectorization that allowed to be run on a single input only, either a state vector or a density matrix, thus forcing to run loops on each element of the basis (which can grow large very quickly). The impact on the runtime is still marginal for the cpu as for 7 qubits its contribute was around ~1-2s out 10s, but for GPUs this starts becoming relevant. Furthermore, vectorization appears to be widely used in the quantum_info module, which convinced me to generalize it to accept batches of state vectors or density matrices, lifting therefore the need of explicit loops and rather leveraging tensor primitives directly. This was applied to the pauli_basis for now but has to be propagated throughout the whole module.

Checklist:

  • Reviewers confirm new code works as expected.
  • Tests are passing.
  • Coverage does not decrease.
  • Documentation is updated.

Copy link

codecov bot commented Sep 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.93%. Comparing base (a59e5d7) to head (2fc4114).

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1459   +/-   ##
=======================================
  Coverage   99.93%   99.93%           
=======================================
  Files          81       81           
  Lines       11777    11785    +8     
=======================================
+ Hits        11769    11777    +8     
  Misses          8        8           
Flag Coverage Δ
unittests 99.93% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alecandido
Copy link
Member

alecandido commented Sep 21, 2024

@BrunoLiegiBastonLiegi you don't have to use ascii characters for einsum

einsum also provides an alternative way to provide the subscripts and operands as einsum(op0, sublist0, op1, sublist1, ..., [sublistout]). If the output shape is not provided in this format einsum will be calculated in implicit mode, otherwise it will be performed explicitly. The examples below have corresponding einsum calls with the two parameter methods.

https://numpy.org/doc/stable/reference/generated/numpy.einsum.html#numpy-einsum

In this way, you can use integers instead of characters, that is much better if you are generating the einsum input (and you're not limited by the amount of characters).

(this is not a suggestion to ban characters for einsum in every situation, since they are handy if they are a few, and you're writing them manually, as it's much more readable - but the case in which you spell out a small operation and the one in which you generate a complex one as an output have to be handled in different ways, and luckily NumPy offers both)

@renatomello
Copy link
Contributor

One way we could move forward is to return generators for n >= 8 and let the user do its memory management by accessing the elements on the fly.

@renatomello renatomello added this to the Qibo 0.2.13 milestone Sep 23, 2024
@renatomello renatomello added enhancement New feature or request documentation Improvements or additions to documentation quantum_info module PRs and issues related to the quantum information module labels Sep 23, 2024
@BrunoLiegiBastonLiegi
Copy link
Contributor Author

One way we could move forward is to return generators for n >= 8 and let the user do its memory management by accessing the elements on the fly.

mmh yeah, that's a possibility, but only in certain cases. For the sake of parallelization, surely having everything represented by a single tensor contraction is better, but you are limited by memory.
Maybe, what we could do is defining the generator of a single element of the basis for each backend:

@cache
def _pauli_basis_element(self, i, nqubits):
    # do things according to backend

and then you can build the complete basis as a generator as you suggested

def pauli_basis(...):
    return (backend._pauli_basis_element(i, nqubits) for i in range(len))

but GPU wise this will be less efficient and it will work, anyway, as long as you don't need the complete basis at the same time.

@renatomello
Copy link
Contributor

One way we could move forward is to return generators for n >= 8 and let the user do its memory management by accessing the elements on the fly.

mmh yeah, that's a possibility, but only in certain cases. For the sake of parallelization, surely having everything represented by a single tensor contraction is better, but you are limited by memory. Maybe, what we could do is defining the generator of a single element of the basis for each backend:

@cache
def _pauli_basis_element(self, i, nqubits):
    # do things according to backend

and then you can build the complete basis as a generator as you suggested

def pauli_basis(...):
    return (backend._pauli_basis_element(i, nqubits) for i in range(len))

but GPU wise this will be less efficient and it will work, anyway, as long as you don't need the complete basis at the same time.

The Hilbert space is too big, unfortunately there's no way around that. I wouldn't spend much time on this since there are bigger priorities in terms of optimization.

@BrunoLiegiBastonLiegi
Copy link
Contributor Author

In the end I was not able to make the numba parallelized implementation to work, it was working with parallel=False but there was no speed up in that case. Thus I just rolled back to using einsum in every case. We are still getting a nice scaling with GPUs. I will get rid of the ascii charachters now.

@renatomello renatomello self-requested a review September 25, 2024 05:11
@BrunoLiegiBastonLiegi
Copy link
Contributor Author

BrunoLiegiBastonLiegi commented Sep 25, 2024

Ok I ended up also adding a generalization to the qibo.quantum_info.superoperator_transformation.vectorization, which now accepts batches of state vectors or density matrices as well, thus allowing to vectorize the whole basis in one shot rather than looping. In the process, I found a weird behaviour of tensorflow. Namely, if you have a complex tensor the elements with zero real part, but still non zero imaginary part, are not recovered by nonzero:

import numpy as np
import tensorflow as tf
import tensorflow.experimental.numpy as tnp 

a = tf.Variable([0 + 1j, 1 + 0j, 1 + 1j])
tnp.nonzero(a)
# this finds the last two only
# [<tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 2])>]
np.nonzero(a)
# numpy gives the correct result
# (array([0, 1, 2]),)

@renatomello renatomello self-requested a review September 27, 2024 04:20
@renatomello
Copy link
Contributor

@stavros11 since you looked into #1462, could you look into this too? Thanks.

@renatomello renatomello added this pull request to the merge queue Oct 8, 2024
Merged via the queue into master with commit 3a7a951 Oct 8, 2024
25 checks passed
@renatomello renatomello deleted the pauli_basis_speedup branch October 8, 2024 11:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement New feature or request quantum_info module PRs and issues related to the quantum information module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants