Optimizations for the `qibo.quantum_info.basis.pauli_basis` and `vectorization` function #1459

BrunoLiegiBastonLiegi · 2024-09-20T11:58:48Z

This improves the construction of the pauli basis by moving everything to tensor notation and removing loops. Namely the basis_full is now constructed via contraction through einsum. This should also scale well with GPU backends, weirdly for standard numpy CPU there is no speedup.
These are the results:

nqubits    old (numpy)    new (numpy)    new (cupy)
4            ~ 1s            ~ 1s          ~ 2.7s
7           ~ 9.5s          ~ 9.1s         ~ 3.5s
8            killed          memory         memory

The GPU always takes ~1s to set up. With 8 qubits, old gets killed together with the shell session, whereas new, both CPU and GPU, raise an out of memory error. To run 8 qubits you apparently need ~64GB of memory.

To perform the einsum I use all the 48 ascii characters available, which means that we are limited to 48/3=16 qubits, unless other characters can be used in einsum. In any case, the memory requirements for 16 qubits are probably going to be very taxing I am using integers as the indices for the einsum now, thus this is not limited anymore by the number of ascii charachters. It may be possible to obtain a speedup with numba as well, but I still have to investigate.

EDIT 1: unfortunately einsum is not supported by numba, however if you jit the old implementation:

            basis_single = list(product(basis_single, repeat=nqubits))

            @njit(fastmath=True, parallel=True, cache=True)
            def _build_basis(basis_single):
                rows = [basis_single[0][0] for i in range(len(basis_single))]
                for i in prange(len(basis_single)):
                    row = basis_single[i][0]
                    for j in range(len(basis_single[i][1:])):
                        row = np.kron(row, basis_single[i][j])
                    rows[i] = row
                return rows

            basis_full = backend.cast(_build_basis(basis_single))

~~you are able to get a nice speedup ~ 5s for 7 qubits, however this happens from the second call, since the first time you are still dealing with the compilation~~ unfortunately this implementation was yielding wrong results, thus I had to rollback to the einsum aprroach. Further investigation is needed to understand if it's possible to parallelize and improve this with numba.

EDIT 2:
At some point I realized that a possible bottleneck, or better inefficiency, was due to the qibo.quantum_info.superoperator_transformation.vectorization that allowed to be run on a single input only, either a state vector or a density matrix, thus forcing to run loops on each element of the basis (which can grow large very quickly). The impact on the runtime is still marginal for the cpu as for 7 qubits its contribute was around ~1-2s out 10s, but for GPUs this starts becoming relevant. Furthermore, vectorization appears to be widely used in the quantum_info module, which convinced me to generalize it to accept batches of state vectors or density matrices, lifting therefore the need of explicit loops and rather leveraging tensor primitives directly. This was applied to the pauli_basis for now but has to be propagated throughout the whole module.

Checklist:

Reviewers confirm new code works as expected.
Tests are passing.
Coverage does not decrease.
Documentation is updated.

codecov · 2024-09-20T11:59:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.93%. Comparing base (a59e5d7) to head (2fc4114).

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #1459   +/-   ##
=======================================
  Coverage   99.93%   99.93%           
=======================================
  Files          81       81           
  Lines       11777    11785    +8     
=======================================
+ Hits        11769    11777    +8     
  Misses          8        8

Flag	Coverage Δ
unittests	`99.93% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alecandido · 2024-09-21T08:59:13Z

@BrunoLiegiBastonLiegi you don't have to use ascii characters for einsum

einsum also provides an alternative way to provide the subscripts and operands as einsum(op0, sublist0, op1, sublist1, ..., [sublistout]). If the output shape is not provided in this format einsum will be calculated in implicit mode, otherwise it will be performed explicitly. The examples below have corresponding einsum calls with the two parameter methods.

https://numpy.org/doc/stable/reference/generated/numpy.einsum.html#numpy-einsum

In this way, you can use integers instead of characters, that is much better if you are generating the einsum input (and you're not limited by the amount of characters).

(this is not a suggestion to ban characters for einsum in every situation, since they are handy if they are a few, and you're writing them manually, as it's much more readable - but the case in which you spell out a small operation and the one in which you generate a complex one as an output have to be handled in different ways, and luckily NumPy offers both)

renatomello · 2024-09-23T05:54:59Z

One way we could move forward is to return generators for n >= 8 and let the user do its memory management by accessing the elements on the fly.

BrunoLiegiBastonLiegi · 2024-09-23T13:34:41Z

One way we could move forward is to return generators for n >= 8 and let the user do its memory management by accessing the elements on the fly.

mmh yeah, that's a possibility, but only in certain cases. For the sake of parallelization, surely having everything represented by a single tensor contraction is better, but you are limited by memory.
Maybe, what we could do is defining the generator of a single element of the basis for each backend:

@cache
def _pauli_basis_element(self, i, nqubits):
    # do things according to backend

and then you can build the complete basis as a generator as you suggested

def pauli_basis(...):
    return (backend._pauli_basis_element(i, nqubits) for i in range(len))

but GPU wise this will be less efficient and it will work, anyway, as long as you don't need the complete basis at the same time.

renatomello · 2024-09-24T05:56:33Z

One way we could move forward is to return generators for n >= 8 and let the user do its memory management by accessing the elements on the fly.

mmh yeah, that's a possibility, but only in certain cases. For the sake of parallelization, surely having everything represented by a single tensor contraction is better, but you are limited by memory. Maybe, what we could do is defining the generator of a single element of the basis for each backend:
@cache
def _pauli_basis_element(self, i, nqubits):
    # do things according to backend
and then you can build the complete basis as a generator as you suggested
def pauli_basis(...):
    return (backend._pauli_basis_element(i, nqubits) for i in range(len))
but GPU wise this will be less efficient and it will work, anyway, as long as you don't need the complete basis at the same time.

The Hilbert space is too big, unfortunately there's no way around that. I wouldn't spend much time on this since there are bigger priorities in terms of optimization.

BrunoLiegiBastonLiegi · 2024-09-24T07:58:16Z

In the end I was not able to make the numba parallelized implementation to work, it was working with parallel=False but there was no speed up in that case. Thus I just rolled back to using einsum in every case. We are still getting a nice scaling with GPUs. I will get rid of the ascii charachters now.

src/qibo/quantum_info/basis.py

Co-authored-by: Renato Mello <[email protected]>

for more information, see https://pre-commit.ci

BrunoLiegiBastonLiegi · 2024-09-25T07:48:39Z

Ok I ended up also adding a generalization to the qibo.quantum_info.superoperator_transformation.vectorization, which now accepts batches of state vectors or density matrices as well, thus allowing to vectorize the whole basis in one shot rather than looping. In the process, I found a weird behaviour of tensorflow. Namely, if you have a complex tensor the elements with zero real part, but still non zero imaginary part, are not recovered by nonzero:

import numpy as np
import tensorflow as tf
import tensorflow.experimental.numpy as tnp 

a = tf.Variable([0 + 1j, 1 + 0j, 1 + 1j])
tnp.nonzero(a)
# this finds the last two only
# [<tf.Tensor: shape=(2,), dtype=int64, numpy=array([1, 2])>]
np.nonzero(a)
# numpy gives the correct result
# (array([0, 1, 2]),)

src/qibo/quantum_info/superoperator_transformations.py

tests/test_quantum_info_basis.py

Co-authored-by: Renato Mello <[email protected]>

renatomello · 2024-10-07T06:39:46Z

@stavros11 since you looked into #1462, could you look into this too? Thanks.

feat: building the basis using einsum

8fa006b

BrunoLiegiBastonLiegi requested a review from renatomello September 20, 2024 11:59

BrunoLiegiBastonLiegi requested review from alecandido and AlejandroSopena September 20, 2024 11:59

BrunoLiegiBastonLiegi marked this pull request as ready for review September 20, 2024 15:06

feat: testing numba jittifcation

01e7fda

BrunoLiegiBastonLiegi mentioned this pull request Sep 20, 2024

Improving integration of quantum_info module and qibojit #1136

Open

renatomello assigned BrunoLiegiBastonLiegi Sep 23, 2024

renatomello added this to the Qibo 0.2.13 milestone Sep 23, 2024

renatomello added enhancement New feature or request documentation Improvements or additions to documentation quantum_info module PRs and issues related to the quantum information module labels Sep 23, 2024

fix: pylint disable

183645a

fix: removed numba version

71d53aa

feat: using int indices in place of letters

591a736

renatomello reviewed Sep 24, 2024

View reviewed changes

src/qibo/quantum_info/basis.py Outdated Show resolved Hide resolved

BrunoLiegiBastonLiegi and others added 4 commits September 24, 2024 12:17

Update src/qibo/quantum_info/basis.py

d5ca0d3

Co-authored-by: Renato Mello <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

aa4f97d

for more information, see https://pre-commit.ci

fix: removed unnecessary cast

94d7e85

feat: adding batch support to vectorize, to be fixed

0e12cf1

renatomello self-requested a review September 25, 2024 05:11

BrunoLiegiBastonLiegi added 2 commits September 25, 2024 09:21

fix: fixing nonzero for tensorflow and torch

3e9d9a8

fix: fix typo

f492af0