Backend review #138

alecandido · 2023-08-15T07:24:36Z

This is actually spanning both Qibojit and Qibo itself, but being specific for backends, I decided to avoid polluting Qibo's tracker.

It is only a proposal and definitely not urgent. The goal is to simplify the code (for maintenance), and potentially also new backends implementation.

The main observation is that, most of the work done at the level of the backend, relies on the usage of a NumPy compatible API.
This has already been observed since the beginning, and indeed there is a self.np attribute to access the API specific to that backend.
However, NumPy has far more refined approaches for interoperability, and since they are quite adopted by the other similar libraries, in principle some of the tasks being performed by Qibo could be delegated to the libraries themselves.

In particular, the main mechanism are __array_ufunc__ and __array_function__, that allow a NumPy call on a foreign object to be handled by the external library defining the object. They are essentially hooks, that are called by the NumPy function, passing them all the details about the original call.
These are not only working on the function processing existing arrays, but also on the creation routines, by using the like argument (see e.g. np.zeros).
Libraries like CuPy are already implementing this mechanism by themselves. In principle, all the backend methods that are just using the NumPy API should not be implemented more than once, at most the underlying NumPy operations should be hooked, by providing an __array_function__ implementation ourselves (possibly a wrapper over an existing one, if not sufficiently complete).
Essentially, we could act at the level of NumPy functions, filling the gaps, instead of at the level of quantum operations.

E.g. the zero_state method is implemented over and over:

https://github.com/qiboteam/qibo/blob/a09e16e3d107f412bc7a57e10b729aeadcfd7c7b/src/qibo/backends/numpy.py#L78-L81
https://github.com/qiboteam/qibo/blob/a09e16e3d107f412bc7a57e10b729aeadcfd7c7b/src/qibo/backends/tensorflow.py#L233-L238

qibojit/src/qibojit/backends/cpu.py

Lines 87 to 90 in 0cac397

    
           def zero_state(self, nqubits): 
        
               size = 2**nqubits 
        
               state = np.empty((size,), dtype=self.dtype) 
        
               return self.ops.initial_state_vector(state)

qibojit/src/qibojit/backends/gpu.py

Lines 144 to 150 in 0cac397

    
           def zero_state(self, nqubits): 
        
               n = 1 << nqubits 
        
               kernel = self.gates.get(f"initial_state_kernel_{self.kernel_type}") 
        
               state = self.cp.zeros(n, dtype=self.dtype) 
        
               kernel((1,), (1,), [state]) 
        
               self.cp.cuda.stream.get_current_stream().synchronize() 
        
               return state

but it should always perform the same operations.

In practice, there are many limitations that should be discussed separately:

CuPy backend is computing the exponential by bit-shift

qibojit/src/qibojit/backends/gpu.py

Line 145 in 0cac397

n = 1 << nqubits

however, this is happening purely in Python, so, if more efficient, could be simply adopted by the unique implementation (in other places the same backend is using exponentiation)

qibojit/src/qibojit/backends/gpu.py

Line 169 in 0cac397

state = self.cp.ones(2**nqubits, dtype=self.dtype)

CuPy backend is using a kernel for setting an element:

qibojit/src/qibojit/backends/gpu.py

Lines 146 to 148 in 0cac397

    
           kernel = self.gates.get(f"initial_state_kernel_{self.kernel_type}") 
        
           state = self.cp.zeros(n, dtype=self.dtype) 
        
           kernel((1,), (1,), [state])

where NumPy is using "fancy indexing", i.e. arr[idx] = el.
However, if indexing is a problem for CuPy (or other backends), and in case it would be problematic to hook on its own, NumPy itself has an equivalent function, i.e. np.put. In the hooking perspective, the kernel implementation can be the np.put replacement (btw, CuPy has the same function, cp.put, and I'm pretty sure is already hooked - but I also suspect indexing to work, and I could quickly check, so I might be missing something about the kernel...)

CuPy backend requires a further operation to finalize the function - but this could also be embedded in one out of two ways: adding it to the np.put replacement, or adding it at the end; however, this choice would become global (while currently it could be different method-by-method), we should investigate if this is a true limitation (most likely who implemented the backend has a better understanding of it)
the main outlier in this landscape is TensorFlow: NumPy and similar libraries are working together to standardize interfaces, through NumPy interoperability, array API, DLpack, and NumPy-like namespaces; but while CuPy and PyTorch are backing almost all of these efforts, TensorFlow is only mentioned in the last two cases, and it's always part of their experimental API; I suspect that this might have affected past choices, since TensorFlow is one of the main backends (the only one in Qibo, other than NumPy itself) - in case this is breaking all possible updates, an alternative might be ditching TensorFlow in favor of PyTorch

As I said, the main observation is that the current Qibo backends contain a lot of duplicated operations, at a higher level than required (an even better example would be matrices, which should definitely not be repeated more than once).
However, the update would require some effort and some (possibly deep) refactoring of the backends. The good part is that this would be fully internal, there is no need to break any interface for the Qibo user.

Given all these points, take this as a report about an investigation for possible improvements. There is no hurry to do anything.

The text was updated successfully, but these errors were encountered:

alecandido · 2023-08-16T18:44:13Z

Keep thinking about it: is there a reason to support something else than NumPy for CPU and CuPy for GPU?

TensorFlow is much more complex than CuPy, and more dissimilar to NumPy.
Numba for CPU should not provide a significant advantage over NumPy if array operations are used (even though there is an advantage, since, according to the QiboJIT paper, it can simulate 5-6 qubits more, i.e. a factor 50... I really wonder where it is coming from...), and CuQuantum is mostly duplicating CuPy.

If ever we'll go through this exercise, I'd really consider trimming down the number of backends, in such a way to support all platforms, but shaving off as much overhead as possible.
(in principle, we could swap CuPy with TensorFlow, relegating the existing backends to QiboJIT, but in practice this would mean to keep the whole backends' mechanism as it is, while if only have to support NumPy and NumPy-compatible for simulation + Qibolab we might be able to refactor more)

renatomello · 2023-08-17T04:22:12Z

@alecandido I agree with the your point about cuquantum and tensorflow. However, there has been some demand about a pytorch backend, which could be a replacement for the tensorflow backend.

alecandido · 2023-08-17T06:29:59Z

@alecandido I agree with the your point about cuquantum and tensorflow. However, there has been some demand about a pytorch backend, which could be a replacement for the tensorflow backend.

In the spirit of the first message, PyTorch would be definitely better than TensorFlow.

However, considering the potential simplification, even PyTorch is still one more backend.
@renatomello are you aware of the benefit of a PyTorch backend?
If it's only to use PyTorch somewhere (possibly external to Qibo, or at least out of the circuit simulation) we could use DLpack and friends to (zero-copy) cast arrays from one library to another one, without the need of a full backend for it.

But if there is a need deeply connected to the circuit simulation, of course it's much better to plan including a PyTorch backend from the beginning (if we'll ever start a refactor, this issue was mostly investigation until now - I just wanted to check if there is room for improvements and simplifications).

renatomello · 2023-08-17T06:34:56Z

@alecandido I personally haven't used pytorch (just have not needed yet). But what I heard from multiple people that is using Qibo for optimization is that these tensor-based backends allow for automatic differentiation. If one's simulating the circuits instead of sending to actual hardware, AD becomes a basic necessity. After that, it's a matter of preferring pytorch over tensorflow in general. But the main point of having at least one backend based on tensors is AD.

stavros11 · 2023-08-17T12:00:56Z

I like the suggestions of the first post, I need to read it in more detail later, but I agree that the methods of Qibo's AbstractBackend could be simplified.

Other than that, regarding the existing backends:

Numba for CPU should not provide a significant advantage over NumPy if array operations are used (even though there is an advantage, since, according to the QiboJIT paper, it can simulate 5-6 qubits more, i.e. a factor 50... I really wonder where it is coming from...),

The advantage is only when the custom kernels are used, which are only for applying gates to states and some state initialization. All other operations are delegated to numpy. I would say (without real proof) that the advantage is coming from the following points, ordered with decreasing importance:

In-place updates. In numba we modify the state vector in-place, while np.einsum creates a copy. An easy way to test this

import qibo
import numpy as np
from qibo import gates, Circuit

qibo.set_backend("qibojit") # or "numpy"

c = Circuit(2)
c.add(gates.H(0))
c.add(gates.H(1))

state = np.random.random(4).astype(complex)
state2 = circuit(state)
print(state)
print(state2)

With numpy state2 != state while with numba state2 == state.

Numpy is single-thread while our numba kernels use parallelization (prange) to take advantage of multi-threading CPUs. That being said, maybe there are simpler ways to make Numpy (in particular np.einsum) compatible with multi-threading.
We are using some binary operations to find the indices during gate multiplications which are fast, however we never really proved how much advantage we get from this. I am guessing the low-level implementation of

np.einsum("ec,abcd->abed", gate, state)

which applies a single-qubit gate to the 3rd qubit of a 4-qubit state, uses similar tricks but I have never checked the actual code.

and CuQuantum is mostly duplicating CuPy.

That's true, CuQuantum is there only for supporting an additional backend, which is backed by NVIDIA, and for allowing easy benchmarking (CuPy vs CuQuantum). It does not offer any additional features.

TensorFlow is much more complex than CuPy, and more dissimilar to NumPy.

As @renatomello said, the main motivation for using Tensorflow is automatic differentiation. Compared to numpy, it also supports multi-threading and GPUs but is still slower than qibojit primarily due to creating copies (point 1 above), which are needed for automatic differentiation. Indeed, there are alternative backends we could add for this point (PyTorch, JAX, etc.), I think we only have TensorFlow for historic reasons, as we started with this and also qibotf, the predecessor of qibojit.

alecandido · 2023-08-17T17:36:03Z

Thanks, @stavros11, for the summary, I believe now everything should be clear enough.

My current understanding is that we'll need:

basic and parallel CPU support
hardware accelerators support (mostly GPU, but if possible any)
automatic differentiation

So, I'm not sure that point 3. is strictly required for simulation, because strict simulation can not derive a circuit (otherwise the same code would not run on hardware out of the box).
However, we could assume that we want it, unless it's blocking greater improvements it would also be fine like that.

On the one side, I have always been tempted to add a further requirement: go beyond Python. However, this, together with the three above, would be incredibly time-consuming, and I'm pretty sure it's not worth for the current state of the project.
In Python many array libraries are available, with a NumPy-like API and broad hardware support, while to move to C the only strategy I can think of would be to make direct usage of XLA, with all the niceties of Bazel...

Speaking of XLA, it seems like all the major ML frameworks are using it (in particular TensorFlow, JAX, and even PyTorch), and it should satisfy all the conditions above on its own.

Thinking twice, I actually wonder if it would be worth to investigate deeper CuPy vs XLA-based libraries. Because if JAX or PyTorch are good enough (maybe not TF, since it's the least interoperable one, and it already "failed" somehow), and they support all the use cases, why should we dedicate effort ourselves to develop/maintain multiple simulation backends?

Eventually, if we really needed something more fine-grained than what these libraries could provide, making a trip into XLA itself might even be worth (but I really hope not, at least for a long while... also because we would lose all/most of the autodiff...).
In particular: is there anything to be executed on GPU or differentiated that is not possible to be implemented with TensorFlow(/...)?

P.S.: about the copies, I was worried the problem could have persisted with the others, but there is room in JAX and PyTorch (all the trailing_ methods) for in-place operations. However, since it could even be an outer product, there is no way that einsum could be in-place (the output requires more memory than the input), we'd need explicit contractions (as I believe you implemented in qibojit)

renatomello · 2023-08-18T04:22:25Z

So, I'm not sure that point 3. is strictly required for simulation, because strict simulation can not derive a circuit (otherwise the >same code would not run on hardware out of the box).
However, we could assume that we want it, unless it's blocking greater improvements it would also be fine like that.

Yes, AD is much better for gradient simulation than any other method that is hardware-compatible. So it is very necessary to keep.

Getting the same computational complexity as AD on hardware is actually a hot topic right now in the QML circles, and there are some theoretical results showing that it even may be impossible to do it for a general circuit without violating complexity bounds. Of course, it can still be possible for specific circuits.

But the point is that AD is indispensable.

scarrazza added the enhancement New feature or request label Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backend review #138

Backend review #138

alecandido commented Aug 15, 2023

alecandido commented Aug 16, 2023

renatomello commented Aug 17, 2023 •

edited

Loading

alecandido commented Aug 17, 2023 •

edited

Loading

renatomello commented Aug 17, 2023

stavros11 commented Aug 17, 2023 •

edited

Loading

alecandido commented Aug 17, 2023

renatomello commented Aug 18, 2023 •

edited

Loading

Backend review #138

Backend review #138

Comments

alecandido commented Aug 15, 2023

alecandido commented Aug 16, 2023

renatomello commented Aug 17, 2023 • edited Loading

alecandido commented Aug 17, 2023 • edited Loading

renatomello commented Aug 17, 2023

stavros11 commented Aug 17, 2023 • edited Loading

alecandido commented Aug 17, 2023

renatomello commented Aug 18, 2023 • edited Loading

renatomello commented Aug 17, 2023 •

edited

Loading

alecandido commented Aug 17, 2023 •

edited

Loading

stavros11 commented Aug 17, 2023 •

edited

Loading

renatomello commented Aug 18, 2023 •

edited

Loading