Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for BLAS SVD functions in MPS simulation #1897

Merged
merged 20 commits into from
Jan 10, 2024

Conversation

Patataman
Copy link
Contributor

@Patataman Patataman commented Aug 21, 2023

Summary

Hello, in this PR I add support for using the OpenBLAS/LAPACK SVD functions to replace the Qiskit's SVD (sequential) implementation in https://github.com/Qiskit/qiskit-aer/blob/main/src/simulators/matrix_product_state/svd.cpp#L148 for the MPS simulation.

Details and comments

Some points about the implementation:

  1. I could not guess why you are not using OpenBLAS library for the SVD. Therefore, as first approach I have used an environmental variable QISKIT_LAPACK_SVD to "activate/deactivate" the replacement to simplify the testing and benchmarking. From the code and comments I understood that it is based in this implementation (https://dl.acm.org/doi/10.1145/363235.363249), therefore, LAPACK zgesvd function should be the same. If you are agree with the replacement, it is just as simple as removing the old code and leaving the new.
  2. I have implemented support for both zgesvd and zgesdd SVD approaches. Why? Because zgesdd is a Divide & Conquer approach and performance for bigger matrices is much better than zgesvd. Ideally the selection would be automatically based on the matrix size, but as first approach (again), it can be manually selected using the environmental variable QISKIT_LAPACK_SVD=DC
  3. PR also includes a couple of changes around print_to_log functions. While profiling, I saw that these functions where quite expensive, and if you are not in "debug" usually you don't care about the logs, so I added #ifdef DEBUG / #endif around them to improve performance. If I am wrong, say it and I will just undo it.

Finally, you want to see if this improve performance. For that I have used Random Quantum Circuits (https://arxiv.org/pdf/2207.14280.pdf) and this server configuration:

CPU Intel Xeon Gold 6148
# sockets 2
# cores 20
RAM 192GB
GPU None
OS Ubuntu 22.04.1 LTS
Python 3.10
OpenBLAS/LAPACK 0.3.21
gcc v11.3

image

And, the average time (in seconds) from 5 different executions:

Depth Base LAPACK SVD LAPACK D&C
1 0.043029881 0.04652729 0.049358368
3 0.116557598 0.109769773 0.126575661
5 0.14433198 0.148080826 0.16601491
10 74.70666704 28.21881118 21.8562469
12 1568.871852 429.5337416 208.0082648
15 33927.16216 12694.68723 1790.687791

As you can see, in the deepest circuits we are talking in (several) hours with the current implementation, but in minutes with the D&C approach.

@CLAassistant
Copy link

CLAassistant commented Aug 21, 2023

CLA assistant check
All committers have signed the CLA.

@Patataman Patataman changed the title Pr lapack svd Add support for BLAS SVD functions in MPS simulation Aug 21, 2023
@doichanj doichanj added the enhancement New feature or request label Aug 21, 2023
@doichanj doichanj self-requested a review August 22, 2023 14:35
@doichanj
Copy link
Collaborator

Could you fix the errors so that all checks will be passed

@Patataman
Copy link
Contributor Author

Patataman commented Aug 24, 2023

With the last commit I think I have solved the Windows-build problem.

However, for macOS the error is:
/Applications/Xcode_14.2.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/wchar.h:123:15: fatal error: 'wchar.h' file not found
And no idea how can I fix this. Never developed in mac nor I modified anything about CMake in the PR... Any idea how to address it?

EDIT: No, I didn't fix everything...

@Patataman
Copy link
Contributor Author

Finally, all ok :)

@doichanj doichanj added the performance Performance improvements label Aug 24, 2023
Copy link
Contributor

@merav-aharoni merav-aharoni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very nice performance improvement for MPS. In answer to your question why we didn't implement this, it was in plan, but never actually got to it.
I reviewed your code, but I am not very familiar with Lapack, so could not say anything on that part.
A few general comments:

  1. Most important - I saw the performance comparison which looks very nice. However, did you compare results for deep and large circuits? In your test, the circuit is shallow, which is fine for regression, but before merge, please check results. In particular, I know the previous version used long double precision in some places, but Lapack uses long so this might make a difference.
  2. Also compare results when using approximation.
  3. I think it would be a good idea to turn on the validate_SVD_result function for the near future after this is merged. @doichanj - you should probably turn this off again after a few months of usage.
  4. Add to the documentation the actual sizes of the matrices where one algorithm becomes better than the other.
  5. Also, best to automatically choose between the two algorithms depending on matrix size.
  6. It is not a good idea to include the change in print_to_log here. Best to separate to a different PR. In any case, I am not sure whether this change is needed after my comment above.

@Patataman
Copy link
Contributor Author

Sorry for the delay, and thanks for the comments.

I will try to fix and change the things mentioned. However, this PR was part of my work in an internship and it already finished, so I have no longer access to the "big" server and it will take me a while to replicate everything. I will come back to you with the things done.

@Patataman
Copy link
Contributor Author

Patataman commented Oct 9, 2023

Hello, as mentioned before I have no longer access to the nice server I was using, and I didn't manage to get access to something similar. So, now (sadly) I am using my laptop. Nevertheless, I can simulate up to 16-18 qubits and for the remaining tests I think it's enough.

Here I left you the plot for deeper circuits, from 10 to 200 with 16 qubits, with the automatic selector between the QR or Divide and Conquer SVD functions in LAPACK.

imagen

Time in seconds

Depth Base LAPACK SVD
10 6.201454782 3.320371771
20 11.47285843 3.320371771
40 22.15550818 4.246894503
60 32.74408193 7.875970221
80 43.35610499 10.47702546
100 53.8639502 11.49814119
150 80.57371044 15.84590697
200 107.1309861 20.91695046

I still have more things to do, like the approximation (but I tested it a little back then and it was good, but I didn't generate speed-up graphs), code style and documentation.

@doichanj doichanj added this to the Aer 0.14.0 milestone Oct 11, 2023
@Patataman
Copy link
Contributor Author

Patataman commented Oct 12, 2023

Results using approximation. I used 16 qubits, depth 40 as it was in the previous result the one with the biggest speed-up.
For the fidelity I used the state_fidelity using the statevector from the execution without LAPACK and execution with LAPACK.

This first plot is using truncation_threshold parameter:
imagen

Time in seconds:

Threshold Base LAPACK SVD
1e-16 17.64385796 5.667835045
1e-10 17.89117179 5.667835045
1e-8 16.61686311 7.67423768
1e-6 14.70690207 7.358585072
1e-4 13.58314958 6.801743698
1e-2 17.81055226 5.805790138

This second plot, using the max_bond_dimension parameter, for a big range of values as it depends a lot on the matrix sizes for the SVD:
imagen

Time in seconds:

Bond Dimension Base LAPACK SVD
10 0.06917600632 0.07078566551
20 0.1661295414 0.1548344612
40 0.8819941998 0.443166399
60 1.863601351 0.8205073357
80 3.030164289 1.139424276
100 4.378917599 3.04997468
150 10.09798179 4.118888235
200 12.50887923 4.118888235

SV Fidelity usually is 0.999999 but I guess that's because the float point error, and nothing to worry about.

I have left fix the documentation

@merav-aharoni
Copy link
Contributor

Hi @Patataman - nice work! I am not sure I understand your graphs. In the table above, it seems the performance improvement is ~x2000 or so. In your current graphs the improvement seems to be x3 at most. What am I missing?
I think it would be helpful to plot the performance of the existing version and of the new version, rather than plotting the improvement.

@Patataman
Copy link
Contributor Author

Hi @Patataman - nice work! I am not sure I understand your graphs. In the table above, it seems the performance improvement is ~x2000 or so. In your current graphs the improvement seems to be x3 at most. What am I missing? I think it would be helpful to plot the performance of the existing version and of the new version, rather than plotting the improvement.

The main differences are:

  • In the original post I was using a nice server with 40 cores, and now I have 8, so parallelism is very limited compared with the original results.
  • Original result was with 30 qubits, but now I can simulate up to 16 because of the execution time and RAM.

Execution times are highly related with the matrix size of the SVD function. During my original evaluation I saw that this size is highly related with the number of qubits and circuit's entanglement. Therefore, as now I am using 16 qubits instead, matrix sizes and speed-up gains are smaller.

Also, in the original results, maximum speed-up was x20, not ~x2000, maybe if you extrapolate it seems the potential speed up was x2000, but as mentioned before, it does not depend so much on the depth of the circuit, but rather the number of qubits. If I remember well, I think the maximum matrix size for a fully entangled circuit was something like 2^(n-1) with n as number of qubits.

I will edit the previous post to include the execution times with approximation

@@ -140,9 +141,28 @@ double reduce_zeros(cmatrix_t &U, rvector_t &S, cmatrix_t &V,
return discarded_value;
}

void validate_SVdD_result(const cmatrix_t &A, const cmatrix_t &U,
Copy link
Contributor Author

@Patataman Patataman Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this function to avoid applying AER::Utils::dagger to V all the time in lapack_csvd_wrapper

@doichanj doichanj merged commit 86a27e3 into Qiskit:main Jan 10, 2024
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants