Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiments on accelerating algorithms #1

Open
3 of 7 tasks
jurihock opened this issue May 18, 2022 · 1 comment
Open
3 of 7 tasks

Experiments on accelerating algorithms #1

jurihock opened this issue May 18, 2022 · 1 comment

Comments

@jurihock
Copy link
Owner

jurihock commented May 18, 2022

  • vDSP vector based processing (Apple specific)
  • Metal GPU parallelization (Apple specific)
  • OpenCL GPU parallelization
  • Limiting signal bandwidth
  • Utilize both CPU and GPU simultaneously
  • atan2 approximation (instead of std::arg)
  • Reduce sample rate (e.g. downsample input and upsample output)
@jurihock
Copy link
Owner Author

jurihock commented May 18, 2022

vDSP

The vDSP library looks interesting, since it can be mixed with the native C++ code. However the complex number arithmetic appears to require the "split complex" memory layout instead of the regular interleaved std::complex representation.

The forward SDFT can be implemented by following sequence:

vDSP_vsadd
vDSP_zvmul
vDSP_zvadd
vDSP_zvsub
vDSP_zvsub
vDSP_zvzsml

Compared to the vanilla C++ implementation I can't see any significant performance difference, just same time measurements and same CPU usage, so 👎.

The allocated memory is aligned by default as required and the clang compiler seems to be doing its job very well.

According to the LLVM docs the auto-vectorization is on by default. E.g. explicitly switched off via compiler flags -fno-vectorize -fno-slp-vectorize the difference is noticeable.

Metal

The first Metal experiment shows a typical "command queue" overhead problem. Although the SDFT can be computed in parallel for a single sample, the equal computation needs to be sequentially repeated for all samples of the frame buffer. Maybe an indirect command encoding can help to deal with that.

OpenCL

Just same story as Metal... The OpenCL 2.0 spec describes a mechanism of enqueuing kernels from kernels. Still not sure, if and how long the OpenCL 2.0 will be supported by Apple. It's still OpenCL 1.2 in 2022...

Limiting signal bandwidth

Probably the fastest way of computing SDFT is not computing it at all... One main feature of the SDFT is arbitrary spectral resolution and thus the possibility of limiting the signal bandwidth to save CPU cycles.

As long as the source signal bandwidth is known in advance, there is no need to compute all spectral bands at analysis step. At synthesis step, the destination signal bandwidth can also be adjusted according to the applied pitch shifting factor.

Utilize both CPU and GPU simultaneously

If delayed by one frame, the computation task can be spread between CPU and GPU. For example in case of SDFT the frame size can be reduced to something like 64 or 32 samples, which will result a latency of about 1 ms at 44,1 kHz and is still an order of magnitude better than STFT.

Reduce sample rate

This is currently the most useful hack, which is actually an another way of bandwidth limitation.

E.g. sample rate conversion 48000 (adc) => 16000 (dsp) => 48000 (dac) works just fine on the CPU with headroom for spectral processing.

jurihock added a commit that referenced this issue Jun 18, 2022
jurihock added a commit that referenced this issue Jan 2, 2023
jurihock added a commit that referenced this issue Jan 2, 2023
jurihock added a commit that referenced this issue Jan 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant