Experiments on accelerating algorithms #1

jurihock · 2022-05-18T10:10:58Z

vDSP vector based processing (Apple specific)
Metal GPU parallelization (Apple specific)
OpenCL GPU parallelization
Limiting signal bandwidth
Utilize both CPU and GPU simultaneously
atan2 approximation (instead of std::arg)
Reduce sample rate (e.g. downsample input and upsample output)

The text was updated successfully, but these errors were encountered:

jurihock · 2022-05-18T10:58:32Z

vDSP

The vDSP library looks interesting, since it can be mixed with the native C++ code. However the complex number arithmetic appears to require the "split complex" memory layout instead of the regular interleaved std::complex representation.

The forward SDFT can be implemented by following sequence:

vDSP_vsadd
vDSP_zvmul
vDSP_zvadd
vDSP_zvsub
vDSP_zvsub
vDSP_zvzsml

Compared to the vanilla C++ implementation I can't see any significant performance difference, just same time measurements and same CPU usage, so 👎.

The allocated memory is aligned by default as required and the clang compiler seems to be doing its job very well.

According to the LLVM docs the auto-vectorization is on by default. E.g. explicitly switched off via compiler flags -fno-vectorize -fno-slp-vectorize the difference is noticeable.

Metal

The first Metal experiment shows a typical "command queue" overhead problem. Although the SDFT can be computed in parallel for a single sample, the equal computation needs to be sequentially repeated for all samples of the frame buffer. Maybe an indirect command encoding can help to deal with that.

OpenCL

Just same story as Metal... The OpenCL 2.0 spec describes a mechanism of enqueuing kernels from kernels. ~~Still not sure, if and how long the OpenCL 2.0 will be supported by Apple.~~ It's still OpenCL 1.2 in 2022...

Limiting signal bandwidth

Probably the fastest way of computing SDFT is not computing it at all... One main feature of the SDFT is arbitrary spectral resolution and thus the possibility of limiting the signal bandwidth to save CPU cycles.

As long as the source signal bandwidth is known in advance, there is no need to compute all spectral bands at analysis step. At synthesis step, the destination signal bandwidth can also be adjusted according to the applied pitch shifting factor.

Utilize both CPU and GPU simultaneously

If delayed by one frame, the computation task can be spread between CPU and GPU. For example in case of SDFT the frame size can be reduced to something like 64 or 32 samples, which will result a latency of about 1 ms at 44,1 kHz and is still an order of magnitude better than STFT.

Reduce sample rate

This is currently the most useful hack, which is actually an another way of bandwidth limitation.

E.g. sample rate conversion 48000 (adc) => 16000 (dsp) => 48000 (dac) works just fine on the CPU with headroom for spectral processing.

jurihock added a commit that referenced this issue Jun 18, 2022

Approximate phase angle #1

cefbf89

jurihock added a commit that referenced this issue Jan 2, 2023

Remove metal test code #1

51042fd

jurihock added a commit that referenced this issue Jan 2, 2023

Remove opencl test code #1

c65cd70

jurihock added a commit that referenced this issue Jan 2, 2023

Remove openmp dependency #1

7a6c105

jurihock added a commit that referenced this issue Jan 2, 2023

Remove bshoshany/thread-pool dependency #1

06a4032

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiments on accelerating algorithms #1

Experiments on accelerating algorithms #1

jurihock commented May 18, 2022 •

edited

Loading

jurihock commented May 18, 2022 •

edited

Loading

Experiments on accelerating algorithms #1

Experiments on accelerating algorithms #1

Comments

jurihock commented May 18, 2022 • edited Loading

jurihock commented May 18, 2022 • edited Loading

jurihock commented May 18, 2022 •

edited

Loading

jurihock commented May 18, 2022 •

edited

Loading