You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some multi-threaded versions of OpenBLAS, the ICOPY_OPERATION (packing for A) seems to take 6-13X more time than the actual KERNEL_OPERATION(s). Is it possible to make this operation multi-threaded?
/* Copy local region of A into workspace */
START_RPCC();
ICOPY_OPERATION(min_l, min_i, a, lda, ls, m_from, sa);
STOP_RPCC(copy_A);
I don't think anybody has tried yet, but there is probably no fundamental argument against doing it (apart from the fact that the copy operation itself is already called from a multithreaded region). The least invasive way of trying would probably be to add the blas_level1_thread() mechanism to an existing ICOPY kernel (similar to how some of the DOT
kernels are parallelized on arm64 and x86_64 targets). But I must admit I have not really thought this through, just a first impression from me. I think we already have an open issue about low performance of ICOPY, though that may have been more concerned with the use of outdated instructions on modern targets.
In some multi-threaded versions of OpenBLAS, the ICOPY_OPERATION (packing for A) seems to take 6-13X more time than the actual KERNEL_OPERATION(s). Is it possible to make this operation multi-threaded?
The text was updated successfully, but these errors were encountered: