This library implements the kronmult_batched
function which computes output[K] += kron(matrix_list[K]) * input[K]
(k
being an index in a batch)
which is a batch version of the matrix product of the kronecker product of several matrices and a given vector.
We provide efficient parallel implementations, for both CPU (using OpenMP) and GPU (using CUDA), that have been tuned for the needs of ASGarD. In particular, we expect our inputs to be * col-major* matrices and some output pointers to overlap.
We implement a variant of the backward version of algorithm 993 (Algorithm 993: Efficient Computation with Kronecker Products), chosen to perform well on col-major matrices and take into account the fact that the right side is a vector and not a matrix ( thus not needing an additional transposition).
We highly recommend reading ON KRONECKER PRODUCTS, TENSOR PRODUCTS AND MATRIX DIFFERENTIAL CALCULUS by Stephen Pollock to get more familiar with the linear algebra and reshaping tricks used in the algorithm.
You can use either the kronmult_omp
(CPU paralelism) or the kronmult_gpu
(GPU paralelism) CMake target to link this
library. See the corresponding folders for further information on both instalation and implementations.
Include either kronmult.hpp
(CPU) or kronmult.cuh
(GPU) to get access to the kronmult_batched
function which
computes output[K] += kron(matrix_list[K]) * input[K]
for 0 <= k < batchCount assuming that some output pointers will
be equal (thus, requiring a thread-safe addition).
void kronmult_batched(int const matrix_number, int const matrix_size, T const * const matrix_list_batched[], int const matrix_stride,
T* input_batched[], T* output_batched[], T* workspace_batched[], int const nb_batch)
matrix_list_batched
is an array ofnb_batch
*matrix_count
pointers to square matrices of sizematrix_size
bymatrix_size
and stridematrix_stride
input_batched
is an array ofnb_batch
pointers to array of sizematrix_size
^matrix_count
output_batched
is an array ofnb_batch
pointers to array of sizematrix_size
^matrix_count
, to which the outputs will be addedworkspace
is an array ofnb_batch
pointers to array of sizematrix_size
^matrix_count
, to be used as workspaces
input_batched
andworkspace_batched
will be used as temporary workspaces and thus modified- the matrices are assumed to be stored in col-major order
- the sizes are assumed to be correct
- the gpu version assumes that all the arrays have already been allocated on GPU (using
cudaMalloc
for example)