Implementations of selected sparse matrix formats for linear algebra supporting scientific and machine learning applications.
Machine learning applications typically model entities as vectors of numerical features so that they may be compared and analysed quantitively. Typically the majority of the elements in these vectors are zeros. In the case of text mining applications, each document within a corpus is represented as a vector and its features represent the vocabulary of unique words. A corpus of several thousand documents might utilise a vocabulary of hundreds of thousands (or perhaps even millions) of unique words but each document will typically only contain a couple of hundred unique words. This means the number of non-zero values in the matrix might only be around 1%.
Sparse matrix formats capitalise on this premise by only storing the non-zero values thereby reducing both storage/memory requirements and processing effort for manipulating the data. Sparse matrices can effectively be divided into 3 main categories:
-
Creational - Sparse matrix formats suited to construction and building of matrices. Matrix formats in this category include DOK (Dictionary Of Keys) and COO (COOrdinate aka triplet).
-
Operational - Sparse matrix formats suited to arithmetic operations e.g. multiplication. Matrix formats in this category include CSR (Compressed Sparse Row aka CRS - Compressed Row Storage) and CSC (Compressed Sparse Column aka CCS - Compressed Column Storage)
-
Specialised - Specialised matrix formats suiting specific sparsity patterns. Matrix formats in this category include DIA (DIAgonal) for efficiently storing and manipulating symmetric diagonal matrices.
A common practice is to construct sparse matrices using a creational format e.g. DOK or COO and then convert them to an operational format e.g. CSR for arithmetic operations.
- DOK (Dictionary Of Keys) format
- COO (COOrdinate) format (sometimes referred to as 'triplet')
- CSR (Compressed Sparse Row) format
- CSC (Compressed Sparse Column) format
- DIA (DIAgonal) format
- CSR dot product (matrix multiplication) of 2 matrices (with optimisations for operands of type DIA (as LHS or RHS operand), CSR (LHS operand only) and CSC (RHS operand only when LHS operand is CSR) but supporting any implementation of gonum/mat64.Matrix interface).
- CSR addition of 2 matrices (with optimisations for operands of type CSR but supporting any implementation of gonum/mat64.Matrix interface).
- Row and column slicing of CSR and CSC types.
- Further optimisations of CSR dot product for sparse matrix type operands (only considering non-zero values as with CSR operands currently), even as RHS operand ((AB)^T = B^T A^T)
- Consider implicitly converting sparse matrix operands to CSR/CSC for arithmetic operations
- Implement Parallel/fast matrix multiplication algorithm for sparse matrices
- Implement further arithmetic operations e.g. subtract, divide, element wise multiplication, etc.
- Consider utilising LAPACK/BLAS/etc. to perform matrix arithmetic (as an option if available on host).
- Improve memory allocation for matrix multiplication - pre-calculating sparsity pattern for product and allocate storage in advance rather than incrementally.
- standard API for iteration over non-zero elements for each sparse format type
- row and column slicing of other types