Added new specialized sparse-matrix extensions. #924

m4rs-mt · 2023-02-06T16:26:18Z

This PR adds new helper classes and methods to process sparse matrices on the GPU while supporting specific masking operations to avoid computation of intermediate values. It also adds add a newly created ConcurrentStreamProcessor to perform generic operations concurrently on multiple concurrent GPU streams.

A sample use case is shown below that multiplies masked sparse matrixes. There are two options available:

Create a single-stream-focused masked multiplier to have explicit control over matrix multiplications
Create a stream-based concurrent processor to perform batched multiplications

// Create a single-streamed sparse matrix buffer
var processor = accel.CreateSparseTransposedMatrixMultiplierMasked<
    float, // Matrix type
    FloatEpsPredicate<Stride2D.DenseY>, // A generic predicate - a float-eps-value predicate in this case
    Stride2D.DenseY, // Custom striding of the matrix to use
    FloatMaskedSparseMatrixProcessor>(); // A float-specific FMA operation to accumulate dot products

// Multiply a single masked sparse matrix
processor(
    accel.DefaultStream,
    new(denseMatrixBuffer.View, 2.0f),
    denseMatrixBuffer.View,
    sparseView, // Sparse matrix view, obtained by explicit construction or using one of the available sparsification methods; will be transposed during runtime (see below)
    outBuffer.View);

// Create a concurrent processor that supports batched multiplication
using var maskedProcessor = new MaskedMatrixProcessor<
    float,
    FloatEpsPredicate<Stride2D.DenseY>,
    Stride2D.DenseY,
    FloatMaskedSparseMatrixProcessor>(accel);
maskedProcessor.Predicate = /* custom predicate here */;
maskedProcessor.MultiplyBatchedTransposed(
    accel.DefaultStream, // The main accelerator stream to attach the next multiplication operations to
    aViewList, // A list of dense input matrices used to multiply the sparse matrices (bViewList) with
    bViewList, // A list of sparse input matrices (will be transposed)
    outViewList); //A list of output views to matrix data containing the results

Sample for accelerated creation of huge sparse matrices on accelerators using the newly added extensions:

// Setup sparse 2D matrix
var matrix = new float[4, 4]
{
    { 1, 0, 0, 0 },
    { 2, 1, 0, 0 },
    { 3, 0, 1, 0 },
    { 4, 0, 0, 1 },
};

// Allocate basic matrix on the accelerator and transfer it to the device
using var matrixBuffer = accel.Allocate2DDenseY<float>(matrix.GetExtent());
matrixBuffer.View.CopyFromCPU(matrix);

// Allocate a temp buffer (or use existing memory from somewhere else)
using var tempBuffer = accel.Allocate1D<int>(1);

// Initialize the basic shape converter and data converters
var shapeConverter = accel.CreateSparseTransposedMatrixShapeConverter<
    float,
    FloatEpsPredicate<Stride2D.DenseY>,
    Stride2D.DenseY>(tempBuffer.View);
var converter = accel.CreateSparseMatrixConverter<float, Stride2D.DenseY>();

// Get basic shape of the sparse matrix living on the device
using var numNeighborsBuffer = accel.Allocate1D<int>(matrix.GetLength(1));
var shapeView = shapeConverter(
    accel.DefaultStream,
    matrixBuffer,
    new(matrixBuffer, 0.0f),
    numNeighborsBuffer.View,
    maxNumNeighbors =>
        // Allocate a shape-view buffer to store the neighbor lists
        accel.Allocate2DDenseY<int>((matrix.GetLength(1), maxNumNeighbors)));

// Allocate the actual sparse data buffer
using var dataBuffer = accel.Allocate2DDenseY<float>(
    (matrix.GetLength(1),
    shapeView.MaxNonZeroEntries));

// Convert data and fill our sparse matrix structure
var sparseView = converter(accel.DefaultStream, matrixBuffer, shapeView, dataBuffer);

Note: This PR requires PR #989 to be merged.

Co-authored by @corwinjoy who wrote the initial POC.

Src/ILGPU.Algorithms/MatrixOperations/CondensedProductRows.cs

Src/ILGPU.Algorithms/MatrixOperations/MaskedMatrixProcessor.cs

Src/ILGPU.Algorithms/MatrixOperations/SparseMatrixShape.cs

Src/ILGPU.Algorithms/ConcurrentStreamProcessor.cs

Src/ILGPU.Algorithms/MatrixOperations/MaskedMatrixProcessor.cs

Src/ILGPU.Algorithms/MatrixOperations/MaskedSparseMatrixExtensions.cs

Src/ILGPU.Algorithms.Tests/MatrixTests.cs

Src/ILGPU.Algorithms/MatrixOperations/MaskedMatrixProcessor.cs

Src/ILGPU.Algorithms/MatrixOperations/MaskedSparseMatrixExtensions.cs

Src/ILGPU.Algorithms.Tests/MatrixTests.cs

corwinjoy

Overall looks pretty good. I suggested a few minor changes.
I think the big piece that is missing here is documentation as to the motivation of this class and what problem it solves. We know, but I think users will be mystified.

corwinjoy

Looks good. As mentioned, I think the only thing that is missing is a piece of documentation explaining why end-users would want to use this / what problem this solves.

…rocessing on GPUs using multiple streams.

…shapes in GPU memory.

…in GPU memory.

…dense matrices into sparse ones in GPU memory.

m4rs-mt · 2023-07-20T08:47:37Z

Looks good. As mentioned, I think the only thing that is missing is a piece of documentation explaining why end-users would want to use this / what problem this solves.

Absolutely! I think we should create a sample in a separate PR and add further documentation to the wiki before the final release.

corwinjoy

OK. Updated tests look good.

m4rs-mt force-pushed the sparse_matrix branch 2 times, most recently from a929e5f to ad17267 Compare February 6, 2023 16:32

corwinjoy reviewed Feb 6, 2023

View reviewed changes

Src/ILGPU.Algorithms/MatrixOperations/CondensedProductRows.cs Outdated Show resolved Hide resolved

corwinjoy reviewed Feb 6, 2023

View reviewed changes

Src/ILGPU.Algorithms/MatrixOperations/MaskedMatrixProcessor.cs Outdated Show resolved Hide resolved

corwinjoy reviewed Feb 6, 2023

View reviewed changes

Src/ILGPU.Algorithms/MatrixOperations/MaskedMatrixProcessor.cs Outdated Show resolved Hide resolved

corwinjoy reviewed Feb 6, 2023

View reviewed changes

Src/ILGPU.Algorithms/MatrixOperations/SparseMatrixShape.cs Outdated Show resolved Hide resolved

m4rs-mt force-pushed the sparse_matrix branch from ad17267 to 6ceedf0 Compare February 7, 2023 08:09

m4rs-mt added this to the v1.4 milestone Feb 16, 2023

m4rs-mt modified the milestones: v1.4, v1.5 Mar 16, 2023

m4rs-mt force-pushed the sparse_matrix branch 4 times, most recently from c0ba115 to f944ca3 Compare April 12, 2023 21:11

m4rs-mt requested a review from corwinjoy April 12, 2023 21:13

m4rs-mt force-pushed the sparse_matrix branch 3 times, most recently from aa46563 to 0bd56f2 Compare April 12, 2023 21:19

m4rs-mt marked this pull request as ready for review April 12, 2023 21:19

m4rs-mt added the feature A new feature (or feature request) label Apr 12, 2023

corwinjoy reviewed Apr 13, 2023

View reviewed changes

Src/ILGPU.Algorithms/ConcurrentStreamProcessor.cs Show resolved Hide resolved

corwinjoy reviewed Apr 13, 2023

View reviewed changes

Src/ILGPU.Algorithms/ConcurrentStreamProcessor.cs Show resolved Hide resolved

corwinjoy reviewed Apr 13, 2023

View reviewed changes

Src/ILGPU.Algorithms/ConcurrentStreamProcessor.cs Show resolved Hide resolved

corwinjoy reviewed Apr 13, 2023

View reviewed changes

Src/ILGPU.Algorithms/MatrixOperations/MaskedMatrixProcessor.cs Show resolved Hide resolved

corwinjoy reviewed Apr 13, 2023

View reviewed changes

Src/ILGPU.Algorithms/MatrixOperations/MaskedSparseMatrixExtensions.cs Show resolved Hide resolved

corwinjoy reviewed Apr 13, 2023

View reviewed changes

Src/ILGPU.Algorithms/MatrixOperations/MaskedSparseMatrixExtensions.cs Outdated Show resolved Hide resolved

corwinjoy reviewed Apr 13, 2023

View reviewed changes

Src/ILGPU.Algorithms/MatrixOperations/MaskedSparseMatrixExtensions.cs Outdated Show resolved Hide resolved

m4rs-mt force-pushed the sparse_matrix branch 3 times, most recently from 356eb44 to 46fbc48 Compare April 17, 2023 15:01

m4rs-mt force-pushed the sparse_matrix branch 2 times, most recently from 79b48df to ccc0f34 Compare April 23, 2023 23:42

m4rs-mt force-pushed the sparse_matrix branch from ccc0f34 to 33ce55a Compare July 12, 2023 09:17