Refactoring MetaDataObject out of DenseMatrix #758

corepointer · 2024-06-17T16:13:03Z

This PR moves the MetaDataObject (MDO) functionality out of DenseMatrix and generalizes it to be used by other classes derived from Structure as well.

Furthermore, this contains a performance improvement to prevent excessive allocation ID lookups and a separation of ranged and full allocations.

All tests are running except the distributed ones.

* This commit introduces the meta data object to the CSR data type * Memory pinning To prevent excessive allocation ID lookups in the hot path when using --vec, this change "pins" memory by allocation type of previous accesses.

… Pinning * This commit introduces the meta data object to the CSRMatrix data type To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix. * Separate handling of ranges Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that". * Memory pinning To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses. Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done. Closes daphne-eu#758

corepointer · 2024-10-18T17:13:51Z

The numerous force pushes are a result of my local clang-format disagreeing with the CI's clang-format:

 --- src/runtime/local/datastructures/AllocationDescriptorGRPC.h	(original)
+++ src/runtime/local/datastructures/AllocationDescriptorGRPC.h	(reformatted)
@@ -35,7 +35,7 @@
   public:
     AllocationDescriptorGRPC() = default;
     AllocationDescriptorGRPC(DaphneContext *ctx, const std::string &address, const DistributedData &data)
-        : ctx(ctx), workerAddress(address), distributedData(data) {};
+        : ctx(ctx), workerAddress(address), distributedData(data){};
 
     ~AllocationDescriptorGRPC() override = default;
     [[nodiscard]] ALLOCATION_TYPE getType() const override { return type; };

corepointer · 2024-10-18T17:20:24Z

Explaining the labels:

feature: CUDA handling CSRMatrix is new
Accelerator: it's (also) about CUDA ops
Distributed: the refactoring affects this component
Performance: besides this one explaining itself, the pinning and being able to run sparse stuff on GPU help with performance 💪

…not throw Changing the behavior of fileExists() to a boolean operation as suggested by the method's name. Throwing an exception us up to the caller of this method. Closes daphne-eu#867

… Pinning * This commit introduces the meta data object to the CSRMatrix data type To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix. * Separate handling of ranges Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that". * Memory pinning To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses. Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done. Closes daphne-eu#758

Due to the use of ptr to local var the distributed (GRPC_SYNC) mode crashed in test cases. This patch fixes this by using std::unique_ptr appropriately. Furthermore, a check for nullptr is performed before getting distributed data to add a message indicating that execution failed here.

corepointer mentioned this pull request Jul 22, 2024

Dnn ops #734

Draft

corepointer force-pushed the mdo_csr_cuda_refactor branch from df4702e to cfd8053 Compare October 18, 2024 15:15

corepointer force-pushed the mdo_csr_cuda_refactor branch from cfd8053 to 17d3baa Compare October 18, 2024 15:31

corepointer force-pushed the mdo_csr_cuda_refactor branch from 17d3baa to d9d1b59 Compare October 18, 2024 17:04

corepointer force-pushed the mdo_csr_cuda_refactor branch from d9d1b59 to 9016ae9 Compare October 18, 2024 17:06

corepointer force-pushed the mdo_csr_cuda_refactor branch from 9016ae9 to 6f6da3b Compare October 18, 2024 17:10

corepointer force-pushed the mdo_csr_cuda_refactor branch from 6f6da3b to ce36921 Compare October 18, 2024 17:12

corepointer marked this pull request as ready for review October 18, 2024 17:15

corepointer added feature missing/requested features performance label for PRs of perf++ and issues of perf-- Accelerators Distributed Issues and PRs related to distributed computation labels Oct 18, 2024

corepointer requested a review from pdamme October 18, 2024 17:20

corepointer added 3 commits October 19, 2024 02:52

[DAPHNE-daphne-eu#867] Change ConfigParser::fileExists() behavior to …

3e442ed

…not throw Changing the behavior of fileExists() to a boolean operation as suggested by the method's name. Throwing an exception us up to the caller of this method. Closes daphne-eu#867

corepointer force-pushed the mdo_csr_cuda_refactor branch from ce36921 to d434bf5 Compare October 19, 2024 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring MetaDataObject out of DenseMatrix #758

Refactoring MetaDataObject out of DenseMatrix #758

corepointer commented Jun 17, 2024

corepointer commented Oct 18, 2024

corepointer commented Oct 18, 2024

Refactoring MetaDataObject out of DenseMatrix #758

Are you sure you want to change the base?

Refactoring MetaDataObject out of DenseMatrix #758

Conversation

corepointer commented Jun 17, 2024

corepointer commented Oct 18, 2024

corepointer commented Oct 18, 2024