Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Compressed Sparse Column Matrix #1377

Merged
merged 26 commits into from
Jun 7, 2024
Merged

Distributed Compressed Sparse Column Matrix #1377

merged 26 commits into from
Jun 7, 2024

Conversation

Mystic-Slice
Copy link
Collaborator

@Mystic-Slice Mystic-Slice commented Feb 20, 2024

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • documentation updated where needed

Description

A new format similar to the one implemented in #1028.

Most code from the previous implementation is common for both formats.

Steps:

  • Implement a common base class
  • Factory methods
  • Sparse and Dense formats interconversion
  • Arithmetic Operations
  • Tests

NOTE: In the future, when arithmetic operations are supported by torch for sparse_csc, they can be enabled for DCSC_matrix by restoring the code removed in commit 6d727af. The binary operator has already been modified to be generic for both DCSR_matrix and DCSC_matrix.

Type of change

New feature

@ghost
Copy link

ghost commented Feb 20, 2024

👇 Click on the image for a new way to code review

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map legend

@Mystic-Slice
Copy link
Collaborator Author

I just discovered something that I should have seen a long time ago.

torch only officially supports the matmul operation for its CSR and CSC formats. (https://pytorch.org/docs/stable/sparse.html#csr-tensor-operations)
The elementwise operations (add and mul) are implemented for CSR(although not mentioned in their docs) but not for CSC.

It is still possible for us to implement matmul for ht.sparse module. Just wanted to know how we should proceed.

@mrfh92 mrfh92 marked this pull request as ready for review March 7, 2024 09:23
Copy link
Contributor

github-actions bot commented Mar 7, 2024

Thank you for the PR!

1 similar comment
Copy link
Contributor

github-actions bot commented Mar 7, 2024

Thank you for the PR!

Copy link

codecov bot commented Mar 7, 2024

Codecov Report

Attention: Patch coverage is 92.85714% with 9 lines in your changes missing coverage. Please review.

Project coverage is 91.74%. Comparing base (ee0d72a) to head (90dc1d6).
Report is 325 commits behind head on main.

Files Patch % Lines
heat/sparse/_operations.py 78.94% 4 Missing ⚠️
heat/sparse/dcsx_matrix.py 93.02% 3 Missing ⚠️
heat/sparse/factories.py 96.96% 1 Missing ⚠️
heat/sparse/manipulations.py 95.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1377      +/-   ##
==========================================
- Coverage   91.76%   91.74%   -0.03%     
==========================================
  Files          80       80              
  Lines       11640    11683      +43     
==========================================
+ Hits        10682    10719      +37     
- Misses        958      964       +6     
Flag Coverage Δ
unit 91.74% <92.85%> (-0.03%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ClaudiaComito
Copy link
Contributor

I just discovered something that I should have seen a long time ago.

torch only officially supports the matmul operation for its CSR and CSC formats. (https://pytorch.org/docs/stable/sparse.html#csr-tensor-operations) The elementwise operations (add and mul) are implemented for CSR(although not mentioned in their docs) but not for CSC.

It is still possible for us to implement matmul for ht.sparse module. Just wanted to know how we should proceed.

@Mystic-Slice thank you so much for this and apologies for the delay.

It's OK not to provide the element-wise operations for CSC if torch doesn't implement them yet.

The basic operation we want to be able to provide with this class, is matrix multiplication CSR @ CSC, when
CSR is distributed (split=0) and CSC is not distributed. This operation requires no MPI communication, returns a distributed CSR. It would be a great starting point, and it doesn't need to be part of this PR.

Is this PR, the CSC class, ready for review @Mystic-Slice ?

Thanks again, we highly appreciate your contribution!

Copy link
Contributor

Thank you for the PR!

@Mystic-Slice
Copy link
Collaborator Author

Hi @ClaudiaComito
Sorry for the delay from my side too.
The implementation of DCSC_matrix is complete. I just need to write a few tests.
I will complete them asap.

If tests aren't a blocker for review, we can proceed with that.

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito
Copy link
Contributor

Hi @ClaudiaComito Sorry for the delay from my side too. The implementation of DCSC_matrix is complete. I just need to write a few tests. I will complete them asap.

If tests aren't a blocker for review, we can proceed with that.

Hi @Mystic-Slice, thanks, the tests make the review so much easier, so please go ahead and sketch the tests first. Thanks a lot!

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented May 27, 2024

@Mystic-Slice on the CUDA runner, with PyTorch 2.2, we get the following error:

___________ ERROR collecting heat/sparse/tests/test_manipulations.py ___________
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/runner.py:341: in from_call
    result: Optional[TResult] = func()
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/runner.py:372: in <lambda>
    call = CallInfo.from_call(lambda: list(collector.collect()), "collect")
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:531: in collect
    self._inject_setup_module_fixture()
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:545: in _inject_setup_module_fixture
    self.obj, ("setUpModule", "setup_module")
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:310: in obj
    self._obj = obj = self._getobj()
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:528: in _getobj
    return self._importtestmodule()
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/python.py:617: in _importtestmodule
    mod = import_path(self.path, mode=importmode, root=self.config.rootpath)
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/pathlib.py:565: in import_path
    importlib.import_module(module_name)
/opt/conda/envs/heat_dev/lib/python3.10/importlib/__init__.py:126: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
<frozen importlib._bootstrap>:1050: in _gcd_import
    ???
<frozen importlib._bootstrap>:1027: in _find_and_load
    ???
<frozen importlib._bootstrap>:992: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:241: in _call_with_frames_removed
    ???
<frozen importlib._bootstrap>:1050: in _gcd_import
    ???
<frozen importlib._bootstrap>:1027: in _find_and_load
    ???
<frozen importlib._bootstrap>:1006: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:688: in _load_unlocked
    ???
<frozen importlib._bootstrap_external>:883: in exec_module
    ???
<frozen importlib._bootstrap>:241: in _call_with_frames_removed
    ???
heat/sparse/tests/__init__.py:1: in <module>
    from .test_arithmetics_csr import *
<frozen importlib._bootstrap>:1027: in _find_and_load
    ???
<frozen importlib._bootstrap>:1006: in _find_and_load_unlocked
    ???
<frozen importlib._bootstrap>:688: in _load_unlocked
    ???
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/_pytest/assertion/rewrite.py:178: in exec_module
    exec(co, module.__dict__)
heat/sparse/tests/test_arithmetics_csr.py:4: in <module>
    import heat as ht
heat/__init__.py:20: in <module>
    from . import utils
heat/utils/__init__.py:5: in <module>
    from . import data
heat/utils/data/__init__.py:7: in <module>
    from . import mnist
heat/utils/data/mnist.py:7: in <module>
    from torchvision import datasets
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/torchvision/__init__.py:6: in <module>
    from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/torchvision/_meta_registrations.py:26: in <module>
    def meta_roi_align(input, rois, spatial_scale, pooled_height, pooled_width, sampling_ratio, aligned):
/opt/conda/envs/heat_dev/lib/python3.10/site-packages/torchvision/_meta_registrations.py:18: in wrapper
    if torchvision.extension._has_ops():
E   AttributeError: partially initialized module 'torchvision' has no attribute 'extension' (most likely due to a circular import)

Many test files return the same error. But we can't reproduce it with other PRs. Can you check it out?
The ROCm runner runs fine, by the way.

@Mystic-Slice
Copy link
Collaborator Author

I was able to recreate this error on my local machine.
It occurs when the torch and torchvision version do not match.
In my case, specifically, torch==2.2 and torchvision==0.18.0 (which corresponds to torch==2.3).
When I made the versions match, they work perfectly fine.

I believe the torch version on cuda runner is 2.2 and not 2.3. But I am not sure.
image

Is there a way to check that?

@ClaudiaComito
Copy link
Contributor

ClaudiaComito commented Jun 3, 2024

I was able to recreate this error on my local machine. It occurs when the torch and torchvision version do not match. In my case, specifically, torch==2.2 and torchvision==0.18.0 (which corresponds to torch==2.3). When I made the versions match, they work perfectly fine.

I believe the torch version on cuda runner is 2.2 and not 2.3. But I am not sure. image

Is there a way to check that?

@mtar can you look into this? thanks never mind

Copy link
Contributor

github-actions bot commented Jun 3, 2024

Thank you for the PR!

@ClaudiaComito
Copy link
Contributor

@Mystic-Slice this looks amazing. Will you add your name to the CITATION.cff file, under the # release contributors - add as needed header? Thanks!

@Mystic-Slice
Copy link
Collaborator Author

Done!

Copy link
Contributor

github-actions bot commented Jun 5, 2024

Thank you for the PR!

Copy link
Contributor

github-actions bot commented Jun 6, 2024

Thank you for the PR!

ClaudiaComito
ClaudiaComito previously approved these changes Jun 6, 2024
Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work @Mystic-Slice ! 👏🏼

@ClaudiaComito
Copy link
Contributor

@Mystic-Slice one single test fails with torch<2, see below. From my side it's absolutely OK to skip the test when torch<2

_________________________ TestDCSC_matrix.test_astype __________________________

self = <heat.sparse.tests.test_dcscmatrix.TestDCSC_matrix testMethod=test_astype>

    def test_astype(self):
        heat_sparse_csc = ht.sparse.sparse_csc_matrix(self.ref_torch_sparse_csc)
    
        # check starting invariant
        self.assertEqual(heat_sparse_csc.dtype, ht.float32)
    
        # check the copy case for uint8
>       as_uint8 = heat_sparse_csc.astype(ht.uint8)

heat/sparse/tests/test_dcscmatrix.py:273: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = , dtype = <class 'heat.core.types.uint8'>, copy = True

    def astype(self, dtype, copy=True) -> __DCSX_matrix:
        """
        Returns a casted version of this matrix.
        Casted matrix is a new matrix of the same shape but with given type of this matrix. If copy is ``True``, the
        same matrix is returned instead.
    
        Parameters
        ----------
        dtype : datatype
            HeAT type to which the matrix is cast
        copy : bool, optional
            By default the operation returns a copy of this matrix. If copy is set to ``False`` the cast is performed
            in-place and this matrix is returned
        """
        dtype = canonical_heat_type(dtype)
>       casted_matrix = self._array.type(dtype.torch_type())
E       RuntimeError: torch.copy_: only sparse compressed tensors with the same number of specified elements are supported.

heat/sparse/dcsx_matrix.py:309: RuntimeError

@Mystic-Slice
Copy link
Collaborator Author

Mystic-Slice commented Jun 7, 2024

@ClaudiaComito
referred https://discuss.pytorch.org/t/why-does-pytorch-needs-the-three-functions-to-type-and-type-as/20164
Seems like the .type() doesnt support conversion of a lot of Tensor types. I should have used .to() method instead.
Made the change.

Copy link
Contributor

github-actions bot commented Jun 7, 2024

Thank you for the PR!

ClaudiaComito
ClaudiaComito previously approved these changes Jun 7, 2024
Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Mystic-Slice ! (take 2)

@Mystic-Slice
Copy link
Collaborator Author

Seems like it didnt work
I will just skip the test for torch < 2.0.
Sorry for the delay. I thought the previous fix would work.

Copy link
Contributor

github-actions bot commented Jun 7, 2024

Thank you for the PR!

Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take 3!

@ClaudiaComito ClaudiaComito merged commit 0e40d14 into main Jun 7, 2024
53 checks passed
@ClaudiaComito ClaudiaComito deleted the sparse_csc branch June 7, 2024 12:40
@ClaudiaComito ClaudiaComito added the enhancement New feature or request label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants