Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#4783 Added nan_euclidean distance metric to pairwise_distances #4797

Merged

Conversation

SreekiranprasadV
Copy link
Contributor

Added nan_euclidean distance metric to pairwise_distances to calculate euclidean distance on data with missing values.

  • Added Test cases for nan_euclidean_distance functions

Time taken to calculate:
#Data Points | Sklearn | Cuml
10000 402 us 2.54 ms
100k 23 ms 3.8 ms
1M 760 ms 16 ms

GPU specifications:

  • Tesla T4 15109MiB

CPU specifications:

  • 11th gen intel i7, 8 cores, 16 Logical processors, 32 GB Memory
  • Sklearn njobs as default

@SreekiranprasadV SreekiranprasadV requested a review from a team as a code owner July 1, 2022 20:25
@GPUtester
Copy link
Contributor

Can one of the admins verify this patch?

@github-actions github-actions bot added the Cython / Python Cython or Python issue label Jul 1, 2022
@beckernick
Copy link
Member

ok to test

@beckernick beckernick added feature request New feature or request non-breaking Non-breaking change labels Jul 1, 2022
@cjnolet
Copy link
Member

cjnolet commented Jul 1, 2022

add to allowlist

@SreekiranprasadV
Copy link
Contributor Author

rerun tests

1 similar comment
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Overall, I think this will be a great addition to cuML. Since the memory footprint of the pairwise distances already grows quadratically w/ the number of data points, we can do some tricks to save memory on the GPU.

python/cuml/metrics/pairwise_distances.pyx Outdated Show resolved Hide resolved
missing_Y = missing_X if Y is X else _get_mask(Y, missing_values)

# set missing values to zero
X[missing_X] = 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we strive to not modify the inputs unless we can't avoid doing so. When we do decide it's best to modify the inputs, we make sure to document that clearly in the pydocs and revert any changes back at the end of the algorithm. One of the reasons for this is that it can cause undefined and non-deterministic behavior when a user attempts to run multiple algorithms on the inputs asynchronously.

If you are to modify the inputs, I would also suggest adding a pytest assertion that the inputs X and Y haven't changed after the algorithm executes.

distances = cp.array(pairwise_distances(X, Y, metric="sqeuclidean"))

# Adjust distances for missing values
XX = X * X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems really expensive- we're essentially copying each input just to compute a masked l2 norm that we can subtract from the pairwise distance matrix. In the case where X == Y, we're copying the input twice. It should be fairly straightforward to do this in place, even using a RawKernel if needed. You should be able to take the storage requirement down to X.shape[0] (or X.shape[0] + Y.shape[0] when Y is supplied).


Parameters
----------
X : array-like of shape (n_samples_X, n_features)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should follow the formats of the other cuML pydocs:

  X : Dense or sparse matrix (device or host) of shape
        (n_samples_x, n_features)
        Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
        ndarray, cuda array interface compliant array like CuPy, or
        cupyx.scipy.sparse for sparse input
    Y : array-like (device or host) of shape (n_samples_y, n_features),\
        optional
        Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
        ndarray, cuda array interface compliant array like CuPy

@pytest.mark.parametrize("metric", ["nan_euclidean"])
@pytest.mark.parametrize("matrix_size", [(5, 4), (1000, 3), (2, 10),
(500, 400), unit_param((1000, 100))])
def test_nan_euclidean_distances(metric: str, matrix_size):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like much of this is duplicated from the test_pairwise_distance pytest and this file is pretty large already. We should be able to combine this w/ the existing test and add nans to the input when the appropriate distance is invoked.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the other changes that were suggested, however in the test cases part to add nan_euclidean metric to test_pairwise_distance there should be a lot of the existing code which needs to be changed, because pairwise_distaces accept numpy arrays also as an input however nan_euclidean function accepts only cupy arrays.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Sreekiran096 I missed that detail when reviewing this PR. We should be accepting numpy arrays as well, which should be as easy as putting X = cp.asarray(X) at the beginning (it's a no-op when the data is already on the GPU).

@@ -136,6 +134,104 @@ def _determine_metric(metric_str, is_sparse=False):
return PAIRWISE_DISTANCE_METRICS[metric_str]


def nan_euclidean_distances(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add "nan_euclidean" to PAIRWISE_DISTANCE_METRICS

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Corey for all the changes suggested will do the relevant changes and commit again.

@cjnolet
Copy link
Member

cjnolet commented Jul 26, 2022

@Sreekiran096 it looks like this PR is just waiting on style fixes

@SreekiranprasadV
Copy link
Contributor Author

rerun tests

1 similar comment
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

@SreekiranprasadV
Copy link
Contributor Author

rerun tests

1 similar comment
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

@cjnolet cjnolet changed the base branch from branch-22.08 to branch-22.10 August 9, 2022 20:32
@codecov-commenter
Copy link

Codecov Report

Merging #4797 (6fb284c) into branch-22.10 (dc77d6b) will increase coverage by 0.39%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.10    #4797      +/-   ##
================================================
+ Coverage         77.62%   78.02%   +0.39%     
================================================
  Files               180      180              
  Lines             11384    11386       +2     
================================================
+ Hits               8837     8884      +47     
+ Misses             2547     2502      -45     
Flag Coverage Δ
dask 46.22% <100.00%> (+0.70%) ⬆️
non-dask 67.27% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/metrics/__init__.py 100.00% <100.00%> (ø)
python/cuml/feature_extraction/_vectorizers.py 89.93% <0.00%> (+0.37%) ⬆️
python/cuml/common/import_utils.py 59.82% <0.00%> (+0.85%) ⬆️
.../dask/extended/linear_model/logistic_regression.py 92.00% <0.00%> (+57.33%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@dantegd
Copy link
Member

dantegd commented Aug 30, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 1e697db into rapidsai:branch-22.10 Aug 30, 2022
@cjnolet
Copy link
Member

cjnolet commented Aug 30, 2022

@gpucibot merge

jakirkham pushed a commit to jakirkham/cuml that referenced this pull request Feb 27, 2023
…es (rapidsai#4797)

Added nan_euclidean distance metric to pairwise_distances to calculate euclidean distance on data with missing values.

- Added Test cases for nan_euclidean_distance functions

Time taken to calculate:
#Data Points | Sklearn | Cuml
      10000       402 us     2.54 ms
      100k         23 ms      3.8 ms
      1M            760 ms    16 ms

GPU specifications:
- Tesla T4 15109MiB

CPU specifications:
- 11th gen intel i7, 8 cores, 16 Logical processors, 32 GB Memory
- Sklearn njobs as default

Authors:
  - https://github.com/Sreekiran096

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4797
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants