#4783 Added nan_euclidean distance metric to pairwise_distances #4797

SreekiranprasadV · 2022-07-01T20:25:11Z

Added nan_euclidean distance metric to pairwise_distances to calculate euclidean distance on data with missing values.

Added Test cases for nan_euclidean_distance functions

Time taken to calculate:
#Data Points | Sklearn | Cuml
10000 402 us 2.54 ms
100k 23 ms 3.8 ms
1M 760 ms 16 ms

GPU specifications:

Tesla T4 15109MiB

CPU specifications:

11th gen intel i7, 8 cores, 16 Logical processors, 32 GB Memory
Sklearn njobs as default

GPUtester · 2022-07-01T20:25:14Z

Can one of the admins verify this patch?

beckernick · 2022-07-01T20:46:25Z

ok to test

cjnolet · 2022-07-01T21:03:54Z

add to allowlist

SreekiranprasadV · 2022-07-05T15:57:05Z

rerun tests

SreekiranprasadV · 2022-07-13T20:22:40Z

rerun tests

cjnolet

Thanks for the PR! Overall, I think this will be a great addition to cuML. Since the memory footprint of the pairwise distances already grows quadratically w/ the number of data points, we can do some tricks to save memory on the GPU.

python/cuml/metrics/pairwise_distances.pyx

cjnolet · 2022-07-14T21:19:55Z

python/cuml/metrics/pairwise_distances.pyx

+    missing_Y = missing_X if Y is X else _get_mask(Y, missing_values)
+
+    # set missing values to zero
+    X[missing_X] = 0


In general, we strive to not modify the inputs unless we can't avoid doing so. When we do decide it's best to modify the inputs, we make sure to document that clearly in the pydocs and revert any changes back at the end of the algorithm. One of the reasons for this is that it can cause undefined and non-deterministic behavior when a user attempts to run multiple algorithms on the inputs asynchronously.

If you are to modify the inputs, I would also suggest adding a pytest assertion that the inputs X and Y haven't changed after the algorithm executes.

cjnolet · 2022-07-14T21:22:11Z

python/cuml/metrics/pairwise_distances.pyx

+    distances = cp.array(pairwise_distances(X, Y, metric="sqeuclidean"))
+
+    # Adjust distances for missing values
+    XX = X * X


This seems really expensive- we're essentially copying each input just to compute a masked l2 norm that we can subtract from the pairwise distance matrix. In the case where X == Y, we're copying the input twice. It should be fairly straightforward to do this in place, even using a RawKernel if needed. You should be able to take the storage requirement down to X.shape[0] (or X.shape[0] + Y.shape[0] when Y is supplied).

cjnolet · 2022-07-14T21:40:25Z

python/cuml/metrics/pairwise_distances.pyx

+
+    Parameters
+    ----------
+    X : array-like of shape (n_samples_X, n_features)


This should follow the formats of the other cuML pydocs:

X : Dense or sparse matrix (device or host) of shape (n_samples_x, n_features) Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy, or cupyx.scipy.sparse for sparse input Y : array-like (device or host) of shape (n_samples_y, n_features),\ optional Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device ndarray, cuda array interface compliant array like CuPy

cjnolet · 2022-07-14T21:44:00Z

python/cuml/tests/test_metrics.py

+@pytest.mark.parametrize("metric", ["nan_euclidean"])
+@pytest.mark.parametrize("matrix_size", [(5, 4), (1000, 3), (2, 10),
+                                         (500, 400), unit_param((1000, 100))])
+def test_nan_euclidean_distances(metric: str, matrix_size):


It looks like much of this is duplicated from the test_pairwise_distance pytest and this file is pretty large already. We should be able to combine this w/ the existing test and add nans to the input when the appropriate distance is invoked.

I have made the other changes that were suggested, however in the test cases part to add nan_euclidean metric to test_pairwise_distance there should be a lot of the existing code which needs to be changed, because pairwise_distaces accept numpy arrays also as an input however nan_euclidean function accepts only cupy arrays.

@Sreekiran096 I missed that detail when reviewing this PR. We should be accepting numpy arrays as well, which should be as easy as putting X = cp.asarray(X) at the beginning (it's a no-op when the data is already on the GPU).

cjnolet · 2022-07-14T21:47:35Z

python/cuml/metrics/pairwise_distances.pyx

@@ -136,6 +134,104 @@ def _determine_metric(metric_str, is_sparse=False):
        return PAIRWISE_DISTANCE_METRICS[metric_str]


+def nan_euclidean_distances(


Add "nan_euclidean" to PAIRWISE_DISTANCE_METRICS

Thank you Corey for all the changes suggested will do the relevant changes and commit again.

cjnolet · 2022-07-26T14:47:31Z

@Sreekiran096 it looks like this PR is just waiting on style fixes

SreekiranprasadV · 2022-07-29T22:02:24Z

rerun tests

SreekiranprasadV · 2022-08-01T17:47:48Z

rerun tests

SreekiranprasadV · 2022-08-02T02:44:44Z

rerun tests

SreekiranprasadV · 2022-08-02T02:49:21Z

rerun tests

codecov-commenter · 2022-08-10T00:04:19Z

Codecov Report

Merging #4797 (6fb284c) into branch-22.10 (dc77d6b) will increase coverage by 0.39%.
The diff coverage is 100.00%.

@@               Coverage Diff                @@
##           branch-22.10    #4797      +/-   ##
================================================
+ Coverage         77.62%   78.02%   +0.39%     
================================================
  Files               180      180              
  Lines             11384    11386       +2     
================================================
+ Hits               8837     8884      +47     
+ Misses             2547     2502      -45

Flag	Coverage Δ
dask	`46.22% <100.00%> (+0.70%)`	⬆️
non-dask	`67.27% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
python/cuml/metrics/__init__.py	`100.00% <100.00%> (ø)`
python/cuml/feature_extraction/_vectorizers.py	`89.93% <0.00%> (+0.37%)`	⬆️
python/cuml/common/import_utils.py	`59.82% <0.00%> (+0.85%)`	⬆️
.../dask/extended/linear_model/logistic_regression.py	`92.00% <0.00%> (+57.33%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

dantegd · 2022-08-30T17:31:43Z

@gpucibot merge

cjnolet · 2022-08-30T17:32:14Z

@gpucibot merge

…es (rapidsai#4797) Added nan_euclidean distance metric to pairwise_distances to calculate euclidean distance on data with missing values. - Added Test cases for nan_euclidean_distance functions Time taken to calculate: #Data Points | Sklearn | Cuml 10000 402 us 2.54 ms 100k 23 ms 3.8 ms 1M 760 ms 16 ms GPU specifications: - Tesla T4 15109MiB CPU specifications: - 11th gen intel i7, 8 cores, 16 Logical processors, 32 GB Memory - Sklearn njobs as default Authors: - https://github.com/Sreekiran096 Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4797

rapidsai#4783 Added nan_euclidean distance metric to pairwise_distances

89e13a2

SreekiranprasadV requested a review from a team as a code owner July 1, 2022 20:25

github-actions bot added the Cython / Python Cython or Python issue label Jul 1, 2022

beckernick added feature request New feature or request non-breaking Non-breaking change labels Jul 1, 2022

cjnolet requested changes Jul 14, 2022

View reviewed changes

SreekiranprasadV and others added 3 commits July 17, 2022 21:59

Merge branch 'rapidsai:branch-22.08' into nan_euclidean_distance

d90511a

resolved PR comments

2149627

updated function name

7663344

SreekiranprasadV and others added 5 commits July 28, 2022 18:00

Merge branch 'rapidsai:branch-22.08' into nan_euclidean_distance

3180ace

optimized distance metric

472440c

style additions

43ce8e4

style additions

4c3d0d0

resolving GPU CI issues

6fb284c

SreekiranprasadV mentioned this pull request Aug 2, 2022

Knn Imputer Class and dependency functionalities #4820

Open

cjnolet changed the base branch from branch-22.08 to branch-22.10 August 9, 2022 20:32

cjnolet approved these changes Aug 30, 2022

View reviewed changes

rapids-bot bot merged commit 1e697db into rapidsai:branch-22.10 Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#4783 Added nan_euclidean distance metric to pairwise_distances #4797

#4783 Added nan_euclidean distance metric to pairwise_distances #4797

SreekiranprasadV commented Jul 1, 2022

GPUtester commented Jul 1, 2022

beckernick commented Jul 1, 2022

cjnolet commented Jul 1, 2022

SreekiranprasadV commented Jul 5, 2022

SreekiranprasadV commented Jul 13, 2022

cjnolet left a comment

cjnolet Jul 14, 2022

cjnolet Jul 14, 2022

cjnolet Jul 14, 2022

cjnolet Jul 14, 2022

SreekiranprasadV Jul 15, 2022

cjnolet Jul 15, 2022

cjnolet Jul 14, 2022

SreekiranprasadV Jul 14, 2022

cjnolet commented Jul 26, 2022

SreekiranprasadV commented Jul 29, 2022

SreekiranprasadV commented Aug 1, 2022

SreekiranprasadV commented Aug 2, 2022

SreekiranprasadV commented Aug 2, 2022

codecov-commenter commented Aug 10, 2022

dantegd commented Aug 30, 2022

cjnolet commented Aug 30, 2022

		@@ -136,6 +134,104 @@ def _determine_metric(metric_str, is_sparse=False):
		return PAIRWISE_DISTANCE_METRICS[metric_str]


		def nan_euclidean_distances(

#4783 Added nan_euclidean distance metric to pairwise_distances #4797

#4783 Added nan_euclidean distance metric to pairwise_distances #4797

Conversation

SreekiranprasadV commented Jul 1, 2022

GPUtester commented Jul 1, 2022

beckernick commented Jul 1, 2022

cjnolet commented Jul 1, 2022

SreekiranprasadV commented Jul 5, 2022

SreekiranprasadV commented Jul 13, 2022

cjnolet left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cjnolet commented Jul 26, 2022

SreekiranprasadV commented Jul 29, 2022

SreekiranprasadV commented Aug 1, 2022

SreekiranprasadV commented Aug 2, 2022

SreekiranprasadV commented Aug 2, 2022

codecov-commenter commented Aug 10, 2022

Codecov Report

dantegd commented Aug 30, 2022

cjnolet commented Aug 30, 2022