-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#4783 Added nan_euclidean distance metric to pairwise_distances #4797
#4783 Added nan_euclidean distance metric to pairwise_distances #4797
Conversation
Can one of the admins verify this patch? |
ok to test |
add to allowlist |
rerun tests |
1 similar comment
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! Overall, I think this will be a great addition to cuML. Since the memory footprint of the pairwise distances already grows quadratically w/ the number of data points, we can do some tricks to save memory on the GPU.
missing_Y = missing_X if Y is X else _get_mask(Y, missing_values) | ||
|
||
# set missing values to zero | ||
X[missing_X] = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, we strive to not modify the inputs unless we can't avoid doing so. When we do decide it's best to modify the inputs, we make sure to document that clearly in the pydocs and revert any changes back at the end of the algorithm. One of the reasons for this is that it can cause undefined and non-deterministic behavior when a user attempts to run multiple algorithms on the inputs asynchronously.
If you are to modify the inputs, I would also suggest adding a pytest assertion that the inputs X and Y haven't changed after the algorithm executes.
distances = cp.array(pairwise_distances(X, Y, metric="sqeuclidean")) | ||
|
||
# Adjust distances for missing values | ||
XX = X * X |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems really expensive- we're essentially copying each input just to compute a masked l2 norm that we can subtract from the pairwise distance matrix. In the case where X == Y, we're copying the input twice. It should be fairly straightforward to do this in place, even using a RawKernel
if needed. You should be able to take the storage requirement down to X.shape[0]
(or X.shape[0]
+ Y.shape[0]
when Y is supplied).
|
||
Parameters | ||
---------- | ||
X : array-like of shape (n_samples_X, n_features) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should follow the formats of the other cuML pydocs:
X : Dense or sparse matrix (device or host) of shape
(n_samples_x, n_features)
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy, or
cupyx.scipy.sparse for sparse input
Y : array-like (device or host) of shape (n_samples_y, n_features),\
optional
Acceptable formats: cuDF DataFrame, NumPy ndarray, Numba device
ndarray, cuda array interface compliant array like CuPy
python/cuml/tests/test_metrics.py
Outdated
@pytest.mark.parametrize("metric", ["nan_euclidean"]) | ||
@pytest.mark.parametrize("matrix_size", [(5, 4), (1000, 3), (2, 10), | ||
(500, 400), unit_param((1000, 100))]) | ||
def test_nan_euclidean_distances(metric: str, matrix_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like much of this is duplicated from the test_pairwise_distance
pytest and this file is pretty large already. We should be able to combine this w/ the existing test and add nans to the input when the appropriate distance is invoked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made the other changes that were suggested, however in the test cases part to add nan_euclidean metric to test_pairwise_distance there should be a lot of the existing code which needs to be changed, because pairwise_distaces accept numpy arrays also as an input however nan_euclidean function accepts only cupy arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Sreekiran096 I missed that detail when reviewing this PR. We should be accepting numpy
arrays as well, which should be as easy as putting X = cp.asarray(X)
at the beginning (it's a no-op when the data is already on the GPU).
@@ -136,6 +134,104 @@ def _determine_metric(metric_str, is_sparse=False): | |||
return PAIRWISE_DISTANCE_METRICS[metric_str] | |||
|
|||
|
|||
def nan_euclidean_distances( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add "nan_euclidean"
to PAIRWISE_DISTANCE_METRICS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you Corey for all the changes suggested will do the relevant changes and commit again.
@Sreekiran096 it looks like this PR is just waiting on style fixes |
rerun tests |
1 similar comment
rerun tests |
rerun tests |
1 similar comment
rerun tests |
Codecov Report
@@ Coverage Diff @@
## branch-22.10 #4797 +/- ##
================================================
+ Coverage 77.62% 78.02% +0.39%
================================================
Files 180 180
Lines 11384 11386 +2
================================================
+ Hits 8837 8884 +47
+ Misses 2547 2502 -45
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
@gpucibot merge |
@gpucibot merge |
…es (rapidsai#4797) Added nan_euclidean distance metric to pairwise_distances to calculate euclidean distance on data with missing values. - Added Test cases for nan_euclidean_distance functions Time taken to calculate: #Data Points | Sklearn | Cuml 10000 402 us 2.54 ms 100k 23 ms 3.8 ms 1M 760 ms 16 ms GPU specifications: - Tesla T4 15109MiB CPU specifications: - 11th gen intel i7, 8 cores, 16 Logical processors, 32 GB Memory - Sklearn njobs as default Authors: - https://github.com/Sreekiran096 Approvers: - Corey J. Nolet (https://github.com/cjnolet) URL: rapidsai#4797
Added nan_euclidean distance metric to pairwise_distances to calculate euclidean distance on data with missing values.
Time taken to calculate:
#Data Points | Sklearn | Cuml
10000 402 us 2.54 ms
100k 23 ms 3.8 ms
1M 760 ms 16 ms
GPU specifications:
CPU specifications: