Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knn Imputer Class and dependency functionalities #4820

Open
wants to merge 10,000 commits into
base: branch-23.02
Choose a base branch
from

Conversation

SreekiranprasadV
Copy link
Contributor

@SreekiranprasadV SreekiranprasadV commented Jul 18, 2022

Merge PR : #4797 before merging this one. The functionalities required for this are in #4797

Created a draft PR and Added KNN Imputer class and dependency functionalities for imputation of missing values.

Supported Inputs: Numpy arrays, Pandas DataFrame, Cupy arrays, Cudf DataFrame

Tested on: Tesla T4 Single GPU

Time Latency:

Tested on numpy arrays with 25% of the data is masked, averaged the distance metric and set the column size to 100.
Data Points Cuml Sklearn
100000 0.513s 0.383s
1M 10.5s 36.1s
10M 105s 373s

Tested on numpy arrays with 1% of the data is masked, averaged the distance metric and set the column size to 100.
Data Points Cuml Sklearn
100000 0.217s 0.208s
1M 2.86s 7.73s
10M 10.2s 122s

Profiling on 1 million records:

    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      100    0.491    0.005    0.570    0.006 {method 'argpartition' of 'cupy._core.core.ndarray' objects}
     3561    0.197    0.000    0.213    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/rmm/rmm.py:212(rmm_cupy_allocator)
        1    0.149    0.149    1.078    1.078 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cuml/_thirdparty/sklearn/preprocessing/_imputation.py:951(transform)
      2/1    0.087    0.044    0.161    0.161 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cuml/internals/api_decorators.py:453(inner_with_getters)
        3    0.056    0.019    0.064    0.021 {method 'dot' of 'cupy._core.core.ndarray' objects}
      201    0.024    0.000    0.039    0.000 {method 'nonzero' of 'cupy._core.core.ndarray' objects}
      100    0.014    0.000    0.621    0.006 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cuml/_thirdparty/sklearn/preprocessing/_imputation.py:863(_calc_impute)
      200    0.005    0.000    0.009    0.000 {built-in method cupy._core._routines_math._nansum}
     3562    0.005    0.000    0.010    0.000 cuda/cudart.pyx:10521(cudaGetDevice)
      101    0.004    0.000    0.009    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cupy/_creation/ranges.py:9(arange)
     3562    0.004    0.000    0.014    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/rmm/_cuda/gpu.py:53(getDevice)
      200    0.004    0.000    0.008    0.000 {method 'take' of 'cupy._core.core.ndarray' objects}
      100    0.003    0.000    0.006    0.000 {method 'all' of 'cupy._core.core.ndarray' objects}
     3562    0.003    0.000    0.005    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/enum.py:358(__call__)
      103    0.003    0.000    0.005    0.000 {method 'any' of 'cupy._core.core.ndarray' objects}
      616    0.002    0.000    0.014    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cupy/_creation/basic.py:7(empty)
     3561    0.002    0.000    0.002    0.000 {built-in method cupy.cuda.stream.get_current_stream}
     3562    0.002    0.000    0.002    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/enum.py:670(__new__)
     2107    0.002    0.000    0.004    0.000 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/numpy/core/numeric.py:1858(isscalar)
      7/1    0.002    0.000    1.080    1.080 /nvme/1/svadaga/miniconda3/envs/cuml_dev/lib/python3.9/site-packages/cuml/internals/api_decorators.py:357(inner)

Cupy in built functionalities are costing more time.

ajschmidt8 and others added 30 commits November 17, 2021 13:41
Implementing LinearSVM using the existing QN solvers.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Robert Maynard (https://github.com/robertmaynard)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4268
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Closes rapidsai#3846 

Adds support for exogenous variables to ARIMA.
All series in the batch must have the same number of exogenous variables, and exogenous variables are not shared across the batch (`exog` therefore has `n_exog * batch_size` columns).

Example:
```python
model = ARIMA(endog=df_endog, exog=df_exog_past, order=(1,0,1),
              seasonal_order=(1,1,1,12), fit_intercept=True,
              simple_differencing=False)
model.fit()
fc, lower, upper = model.forecast(40, exog=df_exog_future, level=0.95)
```

![2021-09-22_exog_fc](https://user-images.githubusercontent.com/17441062/134339807-f815a7a3-98dc-49e5-8599-9607e660597a.png)

Authors:
  - Louis Sugy (https://github.com/Nyrio)
  - Tamas Bela Feher (https://github.com/tfeher)

Approvers:
  - AJ Schmidt (https://github.com/ajschmidt8)
  - Tamas Bela Feher (https://github.com/tfeher)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4221
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Addresses rapidsai#4110

This is an experimental prototype. For now, it supports:
* XGBoost models with numerical splits
* cuML RF regressors with numerical splits

cuML RF classifiers are not supported.

Authors:
  - Philip Hyunsu Cho (https://github.com/hcho3)

Approvers:
  - Rory Mitchell (https://github.com/RAMitchell)
  - William Hicks (https://github.com/wphicks)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4351
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
This upgrade is required to be in-line with: rapidsai/cudf#9716

Depends on: rapidsai/integration#390

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Ray Douglass (https://github.com/raydouglass)

URL: rapidsai#4372
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
Fix Changelog Merge Conflicts for `branch-21.12`
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
…idsai#4400)

PR uses project flash to build the cuML Python package mirroring what the C++ flow looks like.

Note: Currently only changed for the CUDA 11.0 GPU test since that one uses Python 3.7, to do the other jobs we need to build the python package twice on the CPU job.
[gpuCI] Forward-merge branch-21.12 to branch-22.02 [skip gpuci]
…#4382)

Suggest using LinearSVM when the user chooses to use the linear kernel in SVM. The reason is that LinearSVM uses a specialized faster solver.

Closes rapidsai#1664
Also partially addresses rapidsai#2857

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Tamas Bela Feher (https://github.com/tfeher)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4382
…ai#4405)

There were actuall 2 minor issues that prevented `UMAPAlgo::Optimize::find_params_ab()` from being ASAN-clean at the moment:

- One is the mem leaks, of course
- Another one is the `malloc()`-`delete` mismatch -- only memory allocated using `new` or equivalent should be freed with operator `delete` or `delete[]`

Another issue that was also addressed here: exception safety (i.e., by using `make_unique` from C++-14)

Signed-off-by: Yitao Li <[email protected]>

Authors:
  - Yitao Li (https://github.com/yitao-li)

Approvers:
  - Zach Bjornson (https://github.com/zbjornson)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4405
P_sum is equal to n. See rapidsai#2622 where I made this change once before. rapidsai#4208 changed it back while consolidating code.

Authors:
  - Zach Bjornson (https://github.com/zbjornson)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)

URL: rapidsai#4425
@beckernick beckernick added the non-breaking Non-breaking change label Jul 26, 2022
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

SreekiranprasadV and others added 8 commits July 26, 2022 11:25
Pass `NVTX` option to raft in a more similar way to the other arguments and make sure `RAFT_NVTX` option in the installed `raft-config.cmake`.

Authors:
  - Artem M. Chirkin (https://github.com/achirkin)

Approvers:
  - Corey J. Nolet (https://github.com/cjnolet)
  - Robert Maynard (https://github.com/robertmaynard)

URL: rapidsai#4825
The conda recipe was updated to UCX 1.13.0 in rapidsai#4809 , but updating conda environment files was missing there.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Jordan Jacobelli (https://github.com/Ethyling)

URL: rapidsai#4813
Allows cuML to be installed with CuPy 11.

xref: rapidsai/integration#508

Authors:
  - https://github.com/jakirkham

Approvers:
  - Sevag H (https://github.com/sevagh)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: rapidsai#4837
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

1 similar comment
@SreekiranprasadV
Copy link
Contributor Author

rerun tests

@dantegd dantegd changed the base branch from branch-22.08 to branch-22.10 August 31, 2022 17:59
@codecov-commenter
Copy link

Codecov Report

Base: 77.62% // Head: 78.24% // Increases project coverage by +0.61% 🎉

Coverage data is based on head (e629e77) compared to base (dc77d6b).
Patch coverage: 81.81% of modified lines in pull request are covered.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.10    #4820      +/-   ##
================================================
+ Coverage         77.62%   78.24%   +0.61%     
================================================
  Files               180      181       +1     
  Lines             11384    11610     +226     
================================================
+ Hits               8837     9084     +247     
+ Misses             2547     2526      -21     
Flag Coverage Δ
dask 46.27% <14.39%> (+0.75%) ⬆️
non-dask 67.70% <81.81%> (+0.43%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
python/cuml/_thirdparty/sklearn/neighbors/_base.py 66.66% <66.66%> (ø)
...l/_thirdparty/sklearn/preprocessing/_imputation.py 85.71% <84.90%> (-0.62%) ⬇️
...cuml/_thirdparty/sklearn/preprocessing/__init__.py 100.00% <100.00%> (ø)
python/cuml/metrics/__init__.py 100.00% <100.00%> (ø)
python/cuml/common/array.py 97.21% <0.00%> (-0.78%) ⬇️
python/cuml/cluster/__init__.py 100.00% <0.00%> (ø)
python/cuml/feature_extraction/_vectorizers.py 89.93% <0.00%> (+0.37%) ⬆️
python/cuml/common/import_utils.py 59.82% <0.00%> (+0.85%) ⬆️
python/cuml/thirdparty_adapters/adapters.py 92.99% <0.00%> (+1.50%) ⬆️
.../dask/extended/linear_model/logistic_regression.py 92.00% <0.00%> (+57.33%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@github-actions
Copy link

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

@dantegd dantegd changed the base branch from branch-22.10 to branch-23.02 December 8, 2022 11:35
@ajschmidt8 ajschmidt8 requested review from a team as code owners February 13, 2023 18:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Cython / Python Cython or Python issue feature request New feature or request gpuCI gpuCI issue inactive-30d non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.