Sync with upstream #20

`#include <cuml/manifold/umap.hpp>` works now. Co-authored-by: Corey J. Nolet <[email protected]>

* Moving conftest.py files around and adding quick_run plugin * Adding PR to CHANGELOG * Incorporating feedback from code review

* Initial cython test commit * Update changelog * Style fixes Co-authored-by: Nanthini Balasubramanian <[email protected]> Co-authored-by: Dante Gama Dessavre <[email protected]>

…precation warnings (#3155) * Get rid of warnings in random projections test * Update changelog * Fix style * Update other deprecated make_blob imports

* FIX Force local install by specifying exact build string * DOC Update changelog * Update ci/gpu/build.sh Co-authored-by: AJ Schmidt <[email protected]> Co-authored-by: AJ Schmidt <[email protected]>

* Update flake8 config to join python/cython configuration and improve setup to check __init__.py files * Fixing linting issues in previously ignored __init__.py files * Update flake8 config to join python/cython configuration and improve setup to check __init__.py files * Fixing linting issues in previously ignored __init__.py files * Adding PR to CHANGELOG * Incorporating feedback from code review * Fixing style issues after merge with branch-0.17 Co-authored-by: Corey J. Nolet <[email protected]> Co-authored-by: Dante Gama Dessavre <[email protected]>

…kip-ci] (#3144) * Adding ability to set arbitrary cmake flags in ./build.sh via the $CUML_ADDL_CMAKE_ARGS variable * Adding PR to CHANGELOG * Adding more help info requested from code review. Co-authored-by: John Zedlewski <[email protected]>

* Adding brute force knn shell to sparse * Stubbing out algorithm flow * Adding initial headers to wrapper * Performing idx batching * Starting to full in cusparse calls * Checking in * Beginning to add selection kernel * Finished header * Updates. Need to finish populating merge buffer * Using block select for selecting k and using 3-partition merge buffer * Logic is just about done. * Checking in changes. Need to swap out cuda 11 cusparse calls for cuda 10.2 version * Everything is building. Need end-to-end test * Running clang format * Updating changelog * Using raft's cusparse_wrappers.h instead of cuml * Removing cuda11-required GEMM calls (commenting them out for now, will swap them out shortly) * Fixing clang style * Separating distance computation from selection from general brute force algorithm to make pieces more reusable * Updating clang style * Adding batcher to help ease batch state management * Fixing clang style * MOre clang fixes * IP distance is computed using search * index.T. * Making type template for value_t all the way through knn_merge_parts * Adding simple googletest for sparse pairwise dists. The transpose conversion seems super expensive, but maybe it's necessary. * Completing test for basic inner product distances * Removing prints from test * Cleaning up batching for knn. Ready to gtest * KNN w/ max inner product is working * Adding guts of expanded l2 computation. * Cleaning up some debug prints * Fixing clang format * More cleanup and clang style fix * Fixing style for sparse knn prim test * Hoping i've captured all the clang updates * Updating per include_checker * I feel like I"m bouncing back and forth between clang and include checker * Refactoring sparse pairwise dists to return dense outputs * Beginning python layer * iAdding python layer for sparse inputs to nearest neighbors * End to end sparse knn works. Need to finish norms for expanded euclidean and expose it. * Removing unused file * Adding gtest for expanded l2. * Sparse l2 matches sklearn * Fixing clang format style * Fixing dstyle in gtests * Lots of changes and cleanup. Still need to flip the batching * Progress on tiling. Still a failure when tile sizes don't match up. * Tiling w/ uneven batch sizes works! Now just need to figure out what to do when the leftover values are <k * Some further optinmizations are necessary, but this works for now. * Ready for cleanup * Parametrizing sparse knn tests * More cleanup. * Fixing clang format * Fixing clang format style * Fixing flake8 for sparse nn tests * Fixing googletests * More cleanup of sparse knn * Adding sparse support to UMAP by abstracting the inputs * Everything's building. Have one template issue to fix in the sparse knn * Updates to API * Usig a struct to manage the knn graph output state * C++ side is largely done. Still need to figure out what to do w/ the separate int64_t type in the sparse knn * Removing examples/comms, which seems to have gotten re-checked in by mistake * Fixing c++ style * Fixing include checks * This darn style checker is going to kill me..... * Adding template type params for output * UMAP is officially accepting sparse inputs * More cleanup * Cleaning up gtests and making them easier to write * Fixing up and parametrizing tests * Fixing style * Fixing python style * More clang format style fixes * Pulled umap inputs classes to more shared location so tsne can use them. Added kselection gtest * Updating clang format * Fixing bad ide refactor * Updating changelog * Fixing more clang format * Fixing flake8 style. Not sure why these didn't show up locally * Decomposing sparse knn into a class. * Review feedback * Better umap sparse test * More testing updates * Adding docs to some of the remaining prims in csr.cuh * Adding gtests for transpose and row slice. Need to add one for todense * GTest for csr to dense * Fixing style * Removing debug logging from new gtests * Fixing flake8 style * Getting build to pass * Running clang-tidy * Fixing format for sparse gtests * Adding 'algo_params' to get_param_names() * Removing cumlarray output in kneighbors * Finishing review feedback * Fixing style * Fixing format * clang-format * Style changes * More review updates * Style updates * Running clang format on distance.cuh * Runing clang format on tests * Fixing cython style * Updating RAFT commit * Updating neighbors from bad merge

…mples_leaf (#3132) * Enforce min_rows_per_node in experimental RF backend * Add min_samples_split hyperparameter * Use correct definition of min_samples_split * Rename range_len -> n_samples * Add min_samples_split to Dask docstring * Rename min_rows_per_node -> min_samples_leaf * Update docstring for min_samples_leaf * Correctly apply min_samples_split in new RF backend * Address reviewer's comment * Fix broken tests in BatchedLevelAlgo/DtRegTestF.Test * Adjust accuracy requirement in test RFBatchedRegTests/RFBatchedRegTestF.Fit/5 * Add unit tests for min_samples_split, min_samples_leaf * Add descriptive comments for compound literals * Fix formatting * Add changelog * Organize unit tests under prefix BatchedLevelAlgoUnitTest * Change default value for min_samples_leaf to 1 * Deprecate min_rows_per_node; guide users to use min_samples_leaf * Fix style error

…ors (#3113) * FEA Add preferred_order class parameter to linear models * ENH adopt tags from scikit-learn API to support preferred order attribute * DOC remove attribute docstrings * FIX Change straggling classes * FIX Change straggling classes * FIX Add missing self * FIX straggling attribute * ENH Add device data tag for proposal * FEA Add all scikit-learn API tags to base and improve gpu input types tag * FEA Add preferred_order tag to cluster models * FEA Add preferred_order tag to most models * ENH Improvements and PR review feedback * DOC add tag documentation to estimator guide * DOC add scikit link * Update wiki/python/ESTIMATOR_GUIDE.md Co-authored-by: Corey J. Nolet <[email protected]> * Update wiki/python/ESTIMATOR_GUIDE.md Co-authored-by: Corey J. Nolet <[email protected]> * Update wiki/python/ESTIMATOR_GUIDE.md Co-authored-by: Corey J. Nolet <[email protected]> * Update wiki/python/ESTIMATOR_GUIDE.md Co-authored-by: Corey J. Nolet <[email protected]> * Update wiki/python/ESTIMATOR_GUIDE.md Co-authored-by: Corey J. Nolet <[email protected]> * ENH Rename test_fit to test_api and add tags tests * FIX fixes from PR review * DOC Added entry to changelog * FIX PEP8 fixes Co-authored-by: Corey J. Nolet <[email protected]>

* Removing extra unneeded file * Updating changelog

…#3152) * FIX Access to attributes of individual NB objects in dask NB * DOC Added entry to changelog * ENH Add pytest * FIX PEP8 fixes Co-authored-by: John Zedlewski <[email protected]>

…the tiniest models (#3032) * just control block count * blocks_per_sm can now be passed through treelite_params_t or forest_params_t * changelog * made blocks_per_sm mandatory; added tests; fixed a bug * changelog * added tests, moved __syncthreads() to common for all acc's, removed most blockIdx.x uses * removed blocks_per_sm from python API, to avoid a longer discussion on best set * simplified output loops * addressed other review comments * fixed bad merge conflict resolution * comment for blocks_per_sm in fil.pyx * style

* binary reduction: half way there * quaternary reduction * changelog * remove accidental files * generalize the multireduction * adding dedicated tests for multireduction; style * change trap; into setting an atomic. * split into n tests, one per size * ? * tried thrust + rmm, no rmm dependency in tests it seems * no rmm, sync allocations * style * fixed some testing bugs; expanded test to all block sizes; better documentation * fixed wrong test * simplify comparison * member -> non-member function pointer as test template argument * style * replaced reduction with simpler code; tuned radix towards fewer classes * fixed compile dependency and runtime discrepancy * long comment line * fix build issues * Apply suggestions from code review Co-authored-by: Andy Adinets <[email protected]> * addressed review comments Co-authored-by: Andy Adinets <[email protected]>

* add dask-glm demo link * add to changelog Co-authored-by: Corey J. Nolet <[email protected]> Co-authored-by: Dante Gama Dessavre <[email protected]>

Updated with 0.15 and 0.16 release dates. Co-authored-by: Corey J. Nolet <[email protected]>

* Remove outdated, extraneous file * Update changelog

* Expose silhouette score in Python * Style fix * Correct dtypes used in silhouette_score * Update changelog * Fix style * Update linebreaks * Add copyright headers * Collapse Python silhouette_score to single file * Restructure silhouette_score for consistency * Fix style * Loosen silhouette score test tolerance

…[skip-ci] (#3175) * FIX Fix gtest pinned cmake version for build from source option * DOC Added entry to changelog

…3176) * Add probabilistic SVM tests with various input array types * DOC update changelog

* Fix a bug in MSE metric calculation * Style fix * Add changelog * Try smaller grid dimensions

* blocks_per_sm FIL parameter in Python. * Updated CHANGELOG.md. * Fixed style errors. * Reduced the number of parameter combinations in the Python test.

) * Enable pipeline usage for OneHotEncoder and LabelEncoder * Changelog update

* Adding simple dask estimator notebook to demonstrate saving/loading * Renaming and updating cells * Updating source.rst * Updating changelog * Updating pickling notebook * Review updates * More review feedback Co-authored-by: John Zedlewski <[email protected]>

* Fix + multiple improvements * Update changelog * Update model output and testing * Check style update * Update comments * Test one query partition * Check style

…dically-failing FIL test [skip ci] (#3196) * Disable ascending=false path for sortColumnsPerRow * DOC Update chanegelog * Disable flaky FIL test Co-authored-by: John Zedlewski <[email protected]> Co-authored-by: John Zedlewski <[email protected]>

* FIX Fix EXITCODE override in test_notebooks script * DOC Changelog update * FIX Move bash trap to after the GTests so they fail immediately * FIX Move codecov block to gpu build

* Fix cuDF to cuPy conversion (missing value) * Changelog update * Introducing fail_on_nan parameter * Adding test with fail_on_nan=True * Updating conversion * Rename fail_on_nan into fail_on_null

This PR is fixing the attribute error of #3183, and additional bugs on the input type of PCA (`sparse_scipy_to_cp()` function call missed an argument) and on the shape of `self.singular_values_`. I am also adding additional tests on the bug fixed here. Authors: - Mickael Ide <[email protected]> - John Zedlewski <[email protected]> Approvers: - Divye Gala - John Zedlewski URL: #3190

…3214) Add atol parameter to silhouette_score test to ensure consistent test behavior Authors: - William Hicks <[email protected]> Approvers: - John Zedlewski URL: #3214

I found it helpful when debugging the MSE metric calculation in random forest. Gain = Change in the metric (MSE / MAE / Gini / Entropy) that's attributed directly to each internal node (split). Authors: - Hyunsu Cho <[email protected]> Approvers: - John Zedlewski URL: #3186

This PR fixes #3173 . With this PR it renders like below locally for me. ![image](https://user-images.githubusercontent.com/4837571/100154855-22761e00-2e5b-11eb-9be3-173e1a53ad08.png) Authors: - Vibhu Jawa <[email protected]> - Vibhu Jawa <[email protected]> - Corey J. Nolet <[email protected]> Approvers: - John Zedlewski URL: #3185