New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Provide method for auto-optimization of FIL parameters #5368

Merged

rapids-bot merged 23 commits into rapidsai:branch-23.06 from wphicks:dev-auto_optimize_fil

May 31, 2023

Contributor

wphicks commented Apr 17, 2023

Provide an optimize method for experimental FIL models that will automatically select the optimal chunk size and layout based on the indicated batch size or example data.

wphicks added 7 commits

April 10, 2023 14:06


          Add method for automatic optimization of FIL models

3bd3ab0

Provide a method to automatically find the optimal chunk_size and layout
for a FIL model


          Merge branch 'branch-23.06' into dev-auto_optimize_fil

21a3cfe


          Correct random data generation

e7bcd82


          Add test for model optimization

11ef8a4


          Merge branch 'branch-23.06' into dev-auto_optimize_fil

3a1c2cd


          Do not reduce output size if output not placed in shmem

36e751c


          Correct range for optimization data generation

c57b17c

wphicks added feature request 3 - Ready for Review non-breaking labels

wphicks requested review from a team as code owners

April 17, 2023 15:38

github-actions bot added CUDA/C++ Cython / Python labels


          Merge branch 'branch-23.06' into dev-auto_optimize_fil

8858a4f

csadorf reviewed

View reviewed changes

Contributor

csadorf left a comment

Very nice! I have some comments and suggestions, but no major concerns.

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved

python/cuml/experimental/fil/fil.pyx Show resolved Hide resolved

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved

python/cuml/experimental/fil/fil.pyx Outdated

Comment on lines 1294 to 1299

+                      data
+                          Example data of shape iterations x batch size x features or None.
+                          If None, random data will be generated instead.
+                      batch_size : int
+                          If example data is not provided, random data with this many rows
+                          per batch will be used.

Contributor

csadorf Apr 21, 2023

Alternative, we could just accept an int as data parameter, which would imply "please generate data for me." This would avoid ambiguous parameterization.

Contributor Author

wphicks May 22, 2023

I ultimately did not go with this because the same argument applies to the (newly-renamed) unique_batches argument. We could say that data can accept an int or a tuple to cover both of those, but then it starts to get cumbersome to determine the actual intent of the caller. Did they e.g. expect that tuple to be converted to an array and passed in as data?

I think the explicitness of keeping these as separate parameters is worthwhile, but it's definitely a balance of considerations. Thoughts?

Contributor

csadorf May 22, 2023

I see. I think the cleanest solution would then be to provide a generator function that can be used as input if the default random_data shapes are not acceptable:

estimator.optimize(data=fil.random_data_generator(batch_size=1024, unique_batches=10))

This would also allow the user to provide their own custom generator function that more closely models their data if needed.

Just providing example data could be equivalent to:

estimator.optimize(data=KFold().get_n_splits(example_data))

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved


          Apply suggestions from code review

c669490

Co-authored-by: Carl Simon Adorf <[email protected]>

wphicks added 4 - Waiting on Author and removed 3 - Ready for Review labels

wphicks marked this pull request as draft

April 25, 2023 14:22

Contributor Author

wphicks commented Apr 25, 2023

Converted back to draft until I refactor based on @csadorf's excellent suggestions.

wphicks added 7 commits

April 25, 2023 11:05


          Begin refactoring into ForestInferenceOptimizer

cc1b14a


          Revert "Begin refactoring into ForestInferenceOptimizer"

53946ba

This reverts commit cc1b14a.


          Add timeout-based auto-optimization to FIL

46d81fe


          Merge branch 'branch-23.06' into dev-auto_optimize_fil

a3fbb74


          Correct timeout-based auto-optimization for FIL

992e93e


          Add default timeout to optimize docs

d7fca00


          Remove _get_chunk_size method

8e512f0

wphicks added 4 - Waiting on Reviewer and removed 4 - Waiting on Author labels

wphicks marked this pull request as ready for review

May 22, 2023 16:56

wphicks and others added 3 commits

May 22, 2023 12:56


          Merge branch 'branch-23.06' into dev-auto_optimize_fil

6ee1e37


          Merge branch 'branch-23.06' into dev-auto_optimize_fil

eb06e41


          Merge remote-tracking branch 'origin/dev-auto_optimize_fil' into dev-…

eacc790

…auto_optimize_fil

csadorf reviewed

View reviewed changes

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved

python/cuml/experimental/fil/fil.pyx Outdated

Comment on lines 1294 to 1299

+                      data
+                          Example data of shape iterations x batch size x features or None.
+                          If None, random data will be generated instead.
+                      batch_size : int
+                          If example data is not provided, random data with this many rows
+                          per batch will be used.

Contributor

csadorf May 22, 2023

I see. I think the cleanest solution would then be to provide a generator function that can be used as input if the default random_data shapes are not acceptable:

estimator.optimize(data=fil.random_data_generator(batch_size=1024, unique_batches=10))

This would also allow the user to provide their own custom generator function that more closely models their data if needed.

Just providing example data could be equivalent to:

estimator.optimize(data=KFold().get_n_splits(example_data))

wphicks and others added 2 commits

May 22, 2023 13:33


          Initialize sequence per instance instead of as class attribute

6c9dddc


          Merge branch 'branch-23.06' into dev-auto_optimize_fil

1af0fac

github-actions bot removed the CUDA/C++ label


          Merge branch 'branch-23.06' into dev-auto_optimize_fil

cbf3fa2

dantegd mentioned this pull request

[BUG] Sporadic tsne pytest segfault with correlation metric #5445

Closed


          Merge branch 'branch-23.06' into dev-auto_optimize_fil

4dc3bc7

Member

dantegd commented May 31, 2023

/merge

dantegd approved these changes

View reviewed changes

rapids-bot bot merged commit 91eea36 into rapidsai:branch-23.06

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 - Waiting on Reviewer Cython / Python feature request non-breaking