Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide method for auto-optimization of FIL parameters #5368

Merged
merged 23 commits into from
May 31, 2023

Conversation

wphicks
Copy link
Contributor

@wphicks wphicks commented Apr 17, 2023

Provide an optimize method for experimental FIL models that will automatically select the optimal chunk size and layout based on the indicated batch size or example data.

@wphicks wphicks added feature request New feature or request 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Apr 17, 2023
@wphicks wphicks requested review from a team as code owners April 17, 2023 15:38
@github-actions github-actions bot added CUDA/C++ Cython / Python Cython or Python issue labels Apr 17, 2023
Copy link
Contributor

@csadorf csadorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! I have some comments and suggestions, but no major concerns.

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/fil/fil.pyx Show resolved Hide resolved
python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved
python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved
Comment on lines 1294 to 1299
data
Example data of shape iterations x batch size x features or None.
If None, random data will be generated instead.
batch_size : int
If example data is not provided, random data with this many rows
per batch will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative, we could just accept an int as data parameter, which would imply "please generate data for me." This would avoid ambiguous parameterization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ultimately did not go with this because the same argument applies to the (newly-renamed) unique_batches argument. We could say that data can accept an int or a tuple to cover both of those, but then it starts to get cumbersome to determine the actual intent of the caller. Did they e.g. expect that tuple to be converted to an array and passed in as data?

I think the explicitness of keeping these as separate parameters is worthwhile, but it's definitely a balance of considerations. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think the cleanest solution would then be to provide a generator function that can be used as input if the default random_data shapes are not acceptable:

estimator.optimize(data=fil.random_data_generator(batch_size=1024, unique_batches=10))

This would also allow the user to provide their own custom generator function that more closely models their data if needed.

Just providing example data could be equivalent to:

estimator.optimize(data=KFold().get_n_splits(example_data))

python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved
Co-authored-by: Carl Simon Adorf <[email protected]>
@wphicks wphicks added 4 - Waiting on Author Waiting for author to respond to review and removed 3 - Ready for Review Ready for review by team labels Apr 24, 2023
@wphicks wphicks marked this pull request as draft April 25, 2023 14:22
@wphicks
Copy link
Contributor Author

wphicks commented Apr 25, 2023

Converted back to draft until I refactor based on @csadorf's excellent suggestions.

@wphicks wphicks added 4 - Waiting on Reviewer Waiting for reviewer to review or respond and removed 4 - Waiting on Author Waiting for author to respond to review labels May 22, 2023
@wphicks wphicks marked this pull request as ready for review May 22, 2023 16:56
python/cuml/experimental/fil/fil.pyx Outdated Show resolved Hide resolved
Comment on lines 1294 to 1299
data
Example data of shape iterations x batch size x features or None.
If None, random data will be generated instead.
batch_size : int
If example data is not provided, random data with this many rows
per batch will be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think the cleanest solution would then be to provide a generator function that can be used as input if the default random_data shapes are not acceptable:

estimator.optimize(data=fil.random_data_generator(batch_size=1024, unique_batches=10))

This would also allow the user to provide their own custom generator function that more closely models their data if needed.

Just providing example data could be equivalent to:

estimator.optimize(data=KFold().get_n_splits(example_data))

@github-actions github-actions bot removed the CUDA/C++ label May 26, 2023
@dantegd
Copy link
Member

dantegd commented May 31, 2023

/merge

@rapids-bot rapids-bot bot merged commit 91eea36 into rapidsai:branch-23.06 May 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4 - Waiting on Reviewer Waiting for reviewer to review or respond Cython / Python Cython or Python issue feature request New feature or request non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants