-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Provide method for auto-optimization of FIL parameters #5368
Provide method for auto-optimization of FIL parameters #5368
Conversation
Provide a method to automatically find the optimal chunk_size and layout for a FIL model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice! I have some comments and suggestions, but no major concerns.
python/cuml/experimental/fil/fil.pyx
Outdated
data | ||
Example data of shape iterations x batch size x features or None. | ||
If None, random data will be generated instead. | ||
batch_size : int | ||
If example data is not provided, random data with this many rows | ||
per batch will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternative, we could just accept an int as data parameter, which would imply "please generate data for me." This would avoid ambiguous parameterization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ultimately did not go with this because the same argument applies to the (newly-renamed) unique_batches
argument. We could say that data can accept an int or a tuple to cover both of those, but then it starts to get cumbersome to determine the actual intent of the caller. Did they e.g. expect that tuple to be converted to an array and passed in as data?
I think the explicitness of keeping these as separate parameters is worthwhile, but it's definitely a balance of considerations. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think the cleanest solution would then be to provide a generator function that can be used as input if the default random_data shapes are not acceptable:
estimator.optimize(data=fil.random_data_generator(batch_size=1024, unique_batches=10))
This would also allow the user to provide their own custom generator function that more closely models their data if needed.
Just providing example data could be equivalent to:
estimator.optimize(data=KFold().get_n_splits(example_data))
Co-authored-by: Carl Simon Adorf <[email protected]>
Converted back to draft until I refactor based on @csadorf's excellent suggestions. |
This reverts commit cc1b14a.
python/cuml/experimental/fil/fil.pyx
Outdated
data | ||
Example data of shape iterations x batch size x features or None. | ||
If None, random data will be generated instead. | ||
batch_size : int | ||
If example data is not provided, random data with this many rows | ||
per batch will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I think the cleanest solution would then be to provide a generator function that can be used as input if the default random_data shapes are not acceptable:
estimator.optimize(data=fil.random_data_generator(batch_size=1024, unique_batches=10))
This would also allow the user to provide their own custom generator function that more closely models their data if needed.
Just providing example data could be equivalent to:
estimator.optimize(data=KFold().get_n_splits(example_data))
/merge |
Provide an
optimize
method for experimental FIL models that will automatically select the optimal chunk size and layout based on the indicated batch size or example data.