-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support iter_epochs for Datasets #19217
Conversation
ptal |
@ericl, just to confirm is the behavior the same even if the pipeline structure is changed before/after the repeat call (for example on a Also, do we want to not even allow |
That's right, (re)window does not change epochs (though, it is possible re-windowed windows span epochs--- in that case the max epoch of the window is used as the epoch number).
I tried adding this earlier, but it was a bit tricky. Also, I think it's better if the user can use iter_epochs() safely even if there isn't repeat (e.g., so SGDv2 can always use iter_epochs(), regardless of whether the user repeated the data). |
@clarkzinzow good catch on the more_itertools dependency, I didn't realize that was not builtin. Removed it with a custom class. |
Why are these changes needed?
This PR adds support for splitting a DatasetPipeline into epochs to iterate over, a common ML task. An epoch is represented by a DatasetPipeline limited to data from that epoch only. Epochs are implicitly defined by calls to
.repeat()
, where an epoch is one repetition of the data, though the mechanism is extensible so we could support customizable epochs in the future.