-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Stability & accuracy improvements for Data+Train benchmark #42027
Conversation
Signed-off-by: Andrew Xue <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
# Transient errors that can occur during longer reads. Trigger retry when these occur. | ||
READ_FILE_RETRY_ON_ERRORS = ["AWS Error NETWORK_CONNECTION", "AWS Error ACCESS_DENIED"] | ||
READ_FILE_MAX_ATTEMPTS = 10 | ||
READ_FILE_RETRY_MAX_BACKOFF_SECONDS = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been repeated multiple times in the code base. can we consolidate them with a util function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raulchen i intended to keep these constants separate from the retry errors for opening files, e.g.
ray/python/ray/data/datasource/file_based_datasource.py
Lines 67 to 74 in bfc1f78
# The errors to retry for opening file. | |
OPEN_FILE_RETRY_ON_ERRORS = ["AWS Error SLOW_DOWN"] | |
# The max retry backoff in seconds for opening file. | |
OPEN_FILE_RETRY_MAX_BACKOFF_SECONDS = 32 | |
# The max number of attempts for opening file. | |
OPEN_FILE_MAX_ATTEMPTS = 10 |
i'm using the common util function call_with_retry
with these constants. do you mean consolidate these sets of constants into their own file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. might be better to also define those constants in a unified place. but not a big deal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i couldn't find a centralized constants.py
or other similar file for ray data constants. should we expose these as parameters from DataContext
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay to keep it the current way. let's refactor later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized that we already have write_file_retry_on_errors
in DataContext`. Let's also follow this pattern?
# Transient errors that can occur during longer reads. Trigger retry when these occur. | ||
READ_FILE_RETRY_ON_ERRORS = ["AWS Error NETWORK_CONNECTION", "AWS Error ACCESS_DENIED"] | ||
READ_FILE_MAX_ATTEMPTS = 10 | ||
READ_FILE_RETRY_MAX_BACKOFF_SECONDS = 32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. might be better to also define those constants in a unified place. but not a big deal.
for read_task in blocks: | ||
yield from read_task() | ||
if read_task._metadata.input_files is not None: | ||
read_files_name = ",".join(read_task._metadata.input_files) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the input files list may be very large, and will make the error message too verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, updated to just use read fn name by default for all cases (since we have no control over the length of even one file name)
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
…y-project#42027) Shuffles input images for read_images_train_4_gpu release test, which fixes the issue with accuracy going to 0. Add AWS Error NETWORK_CONNECTION and AWS Error ACCESS_DENIED as an Exception type to retry during reads, since this can be a transient error that is fine upon retry. Other small fixes for optional parameters in benchmark file, used for debugging purposes. Results of sample release test run: read_images_train_4_gpu: Result of case cache-none: {'time': 11964.644112934, 'tput': 429.22158930338344, 'accuracy': 0.4667895757295709, 'extra_metrics': {}} read_images_train_16_gpu: Result of case cache-none: {'time': 5400.357632072, 'tput': 1593.6668981608586, 'accuracy': 0.5293150227295434, 'extra_metrics': {}} read_images_train_16_gpu_preserve_order: Result of case cache-none: {'time': 5566.524269388, 'tput': 1571.1312653719967, 'accuracy': 0.5295374787691078, 'extra_metrics': {}} (The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.) --------- Signed-off-by: Andrew Xue <[email protected]> Signed-off-by: Scott Lee <[email protected]> Co-authored-by: Scott Lee <[email protected]> Co-authored-by: Scott Lee <[email protected]>
…y-project#42027) Shuffles input images for read_images_train_4_gpu release test, which fixes the issue with accuracy going to 0. Add AWS Error NETWORK_CONNECTION and AWS Error ACCESS_DENIED as an Exception type to retry during reads, since this can be a transient error that is fine upon retry. Other small fixes for optional parameters in benchmark file, used for debugging purposes. Results of sample release test run: read_images_train_4_gpu: Result of case cache-none: {'time': 11964.644112934, 'tput': 429.22158930338344, 'accuracy': 0.4667895757295709, 'extra_metrics': {}} read_images_train_16_gpu: Result of case cache-none: {'time': 5400.357632072, 'tput': 1593.6668981608586, 'accuracy': 0.5293150227295434, 'extra_metrics': {}} read_images_train_16_gpu_preserve_order: Result of case cache-none: {'time': 5566.524269388, 'tput': 1571.1312653719967, 'accuracy': 0.5295374787691078, 'extra_metrics': {}} (The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.) --------- Signed-off-by: Andrew Xue <[email protected]> Signed-off-by: Scott Lee <[email protected]> Co-authored-by: Scott Lee <[email protected]> Co-authored-by: Scott Lee <[email protected]>
Cherry-pick #42027, which adds stability for read tasks. --------- Signed-off-by: Andrew Xue <[email protected]> Signed-off-by: Scott Lee <[email protected]> Co-authored-by: Andrew Xue <[email protected]>
Why are these changes needed?
read_images_train_4_gpu
release test, which fixes the issue with accuracy going to 0.AWS Error NETWORK_CONNECTION
andAWS Error ACCESS_DENIED
as an Exception type to retry during reads, since this can be a transient error that is fine upon retry.Results of sample release test run:
(The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.)
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.