Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Stability & accuracy improvements for Data+Train benchmark #42027

Merged
merged 13 commits into from
Jan 9, 2024

Conversation

Zandew
Copy link
Contributor

@Zandew Zandew commented Dec 19, 2023

Why are these changes needed?

  • Shuffles input images for read_images_train_4_gpu release test, which fixes the issue with accuracy going to 0.
  • Add AWS Error NETWORK_CONNECTION and AWS Error ACCESS_DENIED as an Exception type to retry during reads, since this can be a transient error that is fine upon retry.
  • Other small fixes for optional parameters in benchmark file, used for debugging purposes.

Results of sample release test run:

  • read_images_train_4_gpu:
Result of case cache-none: {'time': 11964.644112934, 'tput': 429.22158930338344, 'accuracy': 0.4667895757295709, 'extra_metrics': {}}
  • read_images_train_16_gpu:
Result of case cache-none: {'time': 5400.357632072, 'tput': 1593.6668981608586, 'accuracy': 0.5293150227295434, 'extra_metrics': {}}
  • read_images_train_16_gpu_preserve_order:
Result of case cache-none: {'time': 5566.524269388, 'tput': 1571.1312653719967, 'accuracy': 0.5295374787691078, 'extra_metrics': {}}

(The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.)

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Andrew Xue <[email protected]>
@Zandew Zandew changed the title test [data] shuffle input files for train benchmark Dec 21, 2023
@scottjlee scottjlee marked this pull request as ready for review January 3, 2024 01:25
@scottjlee scottjlee changed the title [data] shuffle input files for train benchmark [data] Stability & accuracy improvements for Data+Train benchmark Jan 3, 2024
# Transient errors that can occur during longer reads. Trigger retry when these occur.
READ_FILE_RETRY_ON_ERRORS = ["AWS Error NETWORK_CONNECTION", "AWS Error ACCESS_DENIED"]
READ_FILE_MAX_ATTEMPTS = 10
READ_FILE_RETRY_MAX_BACKOFF_SECONDS = 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been repeated multiple times in the code base. can we consolidate them with a util function?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulchen i intended to keep these constants separate from the retry errors for opening files, e.g.

# The errors to retry for opening file.
OPEN_FILE_RETRY_ON_ERRORS = ["AWS Error SLOW_DOWN"]
# The max retry backoff in seconds for opening file.
OPEN_FILE_RETRY_MAX_BACKOFF_SECONDS = 32
# The max number of attempts for opening file.
OPEN_FILE_MAX_ATTEMPTS = 10

i'm using the common util function call_with_retry with these constants. do you mean consolidate these sets of constants into their own file?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. might be better to also define those constants in a unified place. but not a big deal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i couldn't find a centralized constants.py or other similar file for ray data constants. should we expose these as parameters from DataContext?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay to keep it the current way. let's refactor later if needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that we already have write_file_retry_on_errors in DataContext`. Let's also follow this pattern?

# Transient errors that can occur during longer reads. Trigger retry when these occur.
READ_FILE_RETRY_ON_ERRORS = ["AWS Error NETWORK_CONNECTION", "AWS Error ACCESS_DENIED"]
READ_FILE_MAX_ATTEMPTS = 10
READ_FILE_RETRY_MAX_BACKOFF_SECONDS = 32
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. might be better to also define those constants in a unified place. but not a big deal.

for read_task in blocks:
yield from read_task()
if read_task._metadata.input_files is not None:
read_files_name = ",".join(read_task._metadata.input_files)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the input files list may be very large, and will make the error message too verbose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, updated to just use read fn name by default for all cases (since we have no control over the length of even one file name)

@stephanie-wang stephanie-wang merged commit b87ed2c into ray-project:master Jan 9, 2024
9 checks passed
vickytsang pushed a commit to ROCm/ray that referenced this pull request Jan 12, 2024
…y-project#42027)


    Shuffles input images for read_images_train_4_gpu release test, which fixes the issue with accuracy going to 0.
    Add AWS Error NETWORK_CONNECTION and AWS Error ACCESS_DENIED as an Exception type to retry during reads, since this can be a transient error that is fine upon retry.
    Other small fixes for optional parameters in benchmark file, used for debugging purposes.

Results of sample release test run:

    read_images_train_4_gpu:

Result of case cache-none: {'time': 11964.644112934, 'tput': 429.22158930338344, 'accuracy': 0.4667895757295709, 'extra_metrics': {}}

    read_images_train_16_gpu:

Result of case cache-none: {'time': 5400.357632072, 'tput': 1593.6668981608586, 'accuracy': 0.5293150227295434, 'extra_metrics': {}}

    read_images_train_16_gpu_preserve_order:

Result of case cache-none: {'time': 5566.524269388, 'tput': 1571.1312653719967, 'accuracy': 0.5295374787691078, 'extra_metrics': {}}

(The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.)

---------

Signed-off-by: Andrew Xue <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Co-authored-by: Scott Lee <[email protected]>
Co-authored-by: Scott Lee <[email protected]>
scottjlee added a commit to scottjlee/ray that referenced this pull request Jan 27, 2024
…y-project#42027)


    Shuffles input images for read_images_train_4_gpu release test, which fixes the issue with accuracy going to 0.
    Add AWS Error NETWORK_CONNECTION and AWS Error ACCESS_DENIED as an Exception type to retry during reads, since this can be a transient error that is fine upon retry.
    Other small fixes for optional parameters in benchmark file, used for debugging purposes.

Results of sample release test run:

    read_images_train_4_gpu:

Result of case cache-none: {'time': 11964.644112934, 'tput': 429.22158930338344, 'accuracy': 0.4667895757295709, 'extra_metrics': {}}

    read_images_train_16_gpu:

Result of case cache-none: {'time': 5400.357632072, 'tput': 1593.6668981608586, 'accuracy': 0.5293150227295434, 'extra_metrics': {}}

    read_images_train_16_gpu_preserve_order:

Result of case cache-none: {'time': 5566.524269388, 'tput': 1571.1312653719967, 'accuracy': 0.5295374787691078, 'extra_metrics': {}}

(The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.)

---------

Signed-off-by: Andrew Xue <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Co-authored-by: Scott Lee <[email protected]>
Co-authored-by: Scott Lee <[email protected]>
@scottjlee scottjlee mentioned this pull request Jan 27, 2024
8 tasks
architkulkarni pushed a commit that referenced this pull request Jan 29, 2024
Cherry-pick #42027, which adds stability for read tasks.

---------

Signed-off-by: Andrew Xue <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Co-authored-by: Andrew Xue <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants