[data] Stability & accuracy improvements for Data+Train benchmark #42027

Zandew · 2023-12-19T22:15:44Z

Why are these changes needed?

Shuffles input images for read_images_train_4_gpu release test, which fixes the issue with accuracy going to 0.
Add AWS Error NETWORK_CONNECTION and AWS Error ACCESS_DENIED as an Exception type to retry during reads, since this can be a transient error that is fine upon retry.
Other small fixes for optional parameters in benchmark file, used for debugging purposes.

read_images_train_4_gpu:

Result of case cache-none: {'time': 11964.644112934, 'tput': 429.22158930338344, 'accuracy': 0.4667895757295709, 'extra_metrics': {}}

read_images_train_16_gpu:

Result of case cache-none: {'time': 5400.357632072, 'tput': 1593.6668981608586, 'accuracy': 0.5293150227295434, 'extra_metrics': {}}

read_images_train_16_gpu_preserve_order:

Result of case cache-none: {'time': 5566.524269388, 'tput': 1571.1312653719967, 'accuracy': 0.5295374787691078, 'extra_metrics': {}}

(The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.)

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: Andrew Xue <[email protected]>

Signed-off-by: Scott Lee <[email protected]>

raulchen · 2024-01-05T18:59:38Z

python/ray/data/_internal/planner/plan_read_op.py

+# Transient errors that can occur during longer reads. Trigger retry when these occur.
+READ_FILE_RETRY_ON_ERRORS = ["AWS Error NETWORK_CONNECTION", "AWS Error ACCESS_DENIED"]
+READ_FILE_MAX_ATTEMPTS = 10
+READ_FILE_RETRY_MAX_BACKOFF_SECONDS = 32


This has been repeated multiple times in the code base. can we consolidate them with a util function?

@raulchen i intended to keep these constants separate from the retry errors for opening files, e.g.

ray/python/ray/data/datasource/file_based_datasource.py

Lines 67 to 74 in bfc1f78

# The errors to retry for opening file.

OPEN_FILE_RETRY_ON_ERRORS = ["AWS Error SLOW_DOWN"]

# The max retry backoff in seconds for opening file.

OPEN_FILE_RETRY_MAX_BACKOFF_SECONDS = 32

# The max number of attempts for opening file.

OPEN_FILE_MAX_ATTEMPTS = 10

i'm using the common util function call_with_retry with these constants. do you mean consolidate these sets of constants into their own file?

ok. might be better to also define those constants in a unified place. but not a big deal.

i couldn't find a centralized constants.py or other similar file for ray data constants. should we expose these as parameters from DataContext?

okay to keep it the current way. let's refactor later if needed.

I just realized that we already have write_file_retry_on_errors in DataContext`. Let's also follow this pattern?

raulchen · 2024-01-06T00:32:56Z

python/ray/data/_internal/planner/plan_read_op.py

+# Transient errors that can occur during longer reads. Trigger retry when these occur.
+READ_FILE_RETRY_ON_ERRORS = ["AWS Error NETWORK_CONNECTION", "AWS Error ACCESS_DENIED"]
+READ_FILE_MAX_ATTEMPTS = 10
+READ_FILE_RETRY_MAX_BACKOFF_SECONDS = 32


ok. might be better to also define those constants in a unified place. but not a big deal.

raulchen · 2024-01-06T00:35:57Z

python/ray/data/_internal/planner/plan_read_op.py

        for read_task in blocks:
-            yield from read_task()
+            if read_task._metadata.input_files is not None:
+                read_files_name = ",".join(read_task._metadata.input_files)


the input files list may be very large, and will make the error message too verbose.

good point, updated to just use read fn name by default for all cases (since we have no control over the length of even one file name)

Signed-off-by: Scott Lee <[email protected]>

…y-project#42027) Shuffles input images for read_images_train_4_gpu release test, which fixes the issue with accuracy going to 0. Add AWS Error NETWORK_CONNECTION and AWS Error ACCESS_DENIED as an Exception type to retry during reads, since this can be a transient error that is fine upon retry. Other small fixes for optional parameters in benchmark file, used for debugging purposes. Results of sample release test run: read_images_train_4_gpu: Result of case cache-none: {'time': 11964.644112934, 'tput': 429.22158930338344, 'accuracy': 0.4667895757295709, 'extra_metrics': {}} read_images_train_16_gpu: Result of case cache-none: {'time': 5400.357632072, 'tput': 1593.6668981608586, 'accuracy': 0.5293150227295434, 'extra_metrics': {}} read_images_train_16_gpu_preserve_order: Result of case cache-none: {'time': 5566.524269388, 'tput': 1571.1312653719967, 'accuracy': 0.5295374787691078, 'extra_metrics': {}} (The difference is accuracy is because the 4 worker test only runs for 3 epochs, the 16 worker test runs for 5 epochs, using the entire dataset per epoch.) --------- Signed-off-by: Andrew Xue <[email protected]> Signed-off-by: Scott Lee <[email protected]> Co-authored-by: Scott Lee <[email protected]> Co-authored-by: Scott Lee <[email protected]>

Cherry-pick #42027, which adds stability for read tasks. --------- Signed-off-by: Andrew Xue <[email protected]> Signed-off-by: Scott Lee <[email protected]> Co-authored-by: Andrew Xue <[email protected]>

test

809f1d1

Signed-off-by: Andrew Xue <[email protected]>

Zandew changed the title ~~test~~ [data] shuffle input files for train benchmark Dec 21, 2023

Zandew assigned scottjlee Dec 21, 2023

Merge branch 'master' into shuffle

5043f43

scottjlee marked this pull request as ready for review January 3, 2024 01:25

scottjlee assigned stephanie-wang Jan 3, 2024

scottjlee added 2 commits January 2, 2024 17:52

lint

021276f

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'shuffle' of https://github.com/Zandew/ray into shuffle

3d5a2ee

scottjlee changed the title ~~[data] shuffle input files for train benchmark~~ [data] Stability & accuracy improvements for Data+Train benchmark Jan 3, 2024

scottjlee added 2 commits January 2, 2024 19:57

add ACCESS_DENIED to retry type

3cea2ed

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into shuffle

eea35d0

Signed-off-by: Scott Lee <[email protected]>

Zandew requested review from ericl, scv119, c21, amogkam, scottjlee, bveeramani, raulchen and stephanie-wang as code owners January 3, 2024 03:57

scottjlee and others added 4 commits January 3, 2024 10:40

add NETWORK_CONNECTION as retry error type

9c1c5bb

Signed-off-by: Scott Lee <[email protected]>

add retry logic for read errors

531a5d9

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'master' into shuffle

af4878b

Merge branch 'master' into shuffle

dbb728c

raulchen reviewed Jan 5, 2024

View reviewed changes

stephanie-wang approved these changes Jan 5, 2024

View reviewed changes

raulchen approved these changes Jan 6, 2024

View reviewed changes

scottjlee added 3 commits January 5, 2024 18:28

use read fn name for all cases

ce4bb79

Signed-off-by: Scott Lee <[email protected]>

Merge branch 'shuffle' of https://github.com/Zandew/ray into shuffle

77d57bd

Merge branch 'master' into shuffle

d90ce4f

Signed-off-by: Scott Lee <[email protected]>

raulchen approved these changes Jan 8, 2024

View reviewed changes

stephanie-wang merged commit b87ed2c into ray-project:master Jan 9, 2024
9 checks passed

scottjlee mentioned this pull request Jan 27, 2024

Cherry-pick #42027 #42761

Merged

8 tasks

meltzerpete mentioned this pull request Mar 8, 2024

[Data] Retry on OSError: AWS Error NETWORK_CONNECTION during GetObject operation: curlCode: 28, Timeout was reached #43803

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Stability & accuracy improvements for Data+Train benchmark #42027

[data] Stability & accuracy improvements for Data+Train benchmark #42027

Zandew commented Dec 19, 2023 •

edited by scottjlee

Loading

raulchen Jan 5, 2024

scottjlee Jan 5, 2024

raulchen Jan 6, 2024

scottjlee Jan 6, 2024

raulchen Jan 8, 2024

raulchen Jan 9, 2024

raulchen Jan 6, 2024

raulchen Jan 6, 2024

scottjlee Jan 6, 2024

	# The errors to retry for opening file.
	OPEN_FILE_RETRY_ON_ERRORS = ["AWS Error SLOW_DOWN"]

	# The max retry backoff in seconds for opening file.
	OPEN_FILE_RETRY_MAX_BACKOFF_SECONDS = 32

	# The max number of attempts for opening file.
	OPEN_FILE_MAX_ATTEMPTS = 10

[data] Stability & accuracy improvements for Data+Train benchmark #42027

[data] Stability & accuracy improvements for Data+Train benchmark #42027

Conversation

Zandew commented Dec 19, 2023 • edited by scottjlee Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Zandew commented Dec 19, 2023 •

edited by scottjlee

Loading