Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Data] Add retry for _sample_fragment during ParquetDatasource._estimate_files_encoding_ratio() #42759

Merged
merged 3 commits into from
Jan 27, 2024

Conversation

scottjlee
Copy link
Contributor

Why are these changes needed?

Sometimes, there can be transient errors during the fragment sampling phase of read_parquet(), when Ray Data estimates the file encoding ratio prior to execution. We include this fragment sampling phase into a retry loop so that we automatically retry upon failure.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
Signed-off-by: Scott Lee <[email protected]>
@scottjlee scottjlee marked this pull request as ready for review January 27, 2024 00:38
sample_fragment.options(
scheduling_strategy=scheduling,
# Retry in case of transient errors during sampling.
retry_exceptions=[OSError],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract this as a variable similar to RETRY_EXCEPTIONS_FOR_META_FETCH_TASK, so we can tune later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed with @scottjlee , we'll make this configurable in DataContext. Since other retry errors are not configurable either. we'll use a follow-up PR to update all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it's okay to not put into DataContext. I don't think they are user-facing, and mostly it only should be tuned by ourselves, when debugging user job. Let's only add config to DataContext if they are expected to be set by users.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But agreed to merge this PR ASAP to unblock users.

@c21 c21 self-assigned this Jan 27, 2024
@raulchen raulchen merged commit 3627e94 into ray-project:master Jan 27, 2024
9 checks passed
scottjlee added a commit to scottjlee/ray that referenced this pull request Jan 27, 2024
@scottjlee scottjlee mentioned this pull request Jan 27, 2024
8 tasks
architkulkarni pushed a commit that referenced this pull request Jan 30, 2024
…timate_files_encoding_ratio()` (#42759) (#42774)

Cherry pick #42759, which improves stability during the file sampling phase of read_parquet()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants