Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Workaround for unserializable Arrow JSON ReadOptions. #25821

Conversation

clarkzinzow
Copy link
Contributor

pyarrow.json.ReadOptions are not picklable until Arrow 8.0.0, which we do not yet support. This PR adds a custom serializer for this type and ensures that said serializer is registered before each Ray task submission.

Related issue number

Closes #24966

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@clarkzinzow clarkzinzow force-pushed the datasets/fix/unserializable-json-readoptions branch from b974c72 to ef8d8d6 Compare June 16, 2022 15:49
@clarkzinzow clarkzinzow force-pushed the datasets/fix/unserializable-json-readoptions branch from ef8d8d6 to f07441d Compare June 16, 2022 17:06

try:
import pyarrow.json as pajson
except ModuleNotFoundError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this import guard in other files too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only needed here since this runs at ray.data import time, all other such imports should fail since Arrow will be hard requirement for those code paths.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question on import safety

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jun 16, 2022
@clarkzinzow clarkzinzow merged commit e111b17 into ray-project:master Jun 17, 2022
clarkzinzow pushed a commit that referenced this pull request Aug 18, 2022
…json (#27911)

This PR is to add customized serializer of Arrow JSON ParseOptions for read_json. We found user wanted to read JSON file with ParseOptions, but it's currently not working due to pickle issue (detail of post). So here we add a customized serializer for ParseOptions as a workaround for now, similar to #25821.

Signed-off-by: Cheng Su <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] arrow.json.ReadOptions is not serializable
3 participants