Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Add writer for TFRecords. #29448

Merged
merged 32 commits into from
Nov 2, 2022
Merged

Conversation

xcharleslin
Copy link
Contributor

@xcharleslin xcharleslin commented Oct 19, 2022

Why are these changes needed?

This PR enables users to write TFRecords from datasets.

In particular, the master branch already includes an API for reading TFRecords from datasets. Users have requested the ability to write these datasets back to TFRecords.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@xcharleslin xcharleslin force-pushed the xcharleslin/tfrecordwriter branch 4 times, most recently from e99ce12 to 975f9f5 Compare October 21, 2022 23:33
Xiayue Charles Lin added 13 commits October 21, 2022 16:33
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xcharleslin! This looks great, the overall approach looks good to me. Just have some minor comments. Also double checked other implementation like Huggingface dataset, we are using similar logic here. So we should be good.

python/ray/data/dataset.py Outdated Show resolved Hide resolved
python/ray/data/tests/test_dataset_formats.py Outdated Show resolved Hide resolved
python/ray/data/dataset.py Outdated Show resolved Hide resolved
@c21 c21 assigned c21 and jianoaix Oct 26, 2022
Xiayue Charles Lin added 9 commits October 27, 2022 10:39
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
@c21
Copy link
Contributor

c21 commented Oct 27, 2022

Looks great! Seems some CI documentation failure - https://buildkite.com/ray-project/oss-ci-build-pr/builds/3424#01841ba4-6cc5-461c-9e14-7a0cbaaf1d8c, o.w. the PR looks good to me.

/ray/doc/source/data/consuming-datasets.rst:120: WARNING: start-after pattern not found: __write_tfrecords_begin__

Signed-off-by: Xiayue Charles Lin <[email protected]>
Args:
path: The path to the destination root directory, where tfrecords
files will be written to.
filesystem: The filesystem implementation to write to.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The filesystem is null-able. Can you document what's the behavior if it's None?

Copy link
Contributor Author

@xcharleslin xcharleslin Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me investigate what the behaviour would be. The existing methods write_csv, _parquet, _json, _numpy already have this same signature and docstring for filesystem.

Copy link
Contributor Author

@xcharleslin xcharleslin Oct 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like all these file based datasources will resolve it using pyarrow.fs._resolve_filesystem_and_path.

I don't have the full context to understand under what conditions a None filesystem is resolvable here - @clarkzinzow would you know?
If you'd like, I can add a comment to write_csv, write_parquet, write_json, write_numpy, write_tfrecords that a None filesystem will be resolved by pyarrow.fs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's inferred from provided files.

Right, this seems a stretch since all other APIs do not have a mention about what to expect if it's None. Thank you Charles!

Copy link
Contributor

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @xcharleslin!

@c21
Copy link
Contributor

c21 commented Oct 31, 2022

LGTM, cc @clarkzinzow do you have any more comments? Thanks.
@xcharleslin - please also rebase to latest master, there's a file conflict compared to master..

Copy link
Contributor

@clarkzinzow clarkzinzow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, awesome work @xcharleslin!

@clarkzinzow
Copy link
Contributor

All failing tests are unrelated and are already flaky in master, merging!

@clarkzinzow clarkzinzow merged commit 9fab504 into master Nov 2, 2022
@clarkzinzow clarkzinzow deleted the xcharleslin/tfrecordwriter branch November 2, 2022 04:04
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
This PR enables users to write TFRecords from datasets.

In particular, the master branch already includes an API for reading TFRecords from datasets. Users have requested the ability to write these datasets back to TFRecords.

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants