-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Add writer for TFRecords. #29448
Conversation
e99ce12
to
975f9f5
Compare
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
975f9f5
to
ba1cfb7
Compare
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @xcharleslin! This looks great, the overall approach looks good to me. Just have some minor comments. Also double checked other implementation like Huggingface dataset, we are using similar logic here. So we should be good.
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Signed-off-by: Xiayue Charles Lin <[email protected]>
Looks great! Seems some CI documentation failure - https://buildkite.com/ray-project/oss-ci-build-pr/builds/3424#01841ba4-6cc5-461c-9e14-7a0cbaaf1d8c, o.w. the PR looks good to me.
|
Signed-off-by: Xiayue Charles Lin <[email protected]>
Args: | ||
path: The path to the destination root directory, where tfrecords | ||
files will be written to. | ||
filesystem: The filesystem implementation to write to. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The filesystem
is null-able. Can you document what's the behavior if it's None?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me investigate what the behaviour would be. The existing methods write_csv, _parquet, _json, _numpy already have this same signature and docstring for filesystem
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like all these file based datasources will resolve it using pyarrow.fs._resolve_filesystem_and_path
.
I don't have the full context to understand under what conditions a None filesystem is resolvable here - @clarkzinzow would you know?
If you'd like, I can add a comment to write_csv, write_parquet, write_json, write_numpy, write_tfrecords that a None filesystem will be resolved by pyarrow.fs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's inferred from provided files.
Right, this seems a stretch since all other APIs do not have a mention about what to expect if it's None. Thank you Charles!
Signed-off-by: Xiayue Charles Lin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @xcharleslin!
LGTM, cc @clarkzinzow do you have any more comments? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, awesome work @xcharleslin!
All failing tests are unrelated and are already flaky in master, merging! |
This PR enables users to write TFRecords from datasets. In particular, the master branch already includes an API for reading TFRecords from datasets. Users have requested the ability to write these datasets back to TFRecords. Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
This PR enables users to write TFRecords from datasets.
In particular, the master branch already includes an API for reading TFRecords from datasets. Users have requested the ability to write these datasets back to TFRecords.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.