[Datasets] Update `read_tfrecords` example #28743

bveeramani · 2022-09-23T19:14:03Z

Signed-off-by: Balaji Veeramani [email protected]

Why are these changes needed?

See #28430 (comment).

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

clarkzinzow

Nice! Although am I reading that right, that the iris.tfrecords file is 21 KB?

bveeramani · 2022-09-23T19:22:32Z

Nice! Although am I reading that right, that the iris.tfrecords file is 21 KB?

Yeah, you're right. Didn't check.

❯ ls python/ray/data/examples/data -la
.rw-r--r--  88k bveeramani 19 Sep 14:12 dow_jones.csv
drwxr-xr-x    - bveeramani 19 Sep 20:01 image-folders
.rw-r--r-- 4.0k bveeramani 19 Sep 14:12 iris.csv
.rw-r--r--  15k bveeramani 19 Sep 14:12 iris.json
.rw-r--r-- 3.1k bveeramani 19 Sep 14:12 iris.parquet
.rw-r--r--  22k bveeramani 23 Sep 14:07 iris.tfrecords
.rw-r--r-- 3.9k bveeramani 19 Sep 14:12 iris.tsv
.rw-r--r-- 2.5k bveeramani 19 Sep 14:12 mnist_subset.npy
drwxr-xr-x    - bveeramani 19 Sep 14:12 mnist_subset_partitioned
drwxr-xr-x    - bveeramani 19 Sep 14:12 parquet_images_mini
.rw-r--r-- 1.2k bveeramani 19 Sep 14:12 sms_spam_collection_subset.txt

To get the dataset to under 1KB, I'd have to truncate the dataset to 6 samples. Should I do that, or is it okay as-is?

clarkzinzow · 2022-09-23T19:24:39Z

I think it's ok as is, but does the current iris.tfrecords file have the same number of samples as e.g. the iris.csv file? I'd be surprised if so, I'd assume that the binary format would be smaller than the plain-text format.

bveeramani · 2022-09-23T19:27:00Z

I think it's ok as is, but does the current iris.tfrecords file have the same number of samples as e.g. the iris.csv file? I'd be surprised if so, I'd assume that the binary format would be smaller than the plain-text format.

Yeah, iris.tfrecords is created from iris.csv. Both contain 150 samples.

clarkzinzow · 2022-09-23T19:48:45Z

Weird 🤔 Ok I think keeping it as-is sounds good! 22 KB isn't so bad.

…#28743) Adds a small TFRecords file to repo and updates read_tfrecords example to read it using our example:// protocol. Signed-off-by: Weichen Xu <[email protected]>

Initial commit

5af7c30

bveeramani requested review from ericl, scv119, clarkzinzow, jjyao, jianoaix, maxpumperla, c21 and a team as code owners September 23, 2022 19:14

bveeramani assigned clarkzinzow Sep 23, 2022

clarkzinzow approved these changes Sep 23, 2022

View reviewed changes

clarkzinzow merged commit 7bc265c into ray-project:master Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Datasets] Update `read_tfrecords` example #28743

[Datasets] Update `read_tfrecords` example #28743

bveeramani commented Sep 23, 2022 •

edited

Loading

clarkzinzow left a comment

bveeramani commented Sep 23, 2022

clarkzinzow commented Sep 23, 2022

bveeramani commented Sep 23, 2022

clarkzinzow commented Sep 23, 2022

[Datasets] Update read_tfrecords example #28743

[Datasets] Update read_tfrecords example #28743

Conversation

bveeramani commented Sep 23, 2022 • edited Loading

Why are these changes needed?

Related issue number

Checks

clarkzinzow left a comment

Choose a reason for hiding this comment

bveeramani commented Sep 23, 2022

clarkzinzow commented Sep 23, 2022

bveeramani commented Sep 23, 2022

clarkzinzow commented Sep 23, 2022

[Datasets] Update `read_tfrecords` example #28743

[Datasets] Update `read_tfrecords` example #28743

bveeramani commented Sep 23, 2022 •

edited

Loading