[Datasets] Add FAQ to Datasets docs. #24932

clarkzinzow · 2022-05-18T19:45:02Z

This PR adds a FAQ to Datasets docs.

Docs preview: https://ray--24932.org.readthedocs.build/en/24932/

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

clarkzinzow · 2022-05-19T00:07:41Z

@ericl @jianoaix I added a first pass at each of the FAQ, haven't proof-read yet or added some external links, but I thought I'd ping y'all for an early review.

ericl

I made some minor edits; there also seem to be some broken link refs.

Main comment is that Torch / TF datasets need to be trimmed down. Keep it focused on the top 2-3 bullets to avoid overwhelming the reader.

Rest of sections look good at first pass.

clarkzinzow · 2022-05-19T00:56:37Z

@ericl Please don't push to this PR while I'm actively working on it without letting me know in advance, I just accidentally overwrote your changes. Can you rebase your changes and push them back up?

ericl · 2022-05-19T01:03:27Z

I just used the GitHub edit feature so that's lost forever... consider not force pushing in the future.

ericl · 2022-05-19T01:07:21Z

Here are the edits I had previously. Most are correcting for a more neutral tone suitable for technical documentation / marketing material:

We're happy to say that we know of several Ray Datasets users that are running their
Datasets integrations in production, both via OSS Ray and Anyscale. We give a few
notable integrations below:

Rewrite for tone: "To give an idea of Datasets use cases, we list a few notable users running Dataset integrations in production below:"

Ray Datasets has great integration
with these frameworks, allowing for efficient exchange of distributed data partitions
often involving no data movement or even data copying!

"Datasets integrates with these frameworks, allowing for efficient exchange of distributed data partitions often with zero-copy."

Below, we briefly summarize where we think Ray Datasets is different
and/or adds value.

"Below, we summarize some advantages Datasets offers over these more specific ingest frameworks."

Ray Datasets doesn't (yet) have a featureful query compiler, so some manual performance

"Datasets doesn't perform query optimization, so some manual performance"

clarkzinzow · 2022-05-19T01:16:25Z

@ericl I made those tone changes.

As for trimming down the Torch/TensorFlow datasets comparisons, do you have any opinions on which bullets should be eliminated or merged? IMO each are important differentiators, most are pain-points we've heard from users.

doc/source/data/faq.rst

ericl · 2022-05-19T03:32:59Z

@clarkzinzow regarding the bullets, I'd think of it as separating "key advantages" vs detailed feature scorecard. For example, for the TF datasets you can sum up the key advantages as follows (maybe with a bit more linking/elaboration as you have):

Datasets is framework-agnostic and portable between different distributed training frameworks.
Datasets unifies single and multi-node training. In comparison, TensorFlow datasets presents different concepts
and prevents code from being seamlessly scaled to larger clusters.
Datasets is more general: it can handle general distributed operations, including global per-epoch shuffling,
which would otherwise have to be implemented by stitching together two separate systems.
Datasets is lower overhead: It supports zero-copy exchange between processes, in contrast to the
multi-processing based pipelines of TensorFlow datasets.

And then have a separate table that does the detailed comparison. This will be much easier for the reader to grok. Even without a table you could have a separate expandable section for the more detailed comparison points that aren't "key advantages".

clarkzinzow · 2022-05-19T04:24:03Z

@ericl I was making this kind of categorization when I saw your comment, I like your Datasets-oriented categories, so I aligned it to that. Lmk what you think.

doc/source/data/faq.rst

maxpumperla

Looking really good! Just a couple of nits.

clarkzinzow · 2022-05-19T14:36:55Z

@maxpumperla Implemented your feedback, PTAL (once the docs build finishes in the next few minutes)

clarkzinzow · 2022-05-19T15:56:36Z

@jianoaix @ericl Ping for review

jianoaix

Mostly lgtm

jianoaix · 2022-05-19T19:31:05Z

doc/source/data/faq.rst

+:ref:`dataset creation feature guide <dataset_from_in_memory_data_distributed>` to learn
+more about these integrations.
+
+Datasets is specifically targeting


nit: unintended line break?

Yeah I was refactoring and didn't feel like rebalancing all lines to ~88 characters. 😛 It doesn't change the rendered text. But I can fix this while making the other changes.

jianoaix · 2022-05-19T19:40:35Z

doc/source/data/faq.rst

+Torch datasets (and data loaders)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* **Framework-agnostic:** Datasets is framework-agnostic and portable between different


It looks we are mostly comparing the same set of dimensions. How about groupby(comp_dimension) and then comment on each, or even better put them in a table?

I was going to put that off for the sake of not having to refactor this again, but I'll take a stab at it.

I have a WIP table local but I won't get it done before my laptop dies, let's do this as a follow-up.

clarkzinzow · 2022-05-19T21:25:35Z

@jianoaix @ericl I think that all must-have feedback has been addressed, can we merge this? I won't be able to add to this PR anymore today so I think that we should merge this and do any remaining tweaks as nice-to-have follow-ups.

ericl · 2022-05-19T22:40:55Z

@jianoaix @ericl I think that all must-have feedback has been addressed, can we merge this? I won't be able to add to this PR anymore today so I think that we should merge this and do any remaining tweaks as nice-to-have follow-ups.

Merging. Please remove the author-action-required label in the future, otherwise it's easy to miss PRs that are mergeable.

ericl

Fixed a typo

This PR adds a FAQ to Datasets docs. Docs preview: https://ray--24932.org.readthedocs.build/en/24932/ - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <[email protected]>

…A. (#25010) * [Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial. * Test the CSV read with column types specified (#24398) Make sure users can read csv with columns types specified. Users may want to do this because sometimes PyArrow's type inference doesn't work as intended, in which case users can step in and work around the type inference. * [Datasets] [Docs] Add a warning about from_huggingface (#24608) Adds a warning to docs about the intended use of from_huggingface. * [data] Expose `drop_last` in `to_tf` (#24666) * [data] More informative exceptions in block impl (#24665) * Add a classic yet small-sized ML dataset for demo/documentation/testing (#24592) To facilitate easy demo/documentation/testing with realistic, small-sized yet ML-familiar data. Have it as a source file with code will make it self-contained, i.e. after user "pip install" Ray, they are all set to run it. IRIS is a great fit: super classic ML dataset, simple schema, only 150 rows. * [Datasets] Add more example data. (#24795) This PR adds more example data for ongoing feature guide work. In addition to adding the new datasets, this also puts all example data under examples/data in order to separate it from the example code. * [Datasets] Add example protocol for reading canned in-package example data. (#24800) Providing easy-access datasets is table stakes for a good Getting Started UX, but even with good in-package data, it can be difficult to make these paths accessible to the user. This PR adds an "example://" protocol that will resolve passed paths directly to our canned in-package example data. * [minor] Use np.searchsorted to speed up random access dataset (#24825) * [Datasets] Change `range_arrow()` API to `range_table()` (#24704) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail. * [Datasets] Support tensor columns in `to_tf` and `to_torch`. (#24752) This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR. * Implement random_sample() (#24492) * Map progress bar title; pretty repr for rows. (#24672) * [Datasets] [CI] fix CI of dataset test (#24883) CI test is broken by f61caa3. This PR fixes it. * [Datasets] Add explicit resource allocation option via a top-level scheduling strategy (#24438) Instead of letting Datasets implicitly use cluster resources in the margins of explicit allocations of other libraries, such as Tune, Datasets should provide an option for explicitly allocating resources for a Datasets workload for users that want to box Datasets in. This PR adds such an explicit resource allocation option, via exposing a top-level scheduling strategy on the DatasetContext with which a placement group can be given. * [Datasets] Add example of using `map_batches` to filter (#24202) The documentation says > Consider using .map_batches() for better performance (you can implement filter by dropping records). but there aren't any examples of how to do so. * [doc] Add docs for push-based shuffle in Datasets (#24486) Adds recommendations, example, and brief benchmark results for push-based shuffle in Datasets. * [Doc][Data] fix big-data-ingestion broken links (#24631) The links were broken. Fixed it. * [docs] Fix import error in Ray Data "getting started" (#24424) We did `import pandas as pd` but here we are using it as `pandas` * [Datasets] Overhaul of "Creating Datasets" feature guide. (#24831) This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily. * [Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346) * Revamp the Getting Started page for Dataset (#24860) This is part of the Dataset GA doc fix effort to update/improve the documentation. This PR revamps the Getting Started page. What are the changes: - Focus on basic/core features that are bread-and-butter for users, leave the advanced features out - Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API) - Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case) - Use the same data example throughout to make it context-switch free - Use runnable code rather than faked - Reference to the code from doc, instead of inlining them in the doc Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Eric Liang <[email protected]> * [Datasets] Miscellaneous GA docs P0s. (#24891) This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely: - Documents Datasets resource allocation model. - De-emphasizes global/windowed shuffling. - Documents lazy execution mode, and expands our execution model docs in general. * [docs] After careful consideration, choose the lesser of two evils and set white-space: pre-wrap #24873 * [Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812) This PR makes several improvements to the Datasets' tensor story. See the issues for each item for more details. - Automatically infer tensor blocks (single-column tables representing a single tensor) when returning NumPy ndarrays from map_batches(), map(), and flat_map(). - Automatically infer tensor columns when building tabular blocks in general. - Fixes shuffling and sorting for tensor columns This should improve the UX/efficiency of the following: - Working with pure-tensor datasets in general. - Mapping tensor UDFs over pure-tensor, a better foundation for tensor-native preprocessing for end-users and AIR. * [Datasets] Overhaul "Accessing Datasets" feature guide. (#24963) This PR overhauls the "Accessing Datasets", adding proper coverage of each data consuming methods, including the ML framework exchange APIs (to_torch() and to_tf()). * [Datasets] Add FAQ to Datasets docs. (#24932) This PR adds a FAQ to Datasets docs. Docs preview: https://ray--24932.org.readthedocs.build/en/24932/ - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <[email protected]> * [Datasets] Add basic e2e Datasets example on NYC taxi dataset (#24874) This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset. The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data. * Revamp the Datasets API docstrings (#24949) * Revamp the Saving Datasets user guide (#24987) * Fix AIR references in Datasets FAQ. * [Datasets] Skip flaky pipelining memory release test (#25009) This pipelining memory release test is flaky; it was skipped in this Polars PR, which was then reverted. * Note that explicit resource allocation is experimental, fix typos (#25038) * fix the notebook test failure * no-op indent fix * fix notebooks test #2 * Revamp the Transforming Datasets user guide (#25033) * Fix range_arrow(), which is replaced by range_table() (#25036) * indent * allow empty * Proofread the some datasets docs (#25068) Co-authored-by: Ubuntu <[email protected]> * [Data] Add partitioning classes to Data API reference (#24203) Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Robert <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Stephanie Wang <[email protected]> Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Zhe Zhang <[email protected]> Co-authored-by: Ubuntu <[email protected]>

clarkzinzow force-pushed the datasets/docs/faq branch from c7810f7 to 301a243 Compare May 19, 2022 00:04

clarkzinzow marked this pull request as ready for review May 19, 2022 00:04

clarkzinzow requested review from ericl, scv119, jjyao and maxpumperla as code owners May 19, 2022 00:04

clarkzinzow assigned ericl and jianoaix May 19, 2022

clarkzinzow force-pushed the datasets/docs/faq branch from 301a243 to 6a579b5 Compare May 19, 2022 00:07

ericl reviewed May 19, 2022

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label May 19, 2022

clarkzinzow force-pushed the datasets/docs/faq branch from fb9fb10 to b805c2a Compare May 19, 2022 00:54

jianoaix reviewed May 19, 2022

View reviewed changes

doc/source/data/faq.rst Outdated Show resolved Hide resolved

doc/source/data/faq.rst Show resolved Hide resolved

doc/source/data/faq.rst Outdated Show resolved Hide resolved

doc/source/data/faq.rst Outdated Show resolved Hide resolved

ray-project deleted a comment from jianoaix May 19, 2022