[docs] Editing pass over Dataset docs #26935

ericl · 2022-07-23T20:57:17Z

Why are these changes needed?

This PR makes a general editing pass over the Dataset docs. In particular:

Move scheduling with Tune section out of key concepts and into a user guide
Combine scheduling / memory management into a combined guide
Coalesce pipeline usage into one guide and move advanced examples into example section
Some rearrangements to the order and titles of the sections for more natural flow
Editing to remove out-dated / unnecessary details from various sections

Depends on: #26934

Signed-off-by: Eric Liang <[email protected]>

doc/source/data/pipelining-compute.rst

doc/source/data/consuming-datasets.rst

Signed-off-by: Eric Liang <[email protected]>

jianoaix · 2022-07-25T00:31:46Z

doc/source/data/dataset-internals.rst

+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+In order to reduce memory usage and task overheads, Datasets will automatically fuse together
+lazy operations that are compatible:


Also UDF needs to compatible (eg. constructors of class UDFs).

This is covered already by compute (you can't specify actors for a non-class UDF), right?

doc/source/data/dataset-internals.rst

jianoaix · 2022-07-25T00:46:16Z

doc/source/data/pipelining-compute.rst

@@ -202,3 +201,88 @@ You can also specify the size of each window using ``bytes_per_window``. In this
        .read_binary_files("s3://bucket/image-dir") \
        .window(bytes_per_window=10e9)
    # -> INFO -- Created DatasetPipeline with 73 windows: 9120MiB min, 9431MiB max, 9287MiB mean
+    # -> INFO -- Blocks per window: 10 min, 16 max, 14 mean
+    # -> INFO -- ✔️  This pipeline's per-window parallelism is high enough to fully utilize the cluster.
+    # -> INFO -- ✔️  This pipeline's windows can each fit in object store memory without spilling.


This might be tricky when there are multiple stages in the pipeline, so there are multiple in-flight windows.

Yeah, though the common case is a 1-2 stages due to stage fusion. We could detect multi-stage cases in the future, but so far they seem pretty rare.

jianoaix · 2022-07-25T00:54:06Z

doc/source/data/pipelining-compute.rst

+
+It's common in ML training to want to divide data ingest into epochs, or repetitions over the original source dataset.
+DatasetPipeline provides a convenient ``.iter_epochs()`` method that can be used to split up the pipeline into epoch-delimited pipeline segments.
+Epochs are defined by the last call to ``.repeat()`` in a pipeline, for example:


Not sure what this means, since epoch is independent of .repeat() or .window() calls since it's one repetition of the original source dataset (as said above).

If a dataset is repeated twice, the "source dataset" is defined by the second repeat, not the first.

jianoaix · 2022-07-25T01:00:35Z

doc/source/data/performance-tips.rst

-In addition, we also collect statistics about iterator timings (time spent waiting / processing / in user code).
-Here's a sample output of getting stats in one of the most advanced use cases,
-namely iterating over a split of a dataset pipeline in a remote task:
+These stats can be used to understand the performance of your Dataset workload and can help you debug problematic bottlenecks. Note that both execution and iterator statistics are available:


Can we briefly talk about how to read the stats?
For example, "Remote wall time": is it the time to complete all tasks in a stage, or just one of them? "Remote CPU time": similarly, is it the sum of all CPU costs across tasks or just a single one?

Sure, I think that would be a good followup.

Signed-off-by: Eric Liang <[email protected]>

richardliaw

stamping

Signed-off-by: Eric Liang <[email protected]>

ericl · 2022-07-25T01:48:12Z

I'm going to try to merge this for now. We still need a lot of improvements on Datasets docs, but this will lay the groundwork for further fine-tuning that we can parallelize.

Signed-off-by: Eric Liang <[email protected]>

Signed-off-by: Rohan138 <[email protected]>

Signed-off-by: Stefan van der Kleij <[email protected]>

#26935 removed `accessing-datasets.rst`. Now, most of the code in `doc_code/accessing-datasets.py` is unused. I've cleaned up the file accordingly.

ray-project#26935 removed `accessing-datasets.rst`. Now, most of the code in `doc_code/accessing-datasets.py` is unused. I've cleaned up the file accordingly. Signed-off-by: Weichen Xu <[email protected]>

ericl added 4 commits July 23, 2022 13:54

lazy changes

02c23f9

fix

c4e236b

Signed-off-by: Eric Liang <[email protected]>

doc changes

377fc2a

update

f1ae650

ericl requested review from scv119, clarkzinzow, jjyao, jianoaix, maxpumperla, pcmoritz, richardliaw, edoakes and simon-mo as code owners July 23, 2022 20:57

ericl assigned c21, clarkzinzow and jianoaix Jul 23, 2022

ericl added 5 commits July 23, 2022 13:58

add mini

6c67c4a

Signed-off-by: Eric Liang <[email protected]>

remove custom datasources

178a713

Signed-off-by: Eric Liang <[email protected]>

move random access

8770006

Signed-off-by: Eric Liang <[email protected]>

merge saving

0409ce1

Signed-off-by: Eric Liang <[email protected]>

rows

42c68c6

Signed-off-by: Eric Liang <[email protected]>

ericl force-pushed the split-doc branch from 063c06f to 42c68c6 Compare July 23, 2022 21:21

jianoaix reviewed Jul 23, 2022

View reviewed changes

doc/source/data/pipelining-compute.rst Outdated Show resolved Hide resolved

doc/source/data/consuming-datasets.rst Show resolved Hide resolved

doc/source/data/consuming-datasets.rst Outdated Show resolved Hide resolved

ericl added 2 commits July 23, 2022 16:40

comments

d33f830

Signed-off-by: Eric Liang <[email protected]>

Merge remote-tracking branch 'upstream/master' into split-doc

2e6d38e

ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 24, 2022

jianoaix reviewed Jul 25, 2022

View reviewed changes

ericl added 3 commits July 24, 2022 18:09

update transform doc

dbbc666

Signed-off-by: Eric Liang <[email protected]>

fix indent

0cc2260

Signed-off-by: Eric Liang <[email protected]>

move random access

e3601b7

Signed-off-by: Eric Liang <[email protected]>

richardliaw approved these changes Jul 25, 2022

View reviewed changes

cut text

2b6fe52

Signed-off-by: Eric Liang <[email protected]>

ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 25, 2022

ericl added 2 commits July 24, 2022 18:33

simplify integration doc

04d0fdb

Signed-off-by: Eric Liang <[email protected]>

remove experimental

9e4d4ec

Signed-off-by: Eric Liang <[email protected]>

scv119 approved these changes Jul 25, 2022

View reviewed changes

ericl added 2 commits July 24, 2022 18:49

update

eaf373a

Signed-off-by: Eric Liang <[email protected]>

Merge remote-tracking branch 'upstream/master' into split-doc

6b7b826

pcmoritz approved these changes Jul 25, 2022

View reviewed changes

ericl merged commit 1ac2a87 into ray-project:master Jul 25, 2022

scv119 mentioned this pull request Jul 25, 2022

[Core][Nightly] datasets_ingest_train_infer failing #26966

Closed

scv119 mentioned this pull request Jul 26, 2022

[AIR] Significant data reading regression in Ray cluster from xgboost 100GB test #26995

Closed

Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022

[docs] Editing pass over Dataset docs (ray-project#26935)

60732fb

Signed-off-by: Rohan138 <[email protected]>

zhe-thoughts added the docs An issue or change related to documentation label Jul 30, 2022

Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022

[docs] Editing pass over Dataset docs (ray-project#26935)

9fb1436

Signed-off-by: Stefan van der Kleij <[email protected]>

bveeramani mentioned this pull request Oct 19, 2022

[Datasets] Clean up accessing_datasets.py #29460

Merged

7 tasks

ericl pushed a commit that referenced this pull request Oct 19, 2022

[Datasets] Clean up accessing_datasets.py (#29460)

a6b242f

#26935 removed `accessing-datasets.rst`. Now, most of the code in `doc_code/accessing-datasets.py` is unused. I've cleaned up the file accordingly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[docs] Editing pass over Dataset docs #26935

[docs] Editing pass over Dataset docs #26935

ericl commented Jul 23, 2022

jianoaix Jul 25, 2022

ericl Jul 25, 2022

jianoaix Jul 25, 2022

ericl Jul 25, 2022

jianoaix Jul 25, 2022

ericl Jul 25, 2022

jianoaix Jul 25, 2022

ericl Jul 25, 2022

richardliaw left a comment

ericl commented Jul 25, 2022 •

edited

Loading

[docs] Editing pass over Dataset docs #26935

[docs] Editing pass over Dataset docs #26935

Conversation

ericl commented Jul 23, 2022

Why are these changes needed?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

richardliaw left a comment

Choose a reason for hiding this comment

ericl commented Jul 25, 2022 • edited Loading

ericl commented Jul 25, 2022 •

edited

Loading