Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docs] Editing pass over Dataset docs #26935

Merged
merged 19 commits into from
Jul 25, 2022
Merged

Conversation

ericl
Copy link
Contributor

@ericl ericl commented Jul 23, 2022

Why are these changes needed?

This PR makes a general editing pass over the Dataset docs. In particular:

  • Move scheduling with Tune section out of key concepts and into a user guide
  • Combine scheduling / memory management into a combined guide
  • Coalesce pipeline usage into one guide and move advanced examples into example section
  • Some rearrangements to the order and titles of the sections for more natural flow
  • Editing to remove out-dated / unnecessary details from various sections

Depends on: #26934

Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
doc/source/data/pipelining-compute.rst Outdated Show resolved Hide resolved
doc/source/data/consuming-datasets.rst Show resolved Hide resolved
doc/source/data/consuming-datasets.rst Outdated Show resolved Hide resolved
@ericl ericl added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 24, 2022
~~~~~~~~~~~~~~~~~~~~~~~~~

In order to reduce memory usage and task overheads, Datasets will automatically fuse together
lazy operations that are compatible:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also UDF needs to compatible (eg. constructors of class UDFs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is covered already by compute (you can't specify actors for a non-class UDF), right?

doc/source/data/dataset-internals.rst Outdated Show resolved Hide resolved
@@ -202,3 +201,88 @@ You can also specify the size of each window using ``bytes_per_window``. In this
.read_binary_files("s3://bucket/image-dir") \
.window(bytes_per_window=10e9)
# -> INFO -- Created DatasetPipeline with 73 windows: 9120MiB min, 9431MiB max, 9287MiB mean
# -> INFO -- Blocks per window: 10 min, 16 max, 14 mean
# -> INFO -- ✔️ This pipeline's per-window parallelism is high enough to fully utilize the cluster.
# -> INFO -- ✔️ This pipeline's windows can each fit in object store memory without spilling.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be tricky when there are multiple stages in the pipeline, so there are multiple in-flight windows.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though the common case is a 1-2 stages due to stage fusion. We could detect multi-stage cases in the future, but so far they seem pretty rare.


It's common in ML training to want to divide data ingest into epochs, or repetitions over the original source dataset.
DatasetPipeline provides a convenient ``.iter_epochs()`` method that can be used to split up the pipeline into epoch-delimited pipeline segments.
Epochs are defined by the last call to ``.repeat()`` in a pipeline, for example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what this means, since epoch is independent of .repeat() or .window() calls since it's one repetition of the original source dataset (as said above).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a dataset is repeated twice, the "source dataset" is defined by the second repeat, not the first.

In addition, we also collect statistics about iterator timings (time spent waiting / processing / in user code).
Here's a sample output of getting stats in one of the most advanced use cases,
namely iterating over a split of a dataset pipeline in a remote task:
These stats can be used to understand the performance of your Dataset workload and can help you debug problematic bottlenecks. Note that both execution and iterator statistics are available:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we briefly talk about how to read the stats?
For example, "Remote wall time": is it the time to complete all tasks in a stage, or just one of them? "Remote CPU time": similarly, is it the sum of all CPU costs across tasks or just a single one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I think that would be a good followup.

Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Copy link
Contributor

@richardliaw richardliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamping

Signed-off-by: Eric Liang <[email protected]>
@ericl ericl removed the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Jul 25, 2022
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
@ericl
Copy link
Contributor Author

ericl commented Jul 25, 2022

I'm going to try to merge this for now. We still need a lot of improvements on Datasets docs, but this will lay the groundwork for further fine-tuning that we can parallelize.

@ericl ericl merged commit 1ac2a87 into ray-project:master Jul 25, 2022
Rohan138 pushed a commit to Rohan138/ray that referenced this pull request Jul 28, 2022
@zhe-thoughts zhe-thoughts added the docs An issue or change related to documentation label Jul 30, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
ericl pushed a commit that referenced this pull request Oct 19, 2022
#26935 removed `accessing-datasets.rst`. Now, most of the code in `doc_code/accessing-datasets.py` is unused.

I've cleaned up the file accordingly.
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
ray-project#26935 removed `accessing-datasets.rst`. Now, most of the code in `doc_code/accessing-datasets.py` is unused.

I've cleaned up the file accordingly.

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs An issue or change related to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants