-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docs] Editing pass over Dataset docs #26935
Conversation
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
In order to reduce memory usage and task overheads, Datasets will automatically fuse together | ||
lazy operations that are compatible: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also UDF needs to compatible (eg. constructors of class UDFs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is covered already by compute (you can't specify actors for a non-class UDF), right?
@@ -202,3 +201,88 @@ You can also specify the size of each window using ``bytes_per_window``. In this | |||
.read_binary_files("s3://bucket/image-dir") \ | |||
.window(bytes_per_window=10e9) | |||
# -> INFO -- Created DatasetPipeline with 73 windows: 9120MiB min, 9431MiB max, 9287MiB mean | |||
# -> INFO -- Blocks per window: 10 min, 16 max, 14 mean | |||
# -> INFO -- ✔️ This pipeline's per-window parallelism is high enough to fully utilize the cluster. | |||
# -> INFO -- ✔️ This pipeline's windows can each fit in object store memory without spilling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be tricky when there are multiple stages in the pipeline, so there are multiple in-flight windows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, though the common case is a 1-2 stages due to stage fusion. We could detect multi-stage cases in the future, but so far they seem pretty rare.
|
||
It's common in ML training to want to divide data ingest into epochs, or repetitions over the original source dataset. | ||
DatasetPipeline provides a convenient ``.iter_epochs()`` method that can be used to split up the pipeline into epoch-delimited pipeline segments. | ||
Epochs are defined by the last call to ``.repeat()`` in a pipeline, for example: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what this means, since epoch is independent of .repeat()
or .window()
calls since it's one repetition of the original source dataset (as said above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a dataset is repeated twice, the "source dataset" is defined by the second repeat, not the first.
In addition, we also collect statistics about iterator timings (time spent waiting / processing / in user code). | ||
Here's a sample output of getting stats in one of the most advanced use cases, | ||
namely iterating over a split of a dataset pipeline in a remote task: | ||
These stats can be used to understand the performance of your Dataset workload and can help you debug problematic bottlenecks. Note that both execution and iterator statistics are available: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we briefly talk about how to read the stats?
For example, "Remote wall time": is it the time to complete all tasks in a stage, or just one of them? "Remote CPU time": similarly, is it the sum of all CPU costs across tasks or just a single one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I think that would be a good followup.
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stamping
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Eric Liang <[email protected]>
I'm going to try to merge this for now. We still need a lot of improvements on Datasets docs, but this will lay the groundwork for further fine-tuning that we can parallelize. |
Signed-off-by: Eric Liang <[email protected]>
Signed-off-by: Rohan138 <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
#26935 removed `accessing-datasets.rst`. Now, most of the code in `doc_code/accessing-datasets.py` is unused. I've cleaned up the file accordingly.
ray-project#26935 removed `accessing-datasets.rst`. Now, most of the code in `doc_code/accessing-datasets.py` is unused. I've cleaned up the file accordingly. Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
This PR makes a general editing pass over the Dataset docs. In particular:
Depends on: #26934