Skip to content

Commit

Permalink
Proofread the some datasets docs (ray-project#25068)
Browse files Browse the repository at this point in the history
Co-authored-by: Ubuntu <[email protected]>
  • Loading branch information
jianoaix and Ubuntu committed May 23, 2022
1 parent 3c97861 commit 71443f8
Show file tree
Hide file tree
Showing 3 changed files with 17 additions and 4 deletions.
2 changes: 1 addition & 1 deletion doc/source/data/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -276,7 +276,7 @@ out of such a gradient rut. In the distributed data-parallel training case, the
status quo solution is typically to have a per-shard in-memory shuffle buffer that you
fill up and pop random batches from, without mixing data across shards between epochs.
Ray Datasets also offers fully global random shuffling via
:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()`, and doing so on an
:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()>`, and doing so on an
epoch-repeated dataset pipeline to provide global per-epoch shuffling is as simple as
``ray.data.read().repeat().random_shuffle_each_window()``. But when should you opt for
global per-epoch shuffling instead of local shuffle buffer shuffling?
Expand Down
17 changes: 15 additions & 2 deletions doc/source/data/performance-tips.rst
Original file line number Diff line number Diff line change
Expand Up @@ -90,8 +90,21 @@ Parquet Column Pruning
~~~~~~~~~~~~~~~~~~~~~~

Current Datasets will read all Parquet columns into memory.
If you only need a subset of the columns, make sure to specify the list of columns explicitly when
calling ``ray.data.read_parquet()`` to avoid loading unnecessary data.
If you only need a subset of the columns, make sure to specify the list of columns
explicitly when calling ``ray.data.read_parquet()`` to avoid loading unnecessary
data (projection pushdown).
For example, use ``ray.data.read_parquet("example://iris.parquet", columns=["sepal.length", "variety"]`` to read
just two of the five columns of Iris dataset.

Parquet Row Pruning
~~~~~~~~~~~~~~~~~~~

Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
which will be applied at the file scan so only rows that match the filter predicate
will be returned.
For example, use ``ray.data.read_parquet("example://iris.parquet", filter=pa.dataset.field("sepal.length") > 5.0``
to read rows with sepal.length greater than 5.0.
This can be used in conjunction with column pruning when appropriate to get the benefits of both.

Tuning Read Parallelism
~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion doc/source/data/transforming-datasets.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ Compute Strategy
Datasets transformations are executed by either :ref:`Ray tasks <ray-remote-functions>`
or :ref:`Ray actors <actor-guide>` across a Ray cluster. By default, Ray tasks are
used (with ``compute="tasks"``). For transformations that require expensive setup,
it's preferrable to use Ray actors, which are stateful and allows setup to be reused
it's preferrable to use Ray actors, which are stateful and allow setup to be reused
for efficiency. You can specify ``compute=ray.data.ActorPoolStrategy(min, max)`` and
Ray will use an autoscaling actor pool of ``min`` to ``max`` actors to execute your
transforms. For a fixed-size actor pool, just specify ``ActorPoolStrategy(n, n)``.
Expand Down

0 comments on commit 71443f8

Please sign in to comment.