Proofread the some datasets docs (ray-project#25068)

Co-authored-by: Ubuntu <[email protected]>
clarkzinzow · May 23, 2022 · 71443f8 · 71443f8
1 parent 3c97861
commit 71443f8
Show file tree

Hide file tree

Showing 3 changed files with 17 additions and 4 deletions.
diff --git a/doc/source/data/faq.rst b/doc/source/data/faq.rst
@@ -276,7 +276,7 @@ out of such a gradient rut. In the distributed data-parallel training case, the
 status quo solution is typically to have a per-shard in-memory shuffle buffer that you
 fill up and pop random batches from, without mixing data across shards between epochs.
 Ray Datasets also offers fully global random shuffling via
-:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()`, and doing so on an
+:meth:`ds.random_shuffle() <ray.data.Dataset.random_shuffle()>`, and doing so on an
 epoch-repeated dataset pipeline to provide global per-epoch shuffling is as simple as
 ``ray.data.read().repeat().random_shuffle_each_window()``. But when should you opt for
 global per-epoch shuffling instead of local shuffle buffer shuffling?

diff --git a/doc/source/data/performance-tips.rst b/doc/source/data/performance-tips.rst
@@ -90,8 +90,21 @@ Parquet Column Pruning
 ~~~~~~~~~~~~~~~~~~~~~~
 
 Current Datasets will read all Parquet columns into memory.
-If you only need a subset of the columns, make sure to specify the list of columns explicitly when
-calling ``ray.data.read_parquet()`` to avoid loading unnecessary data.
+If you only need a subset of the columns, make sure to specify the list of columns
+explicitly when calling ``ray.data.read_parquet()`` to avoid loading unnecessary
+data (projection pushdown).
+For example, use ``ray.data.read_parquet("example://iris.parquet", columns=["sepal.length", "variety"]`` to read
+just two of the five columns of Iris dataset.
+
+Parquet Row Pruning
+~~~~~~~~~~~~~~~~~~~
+
+Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown)
+which will be applied at the file scan so only rows that match the filter predicate
+will be returned.
+For example, use ``ray.data.read_parquet("example://iris.parquet", filter=pa.dataset.field("sepal.length") > 5.0``
+to read rows with sepal.length greater than 5.0.
+This can be used in conjunction with column pruning when appropriate to get the benefits of both.
 
 Tuning Read Parallelism
 ~~~~~~~~~~~~~~~~~~~~~~~

diff --git a/doc/source/data/transforming-datasets.rst b/doc/source/data/transforming-datasets.rst
@@ -74,7 +74,7 @@ Compute Strategy
 Datasets transformations are executed by either :ref:`Ray tasks <ray-remote-functions>`
 or :ref:`Ray actors <actor-guide>` across a Ray cluster. By default, Ray tasks are
 used (with ``compute="tasks"``). For transformations that require expensive setup,
-it's preferrable to use Ray actors, which are stateful and allows setup to be reused
+it's preferrable to use Ray actors, which are stateful and allow setup to be reused
 for efficiency. You can specify ``compute=ray.data.ActorPoolStrategy(min, max)`` and
 Ray will use an autoscaling actor pool of ``min`` to ``max`` actors to execute your
 transforms. For a fixed-size actor pool, just specify ``ActorPoolStrategy(n, n)``.