diff --git a/doc/source/data/faq.rst b/doc/source/data/faq.rst index aa36d08c46fa..f2fd5d5d1aa8 100644 --- a/doc/source/data/faq.rst +++ b/doc/source/data/faq.rst @@ -276,7 +276,7 @@ out of such a gradient rut. In the distributed data-parallel training case, the status quo solution is typically to have a per-shard in-memory shuffle buffer that you fill up and pop random batches from, without mixing data across shards between epochs. Ray Datasets also offers fully global random shuffling via -:meth:`ds.random_shuffle() `, and doing so on an epoch-repeated dataset pipeline to provide global per-epoch shuffling is as simple as ``ray.data.read().repeat().random_shuffle_each_window()``. But when should you opt for global per-epoch shuffling instead of local shuffle buffer shuffling? diff --git a/doc/source/data/performance-tips.rst b/doc/source/data/performance-tips.rst index 06abb9dab38c..ba51447742ba 100644 --- a/doc/source/data/performance-tips.rst +++ b/doc/source/data/performance-tips.rst @@ -90,8 +90,21 @@ Parquet Column Pruning ~~~~~~~~~~~~~~~~~~~~~~ Current Datasets will read all Parquet columns into memory. -If you only need a subset of the columns, make sure to specify the list of columns explicitly when -calling ``ray.data.read_parquet()`` to avoid loading unnecessary data. +If you only need a subset of the columns, make sure to specify the list of columns +explicitly when calling ``ray.data.read_parquet()`` to avoid loading unnecessary +data (projection pushdown). +For example, use ``ray.data.read_parquet("example://iris.parquet", columns=["sepal.length", "variety"]`` to read +just two of the five columns of Iris dataset. + +Parquet Row Pruning +~~~~~~~~~~~~~~~~~~~ + +Similarly, you can pass in a filter to ``ray.data.read_parquet()`` (selection pushdown) +which will be applied at the file scan so only rows that match the filter predicate +will be returned. +For example, use ``ray.data.read_parquet("example://iris.parquet", filter=pa.dataset.field("sepal.length") > 5.0`` +to read rows with sepal.length greater than 5.0. +This can be used in conjunction with column pruning when appropriate to get the benefits of both. Tuning Read Parallelism ~~~~~~~~~~~~~~~~~~~~~~~ diff --git a/doc/source/data/transforming-datasets.rst b/doc/source/data/transforming-datasets.rst index 6f4261880c32..ff6e4a8b52ff 100644 --- a/doc/source/data/transforming-datasets.rst +++ b/doc/source/data/transforming-datasets.rst @@ -74,7 +74,7 @@ Compute Strategy Datasets transformations are executed by either :ref:`Ray tasks ` or :ref:`Ray actors ` across a Ray cluster. By default, Ray tasks are used (with ``compute="tasks"``). For transformations that require expensive setup, -it's preferrable to use Ray actors, which are stateful and allows setup to be reused +it's preferrable to use Ray actors, which are stateful and allow setup to be reused for efficiency. You can specify ``compute=ray.data.ActorPoolStrategy(min, max)`` and Ray will use an autoscaling actor pool of ``min`` to ``max`` actors to execute your transforms. For a fixed-size actor pool, just specify ``ActorPoolStrategy(n, n)``.