-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages #44093
[Data] [Docs] Update Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages #44093
Conversation
Docs Pages from PR: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, small nits
doc/source/data/loading-data.rst
Outdated
:class:`~ray.data.from_huggingface` only supports parallel reads in certain | ||
instances, namely for untransformed public 🤗 Datasets. For those datasets, | ||
`hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_ | ||
will be used to perform a distributed read, otherwise a single node read will be used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will be used to perform a distributed read, otherwise a single node read will be used. | |
will be used to perform a distributed read; otherwise, a single node read will be used. |
doc/source/data/loading-data.rst
Outdated
`hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_ | ||
will be used to perform a distributed read, otherwise a single node read will be used. | ||
This shouldn't be an issue with in-memory 🤗 Datasets, but may fail with | ||
large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we include links to HF docs for DatasetDict
and IteraableDatasetDict
?
doc/source/data/loading-data.rst
Outdated
`hosted parquet files <https://huggingface.co/docs/datasets-server/parquet#list-parquet-files>`_ | ||
will be used to perform a distributed read, otherwise a single node read will be used. | ||
This shouldn't be an issue with in-memory 🤗 Datasets, but may fail with | ||
large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict`` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` or ``IteraableDatasetDict`` | |
large memory-mapped 🤗 Datasets. Additionally, 🤗 ``DatasetDict`` and ``IteraableDatasetDict`` |
doc/source/data/loading-data.rst
Outdated
@@ -603,6 +611,31 @@ Ray Data interoperates with HuggingFace and TensorFlow datasets. | |||
|
|||
[{'text': ''}, {'text': ' = Valkyria Chronicles III = \n'}] | |||
|
|||
.. tab-item:: PyTorch Dataset | |||
|
|||
To convert a PyTorch dataset to a Ray Dataset, call ::func:`~ray.data.from_torch`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To convert a PyTorch dataset to a Ray Dataset, call ::func:`~ray.data.from_torch`. | |
To convert a PyTorch dataset to a Ray Dataset, call :func:`~ray.data.from_torch`. |
doc/source/data/loading-data.rst
Outdated
datasource and pass it to :func:`~ray.data.read_datasource`. | ||
datasource and pass it to :func:`~ray.data.read_datasource`. To write results, you might | ||
also need to subclass :class:`ray.data.Datasink`. Then, create an instance of your custom | ||
datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details see the guide |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details see the guide | |
datasink and pass it to :func:`~ray.data.Dataset.write_datasink`. For more details, see |
@@ -145,6 +176,34 @@ To configure the batch type, specify ``batch_format`` in | |||
.map_batches(drop_nas, batch_format="pandas") | |||
) | |||
|
|||
The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As atches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As atches | |
The user defined function passed to :meth:`~ray.data.Dataset.map_batches` is more flexible. As batches |
can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), so the function should be of type | ||
``Callable[DataBatch, DataBatch]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), so the function should be of type | |
``Callable[DataBatch, DataBatch]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In | |
can be represented in multiple ways (more on this in :ref:`Configuring batch format <configure_batch_format>`), the function should be of type | |
``Callable[DataBatch, DataBatch]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. In |
other words your function should input and output a batch of data which can be represented as a | ||
pandas dataframe or a dictionary with string keys and NumPy ndarrays values. Your function does not need |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other words your function should input and output a batch of data which can be represented as a | |
pandas dataframe or a dictionary with string keys and NumPy ndarrays values. Your function does not need | |
other words, your function should take as input and output a batch of data which can be represented as a | |
pandas DataFrame or a dictionary with string keys and NumPy ndarrays values. Your function does not need |
to return a batch in the same format as it is input, so you could input a pandas dataframe and output a | ||
dictionary of NumPy ndarrays. For example your function might look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to return a batch in the same format as it is input, so you could input a pandas dataframe and output a | |
dictionary of NumPy ndarrays. For example your function might look like: | |
to return a batch in the same format as its input, so you could input a pandas DataFrame and output a | |
dictionary of NumPy ndarrays. For example, your function might look like: |
The user defined function can also return an iterator that yields batches, so the function can also | ||
be of type ``Callable[DataBatch, Iterator[[DataBatch]]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. | ||
In this case your function would look like: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The user defined function can also return an iterator that yields batches, so the function can also | |
be of type ``Callable[DataBatch, Iterator[[DataBatch]]`` where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. | |
In this case your function would look like: | |
The user defined function can also be a Python generator that yields batches, so the function can also | |
be of type ``Callable[DataBatch, Iterator[[DataBatch]]``, where ``DataBatch = Union[pd.DataFrame, Dict[str, np.ndarray]]``. | |
In this case, your function would look like: |
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
bb0802a
to
e55ca12
Compare
Signed-off-by: Matthew Owen <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits. Excuse any mangling in the suggestions when I tried to change passive voice to active voice. Please correct as needed. Very nice job overall. Consider using Vale to catch some of these copy edits I made. (go/vale)
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
@@ -83,7 +83,7 @@ the appropriate scheme. URI can point to buckets or folders. | |||
filesystem = gcsfs.GCSFileSystem(project="my-google-project") | |||
ds.write_parquet("gcs://my-bucket/my-folder", filesystem=filesystem) | |||
|
|||
.. tab-item:: ABL | |||
.. tab-item:: ABS | |||
|
|||
To save data to Azure Blob Storage, install the | |||
`Filesystem interface to Azure-Datalake Gen1 and Gen2 Storage <https://pypi.org/project/adlfs/>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as read, also add a tip on how to tune configs for write failure retries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline, will add the tip on configs later.
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Co-authored-by: angelinalg <[email protected]> Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
Signed-off-by: Matthew Owen <[email protected]>
3fb69d0
to
94ac4b4
Compare
This breaks data doc test, I'm putting up a revert to double check (https://buildkite.com/ray-project/postmerge/builds/3645) |
#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that. Signed-off-by: Matthew Owen <[email protected]>
…r, and Saving Data pages (ray-project#44093) --------- Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: Matthew Owen <[email protected]> Co-authored-by: angelinalg <[email protected]>
ray-project#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that. Signed-off-by: Matthew Owen <[email protected]>
…r, and Saving Data pages (#44093) (#44221) Docs only cherry pick for release. Note: this cherry-pick includes four commits which are all related to changing the Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages. They are rolled together to reduce cherry-picking overhead and are all part of the logical update to these pages. The PRs included in this cherry-pick: Main overhaul of listed pages, Two fixes to doc tests that were broken by the above (fix 1, fix 2). Additional small change to explain how to use credentials that was added after initial merge of main overhaul --------- Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: Matthew Owen <[email protected]> Co-authored-by: angelinalg <[email protected]>
…r, and Saving Data pages (ray-project#44093) --------- Signed-off-by: Matthew Owen <[email protected]> Signed-off-by: Matthew Owen <[email protected]> Co-authored-by: angelinalg <[email protected]>
ray-project#44093 broke one of the data doc tests that was not run on premerge (example here: https://buildkite.com/ray-project/postmerge/builds/3645). This adds missing imports to fix that. Signed-off-by: Matthew Owen <[email protected]>
Why are these changes needed?
This PR is to update Ray Data documentation for Loading, Transforming, Inspecting, Iterating Over, and Saving Data pages as discussed offline.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.