Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Datasets] Refactor Ray Data API documentation #31204

Merged
merged 12 commits into from
Jan 6, 2023

Conversation

c21
Copy link
Contributor

@c21 c21 commented Dec 19, 2022

Signed-off-by: Cheng Su [email protected]

Why are these changes needed?

Please review the new documentation in https://ray--31204.org.readthedocs.build/en/31204/data/api/api.html .

This PR is to refactor Ray Data API documentation as specified in #30692 . The motivation is to have a better layout and view of all API references (same as other systems - Pandas, NumPy and Spark).

The new structure has 3 pages:

1.index page (same as before, link)
2.summary page for each group of APIs (newly added, link)
3.doc page for individual APIs (newly added, link)

In addition, the side bar also has list of APIs per category:

Screen Shot 2022-12-20 at 8 49 54 PM

Related issue number

Closes #30692

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@c21 c21 added the do-not-merge Do not merge this PR! label Dec 19, 2022
@c21 c21 changed the title [WIP][Datasets] Ray Data API documentation change [Datasets] Refactor Ray Data API documentation Dec 21, 2022
Comment on lines +248 to +254
/* Remove row spacing on the left */
padding-left: 0;
}

.table thead th {
/* Remove row spacing on the left */
padding-left: 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required to make the summary page readable. Thanks @sijieamoy for suggesting this fix!

Before this change, the summary table looks bad as below (characters are packed together):

Screen Shot 2022-12-20 at 11 59 13 AM

@@ -3,29 +3,54 @@
GroupedDataset API
==================

.. autoclass:: ray.data.grouped_dataset.GroupedDataset
:members:
.. currentmodule:: ray.data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This page is not only refactored, but also restructured.

Reference:
Pandas groupby - https://pandas.pydata.org/docs/reference/groupby.html
PySpark groupby - https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/groupby.html

.. autoclass:: ray.data.block.BlockAccessor
:members:
.. autosummary::
:toctree: doc/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This :toctree: doc/ forces the individual API to be generated in separate page. with .../doc/<API-name>.html link.

Example:

https://ray--31204.org.readthedocs.build/en/31204/data/api/doc/ray.data.Dataset.map_batches.html

Copy link
Contributor

@jianoaix jianoaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks nice and it's consistent with other major OSS data libraries, so LG for just data API itself.

I think we will also need to take care of the consistency with other Ray libraries. This will make data API docs different than the rest of Ray, so we may need to reach a consensus with the ray-docs group that this is the structure we want for Ray overall.

.. automethod:: ray.data.Dataset.limit
.. autosummary::
:toctree: doc/
:nosignatures:
Copy link
Contributor

@clarkzinzow clarkzinzow Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like not having the function signatures really hurts understanding the one-liner API description, and the Pandas and PySpark docs do include the signatures; can you say more about why you decided to not have the signatures displayed?

I think including the signatures also makes extra sense now that we have the super-concise bare API list in the sidebar, so we can afford to and would benefit from adding a bit more detail in the page body items.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am on the other side, that feels the signature is not that useful in one-liner description, given it's also truncated in UI. But I agree this is what other systems are doing. I can add it if this is what people want. WDYT? @jianoaix

Copy link
Contributor

@clarkzinzow clarkzinzow Dec 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I get that perspective as well! My opinion is that, for most of the Datasets APIs, the first argument or two are typically much more important than the remaining arguments, and even with the signature being truncated, the names of the first few arguments add a lot of valuable context.

For example:

read_parquet(paths[, filesystem, parallelism, ...])

map_batches(fn[, batch_size, batch_format, ...])

+1 on getting @jianoaix's perspective, we can do a quick 3-person majority rule here, if needed!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I personally like signatures (and type annotations in general, although may not fit here) and they do convey useful information to the readers (especially when with good namings :).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, let me add signature then.

@c21
Copy link
Contributor Author

c21 commented Dec 21, 2022

I think we will also need to take care of the consistency with other Ray libraries. This will make data API docs different than the rest of Ray, so we may need to reach a consensus with the ray-docs group that this is the structure we want for Ray overall.

Yes this is fair. WDYT? @maxpumperla, @ericl and @zhe-thoughts, thanks.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple thoughts:

  1. https://ray--31204.org.readthedocs.build/en/31204/data/api/doc/ray.data.Dataset.map_batches.html --- if you look at this, it's very hard to read. There are too many args, the type signature is way too verbose, and there are 5 different Tip blocks.

  2. After this PR, every method now has a sidebar entry. This seems a bit much: could we hide these entries from the sidebar, or not use headings?

@@ -3,54 +3,53 @@
Data Representations
====================
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these internal APIs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't these internal APIs?

They are internal. I feel we should remove them (this PR is mostly refactoring stuff). cc @matthewdeng.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these APIs are Public or Developer APIs. cc @clarkzinzow

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The block/batch types are DeveloperAPI, the TableRow type is PublicAPI, and the tensor extension types are PublicAPI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, then let's keep this file.

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Dec 21, 2022
@c21
Copy link
Contributor Author

c21 commented Dec 21, 2022

https://ray--31204.org.readthedocs.build/en/31204/data/api/doc/ray.data.Dataset.map_batches.html --- if you look at this, it's very hard to read. There are too many args, the type signature is way too verbose, and there are 5 different Tip blocks.

Yes, and it's the same as current page. This PR is just moving the doc of map_batches to a separate page. IMO the current page is even harder to read. We should definitely improve the doc of individual API for sure separately.

After this PR, every method now has a sidebar entry. This seems a bit much: could we hide these entries from the sidebar, or not use headings?

I don't have a strong opinion on this. WDYT @clarkzinzow? Remember you think the sidebar is very useful.

@clarkzinzow
Copy link
Contributor

@ericl @c21 The sidebar entries are very useful for fast API doc scanning, I always use them for any libraries that have classes with many methods or modules with many functions, and is a somewhat common practice in the Python data library ecosystem:

This could be considered less of a must-have now that we've reduced the index pages for e.g. the Dataset class to be a table-list of single-line method summaries and link-outs.

@richardliaw richardliaw merged commit 955e756 into ray-project:master Jan 6, 2023
@c21 c21 deleted the api-doc branch January 6, 2023 19:26
tamohannes pushed a commit to ju2ez/ray that referenced this pull request Jan 25, 2023
@jjyao jjyao mentioned this pull request Feb 6, 2023
7 tasks
jjyao added a commit that referenced this pull request Feb 14, 2023
Similar to #31204, refactor the core api reference for better layout and view.

Signed-off-by: Jiajun Yao <[email protected]>
krfricke pushed a commit that referenced this pull request Feb 14, 2023
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class.

This PR is a followup to #31204 and #32311, which made the same changes for Ray Data/Tune docs.

Signed-off-by: Justin Yu <[email protected]>
justinvyu added a commit to justinvyu/ray that referenced this pull request Feb 14, 2023
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class.

This PR is a followup to ray-project#31204 and ray-project#32311, which made the same changes for Ray Data/Tune docs.

Signed-off-by: Justin Yu <[email protected]>
justinvyu added a commit that referenced this pull request Feb 14, 2023
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class.

This PR is a followup to #31204 and #32311, which made the same changes for Ray Data/Tune docs.

Signed-off-by: Justin Yu <[email protected]>
jjyao added a commit that referenced this pull request Feb 14, 2023
Similar to #31204, refactor the core api reference for better layout and view.

Signed-off-by: Jiajun Yao <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
Similar to ray-project#31204, refactor the core api reference for better layout and view.

Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class.

This PR is a followup to ray-project#31204 and ray-project#32311, which made the same changes for Ray Data/Tune docs.

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: Edward Oakes <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
Similar to ray-project#31204, refactor the core api reference for better layout and view.

Signed-off-by: Jiajun Yao <[email protected]>
Signed-off-by: elliottower <[email protected]>
elliottower pushed a commit to elliottower/ray that referenced this pull request Apr 22, 2023
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class.

This PR is a followup to ray-project#31204 and ray-project#32311, which made the same changes for Ray Data/Tune docs.

Signed-off-by: Justin Yu <[email protected]>
Signed-off-by: elliottower <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Datasets] Make Ray Data API reference page more easier to read
7 participants