-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Datasets] Refactor Ray Data API documentation #31204
Conversation
/* Remove row spacing on the left */ | ||
padding-left: 0; | ||
} | ||
|
||
.table thead th { | ||
/* Remove row spacing on the left */ | ||
padding-left: 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required to make the summary page readable. Thanks @sijieamoy for suggesting this fix!
Before this change, the summary table looks bad as below (characters are packed together):
@@ -3,29 +3,54 @@ | |||
GroupedDataset API | |||
================== | |||
|
|||
.. autoclass:: ray.data.grouped_dataset.GroupedDataset | |||
:members: | |||
.. currentmodule:: ray.data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This page is not only refactored, but also restructured.
Reference:
Pandas groupby - https://pandas.pydata.org/docs/reference/groupby.html
PySpark groupby - https://spark.apache.org/docs/latest/api/python/reference/pyspark.pandas/groupby.html
.. autoclass:: ray.data.block.BlockAccessor | ||
:members: | ||
.. autosummary:: | ||
:toctree: doc/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This :toctree: doc/
forces the individual API to be generated in separate page. with .../doc/<API-name>.html
link.
Example:
https://ray--31204.org.readthedocs.build/en/31204/data/api/doc/ray.data.Dataset.map_batches.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks nice and it's consistent with other major OSS data libraries, so LG for just data API itself.
I think we will also need to take care of the consistency with other Ray libraries. This will make data API docs different than the rest of Ray, so we may need to reach a consensus with the ray-docs group that this is the structure we want for Ray overall.
doc/source/data/api/dataset.rst
Outdated
.. automethod:: ray.data.Dataset.limit | ||
.. autosummary:: | ||
:toctree: doc/ | ||
:nosignatures: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like not having the function signatures really hurts understanding the one-liner API description, and the Pandas and PySpark docs do include the signatures; can you say more about why you decided to not have the signatures displayed?
I think including the signatures also makes extra sense now that we have the super-concise bare API list in the sidebar, so we can afford to and would benefit from adding a bit more detail in the page body items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am on the other side, that feels the signature is not that useful in one-liner description, given it's also truncated in UI. But I agree this is what other systems are doing. I can add it if this is what people want. WDYT? @jianoaix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, I get that perspective as well! My opinion is that, for most of the Datasets APIs, the first argument or two are typically much more important than the remaining arguments, and even with the signature being truncated, the names of the first few arguments add a lot of valuable context.
For example:
read_parquet(paths[, filesystem, parallelism, ...])
map_batches(fn[, batch_size, batch_format, ...])
+1 on getting @jianoaix's perspective, we can do a quick 3-person majority rule here, if needed!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I personally like signatures (and type annotations in general, although may not fit here) and they do convey useful information to the readers (especially when with good namings :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, let me add signature then.
Yes this is fair. WDYT? @maxpumperla, @ericl and @zhe-thoughts, thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple thoughts:
-
https://ray--31204.org.readthedocs.build/en/31204/data/api/doc/ray.data.Dataset.map_batches.html --- if you look at this, it's very hard to read. There are too many args, the type signature is way too verbose, and there are 5 different Tip blocks.
-
After this PR, every method now has a sidebar entry. This seems a bit much: could we hide these entries from the sidebar, or not use headings?
@@ -3,54 +3,53 @@ | |||
Data Representations | |||
==================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these internal APIs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aren't these internal APIs?
They are internal. I feel we should remove them (this PR is mostly refactoring stuff). cc @matthewdeng.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of these APIs are Public or Developer APIs. cc @clarkzinzow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The block/batch types are DeveloperAPI
, the TableRow
type is PublicAPI
, and the tensor extension types are PublicAPI
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, then let's keep this file.
Yes, and it's the same as current page. This PR is just moving the doc of
I don't have a strong opinion on this. WDYT @clarkzinzow? Remember you think the sidebar is very useful. |
@ericl @c21 The sidebar entries are very useful for fast API doc scanning, I always use them for any libraries that have classes with many methods or modules with many functions, and is a somewhat common practice in the Python data library ecosystem:
This could be considered less of a must-have now that we've reduced the index pages for e.g. the |
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Co-authored-by: Sijie <[email protected]> Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: Cheng Su <[email protected]>
Signed-off-by: tmynn <[email protected]>
Similar to #31204, refactor the core api reference for better layout and view. Signed-off-by: Jiajun Yao <[email protected]>
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class. This PR is a followup to #31204 and #32311, which made the same changes for Ray Data/Tune docs. Signed-off-by: Justin Yu <[email protected]>
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class. This PR is a followup to ray-project#31204 and ray-project#32311, which made the same changes for Ray Data/Tune docs. Signed-off-by: Justin Yu <[email protected]>
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class. This PR is a followup to #31204 and #32311, which made the same changes for Ray Data/Tune docs. Signed-off-by: Justin Yu <[email protected]>
Similar to #31204, refactor the core api reference for better layout and view. Signed-off-by: Jiajun Yao <[email protected]>
Similar to ray-project#31204, refactor the core api reference for better layout and view. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: Edward Oakes <[email protected]>
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class. This PR is a followup to ray-project#31204 and ray-project#32311, which made the same changes for Ray Data/Tune docs. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: Edward Oakes <[email protected]>
Similar to ray-project#31204, refactor the core api reference for better layout and view. Signed-off-by: Jiajun Yao <[email protected]> Signed-off-by: elliottower <[email protected]>
This PR splits up long API refs in AIR and Train into individual pages, one dedicated to each method/class. This PR is a followup to ray-project#31204 and ray-project#32311, which made the same changes for Ray Data/Tune docs. Signed-off-by: Justin Yu <[email protected]> Signed-off-by: elliottower <[email protected]>
Signed-off-by: Cheng Su [email protected]
Why are these changes needed?
Please review the new documentation in https://ray--31204.org.readthedocs.build/en/31204/data/api/api.html .
This PR is to refactor Ray Data API documentation as specified in #30692 . The motivation is to have a better layout and view of all API references (same as other systems - Pandas, NumPy and Spark).
The new structure has 3 pages:
1.index page (same as before, link)
2.summary page for each group of APIs (newly added, link)
3.doc page for individual APIs (newly added, link)
In addition, the side bar also has list of APIs per category:
Related issue number
Closes #30692
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.