Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve DataFrame Users Guide #11324

Merged
merged 3 commits into from
Jul 8, 2024
Merged

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 7, 2024

Which issue does this PR close?

Part of #3058

Rationale for this change

While responding to comments from @efredine on #11290, I noticed some other ways the DataFrame docs could be improved

Specifically this page: https://datafusion.apache.org/user-guide/dataframe.html

Among other things, the examples are incomplete (and they are not run in CO) and the documentation of methods is also incomplete

What changes are included in this PR?

  1. Run the examples as part of the doctests
  2. Remove the duplicate API documentation, and instead point people at the API docs
  3. Add a link to the library user guide docs on https://datafusion.apache.org/library-user-guide/using-the-dataframe-api.html (improved in Improve and test dataframe API examples in docs #11290)

Are these changes tested?

The examples are now tested as part of CI,

I also built the docs locally and I think they look better:

Screenshot 2024-07-07 at 5 31 36 PM

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jul 7, 2024
@@ -626,6 +626,12 @@ doc_comment::doctest!(
user_guide_configs
);

#[cfg(doctest)]
doc_comment::doctest!(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runs tests as part of cargo doc


## DataFrame Transformations

These methods create a new DataFrame after applying a transformation to the logical plan that the DataFrame represents.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tables are duplicates of what is in the API docs.

Screenshot 2024-07-07 at 5 36 09 PM

I think it is better to send people there (and invest in keeping it up to date / with examples).

The only thing that is lost is a summary table that breaks the functions down into Transformations, Actions, and other.

If reviewers feel this content is valuable, I can move the tables to the API docs

@alamb alamb marked this pull request as ready for review July 7, 2024 21:40
@alamb alamb added the documentation Improvements or additions to documentation label Jul 7, 2024
@efredine
Copy link
Contributor

efredine commented Jul 8, 2024

Reviewing this sparks a lot of broader throughts for me.

First of all, I'm not sure we need the distinction between "user guide" and "library user guide" when it comes to data frames. The only way you can use a data frame is if you are using it as library? I'm unsure why I should be reading one section or the other.

Second, I think you lose a lot of context by removing the table. The SessionContext and DataFrame structs both expose large API surfaces. I think they become much easier to digest once you understand that there is actually a fairly small number of categories of things being exposed. However, the API documentation doesn't provide any way of seeing this structure. Ideally, there would be something like a way to do something like tagging the methods into different categories.

But I think the important part is simply to note that there are transformations, methods that execute the frame and administrative methods. I might further break down the methods that execute the frame into those that return a new frame in some way and those that write to a data sink? That is, I'm not sure its necessary to list every method in each of these categories but it is helpful to identify the categories. That being said, I think a table, perhaps more granular, with links to the API documentation for each method and possibly even links to the SQL equivalent where appropriate would be a good long term goal. Is there some tooling / macros we could build to support this in a sustainable way?

Also, is it the case that I can only create a data frame via SessionContext? The typically in the introduction suggests there are other ways of doing it. I wonder if it would be better to be more precise and just enumerate the different ways you can create a data frame. I think it's something like: read from a file, read from a table (which really covers a lot of possibilities), execute SQL statements.

So - I suppose to make this executable within the context of this PR - perhaps reduce the tables to more of a summary? But also curious to hear from others.

Finally, not for this PR, I wonder if SessionContext warrants its own section. As with DataFrame I think it would benefit from a discussion of the different categories of things it can be used for. Related, it's becoming clear to me from poking around the documentation and methods its becoming clear that there is a great deal of flexibility in mixing and matching SQL and data frames if you want to but I'm not sure that's coming across in the guides? When I have time I can try drafting something to see how it might fit.

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thanks @alamb

@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Jul 8, 2024
@comphead comphead merged commit 8ae56fc into apache:main Jul 8, 2024
24 checks passed
@alamb alamb deleted the alamb/df_user_guide branch July 10, 2024 11:19
@alamb
Copy link
Contributor Author

alamb commented Jul 10, 2024

Thanks @efredine and @comphead

I have not forgotten about @efredine 's feedback in #11324 (comment). I filed #11388 to track

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Improve `DataFrame` Users Guide

* typo

* Update docs/source/user-guide/dataframe.md

Co-authored-by: Oleks V <[email protected]>

---------

Co-authored-by: Oleks V <[email protected]>
xinlifoobar pushed a commit to xinlifoobar/datafusion that referenced this pull request Jul 18, 2024
* Improve `DataFrame` Users Guide

* typo

* Update docs/source/user-guide/dataframe.md

Co-authored-by: Oleks V <[email protected]>

---------

Co-authored-by: Oleks V <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants