Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add best practices page to Dask cuDF docs #16821

Merged
merged 31 commits into from
Sep 20, 2024

Conversation

rjzamora
Copy link
Member

Description

Adds a much-needed "best practices" page to the Dask cuDF documentation.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added 2 - In Progress Currently a work in progress doc Documentation non-breaking Non-breaking change labels Sep 17, 2024
@rjzamora rjzamora self-assigned this Sep 17, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Sep 17, 2024
@rjzamora rjzamora marked this pull request as ready for review September 17, 2024 19:05
@rjzamora rjzamora requested a review from a team as a code owner September 17, 2024 19:05
@rjzamora
Copy link
Member Author

@jacobtomlinson @quasiben @pentschev - Interested in your feedback on the specific "best practices" guidelines I added here. Happy to revise.

Deployment and Configuration
----------------------------

Use Dask DataFrame Directly
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what "section" this belongs to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just move it to the end of the Deployment and Configuration section? The rationale is this at the top may look like a suggestion, which is actually the opposite of what this subsection says. No strong opinions though.

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me, although I'm far from experienced with Dask cuDF best practices. I've left a few suggestions that I hope may be useful to improve quality a bit, but just as is this looks great! Thanks @rjzamora .

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved
Comment on lines 81 to 82
The ideal partition size is typically between 2-10% of the memory capacity
of a single GPU. Increasing the partition size will typically reduce the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we provide here a rule-of-thumb as to whether users should initially target more to the 2% or the 10% range, and how/when to increase/decrease that? Or is this too difficult to provide a good rule-of-thumb and the 2-10% phrasing is the best we can do? I understand it can be quite difficult to give more details for general purpose docs, so it's fine if you think the current phrasing is sufficient/best.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I attempted to turn this into the more-explicit "rule of thumb" I personally use: 1/16 or less if the workflow is memory-intensive (i.e. shuffle intensive), and 1/8 otherwise. The "best" partition size is definitely difficult to know a priori.


``blocksize``: Use this argument to specify the maximum partition size.
The default is `"256 MiB"`, but larger values are usually more performant
(e.g. `1 GiB` is usually safe). Dask will use the ``blocksize`` value to map
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe provide a guideline for when 1 GiB is safe. I imagine this is safe for large devices that we're usually used to work with, but given the recent NO-OOM effort I don't think a small laptop GPU will be capable of handling 1GiB safely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to remove the 1 GiB comment since we already discuss the 1/16-1/8 "rule of thumb" above.

Comment on lines 12 to 14
many of the details discussed in the `Dask DataFrames Best Practices
<https://docs.dask.org/en/stable/dataframe-best-practices.html>`__
documentation also apply to Dask cuDF.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any notable ones that are known NOT to apply to Dask cuDF and we should let users know here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the wording to say the guidelines that are not pandas-specific also apply to Dask cuDF.

Deployment and Configuration
----------------------------

Use Dask DataFrame Directly
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just move it to the end of the Deployment and Configuration section? The rationale is this at the top may look like a suggestion, which is actually the opposite of what this subsection says. No strong opinions though.

Comment on lines 147 to 148
Sorting, joining and grouping operations all have the potential to
require the global shuffling of data between distinct partitions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also applies to repartition, no? Maybe we should mention the same arguments here or point https://github.com/rapidsai/cudf/pull/16821/files#diff-1c3b287013ea5f3b56726f0b7e7538bd0242e8cadcbd89e966a82d3d36719317R93-R95 to here as well to make it more explicit where users are dealing with in those cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. Repartition does not really require data "shuffling". Data shuffling requires "all-to-all", while repartitioning is usually limited to data movement between neighboring partitions.

Comment on lines 157 to 158
* Use a distributed cluster with Dask-CUDA workers
* Use native cuDF spilling whenever possible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we link to the sections above in here that deal with these two suggestions, for the benefit of the reader who may come directly to this section? Not sure if easy/possible, if not skipping is also fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good idea. Will need to figure out how to do that :)

@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 18, 2024
@rjzamora
Copy link
Member Author

@VibhuJawa @ayushdg @randerzander - I would consider you all dask-cudf power users. Let me know if these "best practices" seem reasonable to you.

Copy link
Member

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this! I found the page very helpful overall. Left a few comments and nits

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
``False``, but ``aggregate_files=True`` is usually more performant when
the dataset contains many files that are smaller than half of ``blocksize``.

.. note::
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this note. Once we have more cloud IO specific optimizations it might make sense to add it to best practices or create a new one for cloud IO to discuss tips/tricks for those environments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that we need a lot more remote-IO information. However, it doesn't feel like there is much to say yet :/

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved
Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I added a few general thoughts.

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
Copy link
Member

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc mostly looks great to me, thanks for adding it. Have left some small notes

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Show resolved Hide resolved
@rjzamora
Copy link
Member Author

I will plan to merge this in a few hours if there aren't any more comments.

Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I only had a few tiny wording nits

docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
docs/dask_cudf/source/best_practices.rst Outdated Show resolved Hide resolved
Co-authored-by: Lawrence Mitchell <[email protected]>
@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Sep 20, 2024
@rjzamora
Copy link
Member Author

/merge

@rapids-bot rapids-bot bot merged commit b165210 into rapidsai:branch-24.10 Sep 20, 2024
96 checks passed
@rjzamora rjzamora deleted the dask-cudf-best-practices branch September 21, 2024 03:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge doc Documentation non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants