-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add best practices page to Dask cuDF docs #16821
Add best practices page to Dask cuDF docs #16821
Conversation
@jacobtomlinson @quasiben @pentschev - Interested in your feedback on the specific "best practices" guidelines I added here. Happy to revise. |
Deployment and Configuration | ||
---------------------------- | ||
|
||
Use Dask DataFrame Directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what "section" this belongs to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just move it to the end of the Deployment and Configuration
section? The rationale is this at the top may look like a suggestion, which is actually the opposite of what this subsection says. No strong opinions though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks very good to me, although I'm far from experienced with Dask cuDF best practices. I've left a few suggestions that I hope may be useful to improve quality a bit, but just as is this looks great! Thanks @rjzamora .
The ideal partition size is typically between 2-10% of the memory capacity | ||
of a single GPU. Increasing the partition size will typically reduce the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we provide here a rule-of-thumb as to whether users should initially target more to the 2% or the 10% range, and how/when to increase/decrease that? Or is this too difficult to provide a good rule-of-thumb and the 2-10% phrasing is the best we can do? I understand it can be quite difficult to give more details for general purpose docs, so it's fine if you think the current phrasing is sufficient/best.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I attempted to turn this into the more-explicit "rule of thumb" I personally use: 1/16 or less if the workflow is memory-intensive (i.e. shuffle intensive), and 1/8 otherwise. The "best" partition size is definitely difficult to know a priori.
|
||
``blocksize``: Use this argument to specify the maximum partition size. | ||
The default is `"256 MiB"`, but larger values are usually more performant | ||
(e.g. `1 GiB` is usually safe). Dask will use the ``blocksize`` value to map |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe provide a guideline for when 1 GiB
is safe. I imagine this is safe for large devices that we're usually used to work with, but given the recent NO-OOM effort I don't think a small laptop GPU will be capable of handling 1GiB safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to remove the 1 GiB comment since we already discuss the 1/16-1/8 "rule of thumb" above.
many of the details discussed in the `Dask DataFrames Best Practices | ||
<https://docs.dask.org/en/stable/dataframe-best-practices.html>`__ | ||
documentation also apply to Dask cuDF. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any notable ones that are known NOT to apply to Dask cuDF and we should let users know here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the wording to say the guidelines that are not pandas-specific also apply to Dask cuDF.
Deployment and Configuration | ||
---------------------------- | ||
|
||
Use Dask DataFrame Directly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just move it to the end of the Deployment and Configuration
section? The rationale is this at the top may look like a suggestion, which is actually the opposite of what this subsection says. No strong opinions though.
Sorting, joining and grouping operations all have the potential to | ||
require the global shuffling of data between distinct partitions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also applies to repartition
, no? Maybe we should mention the same arguments here or point https://github.com/rapidsai/cudf/pull/16821/files#diff-1c3b287013ea5f3b56726f0b7e7538bd0242e8cadcbd89e966a82d3d36719317R93-R95 to here as well to make it more explicit where users are dealing with in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm. Repartition does not really require data "shuffling". Data shuffling requires "all-to-all", while repartitioning is usually limited to data movement between neighboring partitions.
* Use a distributed cluster with Dask-CUDA workers | ||
* Use native cuDF spilling whenever possible |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we link to the sections above in here that deal with these two suggestions, for the benefit of the reader who may come directly to this section? Not sure if easy/possible, if not skipping is also fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good idea. Will need to figure out how to do that :)
@VibhuJawa @ayushdg @randerzander - I would consider you all dask-cudf power users. Let me know if these "best practices" seem reasonable to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for this! I found the page very helpful overall. Left a few comments and nits
``False``, but ``aggregate_files=True`` is usually more performant when | ||
the dataset contains many files that are smaller than half of ``blocksize``. | ||
|
||
.. note:: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this note. Once we have more cloud IO specific optimizations it might make sense to add it to best practices or create a new one for cloud IO to discuss tips/tricks for those environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that we need a lot more remote-IO information. However, it doesn't feel like there is much to say yet :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great. I added a few general thoughts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc mostly looks great to me, thanks for adding it. Have left some small notes
I will plan to merge this in a few hours if there aren't any more comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, I only had a few tiny wording nits
Co-authored-by: Lawrence Mitchell <[email protected]>
/merge |
Description
Adds a much-needed "best practices" page to the Dask cuDF documentation.
Checklist