Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Partitioning user guide and small doc fixes #2717

Merged
merged 7 commits into from
Aug 23, 2024

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Aug 23, 2024

Closes: #840

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Aug 23, 2024
Copy link

codspeed-hq bot commented Aug 23, 2024

CodSpeed Performance Report

Merging #2717 will not alter performance

Comparing jay/partitioning-guide (670e564) with main (9df6beb)

Summary

✅ 10 untouched benchmarks

Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! just had two small comments on the wording


1. **Have Enough Partitions**: our general recommendation for high throughput and maximal resource utilization is to have *at least* ``2 x TOTAL_NUM_CPUS`` partitions, which allows Daft to fully saturate your CPUs.
2. **More Partitions**: if you are observing memory issues (excessive spilling or out-of-memory (OOM) issues) then you may choose to increase the number of partitions. This increases the amount of overhead in your system, but improves overall memory stability (since each partition will be smaller).
3. **Fewer Partitions**: if you are observing a large amount of overhead (especially during shuffle operations such as joins and sorts), then you may choose to decrease the number of partitions. This decreases the amount of overhead in the system, at the cost of using more memory (since each partition will be larger).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe some description of how to measure overhead (vs maybe just an operation that is expensive)?

---------------------------

Daft will automatically use certain heuristics to determine the number of partitions for you when you create a DataFrame. When reading data from files (e.g. Parquet, CSV or JSON),
each file is by default one partition on its own, but Daft will also perform splitting of partitions (for files that are egregiously large) and coalescing of partitions (for small files)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "by default" is a little misleading here as Daft by default does scan task splitting and merging.

@jaychia jaychia enabled auto-merge (squash) August 23, 2024 23:22
@jaychia jaychia merged commit 3647b26 into main Aug 23, 2024
46 checks passed
@jaychia jaychia deleted the jay/partitioning-guide branch August 23, 2024 23:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Write a guide on partitioning
2 participants