Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(blog-post): pydata performance part 2; polars and datafusion #7703

Merged
merged 3 commits into from
Dec 12, 2023

Conversation

cpcloud
Copy link
Member

@cpcloud cpcloud commented Dec 10, 2023

This PR adds a new blog post as a follow up to part 1.

The purpose of the post is to continue comparing the performance of the same
workload from part 1 against Polars and DataFusion.

TL; DR: Polars and DataFusion take about the same time to complete this
workload, somewhere between 8 to 10 minutes.

DuckDB completes this workload in about 40 to 60 seconds.

Details

Polars

Polars ran out of memory out of the box, so I added the streaming=True flag
to collect calls in all cases (the Ibis implementation and the Polars native
API implementation).

I am aware that streaming execution is in an alpha state, but I'm not aware of
any other way to complete the workload without getting a machine with more
memory.

Memory use was still pretty high (mid 30GBs) and the workload segfaulted when
run on a machine with 64GB (different from the one I benchmarked on which has
about 94GB), but I was able to reliably complete the workload on my larger
cloud host.

Note that I compare the Ibis version with the Polars native API version to rule
out any large differences in performance caused by Ibis. Both methods have the
same performance characteristics.

I took a look at perf top while the Polars native workload was running and I saw
a few expected things and some surprising-to-me things, especially the getenv calls:

image

DataFusion

DataFusion never ran out of memory and had a memory profile similar to DuckDB:
single digit GBs peak memory.

However, it was still extremely slow compared to DuckDB, about 9-10 minutes to
run the whole workload.

Similarly to Polars I compared both the Ibis implementation and a hand-written
SQL version (built from the generated Ibis code). Both had the same performance

I also looked at perf top while the DataFusion workload was running and saw this:

image

Next steps

I'd like to work with the community to see if I can do something to improve the
performance in either or both of Polars and DataFusion.

cc @ritchie46 @alamb

Would love y'all to take a look at what I'm doing and let me know how to get
better performance.

Note: all of the version and system information is in the preview links below, but here it is for ease of use:

image

@cpcloud cpcloud added this to the 7.2 milestone Dec 10, 2023
@cpcloud cpcloud added docs Documentation related issues or PRs performance Issues related to ibis's performance datafusion The Apache DataFusion backend duckdb The DuckDB backend polars The polars backend labels Dec 10, 2023
@cpcloud
Copy link
Member Author

cpcloud commented Dec 10, 2023

/preview

@cpcloud
Copy link
Member Author

cpcloud commented Dec 10, 2023

The Polars backend test failures are from the change to set streaming=True and are related to:

  1. Checking the size of streamed arrow record batches. That's an ibis testing issue.
  2. Calling polars.scan_ndjson with a glob which panics:
FAILED ibis/backends/tests/test_register.py::test_read_json_glob[polars] - pyo3_runtime.PanicException: not yet implemented

@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Dec 10, 2023

@cpcloud
Copy link
Member Author

cpcloud commented Dec 10, 2023

Found a difference between the polars implementation and the ibis implementation, but it's unlikely to explain the 10x delta: I am calling drop_nulls before the aggregation in the polars code. Fixing it now.

@cpcloud
Copy link
Member Author

cpcloud commented Dec 10, 2023

/preview

@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Dec 10, 2023

@alamb
Copy link

alamb commented Dec 11, 2023

Thank you for the ping @cpcloud -- we will check this out shortly

Tracking in apache/datafusion#8492

@lostmygithubaccount
Copy link
Member

I see some interesting behavior w/ Polars on a mac -- running the native Polars code w/ streaming=False, time is relatively close to DuckDB (~72s for Polars vs ~53s for DuckDB). DuckDB fully saturates all cores, Polars seems to use about 80-90% throughout the run (perhaps corresponding to the difference in runtime).

image

however, when I run with streaming=True, the run takes over 5-7 minutes. I'm on the latest PyPI release

@cpcloud
Copy link
Member Author

cpcloud commented Dec 11, 2023

Interesting!

I can't run the Polars version with streaming=False on Linux without running out of memory, even with ~94GB of RAM.

@cpcloud cpcloud force-pushed the pydata-performance-part2 branch 3 times, most recently from 20e90b6 to 96ff79d Compare December 11, 2023 18:07
@cpcloud
Copy link
Member Author

cpcloud commented Dec 11, 2023

/preview

@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Dec 11, 2023

@cpcloud cpcloud marked this pull request as ready for review December 11, 2023 23:02
Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Three small comments, but I'd say this is ready to go.

docs/posts/pydata-performance-part2/index.qmd Show resolved Hide resolved
docs/posts/pydata-performance-part2/index.qmd Show resolved Hide resolved
docs/posts/pydata-performance-part2/index.qmd Outdated Show resolved Hide resolved
@cpcloud
Copy link
Member Author

cpcloud commented Dec 12, 2023

Rendering now

@cpcloud
Copy link
Member Author

cpcloud commented Dec 12, 2023

/preview

@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Dec 12, 2023

@cpcloud cpcloud added docs-preview Add this label to trigger a docs preview and removed docs-preview Add this label to trigger a docs preview labels Dec 12, 2023
Copy link
Member

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small things I noticed, neither blocking

docs/posts/pydata-performance-part2/index.qmd Outdated Show resolved Hide resolved
docs/posts/pydata-performance-part2/index.qmd Outdated Show resolved Hide resolved
@cpcloud cpcloud added docs-preview Add this label to trigger a docs preview and removed docs-preview Add this label to trigger a docs preview labels Dec 12, 2023
@cpcloud cpcloud added docs-preview Add this label to trigger a docs preview and removed docs-preview Add this label to trigger a docs preview labels Dec 12, 2023
@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Dec 12, 2023

df.head()
```

### DataFusion and Polars
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not showing up in the rendered html for some reason (which is very annoying)

@cpcloud cpcloud force-pushed the pydata-performance-part2 branch 2 times, most recently from 2b54b98 to bf0b7ec Compare December 12, 2023 22:58
@cpcloud cpcloud added docs-preview Add this label to trigger a docs preview and removed docs-preview Add this label to trigger a docs preview labels Dec 12, 2023
@ibis-docs-bot ibis-docs-bot bot removed the docs-preview Add this label to trigger a docs preview label Dec 12, 2023
@cpcloud
Copy link
Member Author

cpcloud commented Dec 12, 2023

Alright, once the preview finishes and looks good I will merge this!

Thanks all!

@ibis-docs-bot
Copy link

ibis-docs-bot bot commented Dec 12, 2023

@cpcloud cpcloud merged commit 9f034dc into ibis-project:master Dec 12, 2023
90 checks passed
@cpcloud cpcloud deleted the pydata-performance-part2 branch December 12, 2023 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion The Apache DataFusion backend docs Documentation related issues or PRs duckdb The DuckDB backend performance Issues related to ibis's performance polars The polars backend
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants