Replacing pandas with polars #1926

langestefan · 2023-12-03T15:53:21Z

Is your feature request related to a problem? Please describe.
Pandas is used extensively in pvlib all over the place. However after using it for many years I think it is now outdated and needs to be replaced. It is greedy, slow, inconsistent and the API is awkward and difficult to use.

Describe the solution you'd like
I would like to propose replacing pandas with polars. Polars is a DataFrame interface built in Rust with many, many performance benefits. Too many to list them here, so I suggest visiting their page to learn more.

I know this is a very big change, but I believe it will lead to some serious performance and usability benefits. So there is no real plan yet, I would like to just start the discussion and see what others think. I think a good starting point would be to benchmark some performance-critical pvlib code in pandas and polars so we can quantify what the performance benefits for pvlib would be.

mikofski · 2023-12-03T17:51:58Z

It is greedy, slow, inconsistent and the API is awkward and difficult to use.

I ❤️ this! 🤣

a good starting point would be to benchmark some performance-critical code in pandas and polars so we can quantify what the performance benefits would be.

I 💯 agree. Maybe change issue title to “benchmark polars v pandas”? And could broaden it to include xarray and numpy structs to be unbiased?

one thing not seeing is how polars behaves with vectorization. I see SIMD but does that require or utilize MKL or specific CPU arch? Polars send great at SQL operations, but we don’t use it much in pvlib, eg sorting, selecting, etc.

also, important to note where/how pvlib api changes bc of pandas extensive use

kandersolar · 2023-12-04T13:52:54Z

@langestefan can you say more about what "replace pandas with polars" would mean in this context? Not supporting pandas inputs, or changing our docs to use polars instead of pandas, or what?

Off the top of my head, there aren't a whole lot of places that pvlib actually uses pandas directly (iotools, spectrum, a few model functions here and there). Many of the files that import pandas only do it so that functions can be sure to return pandas objects when the inputs are pandas objects. It's hard to know for sure, but I'd guess that the majority of pvlib users have pandas-based workflows, so removing support for passing pandas objects to (and getting pandas objects back from) pvlib would be an enormously breaking change. I doubt any performance comparison would convince me personally that dropping support for pandas in favor of polars would be a net positive for the pvlib community.

If polars support is valuable, what seems like a better end result to me is to try to support both polars and pandas rather than dropping one for the other. I'm not eager to proliferate the if isinstance(input, polars.Whatever): return polars.Whatever(output) code pattern that I suspect supporting polars would require, but if the value is big enough then I could see it being a good tradeoff. For this, I agree quantifying the performance impact would be helpful.

jules-ch · 2023-12-05T10:07:24Z

Like @kevinsa5

I'd say it's way to early to support polars instead of pandas. The API is still unstable, version 0.X.
I'm a big user of polars on personal projects because of performance lazyloading, async collection, integration with the Apache Arrow Ecosystem.

But I really don't see any improvement in using polars here, apart from using the new shiny library, an effort is being made to use numpy array as inputs in functions where pandas were used before see #1455.
Performance-wise, benchmarks need to be made first and since we're not processing GB of data, we probably won't see any difference in performance, especially since we can use the new interface with the Arrow ecosystem with pandas 2.0.

Even when processing 20 years of weather data for a single location it's not even close to several millions of rows.
IMO I don't think it will make a significant gain in perf but I'm eager to see some benchmarks.

Moving forward using Numpy types instead of Dataframe or Series will make integrating with Polars or any Dataframe library easy.

We can probably use polars already if that's what you want with a bunch of PVLIB API.

AdamRJensen closed this as completed May 24, 2024

MarcoGorelli mentioned this issue May 24, 2024

potential use cases narwhals-dev/narwhals#62

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replacing pandas with polars #1926

Replacing pandas with polars #1926

langestefan commented Dec 3, 2023 •

edited

Loading

mikofski commented Dec 3, 2023 •

edited

Loading

kandersolar commented Dec 4, 2023

jules-ch commented Dec 5, 2023

Replacing pandas with polars #1926

Replacing pandas with polars #1926

Comments

langestefan commented Dec 3, 2023 • edited Loading

mikofski commented Dec 3, 2023 • edited Loading

kandersolar commented Dec 4, 2023

jules-ch commented Dec 5, 2023

langestefan commented Dec 3, 2023 •

edited

Loading

mikofski commented Dec 3, 2023 •

edited

Loading