Improve performance by switching from pandas to numpy #321

maurerle · 2024-03-12T15:37:37Z

During my recent studies I found ASSUME to be very slow when simulating a whole year.
One way to improve this is by switching towards daily market clearing instead of hourly, but it still takes a while.
When looking into the code which takes a lot of time I found pandas to often be the case:

Act 1: Profiling Benchmarks

due to various reasons, cProfile does not give good timings when running async code.
More correct timings can be seen using yappi (pip install yappi) - https://github.com/sumerc/yappi
So one can run yappi -o "out.profile" cli.py and then use tuna (pip install tuna) to visualize the profiling result:
tuna out.profile.
This gives theses visual charts like shown below.
The results are therefore equally to running assume -s example01a -c base. This run takes 88s on my laptop.
Probably ~20s are spent organizing asyncio-stuff
~60s is spent in pandas
~3s on imports
~ rest on other stuff

calculate_bids boils down to take time in pandas

handling market_feedback spends a lot of time in pandas too - nearly all the site-packages stuff is spend in pandas

writing outputs spends most of its time in pandas too

Though one can not see that much due to the long lines - I could not find a way to remove the absolute paths from the pictures..

Act 2: Alternatives

So I thought how one can replace pandas.
Our requirements includes slicing, indexing by datetime and having multiple series.
After experimenting with modin and dask
I could not use modin as a drop in replacement and dask did not seem like a good solution either, as we spend a lot of time in the initialization of dataframes and not in the heavy lifting.

I came up with good old numpy, which supports slicing. But can only have an array with the same types.
So a datetime index is not possible.

I thought about having a convenience wrapper - something like this:

def idx_from_date(date):
    return (date-start)//freq

def numpy_dt_indexer(data, fr, to):
    from_idx = idx_from_date(fr)
    to_idx = idx_from_date(to)
    return data[from_idx:to_idx]

After all, it turns out, that switching to numpy is at least 40x faster than pandas.
I really hope, that this is also the case when switching the main parts of the simulation to it.

Act 3

Implementation
TBD

The text was updated successfully, but these errors were encountered:

maurerle mentioned this issue Mar 12, 2024

ASSUME startup time high due to packages like pypsa and torch #318

Closed

nick-harder added enhancement An optional feature or enhancement priority: medium labels Apr 25, 2024

nick-harder assigned maurerle and nick-harder Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance by switching from pandas to numpy #321

Improve performance by switching from pandas to numpy #321

maurerle commented Mar 12, 2024

Improve performance by switching from pandas to numpy #321

Improve performance by switching from pandas to numpy #321

Comments

maurerle commented Mar 12, 2024

Act 1: Profiling Benchmarks

Act 2: Alternatives

Act 3