Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance by switching from pandas to numpy #321

Open
maurerle opened this issue Mar 12, 2024 · 0 comments
Open

Improve performance by switching from pandas to numpy #321

maurerle opened this issue Mar 12, 2024 · 0 comments
Assignees
Labels
enhancement An optional feature or enhancement priority: medium

Comments

@maurerle
Copy link
Member

During my recent studies I found ASSUME to be very slow when simulating a whole year.
One way to improve this is by switching towards daily market clearing instead of hourly, but it still takes a while.
When looking into the code which takes a lot of time I found pandas to often be the case:

Act 1: Profiling Benchmarks

due to various reasons, cProfile does not give good timings when running async code.
More correct timings can be seen using yappi (pip install yappi) - https://github.com/sumerc/yappi
So one can run yappi -o "out.profile" cli.py and then use tuna (pip install tuna) to visualize the profiling result:
tuna out.profile.
This gives theses visual charts like shown below.
The results are therefore equally to running assume -s example01a -c base. This run takes 88s on my laptop.
Probably ~20s are spent organizing asyncio-stuff
~60s is spent in pandas
~3s on imports
~ rest on other stuff

calculate_bids boils down to take time in pandas
image

handling market_feedback spends a lot of time in pandas too - nearly all the site-packages stuff is spend in pandas
image

writing outputs spends most of its time in pandas too
image

Though one can not see that much due to the long lines - I could not find a way to remove the absolute paths from the pictures..

Act 2: Alternatives

So I thought how one can replace pandas.
Our requirements includes slicing, indexing by datetime and having multiple series.
After experimenting with modin and dask
I could not use modin as a drop in replacement and dask did not seem like a good solution either, as we spend a lot of time in the initialization of dataframes and not in the heavy lifting.

I came up with good old numpy, which supports slicing. But can only have an array with the same types.
So a datetime index is not possible.

I thought about having a convenience wrapper - something like this:

def idx_from_date(date):
    return (date-start)//freq

def numpy_dt_indexer(data, fr, to):
    from_idx = idx_from_date(fr)
    to_idx = idx_from_date(to)
    return data[from_idx:to_idx]

After all, it turns out, that switching to numpy is at least 40x faster than pandas.
I really hope, that this is also the case when switching the main parts of the simulation to it.

Act 3

Implementation
TBD

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement An optional feature or enhancement priority: medium
Projects
None yet
Development

No branches or pull requests

2 participants