-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
initial attempt at a fast init #726
Conversation
Hmm, some odd plotting test failures:
|
return extra_cols, time_col, melt_cols | ||
|
||
|
||
def fast_format_data(df, index=DEFAULT_META_INDEX): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this function is mostly copied/pasted from format_data()
with a few optimisations. We could in principle break out the copied over parts to separate methods that are called by both.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you noted the requirements on df
are too lose for this to be really a zero-copy check. Ie you found that you need melt
and set_index
, which are both expensive.
I'd suggest to start fast_format_data
from the premise that set(IAMC_IDX) | set(index)
are already in df.index
. And df.columns
is the time column. ie. you have a timeseries based view. (I'd really like for this also to work with everything as a series directly but let's do this common case first):
Then the only operations that need to happen are:
extra_cols = list(set(df.index.names).difference((set(IAMC_IDX) | set(index))))
time_col = intuit_time_col(df.columns)
idx = IAMC_IDX + list(set(index + extra_cols) - set(IAMC_IDX))
df = df.reorder_levels(idx)
df = df.rename_axis(columns=time_col)
df = df.stack()
df = df.dropna() # but i would skip this for the fast path and add it to the requirement
Everything here is fast, including the stack, which is only a reshape and an extension of the multiindex.
df.drop(columns=empty_cols, inplace=True) | ||
df.dropna(axis=0, how="all", inplace=True) | ||
return df |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inplace=True
seldom gives you a speed-up with pandas.
https://stackoverflow.com/a/60020384/2873952 .
reset_index
, fillna
, clip
are exceptions, but there is not a good list.
In general, if the size of the data and the data-type stays the same, but needs to be copied to not operate in inplace, then there might be a sizable speed-up. In other situations more likely that it does not change anything. (Other than that you are forcing your users to be careful.)
drop
along columns can be done on a view, so it doesn't matter. dropna
might actually be faster, but probably you have to copy data in both cases.
Riffing off this comment by @coroa
The expensive operations in the i/o chain are converting between long and wide format. I think that the biggest performance gain would be using a file format that supports long format... Also, Line 186 in a9bb3c3
The most effective performance boost may be simply add a direct processing route for a Series. |
…ports file reading
Thanks @coroa and @danielhuppmann - I've updated the PR as you suggest and have a series of requirements and I haven't updated the profile for this, nor have I checked if there are other tests where this can be added as a parameter. I suspect we should probably do more of that, so please let me know if you already know of cases where we could add this. |
data["type"].append(type) | ||
data["time"].append(time) | ||
data["label"].append("autogenerated") | ||
except: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E722 do not use bare 'except'
I'm wondering a bit about the use case and where the performance boost could come from... I assume that the main use case is reading (large) files created by pyam, ie. wide IAMC-format, where we know that we have the "correct" ordering of columns. So the expensive operations are (based on my limited understanding):
So the main issue is |
Refactored and compared |
Thanks @gidden - sweet that In light of that, I think it would be prudent to not have an extra "fast" option - it's only a small performance boost compared to a lot of extra overhead to carry around... |
I'm indifferent here - @coroa any opinions? |
Please confirm that this PR has done the following:
Description of PR
This PR attempts to make initialization of dataframes faster if possible. The goal of this PR is to start a conversation about how to support faster initialization and provide tools to that effect.
It adds a method to initialize a dataframe with minimal checking called
fast_format_data()
and afast
kwarg to the__init__()
method. I see a ~30% speed up on a real dataset (AR6), and show the profiling of increasing datasizes (though I randomly generate the data, so we won't see effects of faster sorting with common model/scenario/variable names etc.). In the graph,N
denotes the number of rows in the original wide-format dataframe. The image is generated byprofile_init.py
, and the largest datapoint comes from reading in the AR6 database.Note that the last point is reading in AR6 data, where I am guessing speed ups come from less heterogenous data
New version showing refactor to stack (which is now called
slow
in the figure), which goes back to a 30% speedup for 10**7 rows or 35% speed up for reading in AR6 between the new implementations. Both are significantly faster than melt (e.g.,fast
is 70% faster thanold
for AR6).