Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

initial attempt at a fast init #726

Closed
wants to merge 12 commits into from
Closed

Conversation

gidden
Copy link
Member

@gidden gidden commented Feb 17, 2023

Please confirm that this PR has done the following:

  • Tests Added
  • Documentation Added
  • Name of contributors Added to AUTHORS.rst
  • Description in RELEASE_NOTES.md Added

Description of PR

This PR attempts to make initialization of dataframes faster if possible. The goal of this PR is to start a conversation about how to support faster initialization and provide tools to that effect.

It adds a method to initialize a dataframe with minimal checking called fast_format_data() and a fast kwarg to the __init__() method. I see a ~30% speed up on a real dataset (AR6), and show the profiling of increasing datasizes (though I randomly generate the data, so we won't see effects of faster sorting with common model/scenario/variable names etc.). In the graph, N denotes the number of rows in the original wide-format dataframe. The image is generated by profile_init.py, and the largest datapoint comes from reading in the AR6 database.

profile_init

Note that the last point is reading in AR6 data, where I am guessing speed ups come from less heterogenous data

New version showing refactor to stack (which is now called slow in the figure), which goes back to a 30% speedup for 10**7 rows or 35% speed up for reading in AR6 between the new implementations. Both are significantly faster than melt (e.g., fast is 70% faster than old for AR6).

profile_init

@gidden
Copy link
Member Author

gidden commented Feb 17, 2023

Hmm, some odd plotting test failures:

  • Windows: py3.9 (py3.10 looks auth related)
  • Mac: py3.9
  • Ubuntu: py3.9

return extra_cols, time_col, melt_cols


def fast_format_data(df, index=DEFAULT_META_INDEX):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this function is mostly copied/pasted from format_data() with a few optimisations. We could in principle break out the copied over parts to separate methods that are called by both.

pyam/utils.py Outdated Show resolved Hide resolved
pyam/utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@coroa coroa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noted the requirements on df are too lose for this to be really a zero-copy check. Ie you found that you need melt and set_index, which are both expensive.

I'd suggest to start fast_format_data from the premise that set(IAMC_IDX) | set(index) are already in df.index. And df.columns is the time column. ie. you have a timeseries based view. (I'd really like for this also to work with everything as a series directly but let's do this common case first):

Then the only operations that need to happen are:

extra_cols = list(set(df.index.names).difference((set(IAMC_IDX) | set(index))))
time_col = intuit_time_col(df.columns)
idx = IAMC_IDX + list(set(index + extra_cols) - set(IAMC_IDX))

df = df.reorder_levels(idx)
df = df.rename_axis(columns=time_col)
df = df.stack()

df = df.dropna() # but i would skip this for the fast path and add it to the requirement

Everything here is fast, including the stack, which is only a reshape and an extension of the multiindex.

pyam/core.py Outdated Show resolved Hide resolved
Comment on lines +172 to +174
df.drop(columns=empty_cols, inplace=True)
df.dropna(axis=0, how="all", inplace=True)
return df
Copy link
Collaborator

@coroa coroa Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inplace=True seldom gives you a speed-up with pandas.

https://stackoverflow.com/a/60020384/2873952 .

reset_index, fillna, clip are exceptions, but there is not a good list.
In general, if the size of the data and the data-type stays the same, but needs to be copied to not operate in inplace, then there might be a sizable speed-up. In other situations more likely that it does not change anything. (Other than that you are forcing your users to be careful.)

drop along columns can be done on a view, so it doesn't matter. dropna might actually be faster, but probably you have to copy data in both cases.

pyam/utils.py Outdated Show resolved Hide resolved
pyam/utils.py Outdated Show resolved Hide resolved
pyam/utils.py Outdated Show resolved Hide resolved
pyam/utils.py Outdated Show resolved Hide resolved
@danielhuppmann
Copy link
Member

Riffing off this comment by @coroa

A Series is pyam's backend format. I'd really like to see Series supported by the fast case.

The expensive operations in the i/o chain are converting between long and wide format. I think that the biggest performance gain would be using a file format that supports long format...

Also, format_data() always turns a Series into a DataFrame, see

if isinstance(df, pd.Series):

The most effective performance boost may be simply add a direct processing route for a Series.

@gidden
Copy link
Member Author

gidden commented Feb 18, 2023

Thanks @coroa and @danielhuppmann - I've updated the PR as you suggest and have a series of requirements and fast_format_data now assumes either a series or a multi-index wide-form dataframe.

I haven't updated the profile for this, nor have I checked if there are other tests where this can be added as a parameter. I suspect we should probably do more of that, so please let me know if you already know of cases where we could add this.

data["type"].append(type)
data["time"].append(time)
data["label"].append("autogenerated")
except:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E722 do not use bare 'except'

@danielhuppmann
Copy link
Member

I'm wondering a bit about the use case and where the performance boost could come from...

I assume that the main use case is reading (large) files created by pyam, ie. wide IAMC-format, where we know that we have the "correct" ordering of columns.

So the expensive operations are (based on my limited understanding):

  • set the index: not really a way around that...
  • melt from wide to long
  • sort the index: given that DataFrame.is_monotonic_increasing is fast, I doubt that ordering would be a performance drag if the input data is already sorted

So the main issue is melt, where @coroa suggested to use stack instead. But... Why not simply refactor format_data() to use stack instead of implementing a parallel method...?

@gidden
Copy link
Member Author

gidden commented Feb 20, 2023

Refactored and compared

@danielhuppmann
Copy link
Member

danielhuppmann commented Feb 20, 2023

Thanks @gidden - sweet that stack() really gives such a performance boost!

In light of that, I think it would be prudent to not have an extra "fast" option - it's only a small performance boost compared to a lot of extra overhead to carry around...

@gidden
Copy link
Member Author

gidden commented Feb 20, 2023

I'm indifferent here - @coroa any opinions?

@gidden
Copy link
Member Author

gidden commented Feb 22, 2023

closing in favor of #729 and #727

@gidden gidden closed this Feb 22, 2023
@danielhuppmann danielhuppmann mentioned this pull request Feb 27, 2023
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants