initial attempt at a fast init #726

gidden · 2023-02-17T09:19:17Z

Please confirm that this PR has done the following:

Tests Added
Documentation Added
Name of contributors Added to AUTHORS.rst
Description in RELEASE_NOTES.md Added

Description of PR

This PR attempts to make initialization of dataframes faster if possible. The goal of this PR is to start a conversation about how to support faster initialization and provide tools to that effect.

It adds a method to initialize a dataframe with minimal checking called fast_format_data() and a fast kwarg to the __init__() method. I see a ~30% speed up on a real dataset (AR6), and show the profiling of increasing datasizes (though I randomly generate the data, so we won't see effects of faster sorting with common model/scenario/variable names etc.). In the graph, N denotes the number of rows in the original wide-format dataframe. The image is generated by profile_init.py, and the largest datapoint comes from reading in the AR6 database.

Note that the last point is reading in AR6 data, where I am guessing speed ups come from less heterogenous data

New version showing refactor to stack (which is now called slow in the figure), which goes back to a 30% speedup for 10**7 rows or 35% speed up for reading in AR6 between the new implementations. Both are significantly faster than melt (e.g., fast is 70% faster than old for AR6).

tests/profile_init.py

gidden · 2023-02-17T10:51:53Z

Hmm, some odd plotting test failures:

Windows: py3.9 (py3.10 looks auth related)
Mac: py3.9
Ubuntu: py3.9

gidden · 2023-02-17T10:52:54Z

pyam/utils.py

+    return extra_cols, time_col, melt_cols
+
+
+def fast_format_data(df, index=DEFAULT_META_INDEX):


Note that this function is mostly copied/pasted from format_data() with a few optimisations. We could in principle break out the copied over parts to separate methods that are called by both.

pyam/utils.py

coroa

As you noted the requirements on df are too lose for this to be really a zero-copy check. Ie you found that you need melt and set_index, which are both expensive.

I'd suggest to start fast_format_data from the premise that set(IAMC_IDX) | set(index) are already in df.index. And df.columns is the time column. ie. you have a timeseries based view. (I'd really like for this also to work with everything as a series directly but let's do this common case first):

Then the only operations that need to happen are:

extra_cols = list(set(df.index.names).difference((set(IAMC_IDX) | set(index))))
time_col = intuit_time_col(df.columns)
idx = IAMC_IDX + list(set(index + extra_cols) - set(IAMC_IDX))

df = df.reorder_levels(idx)
df = df.rename_axis(columns=time_col)
df = df.stack()

df = df.dropna() # but i would skip this for the fast path and add it to the requirement

Everything here is fast, including the stack, which is only a reshape and an extension of the multiindex.

pyam/core.py

coroa · 2023-02-17T15:09:42Z

pyam/utils.py

+        df.drop(columns=empty_cols, inplace=True)
+        df.dropna(axis=0, how="all", inplace=True)
+        return df


inplace=True seldom gives you a speed-up with pandas.

https://stackoverflow.com/a/60020384/2873952 .

reset_index, fillna, clip are exceptions, but there is not a good list.
In general, if the size of the data and the data-type stays the same, but needs to be copied to not operate in inplace, then there might be a sizable speed-up. In other situations more likely that it does not change anything. (Other than that you are forcing your users to be careful.)

drop along columns can be done on a view, so it doesn't matter. dropna might actually be faster, but probably you have to copy data in both cases.

pyam/utils.py

danielhuppmann · 2023-02-17T16:12:40Z

Riffing off this comment by @coroa

A Series is pyam's backend format. I'd really like to see Series supported by the fast case.

The expensive operations in the i/o chain are converting between long and wide format. I think that the biggest performance gain would be using a file format that supports long format...

Also, format_data() always turns a Series into a DataFrame, see

pyam/pyam/utils.py

Line 186 in a9bb3c3

if isinstance(df, pd.Series):

The most effective performance boost may be simply add a direct processing route for a Series.

…ports file reading

gidden · 2023-02-18T11:26:32Z

Thanks @coroa and @danielhuppmann - I've updated the PR as you suggest and have a series of requirements and fast_format_data now assumes either a series or a multi-index wide-form dataframe.

I haven't updated the profile for this, nor have I checked if there are other tests where this can be added as a parameter. I suspect we should probably do more of that, so please let me know if you already know of cases where we could add this.

stickler-ci · 2023-02-19T13:39:15Z

profile/profile_init.py

+                data["type"].append(type)
+                data["time"].append(time)
+                data["label"].append("autogenerated")
+            except:


E722 do not use bare 'except'

danielhuppmann · 2023-02-19T23:14:16Z

I'm wondering a bit about the use case and where the performance boost could come from...

I assume that the main use case is reading (large) files created by pyam, ie. wide IAMC-format, where we know that we have the "correct" ordering of columns.

So the expensive operations are (based on my limited understanding):

set the index: not really a way around that...
melt from wide to long
sort the index: given that DataFrame.is_monotonic_increasing is fast, I doubt that ordering would be a performance drag if the input data is already sorted

So the main issue is melt, where @coroa suggested to use stack instead. But... Why not simply refactor format_data() to use stack instead of implementing a parallel method...?

gidden · 2023-02-20T07:42:31Z

Refactored and compared

…efaults

danielhuppmann · 2023-02-20T08:17:22Z

Thanks @gidden - sweet that stack() really gives such a performance boost!

In light of that, I think it would be prudent to not have an extra "fast" option - it's only a small performance boost compared to a lot of extra overhead to carry around...

gidden · 2023-02-20T08:19:33Z

I'm indifferent here - @coroa any opinions?

gidden · 2023-02-22T08:20:26Z

closing in favor of #729 and #727

gidden added 4 commits February 16, 2023 16:33

initial attempt at a fast init

64a41f6

added fast to filereading with tests and profiling

2dfc141

put ar6 data location with other files

6a58ab2

blacked files

7d5a596

gidden marked this pull request as draft February 17, 2023 09:22

gidden requested review from danielhuppmann and coroa February 17, 2023 09:24

gidden added enhancement next release labels Feb 17, 2023

danielhuppmann reviewed Feb 17, 2023

View reviewed changes

tests/profile_init.py Outdated Show resolved Hide resolved

gidden added 2 commits February 17, 2023 10:27

uncomment sorts

45f3161

moved profile_init to profile module

7046b86

gidden commented Feb 17, 2023

View reviewed changes

pyam/utils.py Outdated Show resolved Hide resolved

gidden commented Feb 17, 2023

View reviewed changes

pyam/utils.py Outdated Show resolved Hide resolved

coroa reviewed Feb 17, 2023

View reviewed changes

refactor fast format to support series and basic dataframes. also sup…

873ee46

…ports file reading

gidden added 2 commits February 19, 2023 13:30

blacked

fd081ee

update profiling for new structure

2b06af7

stickler-ci bot reviewed Feb 19, 2023

View reviewed changes

refactor for using stack instead of melt

58c9822

bump workflow to do mpl tests on 3.8 since something upstream broke d…

22a6fa5

…efaults

errant prints

80cf8c5

This was referenced Feb 21, 2023

Combination of fastinit approaches gidden/pyam#9

Open

Next iteration at a fast format_data #727

Closed

danielhuppmann mentioned this pull request Feb 22, 2023

Improve performance of format_data() #729

Merged

1 task

gidden closed this Feb 22, 2023

danielhuppmann mentioned this pull request Feb 27, 2023

Add fast-path to format data #731

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initial attempt at a fast init #726

initial attempt at a fast init #726

gidden commented Feb 17, 2023 •

edited

Loading

gidden commented Feb 17, 2023

gidden Feb 17, 2023

coroa left a comment

coroa Feb 17, 2023 •

edited

Loading

danielhuppmann commented Feb 17, 2023

gidden commented Feb 18, 2023

stickler-ci bot Feb 19, 2023

danielhuppmann commented Feb 19, 2023

gidden commented Feb 20, 2023

danielhuppmann commented Feb 20, 2023 •

edited

Loading

gidden commented Feb 20, 2023

gidden commented Feb 22, 2023

		return extra_cols, time_col, melt_cols


		def fast_format_data(df, index=DEFAULT_META_INDEX):

initial attempt at a fast init #726

initial attempt at a fast init #726

Conversation

gidden commented Feb 17, 2023 • edited Loading

Please confirm that this PR has done the following:

Description of PR

gidden commented Feb 17, 2023

gidden Feb 17, 2023

Choose a reason for hiding this comment

coroa left a comment

Choose a reason for hiding this comment

coroa Feb 17, 2023 • edited Loading

Choose a reason for hiding this comment

danielhuppmann commented Feb 17, 2023

gidden commented Feb 18, 2023

stickler-ci bot Feb 19, 2023

Choose a reason for hiding this comment

danielhuppmann commented Feb 19, 2023

gidden commented Feb 20, 2023

danielhuppmann commented Feb 20, 2023 • edited Loading

gidden commented Feb 20, 2023

gidden commented Feb 22, 2023

gidden commented Feb 17, 2023 •

edited

Loading

coroa Feb 17, 2023 •

edited

Loading

danielhuppmann commented Feb 20, 2023 •

edited

Loading