Improve performance of `format_data()` #729

danielhuppmann · 2023-02-22T05:34:28Z

Please confirm that this PR has done the following:

~~Tests Added~~
~~Documentation Added~~
~~Name of contributors Added to AUTHORS.rst~~
Description in RELEASE_NOTES.md Added

Description of PR

This PR is an alternative approach to #726 & #727 only focusing on performance improvements in format_data() by switching from pd.melt() to stack().

Performance improvements

Current implementation

      ITEM_PATH      ITEM_VARIANT  TOTAL_TIME  CPU_USAGE    MEM_USAGE
0  test_profile  test_init[data0]    0.673276   0.203255    -1.433594
1  test_profile  test_init[data1]    9.140325   0.931676  1389.832031
2  test_profile  test_init[data2]   29.698939   0.932599  3407.539062

New implementation

      ITEM_PATH      ITEM_VARIANT  TOTAL_TIME  CPU_USAGE    MEM_USAGE
0  test_profile  test_init[data0]    0.786510   0.167053    -0.167969
1  test_profile  test_init[data1]    4.425207   0.850163   631.160156
2  test_profile  test_init[data2]   12.178936   0.942462  1482.765625

codecov · 2023-02-22T05:44:32Z

Codecov Report

Merging #729 (4b6207c) into main (8c56dc3) will increase coverage by 0.0%.
The diff coverage is 100.0%.

❗ Current head 4b6207c differs from pull request most recent head 3fb9bd6. Consider uploading reports for the commit 3fb9bd6 to get more accurate results

@@          Coverage Diff          @@
##            main    #729   +/-   ##
=====================================
  Coverage   95.0%   95.0%           
=====================================
  Files         59      59           
  Lines       6004    6015   +11     
=====================================
+ Hits        5707    5718   +11     
  Misses       297     297

Impacted Files	Coverage Δ
pyam/utils.py	`91.9% <100.0%> (+0.2%)`	⬆️
tests/test_core.py	`100.0% <100.0%> (ø)`

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

coroa

Sure, looks good!

My qualms, which the other PR #727 is addressing:

format_data is a maintenance burden, it's too long and difficult to parse. Next iteration at a fast format_data #727 split it out into four different functions: convert_r_columns, knead_data (not the best of names), intuit_column_groups and transform_to_series.
format_data here relies on creating a copy first. Next iteration at a fast format_data #727 avoids the copy in core.
The fast-case avoids reset_index with subsequent set_index, for correctly formatted timeseries()-like or inner _data-like representations, which is used all over the place from within pyam itself, but also for all the times, when you need to do calculations bypassing pyam.

I'd gladly add two more PRs on top of this one:

Refactor format_data into 5 different functions (above 4 + _validate_complete_index) -> Refactor initialization for simpler maintenance #730
Re-add multi-index fast-case -> Add fast-path to format data #731

pyam/utils.py

gidden · 2023-02-22T08:25:20Z

Thanks @danielhuppmann - I'll let @coroa finish off the review. In general I am happy especially if we minimize the read-in times of large AR6-like files as much as possible.

danielhuppmann · 2023-02-22T08:36:56Z

Thanks @coroa - fully agree with your qualms, happy to take this further and implement more of your suggestions! (Or even happier to review/merge this PR and you implement your suggestions on top of my changes). I was just worried that the current path-dependency of the existing PRs made it quite hard to review - and might make format_data() even more complicated and hard to maintain going forward.

Co-authored-by: Jonas Hörsch <[email protected]>

coroa · 2023-02-22T11:57:14Z

I'm fine with merging this PR. (That's why i clicked Approve :))

The announced PRs are #730 and #731 .

danielhuppmann added 4 commits February 21, 2023 22:28

Refactor format_data() to use stack

ac80d40

Fix validation steps

25de43c

Add match to testing for nan in data index

f371e07

Use pandas to check for complete index

9283e84

danielhuppmann requested review from gidden and coroa February 22, 2023 05:34

danielhuppmann self-assigned this Feb 22, 2023

Make black

5eb1764

Add to release notes

8602ef7

danielhuppmann marked this pull request as ready for review February 22, 2023 05:59

coroa approved these changes Feb 22, 2023

View reviewed changes

pyam/utils.py Outdated Show resolved Hide resolved

pyam/utils.py Outdated Show resolved Hide resolved

pyam/utils.py Outdated Show resolved Hide resolved

pyam/utils.py Show resolved Hide resolved

gidden mentioned this pull request Feb 22, 2023

initial attempt at a fast init #726

Closed

4 tasks

danielhuppmann and others added 3 commits February 22, 2023 09:38

Implement suggestions by @coroa

4b6207c

Co-authored-by: Jonas Hörsch <[email protected]>

Remove superfluous deletion

5877b30

Update a comment

3fb9bd6

This was referenced Feb 22, 2023

Refactor initialization for simpler maintenance #730

Merged

Next iteration at a fast format_data #727

Closed

danielhuppmann merged commit 7a97516 into IAMconsortium:main Feb 22, 2023

danielhuppmann deleted the performance/format-data branch February 22, 2023 12:40

This was referenced Feb 28, 2023

Validation for illegal column names in data #734

Merged

Improve performance of IamDataFrame initialization (phase 2) #580

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `format_data()` #729

Improve performance of `format_data()` #729

danielhuppmann commented Feb 22, 2023 •

edited

Loading

codecov bot commented Feb 22, 2023 •

edited

Loading

coroa left a comment •

edited

Loading

gidden commented Feb 22, 2023

danielhuppmann commented Feb 22, 2023

coroa commented Feb 22, 2023 •

edited

Loading

Improve performance of format_data() #729

Improve performance of format_data() #729

Conversation

danielhuppmann commented Feb 22, 2023 • edited Loading

Please confirm that this PR has done the following:

Description of PR

Performance improvements

codecov bot commented Feb 22, 2023 • edited Loading

Codecov Report

coroa left a comment • edited Loading

Choose a reason for hiding this comment

gidden commented Feb 22, 2023

danielhuppmann commented Feb 22, 2023

coroa commented Feb 22, 2023 • edited Loading

Improve performance of `format_data()` #729

Improve performance of `format_data()` #729

danielhuppmann commented Feb 22, 2023 •

edited

Loading

codecov bot commented Feb 22, 2023 •

edited

Loading

coroa left a comment •

edited

Loading

coroa commented Feb 22, 2023 •

edited

Loading