-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of format_data()
#729
Improve performance of format_data()
#729
Conversation
Codecov Report
@@ Coverage Diff @@
## main #729 +/- ##
=====================================
Coverage 95.0% 95.0%
=====================================
Files 59 59
Lines 6004 6015 +11
=====================================
+ Hits 5707 5718 +11
Misses 297 297
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, looks good!
My qualms, which the other PR #727 is addressing:
format_data
is a maintenance burden, it's too long and difficult to parse. Next iteration at a fast format_data #727 split it out into four different functions:convert_r_columns
,knead_data
(not the best of names),intuit_column_groups
andtransform_to_series
.format_data
here relies on creating a copy first. Next iteration at a fast format_data #727 avoids the copy in core.- The fast-case avoids reset_index with subsequent set_index, for correctly formatted timeseries()-like or inner _data-like representations, which is used all over the place from within
pyam
itself, but also for all the times, when you need to do calculations bypassing pyam.
I'd gladly add two more PRs on top of this one:
- Refactor format_data into 5 different functions (above 4 +
_validate_complete_index
) -> Refactor initialization for simpler maintenance #730 - Re-add multi-index fast-case -> Add fast-path to format data #731
Thanks @danielhuppmann - I'll let @coroa finish off the review. In general I am happy especially if we minimize the read-in times of large AR6-like files as much as possible. |
Thanks @coroa - fully agree with your qualms, happy to take this further and implement more of your suggestions! (Or even happier to review/merge this PR and you implement your suggestions on top of my changes). I was just worried that the current path-dependency of the existing PRs made it quite hard to review - and might make |
Co-authored-by: Jonas Hörsch <[email protected]>
Please confirm that this PR has done the following:
Tests AddedDocumentation AddedName of contributors Added to AUTHORS.rstDescription of PR
This PR is an alternative approach to #726 & #727 only focusing on performance improvements in
format_data()
by switching frompd.melt()
tostack()
.Performance improvements
Current implementation
New implementation