Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add xlsxwriter to improve to_excel performance #701

Merged
merged 9 commits into from
Sep 14, 2022

Conversation

phackstock
Copy link
Contributor

@phackstock phackstock commented Sep 12, 2022

Please confirm that this PR has done the following:

  • Tests Added (none needed except for possibly benchmarks)
  • Documentation Added
  • Name of contributors Added to AUTHORS.rst
  • Description in RELEASE_NOTES.md Added

Description of PR

Added xlsxwriter to the dependencies of pyam. pandas uses xlsxwriter over openpyxl if it's found. This means that if xlsxwriter is found on the system it is used without any changes required.
According to benchmarks (https://exchangetuts.com/python-fastest-way-to-write-pandas-dataframe-to-excel-on-multiple-sheets-1640154784194443), xlsxwriter is significantly faster than openpyxl.
Should I set up some benchmarks of our own to test it for pyam?

@codecov
Copy link

codecov bot commented Sep 12, 2022

Codecov Report

Merging #701 (a8c77af) into main (759120f) will increase coverage by 0.0%.
The diff coverage is 60.0%.

@@          Coverage Diff          @@
##            main    #701   +/-   ##
=====================================
  Coverage   94.8%   94.9%           
=====================================
  Files         58      58           
  Lines       5856    5853    -3     
=====================================
- Hits        5557    5555    -2     
+ Misses       299     298    -1     
Impacted Files Coverage Δ
pyam/core.py 94.7% <33.3%> (-0.2%) ⬇️
pyam/utils.py 91.7% <100.0%> (+0.5%) ⬆️
tests/test_io.py 100.0% <100.0%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@danielhuppmann
Copy link
Member

I just pip-installed pyam in a clean environment and xlsxwriter was installed automatically already. Not sure if it's necessary to add it as an explicit dependency?

@phackstock
Copy link
Contributor Author

Interesting, it's not a dependency of pandas (https://github.com/pandas-dev/pandas/blob/main/setup.cfg#L33).
Maybe it was installed as a dependency by some other library.
I just tried the same, install pyam from a clean virtual environment and it did not install it for me. Maybe it's platform dependent.

@danielhuppmann
Copy link
Member

You're right, looks like I was confused between two environments. Running some profiling now.

Copy link
Member

@danielhuppmann danielhuppmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that openpyxl is hard-coded in line

excel_file = pd.ExcelFile(data, engine="openpyxl")

Also, I guess it's possible to remove openpyxl as a dependency?

I just ran some tests and can confirm that

  • removing the hard-coded engine works whether or not xlsxwriter is installed
  • performance is significantly better if xlsxwriter is installed, no need to specify it a engine explicitly (on a 30MB xlsx file, the resulting file is smaller by 50%, time is -45%, CPU is -80%)

RELEASE_NOTES.md Outdated Show resolved Hide resolved
@danielhuppmann
Copy link
Member

There is another usage of pd.ExcelWriter where the engine is not explicitly specified. Please make the two usages consistent.

pyam/pyam/core.py

Line 2377 in 759120f

excel_writer, close = pd.ExcelWriter(excel_writer), True

pyam/core.py Outdated Show resolved Hide resolved
Co-authored-by: Daniel Huppmann <[email protected]>
@phackstock
Copy link
Contributor Author

Updated the usage of xlsxwriter in pd.ExcelWriter.
There's one more thing that I found:

https://github.com/IAMconsortium/pyam/blob/main/pyam/utils.py#L135

is there a test for utils.write_sheet?

@danielhuppmann
Copy link
Member

is there a test for utils.write_sheet?

It's tested implicitly via

def test_io_xlsx(test_df, meta_args, tmpdir):

The line you found was a hacky attempt by me to make xlsx files look "nice" by having useful column widths.

@phackstock
Copy link
Contributor Author

Ok, should I check if using xlsxwriter triggered the error?

@danielhuppmann
Copy link
Member

Let me give this a quick try, I think you have more urgent things on your to-do list...

@phackstock
Copy link
Contributor Author

True, just thought this might be a quick one to get off the list ...

@danielhuppmann danielhuppmann merged commit fde3690 into IAMconsortium:main Sep 14, 2022
@phackstock phackstock deleted the feature/add-xlsxwriter branch September 14, 2022 16:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants