Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLD/DISC: Decreasing build/distribution size #52654

Open
jbrockmendel opened this issue Apr 13, 2023 · 9 comments
Open

BLD/DISC: Decreasing build/distribution size #52654

jbrockmendel opened this issue Apr 13, 2023 · 9 comments
Labels
Build Library building on various platforms Enhancement

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Apr 13, 2023

This came up indirectly in #52509 and I think merits some brainstorming. In no particular order:

  1. Circa 2018 there was discussion of stripping some (debug?) symbols from our C files. No idea if that went anywhere. cc @WillAyd

  2. In the last couple years we have improved perf in some groupby reductions by using fused types in libgroupby to support more dtypes directly without casts. I think this significantly increased the size of libgroupby. We did something similar in libalgos and libhashtable. I think avoiding the casting is worth it, but we should acknowledge the tradeoffs.

  3. Some stuff in _libs could plausibly live outside of cython without a ton of downside. ops_dispatch and reduction come to mind, though these are both quite small. More could move if we learn to live with circular dependencies.

  4. This would be a PITA, but we could distribute some dtype-specific stuff separately e.g. pip install pandas[sparse] pandas[interval] pandas[period] and potentially see some big savings that way. This would really be a PITA, but would make a big dent.

  5. IIUC moving cython code back to plain C might get some mileage cc @WillAyd again? This wo

  6. Avoid the numpy dependency. (grep finds 1105 "import numpy"s in pandas/, some of them in eg doctests. 33 "cimport numpy"s)

  7. Avoid pytz dependency (xref DEPR: deprecate pytz support #46463 coming up shortly once we drop py38)

  8. Avoid dateutil dependency

  9. There was a discussion [citation needed] of distributing pandas without the tests. I guess that was a "no".

  10. related DEV: reduce the size of the dev environment.yml #49998

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 13, 2023
@MarcoGorelli
Copy link
Member

Avoid dateutil dependency

Yeah, the only part of dateutil which I think we should be using is in guess_datetime_format. It's been on the back of my mind that, aside from inferring formats, we shouldn't be using it to parse, it's just too unreliable

@lithomas1
Copy link
Member

lithomas1 commented Apr 13, 2023

1 is done already (for cibuildwheel), and has been for a couple years now (with the old build system).

I think one other thing to consider is the size of the tests, IIUC some people were complaining about that a little while back.
EDIT: Nvm, didn't see the bullet about tests.

@lithomas1 lithomas1 added Enhancement Build Library building on various platforms and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 13, 2023
@mroeschke
Copy link
Member

  1. I would be +1 for making dateutil optional. The timezones could definitely be optionally supported, and the parser could be floated up as generically provided by the user. For format guessing could we vendor that part?
  2. Suggestion: remove tests from the distribution #30741, this is "hard"/annoying due to pd.test and the EA tests as publicly advertised

@jbrockmendel
Copy link
Member Author

I would be +1 for making dateutil optional. The timezones could definitely be optionally supported, and the parser could be floated up as generically provided by the user. For format guessing could we vendor that part?

There are three things from dateutil we use: parser, tz, relativedelta. parser I pretty much agree with Marco we can/should move away from. tz IIUC is going to be subsumed by zoneinfo anyway (i.e. the private attributes we rely on will go away) eventually, so getting rid of our special support for that makes sense. relativedelta we use in a few places (some of which are unnecessary or broken, xref #52569); i havent given much thought to how we could avoid that

@WillAyd
Copy link
Member

WillAyd commented Apr 13, 2023

Always a good conversation but I don't think we should change the way we think about C / Cython and fused types to account for this

@jorisvandenbossche
Copy link
Member

  1. Suggestion: remove tests from the distribution #30741, this is "hard"/annoying due to pd.test and the EA tests as publicly advertised

It seems that the tests make up about 1/3rd of the distributed package size, so that might be worth reconsidering

$ du -a --exclude=__pycache__ pandas | sort -n -r | head -n 20
44124	pandas
22156	pandas/_libs
14136	pandas/tests
6076	pandas/_libs/tslibs
5044	pandas/core
2336	pandas/_libs/join.cpython-311-x86_64-linux-gnu.so
2316	pandas/tests/io
2012	pandas/_libs/algos.cpython-311-x86_64-linux-gnu.so
1984	pandas/_libs/groupby.cpython-311-x86_64-linux-gnu.so
1948	pandas/io
1768	pandas/tests/frame
1768	pandas/_libs/hashtable.cpython-311-x86_64-linux-gnu.so
1660	pandas/tests/indexes
1348	pandas/_libs/interval.cpython-311-x86_64-linux-gnu.so
984	pandas/tests/frame/methods
932	pandas/_libs/tslibs/offsets.cpython-311-x86_64-linux-gnu.so
928	pandas/tests/series
856	pandas/_libs/sparse.cpython-311-x86_64-linux-gnu.so
852	pandas/_libs/index.cpython-311-x86_64-linux-gnu.so
816	pandas/core/arrays

@lithomas1
Copy link
Member

I can have a look at removing tests (moving them to a separate pandas-tests package). This is probably going to cause some friction, for developers, though (I will comment more on the other issue soon).

R.e. points 7 & 8, it is worth noting that pytz and dateutil hardly take up any space, as they are pure Python. (both are < 1MB). I would not worry about those too much.

One thing that might be interesting to try PGO/LTO on our C extensions.
This article
https://documentation.suse.com/sbp/all/html/SBP-GCC-10/index.html#sec-gcc10-spec (see section 7.1 Figure 6)
seems to suggest that it could result in a pretty nice decrease in size, but even if doesn't result in a size reduction it might be worth enabling for the other perf benefits.

While I don't think any other projects are doing this, I think Python itself is built with PGO/LTO, and there is an issue in one of the Python repos suggesting using BOLT on module .so libraries
(faster-cpython/ideas#449).

(One thing to note, though, is that it would dramatically increase the compile time, and maybe OOM the GHA runners used to build our wheels. There's also the question of what to use as profiling data for PGO)

(@WillAyd Do you think this is worth pursuing?)

@WillAyd
Copy link
Member

WillAyd commented Apr 15, 2023

Looks interesting. A little out of my wheelhouse but if you have time/interest I say go for it. PGO looks particularly interesting, though I guess we'd have to decide how we want to best train the program for optimization

@jbrockmendel
Copy link
Member Author

On the pytz one, we would also need to roll a replacement for pytz.AmbiguousTimeError

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Enhancement
Projects
None yet
Development

No branches or pull requests

6 participants