-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5082: [Python] Substantially reduce Python wheel package and install size #7334
Conversation
@kszucs can you help me get this across the finish line? |
It's a bad idea to add side effects to a simple inquiry function. This should IMHO be in a separate function (e.g.
I'd like to keep them. It's helpful to test if an installation works correctly. |
Couldn't we solve this problem another way? If a user wants to run the whole test suite locally they need more than just the pyarrow/tests directory. I don't think it's worth bloating the installs for an infrequent use case. |
At first I didn't like that the tests are shipped with the packages, but later on I found it useful. It also worth mentioning that many of our packaging builds and CI tests run the pyarrow unittests using If we decide to remove the tests from the packages (I'd rather keep them though) please defer it to another pull request because we need to update more CI scripts. |
Agree with Antoine. |
Yes, I'll ensure that the wheel packaging builds work properly. |
@github-actions crossbow submit -g wheel |
Revision: 208bd8c Submitted crossbow builds: ursa-labs/crossbow @ actions-283 |
@github-actions crossbow submit -g wheel |
Revision: 2d89b44 Submitted crossbow builds: ursa-labs/crossbow @ actions-284 |
I'm fine with debating whether to ship the tests separately. The people who benefit from being able to do |
Hard to disagree with that argument :) Either way I'm deferring it to a follow-up because it'll involve quite some CI and packaging updates. |
It's between the burden for the user of two additional megabytes installed, vs. the burden for us of "implement [and maintain] a function that downloads the tests along with the test dependencies (e.g. the testing data repos) and then executes them". For me it's a no-brainer to ship the tests with the wheels, but YMMV. |
Well, the point is that we don't have to do it at all. At no point in the 52 months since Apache Arrow started do I recall a user running the test suite out of a wheel or asking about doing so. If this is truly something that people need to be able to do, maybe an interested party can contribute it to the project? We've already expressed that we are going to limit our investment of time in maintaining wheels, and, from what I can tell, smaller wheels -> fewer complaints (here a 3% savings isn't that compelling but given that people are trying to squeeze pyarrow into deployments in AWS lambda that have to be < 250 MB including all dependencies, every megabyte does indeed count). |
OR (ARROW_PARQUET | ||
AND CMAKE_CXX_COMPILER_ID STREQUAL "GNU" | ||
AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS "4.9")) | ||
OR ARROW_PARQUET) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In file included from /usr/local/include/thrift/TApplicationException.h:23,
from /arrow/cpp/src/parquet/thrift_internal.h:36,
from /arrow/cpp/src/parquet/column_reader.cc:47:
/usr/local/include/thrift/Thrift.h:45:10: fatal error: boost/utility/enable_if.hpp: No such file or directory
#include <boost/utility/enable_if.hpp>
Seems like the parquet headers require boost headers as a transitive dependency.
@github-actions crossbow submit -g wheel |
Revision: 00d9c86 Submitted crossbow builds: ursa-labs/crossbow @ actions-285 |
@github-actions crossbow submit wheel-osx-* |
Revision: 8dacdc9 Submitted crossbow builds: ursa-labs/crossbow @ actions-286
|
Well, I won't argue too much about it. But at some point we had decided that wheels were too much of a burden for us, and now it seems we're going out of our way to please people. I'm not sure I understand the strategy. |
8dacdc9
to
271b25b
Compare
@github-actions crossbow submit -g wheel |
Revision: 271b25b Submitted crossbow builds: ursa-labs/crossbow @ actions-287 |
Well, the wheels are being installed at least 6.5 million times per month (for point of reference, for pandas it's 22.7M) and so wheel use has an impact on the health and success of the open source project. My attitude is that we shouldn't feel too bad about "taking things away" from the wheels absent more enthusiastic maintainers. With a half day's labor I was able to shrink the wheels by 4x -- the tests / no tests thing wasn't the most significant change but I definitely don't want to be going out of our way to put things in the wheels or maintain special code to cater to wheel users under the present circumstances. I think at least this will stymie some of the pain for some period of time until perhaps more maintainers come out of the woodwork (or I can afford to recruit and hire them). |
I made the |
I would say that whoever depends on that may want to add the required unit test. |
Sounds fine to me, we can open a JIRA. |
FYI @fjetter |
@github-actions crossbow submit -g wheel |
Revision: 599660e Submitted crossbow builds: ursa-labs/crossbow @ actions-288 |
@github-actions crossbow submit wheel-manylinux* |
Revision: 83b0c5f Submitted crossbow builds: ursa-labs/crossbow @ actions-289 |
@xhochy could you please give it a review? |
%PYTHON_INTERPRETER% -c "import pyarrow.dataset" || exit /B | ||
|
||
@rem %PYTHON_INTERPRETER% -c "import pyarrow.gandiva" || exit /B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete this line
try 2 failure start try 3 Add post-install symlinking code for wheels, add option to not install tests
…an explicit opt-in, update documentation
… in macos wheel builds
83b0c5f
to
3267d57
Compare
Merging. Thanks all. |
How can I download this ~15MB .whl version of PyArrow? |
@francisco-hoo For now this is only available in nightly builds: Soon we will release a 1.0.0 with those improvements included, though. |
Current manylinux wheel packages on master:
This patch
That's more than a 4x size reduction. There's several things in this patch:
-larrow -lparquet
etc won't work as is), I added a functionpyarrow.create_library_symlinks()
that adds symlinks to the versioned shared libraries. This function has to be run once under user permissions that can write to the site-packages/pyarrow/ directory.pyarrow_gandiva
package per ARROW-8518.pyarrow.tests
, which is about 2.3MB uncompressed. I don't think we need to ship the tests in the wheels. Currently we are still shipping the tests.