Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove docs from packages #38846

Merged
merged 35 commits into from
Apr 9, 2021
Merged

remove docs from packages #38846

merged 35 commits into from
Apr 9, 2021

Conversation

jameslamb
Copy link
Contributor

@jameslamb jameslamb commented Dec 31, 2020

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

#30741 discusses removing documentation and tests from package distributions. This is valuable because it makes pandas easier to use in storage-sensitive environments such as AWS Lambda (#30741 (comment), #30741 (comment)).

This PR does not close that issue, but it's a first step. This proposes removing the content of doc/ from package distributions. This PR trims about 1MB (compressed) and 6MB (uncompressed) out of the sdist package.

master this PR
sdist (compressed) 5.7M 4.5M
sdist (uncompressed) 25M 19M
how I checked these sizes
rm -rf pandas.egg-info
rm -rf dist/
rm -rf __pycache__

echo ""
echo "building source distribution"
echo ""
python setup.py sdist > /dev/null
cp pandas.egg-info/SOURCES.txt ~/PANDAS-SOURCES.txt
pushd dist/
    echo ""
    echo "sdist compressed size"
    echo ""
    du -a -h .
    tar -xf pandas*.tar.gz
    rm pandas*.tar.gz
    ls .
    echo ""
    echo "sdist uncompressed size"
    echo ""
    du -sh .
popd

To confirm that the documentation files were removed correctly by these changes, I ran the following

python setup.py sdist
cat pandas.egg-info/SOURCES.txt | grep -E "^doc"

Thanks for your time and consideration.

@jreback jreback added Docs Build Library building on various platforms labels Dec 31, 2020
@jreback jreback modified the milestones: 1.3, 1.2.1 Dec 31, 2020
@jreback
Copy link
Contributor

jreback commented Dec 31, 2020

@jameslamb can you add a note in 1.3. Other API changes section (make a separete entry, call it Build)

@jreback
Copy link
Contributor

jreback commented Dec 31, 2020

cc @simonjayhawkins if you can have a test on this to make sure build looks ok.

@jreback
Copy link
Contributor

jreback commented Dec 31, 2020

@jameslamb we might also want to have that check in https://github.com/pandas-dev/pandas-release, if you can make an issue / PR there as well.

@jameslamb
Copy link
Contributor Author

@jameslamb can you add a note in 1.3. Other API changes section (make a separete entry, call it Build)

sure! Added in 3aee9b9, let me know if that wasn't what you meant.

@jameslamb
Copy link
Contributor Author

@jameslamb we might also want to have that check in https://github.com/pandas-dev/pandas-release, if you can make an issue / PR there as well.

Sorry, I don't understand what you'd like me to add to pandas-release. Could you elaborate?

@jreback
Copy link
Contributor

jreback commented Jan 5, 2021

@jameslamb we might also want to have that check in https://github.com/pandas-dev/pandas-release, if you can make an issue / PR there as well.

Sorry, I don't understand what you'd like me to add to pandas-release. Could you elaborate?

an issue / PR that makes a test like you did above. that repo actually builds the releases and should check that no docs are present via a test.

@jameslamb
Copy link
Contributor Author

ooooo I see! Yep ok, I can do that

@jameslamb
Copy link
Contributor Author

Ok I've attempted a pandas-release PR at pandas-dev/pandas-release#35. I've also updated this PR to the latest master.

@WillAyd
Copy link
Member

WillAyd commented Feb 16, 2021

Rather than outright removing would we consider a no-doc specification as outlined in PEP 508 ? By default I think including documentation is a good thing

@jameslamb
Copy link
Contributor Author

Rather than outright removing would we consider a no-doc specification as outlined in PEP 508 ? By default I think including documentation is a good thing

could you provide specific reasons why you think it's valuable to include documentation like these Powerpoint files in package distributions?

There was already substantial discussion on #30741 about the disadvantages of doing that (for example, #30741 (comment) and #30741 (comment)), and @jreback agreed with the proposal in this PR.

@WillAyd
Copy link
Member

WillAyd commented Feb 26, 2021

Sure - if you are ever without internet and try to develop anything with pandas having the documentation available locally is invaluable. I also think it is relatively standard to include. At least other libraries like numpy and scikit-learn keep their doc folders in the sdist.

The powerpoint is a little different and I think fine to exclude regardless, but I do think PEP 508 would be a better solution for all

@jameslamb
Copy link
Contributor Author

jameslamb commented Feb 26, 2021

Thanks for clarifying. In my opinion, that use case is one that describes a very small minority of all the users (machine and human) that pip install pandas.

Because I believe that use case describes a small minority of all users, I think that if you wanted to provide support for it, it would be more appropriate to make that the optional thing you have to opt in to (e.g. pip install pandas[include-docs]).

I think the possible benefit of removing around 1MB of data transfer and around 6MB of uncompressed size on disk per install for users who have slow internet connections or who are storage-conscious (like AWS Lambda users) far outweighs the downside of not supporting by default the use case where someone has no internet access and wants to read the long-form pandas docs (they can always help() on any object in the library to get its dosctring).

One more point I'd like to make in favor of this change. A conservative reading of the last few months of data available at https://pypistats.org/packages/pandas suggests that pandas is installed from PyPi 500,000 times a day. That means that the changes in this PR might cut 500GB of daily data transfer from PyPi, without breaking anyone's existing code. I'd like you and other maintainers who comment on this to consider whether it is worth forgoing that benefit and the others mentioned above and in #30741 to preserve the behavior "if you've installed pandas successfully, all of the long-form documentation is available to you locally".

@jameslamb
Copy link
Contributor Author

Is there some additional information I could provide or changes reviewers would like me to make?

I've been merging the latest master into this PR every few days for the last few months, and I'd be happy to do whatever I can to move this to resolution (even if that resolution is maintainers deciding to reject this change).

Thanks.

@WillAyd
Copy link
Member

WillAyd commented Mar 25, 2021

I think the pdf / ppts are fine to exclude regardless, but for more general documentation I would still prefer to go the PEP 508 route. This has an additional advantage that you can provide further optimizations like the -OO switch if you are really trying to minimize package size

@jbrockmendel
Copy link
Member

I'm fine with either this or @WillAyd's method. If we go with this, the whatsnew should have a "to pip install with the documents do XYZ"

@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

yeah let's start with excluding the pdfs / ppts and see how that goes. @jameslamb if you'd update.

@jameslamb
Copy link
Contributor Author

Alright. Updated in 648187a.

After that change, here's an updated version of the table from the original description.

master this PR
sdist (compressed) 5.4M 4.6M
sdist (uncompressed) 40M 39M

@jreback jreback merged commit 1ad8d5d into pandas-dev:master Apr 9, 2021
@jreback
Copy link
Contributor

jreback commented Apr 9, 2021

thanks @jameslamb

@jameslamb jameslamb deleted the misc/remove-docs branch April 9, 2021 00:10
@simonjayhawkins simonjayhawkins mentioned this pull request Jun 12, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants