Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change cpuonly to a conda mutex #488

Closed
wants to merge 7 commits into from
Closed

Conversation

scopatz
Copy link
Contributor

@scopatz scopatz commented Aug 11, 2020

This is in an effort to address pytorch/pytorch#40213

But I think I have ended up with more questions than solutions.

  1. Where does the main, templated meta.yaml actually live?
  2. The build_pytorch.sh seems like it would be better suited to being in Python or some kind of cross-platform language that actually has string formatting. I am curious as to why it is in Bash...
  3. How does this actually work? I have tried following the instructions in the README several times now locally on master, and I have gotten them to fail in a variety of ways, but they seem out of date in some ways. E.g. there is no pytorch-vX.Y.Z/ directory.
  4. Does the current way of templating the meta.yaml allow for multiple outputs? It seems like it does not / may not, which is sort of required for moving to a mutex.

CC @ezyang @rgommers & Thanks in advance!

@rgommers
Copy link
Contributor

Yep I've gotten lost here a few times. I don't know the answers, so let's try to ping @seemethere and @malfet as the experts here. If one of you can help answer these questions, then @scopatz can return the favour by updating the README with the info he's currently missing.

@scopatz
Copy link
Contributor Author

scopatz commented Aug 11, 2020

It looks like everything is based off of what is in nightly?

@seemethere
Copy link
Member

Yes this repository is a mess unfortunately :(

  1. The main meta.yaml for pytorch packages is here: https://github.com/pytorch/builder/blob/master/conda/pytorch-nightly/meta.yaml

  2. I don't disagree, we actually have something on our roadmap to refactor these scripts in general to make them easier to understand / easier to run

  3. I haven't actually built a PyTorch conda package locally in a long time so building one using our README instructions, the last command I have in my zsh history that did this looked something like this:

docker run --rm -it \
    -e PACKAGE_TYPE=conda \
    -e DESIRED_CUDA=cu101 \
    -e DESIRED_PYTHON=3.8 \
    -e PYTORCH_BUILD_VERSION=1.5.0 \
    -e PYTORCH_BUILD_NUMBER=1 \
    -e OVERRIDE_PACKAGE_VERSION=1.5.0 \
    -e TORCH_CONDA_BUILD_FOLDER='pytorch-nightly' \
    -v ${path_to_pytorch}:/pytorch \
    -v ${path_to_build}:/builder \
    -v "$(pwd):/final_pkgs" \
    pytorch/conda-cuda \
    /builder/conda/build_pytorch.sh |& tee build.log

Which is admittedly pretty awful.

  1. I don't believe we support multiple outputs, so to answer your question, most likely no

@scopatz
Copy link
Contributor Author

scopatz commented Aug 28, 2020

Hi @seemethere - thanks! This was extremely helpful. I was able to get it successfully building locally.

I have refactored the recipe to build with a gpu mutex package now, and also simplified some of the logic. I have tested this with cuda 10.1 and cpu-only at this point. Building locally, though, does produce some side effects on the system that end up having root permissions even when run as a regular user, which is not great.

There is clearly more work to do here in terms of cleanup and documentation, which I will be getting to next week. I wanted to mention this now, just as a status update and give the opportunity for comments and feedback.

@scopatz
Copy link
Contributor Author

scopatz commented Sep 16, 2020

Hi folks, I believe that this working now. I have tried it locally in conjunction with the nonroot branch (#518) and the mutex packages are built and docker run completes successfully. There are now three packages that are created: pytorch itself, a cudaXY convenience package, and the gpu mutex

Copy link
Contributor

@rgommers rgommers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds good! This PR LGTM as far as I can judge.

What would be the steps to deploy this? E.g., since this changes the current cpuonly, would it require a coordinates release of cpuonly, pytorch, torchvision, etc.?

- pytorch-proc * cpu

outputs:
# A meta-package to select CPU or GPU build for faiss.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little confusing - why specifically for faiss? I assume it's for pytorch and possibly other packages too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oopps, this change shouldn't have made it in

@ezyang
Copy link
Contributor

ezyang commented Sep 16, 2020

The first step would be to get the nightly conda packages onto the new mutex scheme, so we can iron out bugs. This is going to need coordination with a release manager, e.g., @seemethere.

@scopatz
Copy link
Contributor Author

scopatz commented Oct 21, 2020

BTW I have fixed the conflicts here and this is good to go, as far as I am concerned. Let me know if you want to sync on it @seemethere!

Copy link
Member

@seemethere seemethere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks mostly good, going to hold off on merging until after the 1.7 release

@facebook-github-bot
Copy link
Contributor

Hi @scopatz!

Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention.

You currently have a record in our system, but we do not have a signature on file.

In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@kev-zheng
Copy link

Hey! quick bump -- We'd really like to use this!

@rgommers
Copy link
Contributor

rgommers commented Jan 5, 2021

@kev-zheng thanks for the ping. Just curious, where are you planning to use this?

@rgommers
Copy link
Contributor

rgommers commented Jan 5, 2021

I see the CLA check turned red, @scopatz would you mind signing it (takes about 10 seconds)?

@kev-zheng
Copy link

@kev-zheng thanks for the ping. Just curious, where are you planning to use this?

We have a large conda environment that include pytorch and cpuonly depending on gpu. Specifically, we're building environment specs with conda-lock and mamba seems to massively speed up generating lockfiles. Not a huge issue, but decent QOL improvement

@scopatz
Copy link
Contributor Author

scopatz commented Jan 18, 2021

Hi All, On the CLA issue, all code here was written on behalf of Quansight. Is this for some reason not covered by Quansight's corporate CLA (https://code.facebook.com/cla/corporate)?

@rgommers
Copy link
Contributor

rgommers commented Jan 18, 2021

Hi All, On the CLA issue, all code here was written on behalf of Quansight. Is this for some reason not covered by Quansight's corporate CLA (https://code.facebook.com/cla/corporate)?

We filled out both personal and corporate CLAs for each team member. You're not on the corporate CLA, because it was signed in early Nov just after you left (it's not retroactive).

@scopatz
Copy link
Contributor Author

scopatz commented Jan 27, 2021

Alright, in order to move this along, I have fully read through and signed the CLA. Sorry it took so long

@rgommers
Copy link
Contributor

Thanks @scopatz!

@seemethere
Copy link
Member

seemethere commented Feb 1, 2021

So I'm in the process of trying to merge this and I'm running into a weird issue when attempting to build:

$ docker run --rm \
    -it \
    -e DESIRED_PYTHON=3.8 \
    -e DESIRED_CUDA="110" \
    -e CUDA_VERSION=11.0 \
    -e PYTORCH_BUILD_VERSION=1.0.0 \
    -e PYTORCH_BUILD_NUMBER=1 \
    -v "$(pwd)":/builder \
    -v /raid/eliuriegas/pytorch:/pytorch \
    pytorch/conda-builder:cuda11.0 \
    /builder/conda/build_pytorch.sh
....
Traceback (most recent call last):
  File "/opt/conda/bin/conda-build", line 11, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/conda_build/cli/main_build.py", line 481, in main
    execute(sys.argv[1:])
  File "/opt/conda/lib/python3.8/site-packages/conda_build/cli/main_build.py", line 470, in execute
    outputs = api.build(args.recipe, post=args.post, test_run_post=args.test_run_post,
  File "/opt/conda/lib/python3.8/site-packages/conda_build/api.py", line 186, in build
    return build_tree(
  File "/opt/conda/lib/python3.8/site-packages/conda_build/build.py", line 3068, in build_tree
    packages_from_this = build(metadata, stats,
  File "/opt/conda/lib/python3.8/site-packages/conda_build/build.py", line 2031, in build
    output_metas = expand_outputs([(m, need_source_download, need_reparse_in_env)])
  File "/opt/conda/lib/python3.8/site-packages/conda_build/render.py", line 789, in expand_outputs
    for (output_dict, m) in deepcopy(_m).get_output_metadata_set(permit_unsatisfiable_variants=False):
  File "/opt/conda/lib/python3.8/site-packages/conda_build/metadata.py", line 2120, in get_output_metadata_set
    ensure_matching_hashes(conda_packages)
  File "/opt/conda/lib/python3.8/site-packages/conda_build/metadata.py", line 325, in ensure_matching_hashes
    dep.split(' ')[-1] != m.build_id() and _variants_equal(m, om)):
  File "/opt/conda/lib/python3.8/site-packages/conda_build/metadata.py", line 1364, in build_id
    check_bad_chrs(manual_build_string, 'build/string')
  File "/opt/conda/lib/python3.8/site-packages/conda_build/metadata.py", line 579, in check_bad_chrs
    if c in s:
TypeError: argument of type 'int' is not iterable

When I added a print statement to see what was the string it was getting caught up on I found it was interpreting DESIRED_CUDA=110 as an integer value instead of a string value, I tried appending | string() to the corresponding meta.yaml line and it appears as though that doesn't work either.

@rgommers or @scopatz do you guys have any idea what might be happening?

@ezyang
Copy link
Contributor

ezyang commented Feb 2, 2021

maybe @mattip might also have some ideas

@mattip
Copy link
Contributor

mattip commented Feb 2, 2021

Digging around in the sources, it seems the metadata parser will always prefer to convert to int?

@wolfv
Copy link

wolfv commented Mar 16, 2021

would be great to have this!

@facebook-github-bot
Copy link
Contributor

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

@jph00
Copy link

jph00 commented Jul 13, 2021

Thanks so much for your work on this @scopatz , @rgommers, @seemethere and @ezyang! :) FYI this issue has just started blocking PyTorch GPU installation with mamba working. @wolfv tells me that this PR is required in order to get it working again.

It looks like it was just about ready to merge AFAIK, but perhaps it got forgotten?

@scopatz
Copy link
Contributor Author

scopatz commented Jul 26, 2021

Seems like it did. Also, I would like to remove my facebook CLA at this time. Can this please be merged / forked / etc ASAP?

@rgommers
Copy link
Contributor

Seems like it did. Also, I would like to remove my facebook CLA at this time. Can this please be merged / forked / etc ASAP?

I'll follow up on this today.

@rgommers
Copy link
Contributor

pytorch/pytorch#54900 showed that this PR no longer worked as expected 3 months ago, something had had changed in the meantime. It needs more work. Trying to hash out a plan together with @seemethere and @malfet. Probably we'll close and resubmit - should have it figured out by tomorrow.

@rgommers rgommers changed the title Mutex Change cpuonly to a conda mutex Jul 28, 2021
@rgommers
Copy link
Contributor

Okay closing, we'll resubmit together with a more extended and documented test plan - will comment once that is done, so everyone who is interested knows where to look for updates.

Also, I would like to remove my facebook CLA at this time.

Please go ahead @scopatz, should be fine now. Thanks for the work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants