Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distribution dependencies not showing up in packaged distribution #17353

Closed
engnatha opened this issue Oct 26, 2022 · 19 comments
Closed

Distribution dependencies not showing up in packaged distribution #17353

engnatha opened this issue Oct 26, 2022 · 19 comments
Assignees
Labels

Comments

@engnatha
Copy link

Describe the bug
A pants distribution can be constructed in such a way that dependencies shows relevant package data in the graph. Upon building the distribution, one will find that those data files have been (silently) omitted from the built distribution. They are left to discover these missing files likely at run time. This issue has been reproduced in https://github.com/engnatha/pants-distribution-bug. The steps to reproduce are

  • Verify secret_data.txt is a dependency with ./pants dependencies --transitive //:test_distribution
  • Build the distribution with ./pants package //:test_distribution
  • Confirm the file is not present in the packaged distribution unzip -l dist/test_dist-0-py3-none-any.whl

The silent dropping is coming from this line. The data file cannot be associated with a package and is dropped from package_data in the constructed setup.py. If one places a dummy file in helloworld/, sets up a BUILD file, and associates that module with the distribution target, then the data file is included again. Note that the "dummy" file can't be just __init__.py since tailor was not picking that up to make a new BUILD file (though that's likely a separate issue).

This does not seem to conform to pep-420 standards for namespace packages that intend to remove the need for python modules within every directory.

Pants version
2.14.0rc5 (though this was also the case on 2.13.0 and likely earlier versions)

OS
Ubuntu 20.04

Additional info
To be fair, sphinx also doesn't gracefully handle namespace packages (at least for the version we use). We have had to put __init__.py files around our repo to make documentation building work properly.

@engnatha engnatha added the bug label Oct 26, 2022
@benjyw benjyw self-assigned this Oct 27, 2022
@benjyw
Copy link
Contributor

benjyw commented Oct 27, 2022

Yeah, this does seem like the wrong behavior. To fix it, I think we need to ask what package we should be assigning the data file to in this example.

For instance, should the file helloworld/external/data/top_secret.txt be considered as

  1. the resource external/data/top_secret.txt in the package helloworld
  2. the resource data/top_secret.txt in the package helloworld/external
  3. the resource top_secret.txt in the package helloworld/external/data

@benjyw
Copy link
Contributor

benjyw commented Oct 27, 2022

Probably the right thing is to assume the location of the BUILD file that owns the file is the package?

@engnatha
Copy link
Author

Yes, I'd agree with that assumption. I mentioned this on slack, but when we're sharing this repo, we utilize a MANIFEST.in file to specify which data files we'd also like to include. This works great and keeps all the files in the same place relative to the .py modules that are needed.

@benjyw
Copy link
Contributor

benjyw commented Oct 27, 2022

Interestingly, MANIFEST.in doesn't divide files up by package, it just lists files to include. I'm not sure if setup() then deduces the packages, or if it simply ignores that aspect.

@engnatha
Copy link
Author

It seems like it ignores that aspect. Personally, I'm not sure what the relevance is in associating data files with particularly packages.

@benjyw
Copy link
Contributor

benjyw commented Oct 28, 2022

Yeah, I don't really know TBH.

@benjyw
Copy link
Contributor

benjyw commented Oct 28, 2022

I dug into this a little further, and at least wrt to PEP-420, the behavior is correct as-is. Neither pkgutil nor importlib.resources will load the file relative to a package unless there are __init__.py to denote the package.

So I assume in your real-world use case you're loading the file some other way? I don't think we can call it package_data if it's not in a package, so we may have to add it to MANIFEST.in or something.

@benjyw
Copy link
Contributor

benjyw commented Oct 28, 2022

Also, if I put the file in MANIFEST.in it does end up in the sdist, but not in the wheel. And in your confirmation step above, you're checking for it in the wheel. In your pre-Pants workflow, how do you get that file into the wheel?

@engnatha
Copy link
Author

When we share repos internally, we're using a version control entry to grab a specific hash of our repo. So it's possible MANIFEST.in only works for source distributions. I'm interested in the wheel scenario because that's how we're building some separate tooling. If this is working as intended by the Python gods, then I can proceed with package-ifying these directories so it shows up. I'd just have to take a long 100 yard stare to accept that we have to carry around no-op files just for the sake of building distributions.

@benjyw
Copy link
Contributor

benjyw commented Oct 28, 2022

Yeah, from my readings, MANIFEST.in is for source distributions. When you build a wheel from an sdist, it only takes data files specified by package_data. There is also a data_files kwarg, but those files end up moved entirely outside of the source tree, into, e.g., test_dist-0.data/. They're intended for files that go

And from my experiments (just using Python, both with pkgutil and importlib.resources, entirely outside of Pants!) it looks like loading data from a package requires an __init__.py.

So I think this is a Python wrinkle, not a Pants wrinkle...

@benjyw
Copy link
Contributor

benjyw commented Oct 28, 2022

TBH there are so many tools and use-cases that still assume the presence of __init__.py that PEP-420 doesn't really work in the real world. E.g., it completely breaks Django.

@engnatha
Copy link
Author

Yup, agreed. I think resolution of this can be as simple as agreeing this is working as expected and that it is desirable to not so silently hide that items went missing as part of the packaging. I'd like to handle that just to see the contribution process through if possible. Any thoughts on what's preferable? I'm not sure how much work it is, but I was thinking something like how other subsystems have an ignore/warn/error option. That seems like it would be useful here.

@benjyw
Copy link
Contributor

benjyw commented Oct 31, 2022

I agree that adding a least a warning makes sense. In fact, I'm not sure why we did that skipping silently in the first place...

We could, in increasing order of work do:

  1. A hardcoded warning
  2. A message gated by an error/warning/ignore option, as you suggest
  3. Configure this per python_distribution target

Maybe even just #1 seems reasonable?

@matan129
Copy link

Any progress on this matter? Is there a way to include non-Python files in pants?
cc @TalAmuyal

@benjyw
Copy link
Contributor

benjyw commented Jan 26, 2023

Hi @matan129 , can you clarify your issue? You can include non-Python asset files in built distribution using files() or resources() targets. I think this ticket is around the specific problem of asset files that aren't in a python package (as defined by __init__.py or some other python file). And I think we've established that this doesn't work even outside of Pants.

At least, that is my understanding of re-reading the comments above.

@engnatha
Copy link
Author

Yes, that is my understanding as well. The latest summary of this was to provide a better notification to the user when resources in the dependency graph did not show up in the packaged output. Currently, the behavior more silently skips over them.

In my original investigation, once I moved the files to a place that was part of the package tree, my resources were included properly.

@matan129
Copy link

matan129 commented Jan 27, 2023

@benjyw My bad, I didn't know that resource() work with python_distribution(). Thanks 🙏!

@TalAmuyal
Copy link
Contributor

@benjyw, maybe this thread should be marked as closed with a comment that links to the summary?

@benjyw
Copy link
Contributor

benjyw commented Jan 29, 2023

Closing - it looks like this works as intended, and the wrinkle is a Python issue, not a Pants issue.

@benjyw benjyw closed this as completed Jan 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants