Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify Upload API to be asynchronous #7730

Open
brainwane opened this issue Apr 3, 2020 · 4 comments
Open

Modify Upload API to be asynchronous #7730

brainwane opened this issue Apr 3, 2020 · 4 comments
Labels
APIs/feeds feature request to user test Issues that should be investigated further via user testing

Comments

@brainwane
Copy link
Contributor

What's the problem this feature will solve?
Right now, uploading to Warehouse is synchronous. This is a pain when we want to implement upload gating like #5420 or other checks, and cramps our style regarding TUF (@ewdurbin can go into that further).

Describe the solution you'd like
We would change the Warehouse API (or add a new version) to make uploads asynchronous.

Additional context
From IRC today:

@dstufft: the hardest part about that issue is iguring out how we deal with the existing uploaders
dstufft: because we'd want some mechanism where twine could get an identifier back, and then poll for completion (and any errors / warnings)
dstufft: and I'm not sure we could just YOLO change the existing API to do that
dstufft: so it'd probably be some thing where we stand up a second API with the new semantics, then deprecate the old one and start pushing people to it
dstufft: etc etc

@brainwane brainwane added feature request APIs/feeds to user test Issues that should be investigated further via user testing python Pull requests that update Python code labels Apr 3, 2020
@brainwane
Copy link
Contributor Author

Continuing the IRC transcript:

dstufft: but maybe spending omre than 30s thinking about it would come up with something clever
dstufft: It would be interesting to talk to the TUF folks when doing it
dstufft: because if we do PEP 480 then that's going to change the upload API too
dstufft: so we'd want to make sure whatever we do is compatible with that

cc @woodruffw

@dstufft
Copy link
Member

dstufft commented Apr 4, 2020

Alright, so here's the rough idea I have for the work flow for an asynchronous upload API that is also extensible in the future to support PEP 480 (assuming we ever accept and implement PEP 480). This should itself be a PEP since it's defining an standards based API, but just to get the very rough idea out there, here's a quick summary of what I'm thinking:

The basic idea here is the Upload API effectively becomes 3 endpoints, and an upload request "flows" through them. This is basically modeled after the YouTube upload API and gives a few extra nice properties that we didn't have before.


  1. A Request is made to an upload endpoint to start an upload request, this includes in it a JSON payload, but does NOT include any sort of actual file uploads. The exact specifics of this metadata remains TBD, but you could imagine something like:
{
    "files": {
        "foo-1.0.tar.gz": {... Python metadata},
        "foo-1.0-py3-none-any.whl": {... Python metadata},
    },
    // This doesn't exist yet, but in a hypothetical future where we implement PEP 480,
   // this could work.
    "tuf": {... TUF metadata}
}

This gives us all of the metadata up front, and allows us to do things like check permissions, check filenames, validate the metadata, etc basically anything that we can do upfront without the actual file, before the user has ever attempted to upload the file and return with an appropriate error if any of that fails.

Assuming all of the the checks above pass, this would return with a 200 OK, with a body that has a JSON object that looks something like:

{
    "status": "an url to check on the status of this upload",
    "files": {
        "foo-1.0.tar.gz": "An URL",
        "foo-1.0-py3-none-any-whl": "Another URL",
    }
}
  1. The client would take the URLs given to them in the above files dictionary, and would do a simple PUT request to each one for the associated file. This could be done in parallel, and using a method similar to that of the YouTube API, could also support resumption of an interrupted upload (assuming our entire stack supports that at least).

Once the server has received the data for all of the files present in this upload, it will then finish processing the request (including any async checks it wishes to do now that the full contents of each file is available) and then publish the files.

  1. At any point after (1), but presumably not until after (2) the client may use the url indicated in the status key to query the status of the upload, which should give an overall success/fail/pending status as well as any errors or warnings that might have been generated on a per file basis.

This change has a number of positives:

  • A lot of work can be done quickly before any file bytes touch the network, reducing bandwidth costs for files that will ultimately fail some kind of validation or authorization check.
  • Multiple files can be uploaded in a single "transaction" which gives us some of the benefits of Draft release feature on main archive to allow testing a release before it goes live #726 (but not all of them!) and solves a lot of consistency problems people have when trying upload a disparate number of wheels.
  • Allows us to take an async approach to processing a file upload, which allows us to shorten the request/response time back down to something reasonable on these routes, while being able to do pretty much any processing we want (within reason, something that takes hours to process is probably still not the right call).
  • Gives us the ability to add new functionally to the API in terms of warnings and failures and know that clients using this API will actually support them (since it's a whole new API).
  • If our stack supports it (we will effectively need to ensure that for the file upload portion, nothing will buffer the entire request before passing it onto the backend) we can implement upload resumes using a method similar to what youtube does.

It is not without it's downsides however

  • The upload API is more complicated, involving more total steps which applies additional burden on clients.
  • It puts Warehouse in the position of having to have data that is in an inconsistent state (e.g. between steps 1 and 2) where we'll want the records to exist in the database, but not be available anywhere.
    • This opens up a question of how do we handle uniqueness constraints. Off the cuff my instinct is you can only have one pending "release" for a specific Project + Release at a single time and if your thing fails in the meantime you have to either resume somehow, OR you have to delete the release (so we'd also need an API for deleting a pending release). However another option is that we don't enforce uniqueness constraints between unfinished releases until the final steps (we can of course check for uniqueness against things that have already been published).
  • It also means that we can have releases in an inconsistent state indefinitely (if someone started a release and then walked away from it). We will likely need a process that will periodically expire and remove any pending release that is > some age (24h?).

Over all, I think this API for uploads allows us a lot more flexibility and gives us a much nicer UX over all, at the cost of some additional complexity in implementation (most of which is on the Warehouse side, but some is on the client side).

This would not change anything in the Simple API or anything else besides the actual act of uploading.

I'm going to start writing this up into a proper PEP, but I wanted to give a sort of brain dump on my thoughts here to see if anyone else had any thoughts on it.


I had a quick call with @trishankkarthik just to make sure that for any hypothetical PEP 480 world, that the above API wouldn't lock us into place. It appears that we're perfectly fine to stick the TUF metadata as a sub key under the overall JSON object that gets sent in (1).

For plans even beyond that, if we ever implement in-toto (which, who knows if we will) we would need the ability to upload multiple files in a single transaction (in-toto uses additional files beyond the actual file payloads), since one of the problems with the existing API is the lack of a multi file atomicity, this lead into the ability to upload multiple files in a single "transaction", which we could also then utilize for something like in-toto if we so desired. If/when PEP 480 gets implemented, we would also have the option of uploading the TUF metadata as an additional file, instead of baking it into the JSON object in the initial publish if we wanted to as well (and there is some benefit to doing that in terms of the total size of that initial request, but downsides in that we can't verify the TUF metadata is acceptable prior to accepting file uploads).

@ewdurbin ewdurbin changed the title Modify API to be asynchronous Modify Upload API to be asynchronous Apr 4, 2020
@brainwane
Copy link
Contributor Author

@dstufft You mentioned

I'm going to start writing this up into a proper PEP, but I wanted to give a sort of brain dump on my thoughts here to see if anyone else had any thoughts on it.

Have you heard any thoughts outside this issue? And how is the PEP going?

@di di removed the python Pull requests that update Python code label May 12, 2020
brainwane added a commit to psf/fundable-packaging-improvements that referenced this issue Jun 10, 2020
Per pypi/warehouse#7730 ,
overhauling Warehouse's upload API requires a new
PEP. The fundable improvement guidelines say we
shouldn't ask for money for something till we have
consensus for the idea, meaning that any PEPs are
finished and approved. Thus, this commit removes
the upload work from the list of features in the
PyPI API revamp task.

Signed-off-by: Sumana Harihareswara <[email protected]>
brainwane added a commit to psf/fundable-packaging-improvements that referenced this issue Jun 10, 2020
Per pypi/warehouse#7730 ,
overhauling Warehouse's upload API requires a new
PEP. The fundable improvement guidelines say we
shouldn't ask for money for something till we have
consensus for the idea, meaning that any PEPs are
finished and approved. Thus, this commit removes
the upload work from the list of features in the
PyPI API revamp task.

Signed-off-by: Sumana Harihareswara <[email protected]>
@dstufft
Copy link
Member

dstufft commented Jun 28, 2022

There is now PEP 694: Upload 2.0 API for Python Package Repositories, which has discussions on discuss.python.org which is relevant to this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
APIs/feeds feature request to user test Issues that should be investigated further via user testing
Projects
None yet
Development

No branches or pull requests

3 participants