Modify Upload API to be asynchronous #7730

brainwane · 2020-04-03T22:28:16Z

What's the problem this feature will solve?
Right now, uploading to Warehouse is synchronous. This is a pain when we want to implement upload gating like #5420 or other checks, and cramps our style regarding TUF (@ewdurbin can go into that further).

Describe the solution you'd like
We would change the Warehouse API (or add a new version) to make uploads asynchronous.

Additional context
From IRC today:

@dstufft: the hardest part about that issue is iguring out how we deal with the existing uploaders
dstufft: because we'd want some mechanism where twine could get an identifier back, and then poll for completion (and any errors / warnings)
dstufft: and I'm not sure we could just YOLO change the existing API to do that
dstufft: so it'd probably be some thing where we stand up a second API with the new semantics, then deprecate the old one and start pushing people to it
dstufft: etc etc

brainwane · 2020-04-03T22:29:06Z

Continuing the IRC transcript:

dstufft: but maybe spending omre than 30s thinking about it would come up with something clever
dstufft: It would be interesting to talk to the TUF folks when doing it
dstufft: because if we do PEP 480 then that's going to change the upload API too
dstufft: so we'd want to make sure whatever we do is compatible with that

cc @woodruffw

dstufft · 2020-04-04T20:46:24Z

Alright, so here's the rough idea I have for the work flow for an asynchronous upload API that is also extensible in the future to support PEP 480 (assuming we ever accept and implement PEP 480). This should itself be a PEP since it's defining an standards based API, but just to get the very rough idea out there, here's a quick summary of what I'm thinking:

The basic idea here is the Upload API effectively becomes 3 endpoints, and an upload request "flows" through them. This is basically modeled after the YouTube upload API and gives a few extra nice properties that we didn't have before.

A Request is made to an upload endpoint to start an upload request, this includes in it a JSON payload, but does NOT include any sort of actual file uploads. The exact specifics of this metadata remains TBD, but you could imagine something like:

{
    "files": {
        "foo-1.0.tar.gz": {... Python metadata},
        "foo-1.0-py3-none-any.whl": {... Python metadata},
    },
    // This doesn't exist yet, but in a hypothetical future where we implement PEP 480,
   // this could work.
    "tuf": {... TUF metadata}
}

This gives us all of the metadata up front, and allows us to do things like check permissions, check filenames, validate the metadata, etc basically anything that we can do upfront without the actual file, before the user has ever attempted to upload the file and return with an appropriate error if any of that fails.

Assuming all of the the checks above pass, this would return with a 200 OK, with a body that has a JSON object that looks something like:

{
    "status": "an url to check on the status of this upload",
    "files": {
        "foo-1.0.tar.gz": "An URL",
        "foo-1.0-py3-none-any-whl": "Another URL",
    }
}

The client would take the URLs given to them in the above files dictionary, and would do a simple PUT request to each one for the associated file. This could be done in parallel, and using a method similar to that of the YouTube API, could also support resumption of an interrupted upload (assuming our entire stack supports that at least).

Once the server has received the data for all of the files present in this upload, it will then finish processing the request (including any async checks it wishes to do now that the full contents of each file is available) and then publish the files.

At any point after (1), but presumably not until after (2) the client may use the url indicated in the status key to query the status of the upload, which should give an overall success/fail/pending status as well as any errors or warnings that might have been generated on a per file basis.

This change has a number of positives:

A lot of work can be done quickly before any file bytes touch the network, reducing bandwidth costs for files that will ultimately fail some kind of validation or authorization check.
Multiple files can be uploaded in a single "transaction" which gives us some of the benefits of Draft release feature on main archive to allow testing a release before it goes live #726 (but not all of them!) and solves a lot of consistency problems people have when trying upload a disparate number of wheels.
Allows us to take an async approach to processing a file upload, which allows us to shorten the request/response time back down to something reasonable on these routes, while being able to do pretty much any processing we want (within reason, something that takes hours to process is probably still not the right call).
Gives us the ability to add new functionally to the API in terms of warnings and failures and know that clients using this API will actually support them (since it's a whole new API).
If our stack supports it (we will effectively need to ensure that for the file upload portion, nothing will buffer the entire request before passing it onto the backend) we can implement upload resumes using a method similar to what youtube does.

It is not without it's downsides however

The upload API is more complicated, involving more total steps which applies additional burden on clients.
It puts Warehouse in the position of having to have data that is in an inconsistent state (e.g. between steps 1 and 2) where we'll want the records to exist in the database, but not be available anywhere.
- This opens up a question of how do we handle uniqueness constraints. Off the cuff my instinct is you can only have one pending "release" for a specific Project + Release at a single time and if your thing fails in the meantime you have to either resume somehow, OR you have to delete the release (so we'd also need an API for deleting a pending release). However another option is that we don't enforce uniqueness constraints between unfinished releases until the final steps (we can of course check for uniqueness against things that have already been published).
It also means that we can have releases in an inconsistent state indefinitely (if someone started a release and then walked away from it). We will likely need a process that will periodically expire and remove any pending release that is > some age (24h?).

Over all, I think this API for uploads allows us a lot more flexibility and gives us a much nicer UX over all, at the cost of some additional complexity in implementation (most of which is on the Warehouse side, but some is on the client side).

This would not change anything in the Simple API or anything else besides the actual act of uploading.

I'm going to start writing this up into a proper PEP, but I wanted to give a sort of brain dump on my thoughts here to see if anyone else had any thoughts on it.

I had a quick call with @trishankkarthik just to make sure that for any hypothetical PEP 480 world, that the above API wouldn't lock us into place. It appears that we're perfectly fine to stick the TUF metadata as a sub key under the overall JSON object that gets sent in (1).

For plans even beyond that, if we ever implement in-toto (which, who knows if we will) we would need the ability to upload multiple files in a single transaction (in-toto uses additional files beyond the actual file payloads), since one of the problems with the existing API is the lack of a multi file atomicity, this lead into the ability to upload multiple files in a single "transaction", which we could also then utilize for something like in-toto if we so desired. If/when PEP 480 gets implemented, we would also have the option of uploading the TUF metadata as an additional file, instead of baking it into the JSON object in the initial publish if we wanted to as well (and there is some benefit to doing that in terms of the total size of that initial request, but downsides in that we can't verify the TUF metadata is acceptable prior to accepting file uploads).

brainwane · 2020-05-05T16:20:05Z

@dstufft You mentioned

I'm going to start writing this up into a proper PEP, but I wanted to give a sort of brain dump on my thoughts here to see if anyone else had any thoughts on it.

Have you heard any thoughts outside this issue? And how is the PEP going?

Per pypi/warehouse#7730 , overhauling Warehouse's upload API requires a new PEP. The fundable improvement guidelines say we shouldn't ask for money for something till we have consensus for the idea, meaning that any PEPs are finished and approved. Thus, this commit removes the upload work from the list of features in the PyPI API revamp task. Signed-off-by: Sumana Harihareswara <[email protected]>

dstufft · 2022-06-28T00:58:35Z

There is now PEP 694: Upload 2.0 API for Python Package Repositories, which has discussions on discuss.python.org which is relevant to this issue.

brainwane added feature request APIs/feeds to user test Issues that should be investigated further via user testing python Pull requests that update Python code labels Apr 3, 2020

brainwane added this to the Post Legacy Shutdown priorities milestone Apr 3, 2020

brainwane mentioned this issue Apr 3, 2020

Run auditwheel on new manylinux uploads, reject if it fails #5420

Open

ewdurbin changed the title ~~Modify API to be asynchronous~~ Modify Upload API to be asynchronous Apr 4, 2020

brainwane mentioned this issue May 1, 2020

Add project releases RSS feed #7013

Merged

di removed the python Pull requests that update Python code label May 12, 2020

brainwane mentioned this issue Jun 10, 2020

Python package index upload API spec pypa/packaging-problems#128

Open

brainwane mentioned this issue Jun 10, 2020

Remove upload work from PyPI API revamp task psf/fundable-packaging-improvements#6

Merged

brainwane mentioned this issue Jun 10, 2020

Fix how Warehouse stores metadata (per-file, not per-release) #8090

Open

6 tasks

dstufft mentioned this issue Jul 13, 2020

Expose the METADATA file of wheels in the simple API #8254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify Upload API to be asynchronous #7730

Modify Upload API to be asynchronous #7730

brainwane commented Apr 3, 2020

brainwane commented Apr 3, 2020

dstufft commented Apr 4, 2020

brainwane commented May 5, 2020

dstufft commented Jun 28, 2022

Modify Upload API to be asynchronous #7730

Modify Upload API to be asynchronous #7730

Comments

brainwane commented Apr 3, 2020

brainwane commented Apr 3, 2020

dstufft commented Apr 4, 2020

brainwane commented May 5, 2020

dstufft commented Jun 28, 2022