[BT-536] file api refactor multipart and tcp connection pooling #290

jonathanrcross · 2020-06-01T17:09:54Z

Related to: https://github.com/AtomicConductor/conductor_ae/pull/715, https://github.com/AtomicConductor/file-api/pull/24

Uploader

Update request payload to v2 signing url (path, hash, size)
Update tuple output queue information for FileStatWorker (normal presigned url or multipart)
Fix AWS reporting on upload progress to conductor api/dashboard, cannot use chunkreader method due to it needing a len() function from the generator. Since the default prepared_request is using a stream it will not load the entire file into memory.
Multipart upload partsize is controlled by File-API allowing backend to adjust sizes as needed.
FileStatWorker uses HttpBatchWorker file_size response instead of additional os.stat since this is being done prior in the MD5Worker.

API Client

Use requests.Session object so that tcp connections are re-used when the same destination host and port are being accessed. This should help speed up uploads since the 3 way handshake will not occur everytime.
add prepared_request method, since s3 will return a 501 with the Transfer-Encoding header, need to use prepared request to remove header from being added( default requests.Request will do that). Add same retry logic as make_request

…ader being added. Update descriptions and v2 endpoints. Add Content-Length for s3 calls. Don't return response object on s3 calls which can cause a build up of memory due to JobWorker, return headers or None.

…se chunked reader since generator does not have len function.

flebel

This is great, thanks a bunch for making those improvements!

lawschlosser

looking good. A few questions/concerns.
I'm still reading through the PR, but figured I would get you some feedback sooner rather than later.

conductor/lib/uploader.py

conductor/lib/api_client.py

conductor/lib/uploader.py

…for aws uploads.

lawschlosser

A few more here. thanks for bearing with me.

conductor/lib/uploader.py

lawschlosser · 2020-06-05T05:41:54Z

conductor/lib/uploader.py

+            )
+
+            # report upload progress
+            self.metric_store.increment('bytes_uploaded', content_length, filename)


It looks liked we're now only updating progress when a part (of a multipart transfer) completes (e.g every 5GB, etc). In lieu of being able to use the chunked reader, this compromise is fine for now, but we should come up with a better solution for the longer term (ideally something more granular).
Another issue to consider is how to reset progress data upon failure/retry of a Part. This isn't a problem specific to your changes; I don't believe there is consideration for this anywhere in the uploader. Just food for thought.

conductor/lib/uploader.py

lawschlosser · 2020-06-05T19:09:53Z

conductor/lib/uploader.py

+            start = (part_number - 1) * part_size
+            fh.seek(start)


Regarding https://github.com/AtomicConductor/conductor_client/pull/290/files#r435959647,
yeah, i'm wondering about the part size logic/math. Though efficient/minimal, I'm wondering if we should be more explicit.
For example, the code in this PR is currently deriving which part of the file to seek to by calculating the partSize by the partNumber it's been given (minus 1). This is an efficient, and elegently minimal implementation, but it's relying on a couple implementation decisions.

In contrast, I'm wondering if we should follow a more descriptive approach that more explicitly (yet generically) describes to the client what is needed.

So instead of:

"multiPartURLs": [ { ... "partSize": 1073741824, "parts: [ { "partNumber": 1, "url: "https://www.signedurlexample.com/signature1" }, { "partNumber": 2, "url: "https://www.signedurlexample.com/signature1" } ] } ]

We use:

"multiPartURLs": [ { ... "size": 2147483648, # the size of the entire file (replaces the partSize field) "parts: [ { "partNumber": 1, "url: "https://www.signedurlexample.com/signature1", "range": "0-1073741823" }, { "partNumber": 2, "url: "https://www.signedurlexample.com/signature1", "range": "1073741824-2147483648" } ] } ]

There are a few benefits to this:

The explicit range makes it clear as to what to upload. There's no need to derive this information based on the partNumber and partSize.

it inherently provides the size of each part.

it remedies the misleading situation where the "partSize" doesn't actually reflect the size of the part that is being uploaded (which is oftentimes the case for the last Part of a multipart).

By designating an explicit range for each Part, it opens up the possibility of having dynamic part sizes (i.e. the backend can potentially adjust part sizes, rather than always targeting 1GB, etc).

(arguably/superficially) Using a range for each part more closely aligns with a traditional http convention (i.e. using a Range header).

Personally I don't see many major reasons to switch to the latter, but since this is a new handler/feature/implementation we probably want to get this right/agreed upon before regretting it later on.

Point 3, technically if someone were to make a different implementation of the uploader and received a multipart upload response, there is nothing stopping them from NOT using all the preSignedURLs. For example a 6GB file, current implementation would create and respond with 6 preSignedURLs based on a 1GB partSize, but they could use only 2 (as long as they are all under 5GB and more than 5MB except for the last) of them if they want a 5GB + 1GB, send a completeMultiPart with those ETags and all would be well.

BackBlaze for example calls it a recommendedPartSize when dealing with large uploads would that be preferable/less misleading?

Point 4 , is interesting but that could always just be a calculated value on the backend there is not stopping us from making the partSize completely dynamic based on fileSize or other considerations. I do see your point where we could have different ranges within the given parts, but can't really see why we would change those values per Part.

These are mostly just my personal opinions, if you and/or @flebel feel strongly about changing it to that I have no problem making the necessary changes on the file-api side to reflect that.

If the current implementation feels good to you, carry on

lawschlosser

thanks for making those changes

jonathanrcross added 5 commits May 29, 2020 10:07

multipart updates to workers

b9851f2

add make_preapred_request for s3 calls to remove Transfer-Encoding he…

a1978ef

…ader being added. Update descriptions and v2 endpoints. Add Content-Length for s3 calls. Don't return response object on s3 calls which can cause a build up of memory due to JobWorker, return headers or None.

docstrings, remove additional requests import

fc0a57b

add metric_store increments for aws presigned and multipart, cannot u…

9221d0f

…se chunked reader since generator does not have len function.

pop headers

2f042fb

jonathanrcross requested review from flebel, hoolymama and lawschlosser June 1, 2020 17:10

_make_request uses internal request session object

96b4018

flebel reviewed Jun 1, 2020

View reviewed changes

explicitly close response to ensure socket is added back to pool

60b61ac

jonathanrcross marked this pull request as ready for review June 2, 2020 21:39

lawschlosser suggested changes Jun 4, 2020

View reviewed changes

jonathanrcross added 4 commits June 4, 2020 09:58

pr recommendations

5626fba

casing

4ac3eae

stream set to default False on prepared_request, remove stream kwarf …

1b45477

…for aws uploads.

rename to private method

48eb3f3

lawschlosser suggested changes Jun 5, 2020

View reviewed changes

jonathanrcross added 2 commits June 5, 2020 10:13

recommendations

b32620d

docstring update

2d4564e

lawschlosser reviewed Jun 5, 2020

View reviewed changes

lawschlosser self-requested a review June 6, 2020 01:12

lawschlosser approved these changes Jun 6, 2020

View reviewed changes

use HttpBatchWorker response to avoid additional os.stat calls

e1e9410

jonathanrcross merged commit dea7bb0 into master Jun 7, 2020

delete-merged-branch bot deleted the BT-536-file-api-refactor-multipart branch June 7, 2020 15:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BT-536] file api refactor multipart and tcp connection pooling #290

[BT-536] file api refactor multipart and tcp connection pooling #290

jonathanrcross commented Jun 1, 2020 •

edited

Loading

flebel left a comment

lawschlosser left a comment

lawschlosser left a comment

lawschlosser Jun 5, 2020

lawschlosser Jun 5, 2020

jonathanrcross Jun 5, 2020 •

edited

Loading

lawschlosser Jun 6, 2020

lawschlosser left a comment

[BT-536] file api refactor multipart and tcp connection pooling #290

[BT-536] file api refactor multipart and tcp connection pooling #290

Conversation

jonathanrcross commented Jun 1, 2020 • edited Loading

Uploader

API Client

flebel left a comment

Choose a reason for hiding this comment

lawschlosser left a comment

Choose a reason for hiding this comment

lawschlosser left a comment

Choose a reason for hiding this comment

lawschlosser Jun 5, 2020

Choose a reason for hiding this comment

lawschlosser Jun 5, 2020

Choose a reason for hiding this comment

jonathanrcross Jun 5, 2020 • edited Loading

Choose a reason for hiding this comment

lawschlosser Jun 6, 2020

Choose a reason for hiding this comment

lawschlosser left a comment

Choose a reason for hiding this comment

jonathanrcross commented Jun 1, 2020 •

edited

Loading

jonathanrcross Jun 5, 2020 •

edited

Loading