Skip to content
This repository has been archived by the owner on Jul 21, 2022. It is now read-only.

[BT-536] file api refactor multipart and tcp connection pooling #290

Merged
merged 14 commits into from
Jun 7, 2020

Conversation

jonathanrcross
Copy link
Contributor

@jonathanrcross jonathanrcross commented Jun 1, 2020

Related to: https://github.com/AtomicConductor/conductor_ae/pull/715, https://github.com/AtomicConductor/file-api/pull/24

Uploader

  • Update request payload to v2 signing url (path, hash, size)
  • Update tuple output queue information for FileStatWorker (normal presigned url or multipart)
  • Fix AWS reporting on upload progress to conductor api/dashboard, cannot use chunkreader method due to it needing a len() function from the generator. Since the default prepared_request is using a stream it will not load the entire file into memory.
  • Multipart upload partsize is controlled by File-API allowing backend to adjust sizes as needed.
  • FileStatWorker uses HttpBatchWorker file_size response instead of additional os.stat since this is being done prior in the MD5Worker.

API Client

  • Use requests.Session object so that tcp connections are re-used when the same destination host and port are being accessed. This should help speed up uploads since the 3 way handshake will not occur everytime.
  • add prepared_request method, since s3 will return a 501 with the Transfer-Encoding header, need to use prepared request to remove header from being added( default requests.Request will do that). Add same retry logic as make_request

…ader being added. Update descriptions and v2 endpoints. Add Content-Length for s3 calls. Don't return response object on s3 calls which can cause a build up of memory due to JobWorker, return headers or None.
…se chunked reader since generator does not have len function.
Copy link
Contributor

@flebel flebel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great, thanks a bunch for making those improvements!

@jonathanrcross jonathanrcross marked this pull request as ready for review June 2, 2020 21:39
Copy link
Contributor

@lawschlosser lawschlosser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking good. A few questions/concerns.
I'm still reading through the PR, but figured I would get you some feedback sooner rather than later.

conductor/lib/uploader.py Outdated Show resolved Hide resolved
conductor/lib/uploader.py Outdated Show resolved Hide resolved
conductor/lib/uploader.py Outdated Show resolved Hide resolved
conductor/lib/uploader.py Outdated Show resolved Hide resolved
conductor/lib/api_client.py Show resolved Hide resolved
conductor/lib/uploader.py Outdated Show resolved Hide resolved
Copy link
Contributor

@lawschlosser lawschlosser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more here. thanks for bearing with me.

conductor/lib/uploader.py Show resolved Hide resolved
conductor/lib/uploader.py Outdated Show resolved Hide resolved
conductor/lib/uploader.py Outdated Show resolved Hide resolved
)

# report upload progress
self.metric_store.increment('bytes_uploaded', content_length, filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks liked we're now only updating progress when a part (of a multipart transfer) completes (e.g every 5GB, etc). In lieu of being able to use the chunked reader, this compromise is fine for now, but we should come up with a better solution for the longer term (ideally something more granular).
Another issue to consider is how to reset progress data upon failure/retry of a Part. This isn't a problem specific to your changes; I don't believe there is consideration for this anywhere in the uploader. Just food for thought.

conductor/lib/uploader.py Outdated Show resolved Hide resolved
conductor/lib/uploader.py Show resolved Hide resolved
conductor/lib/uploader.py Show resolved Hide resolved
Comment on lines +421 to +422
start = (part_number - 1) * part_size
fh.seek(start)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding https://github.com/AtomicConductor/conductor_client/pull/290/files#r435959647,
yeah, i'm wondering about the part size logic/math. Though efficient/minimal, I'm wondering if we should be more explicit.
For example, the code in this PR is currently deriving which part of the file to seek to by calculating the partSize by the partNumber it's been given (minus 1). This is an efficient, and elegently minimal implementation, but it's relying on a couple implementation decisions.

In contrast, I'm wondering if we should follow a more descriptive approach that more explicitly (yet generically) describes to the client what is needed.

So instead of:

"multiPartURLs": [
    {
        ...
        "partSize": 1073741824,
        "parts: [
            {
                "partNumber": 1,
                "url: "https://www.signedurlexample.com/signature1"
            },
            {
                "partNumber": 2,
                "url: "https://www.signedurlexample.com/signature1"
            }
        ]
    }
]

We use:

"multiPartURLs": [
    {
        ...
        "size": 2147483648,  # the size of the entire file  (replaces the partSize field)
        "parts: [
            {
                "partNumber": 1,
                "url: "https://www.signedurlexample.com/signature1",
                "range": "0-1073741823"
            },
            {
                "partNumber": 2,
                "url: "https://www.signedurlexample.com/signature1",
                "range": "1073741824-2147483648"
            }
        ]
    }
]

There are a few benefits to this:

  1. The explicit range makes it clear as to what to upload. There's no need to derive this information based on the partNumber and partSize.

  2. it inherently provides the size of each part.

  3. it remedies the misleading situation where the "partSize" doesn't actually reflect the size of the part that is being uploaded (which is oftentimes the case for the last Part of a multipart).

  4. By designating an explicit range for each Part, it opens up the possibility of having dynamic part sizes (i.e. the backend can potentially adjust part sizes, rather than always targeting 1GB, etc).

  5. (arguably/superficially) Using a range for each part more closely aligns with a traditional http convention (i.e. using a Range header).

Copy link
Contributor Author

@jonathanrcross jonathanrcross Jun 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I don't see many major reasons to switch to the latter, but since this is a new handler/feature/implementation we probably want to get this right/agreed upon before regretting it later on.

Point 3, technically if someone were to make a different implementation of the uploader and received a multipart upload response, there is nothing stopping them from NOT using all the preSignedURLs. For example a 6GB file, current implementation would create and respond with 6 preSignedURLs based on a 1GB partSize, but they could use only 2 (as long as they are all under 5GB and more than 5MB except for the last) of them if they want a 5GB + 1GB, send a completeMultiPart with those ETags and all would be well.

BackBlaze for example calls it a recommendedPartSize when dealing with large uploads would that be preferable/less misleading?

Point 4 , is interesting but that could always just be a calculated value on the backend there is not stopping us from making the partSize completely dynamic based on fileSize or other considerations. I do see your point where we could have different ranges within the given parts, but can't really see why we would change those values per Part.

These are mostly just my personal opinions, if you and/or @flebel feel strongly about changing it to that I have no problem making the necessary changes on the file-api side to reflect that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the current implementation feels good to you, carry on

@lawschlosser lawschlosser self-requested a review June 6, 2020 01:12
Copy link
Contributor

@lawschlosser lawschlosser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for making those changes

@jonathanrcross jonathanrcross merged commit dea7bb0 into master Jun 7, 2020
@delete-merged-branch delete-merged-branch bot deleted the BT-536-file-api-refactor-multipart branch June 7, 2020 15:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants