-
Notifications
You must be signed in to change notification settings - Fork 7
[BT-536] file api refactor multipart and tcp connection pooling #290
Conversation
…ader being added. Update descriptions and v2 endpoints. Add Content-Length for s3 calls. Don't return response object on s3 calls which can cause a build up of memory due to JobWorker, return headers or None.
…se chunked reader since generator does not have len function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, thanks a bunch for making those improvements!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good. A few questions/concerns.
I'm still reading through the PR, but figured I would get you some feedback sooner rather than later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more here. thanks for bearing with me.
) | ||
|
||
# report upload progress | ||
self.metric_store.increment('bytes_uploaded', content_length, filename) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks liked we're now only updating progress when a part (of a multipart transfer) completes (e.g every 5GB, etc). In lieu of being able to use the chunked reader, this compromise is fine for now, but we should come up with a better solution for the longer term (ideally something more granular).
Another issue to consider is how to reset progress data upon failure/retry of a Part. This isn't a problem specific to your changes; I don't believe there is consideration for this anywhere in the uploader. Just food for thought.
start = (part_number - 1) * part_size | ||
fh.seek(start) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding https://github.com/AtomicConductor/conductor_client/pull/290/files#r435959647,
yeah, i'm wondering about the part size logic/math. Though efficient/minimal, I'm wondering if we should be more explicit.
For example, the code in this PR is currently deriving which part of the file to seek to by calculating the partSize by the partNumber it's been given (minus 1). This is an efficient, and elegently minimal implementation, but it's relying on a couple implementation decisions.
In contrast, I'm wondering if we should follow a more descriptive approach that more explicitly (yet generically) describes to the client what is needed.
So instead of:
"multiPartURLs": [
{
...
"partSize": 1073741824,
"parts: [
{
"partNumber": 1,
"url: "https://www.signedurlexample.com/signature1"
},
{
"partNumber": 2,
"url: "https://www.signedurlexample.com/signature1"
}
]
}
]
We use:
"multiPartURLs": [
{
...
"size": 2147483648, # the size of the entire file (replaces the partSize field)
"parts: [
{
"partNumber": 1,
"url: "https://www.signedurlexample.com/signature1",
"range": "0-1073741823"
},
{
"partNumber": 2,
"url: "https://www.signedurlexample.com/signature1",
"range": "1073741824-2147483648"
}
]
}
]
There are a few benefits to this:
-
The explicit range makes it clear as to what to upload. There's no need to derive this information based on the
partNumber
andpartSize
. -
it inherently provides the size of each part.
-
it remedies the misleading situation where the "partSize" doesn't actually reflect the size of the part that is being uploaded (which is oftentimes the case for the last Part of a multipart).
-
By designating an explicit range for each Part, it opens up the possibility of having dynamic part sizes (i.e. the backend can potentially adjust part sizes, rather than always targeting 1GB, etc).
-
(arguably/superficially) Using a range for each part more closely aligns with a traditional http convention (i.e. using a Range header).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I don't see many major reasons to switch to the latter, but since this is a new handler/feature/implementation we probably want to get this right/agreed upon before regretting it later on.
Point 3, technically if someone were to make a different implementation of the uploader and received a multipart upload response, there is nothing stopping them from NOT using all the preSignedURLs. For example a 6GB file, current implementation would create and respond with 6 preSignedURLs based on a 1GB partSize, but they could use only 2 (as long as they are all under 5GB and more than 5MB except for the last) of them if they want a 5GB + 1GB, send a completeMultiPart with those ETags and all would be well.
BackBlaze for example calls it a recommendedPartSize when dealing with large uploads would that be preferable/less misleading?
Point 4 , is interesting but that could always just be a calculated value on the backend there is not stopping us from making the partSize
completely dynamic based on fileSize or other considerations. I do see your point where we could have different ranges within the given parts, but can't really see why we would change those values per Part.
These are mostly just my personal opinions, if you and/or @flebel feel strongly about changing it to that I have no problem making the necessary changes on the file-api side to reflect that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the current implementation feels good to you, carry on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for making those changes
Related to: https://github.com/AtomicConductor/conductor_ae/pull/715, https://github.com/AtomicConductor/file-api/pull/24
Uploader
API Client
make_request