-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
File upload getting failed for file size larger than 200-500MB #18002
Comments
Thank you for reporting this issue. Could you please paste the stack trace or describe the error you are encountering? The upload method should be able to handle large uploads by chunking the data you provide. |
Hi @gapra-msft |
Hi @gapra-msft This content-MD5 value is mandatory in our case. So now for larger file > 256MB upload, how we get getContentMd5() value from its metadata? Can you please help us into this issue? |
@harisingh-highq and @rakhmvi Thank you for clarifying the issue. According to the Rest Doc, Get blob properties only returns the contentMd5 in the following situations If the Content-MD5 header has been set for the blob, this response header is returned so that the client can check for message content integrity. So the service will compute the md5 if the blob is small enough to fit in a single put blob request (which I think used to be around 256MB). It looks like you will have to compute the md5 yourself if the file is any larger and you need to get the md5 back. |
Hi @gapra-msft thanks for quick responding on this thread. Actually me, @harisingh-highq and @rakhmvi all are working on one project where we stuck at this stage for fetching MD5 for larger size files. One question I have for your last comment that you suggest that we should calculate MD5 and need to pass in upload call for large file i.e. >256 MB right? Now seems it is strange because MD5 generally we use for making sure that data is not corrupted on storage. Here if I pass MD5 while upload, and something corruption happened on storage in future and then If I check metadata of MD5 then I will get same MD5 which I sent in past so how can we detect that file is corrupted or not? |
Hi @gapra-msft , thank you for your efforts. Azure calculates MD5 value for file whose size is less than maxSingleUploadSize value (default is 256MB) of ParallelTransferOptions. If the size of the data is less than or equal to this value, it will be uploaded in a single put rather than broken up into chunks. And so for the single put Azure checks MD5 value which is passed via metadata and actual one being calculated on Azure side. In case values are various then the file is not stored. @gapra-msft Could you confirm or clarify above? |
Hi @gapra-msft Apart from @niravravalhighq comment, we stuck with calculate MD5 logic also on our side. Now to calculate MD5 on input stream, we need to read that input stream while doing MD5 calculation. This is MD5 calculation logic as per below:- public String getFileChecksumForInputStream(InputStream is,byte[] byteArray) throws IOException { After execution of above code, I pass same input stream to azure upload API like as below:- blobClient.upload(is, length, true); It throws below exception as stream is already closed and we try to re-read it twice. Note:- We have strict requirement to not to store input stream to local temp file. We directly play with request input stream. Can you please help us to solve this issue like how can we use same input stream Thank you. Your help would be appreciable. |
Ah, so storage has 2 concepts of md5
@harisingh-highq I think the problem in your code is that you are closing the input stream before passing it to the client. I think you need to keep the IS open and reset it before passing it to the SDK. |
@gapra-msft FYI, even if I remove close() operation and reset() stream before passing it to SDK, its also not working Also if I set it mark(0) position, it is also not working. Because once stream is read we can't read it again by just reset its pointer. 2nd way:- I also tried another way to copy input stream to ByteArrayOutputStream for future read but it also loads full file to memory In above code, baos.toByteArray(); loads whole stream into JVM memory. 3rd way:-
Now can you please provide us solution to overcome this MD5 calculation & upload file simultaneously issue? |
Hi @gapra-msft ,
As BlobParallelUploadOptions has no setComputeMd5 in12.7 version of sdk, does 12.7 versions sdk guarantee that data can not be corrupted during the upload process? |
@gapra-msft PLEASE TAKE A NOTE OF THE BELOW IMP QUESTIONS:-
|
Thanks for your questions,
As far as your code snippets that fail, I think you need to use mark with a large enough value that you will be able to reset to the end of the stream. When you say mark(0) - I think you are trying to say you want to come back to position 0. How mark works is the mark is set at the stream's current position and the int passed is how many bytes you can read before you can't reset back to the mark. I hope that makes sense. |
@gapra-msft Finally we would like to update you that we're able to calculate content-md5 during upload file with other way and also set it in metadata successfully. Now as per your previous comment, we're worried about data integrity issue in azure service that it works fine in case of smaller files but found limitation in case of larger files.
(For smaller file < 256MB, content-md5 value is in sync with real time data of blob. But in case of larger files > 256MB, content-md5 value is not in sync with real time data of blob.) Also I've query regarding our finalize solution as per below:- As you can see highlighted area like we've used BlobAsyncClient for azure blob client connection So is there any difference between BlobAsyncClient v/s BlobClient in term of performance, concurrency, n/w glitch, multi thread, async behavior, upload time diff etc? Can you please clarify this? I look forward to hearing from you. |
Great to hear that you got your solution working! **_Now as per your previous comment, we're worried about data integrity issue in azure service that it works fine in case of smaller files but found limitation in case of larger files. The BlobHttpHeaders.content-md5 value is just a convenient place to store your data's expected md5 value. The value stored on the service provides no guarantees that it is the md5 of the data in storage. (For smaller file < 256MB, content-md5 value is in sync with real time data of blob. But in case of larger files > 256MB, content-md5 value is not in sync with real time data of blob.)_** I would like to clarify the service behavior. The service will automatically compute the md5 of data that is uploaded in a single put request (around 256MB but this limit has increased recently) since the service can be sure that the information there is the entire content of the blob, and the blob is essentially immutable. However, for larger blobs when a user calls multiple stage blocks and commit block list, the content of the blob can change, requiring recalculation of the MD5. This is expensive, and the storage service does not support this functionality. So is there any difference between BlobAsyncClient v/s BlobClient in term of performance, concurrency, n/w glitch, multi thread, async behavior, upload time diff etc? But the BlobOutputStream API is inherently a sync API that uses async buffered upload under the hood whether you get there from a BlobClient or a BlobAsyncClient. The following code snippets get the same BlobOutputStream (The second snippet is the recommended way to get a BlobOutputStream (only because OutputStream is a sync operation)) |
Hi @gapra-msft Now we're facing a new issue in file upload with apache http server only in case of large file size Please help us to sort out this issue? |
Thanks for posting this new issue. I can take look at it and respond on that thread. Since this issue seems to have been resolved, could we close this issue? |
Hi @gapra-msft |
Azure Networking 2021-08-01 release of monthly branch (Azure#18440) * Adds base for updating Microsoft.Network from version stable/2021-05-01 to version 2021-08-01 * Updates readme * Updates API version in new specs and examples * Add AppGw swagger changes for L4 proxy (Azure#17561) * Add AppGw swagger changes for L4 proxy * Fix Lint Errors * fix prettier checks * HubRoutingPreference in VirtualHub (Azure#17609) * commit1 * commit2 Co-authored-by: Khushboo Baheti <[email protected]> * MultipleApipa feature VpnSiteLinkConnection and VirtualNetworkGatewayConnection (Azure#17672) * VngConnection * VpnSiteLinkConnection * fixes * fixes * fix2 * fixes Co-authored-by: Khushboo Baheti <[email protected]> * Virtual Wan P2S MultiPool feature swagger changes (Azure#17620) * Virtual Wan P2S MultiPool feature swagger changes * Fix Swagger LintDiff errors * Fix LintDiff errors * Fix errors * Fix spec * Fix spec * Fix spec * Fix LintDiff errors * Fix LintDiff errors * Fix SDK azure-sdk-for-net generation error * Remove suppression * Fix errors * Fix Lintdiff error * Fix PrettierCheck * changes (Azure#18002) * Revert "changes (Azure#18002)" (Azure#18014) This reverts commit 320ed6a6fc5a68e8af43da303f8e1caaacf24708. * Add nic auxiliary mode (Azure#17577) * Add nic auxiliary mode * fix spacing * Fixing prettier check * Restoring package-lock file * Restoring package json Co-authored-by: Prachi Bhavsar <[email protected]> * Connection Draining add new properties (Azure#18052) * merge * fix * fix * Adding express route port authorization apis (Azure#17582) * adding apis and updating resource to support ports auth * moving change to 2021-08-01 * minor: removing change from 2020-07-01 * lintdiff : adding type object * minor: fixing prettier * adding authorizations to ports property * fixing circuitResourceUri property name * fixing model validation * changing circuit resource uri type to string * removing authorizations child reosurce from parent property * Fix Azure Firewall Policy regressions. Back fix validation issues (Azure#18233) * Fix regressions in Firewall Policy Swagger / give firewallPolicy.json some love * Additional lint violations * remove breaking changes for next time. TO DO * Revert "remove breaking changes for next time. TO DO" This reverts commit 8f44a174c73c02d18d829f6dfb1d990488770b23. * Reintroduce api-version for idps signature based routes. Create better names for enums to be generated in SDKs * standardize enum names with FirewallPolicy prefix * Azure Firewall Support of Private IP Ranges in IDPS (Azure#18320) * Azure Firewall Support of Private IP Ranges in IDPS * make sure all arrays have x-ms-identifiers * FirewallPolicy not Firewall policy fix spellcheck validation * Ability to update tags on firewall policies (Azure#18322) * Support updating of Azure Firewall Policy Tags. Includes HTTP Patch example * Use common-types ErrorDetail * Ability to update tags for Firewall Policies * spell check fix for firewallpolicy * Use future release api-version for example * Added flush conn to nsg (Azure#18393) * Added flush conn to nsg * Updated flushConn to correct location * Updated description * Modified T/F to Enabled/Disabled * Refactoring so that null value appears first * Reverted FlushConnection to boolean value instead of string * Revert "Added flush conn to nsg (Azure#18393)" (Azure#18576) This reverts commit 6541d305880d1cf580496adc01f55197a01e992c. * Fixing typo in response of idps private ip range feature (Azure#18574) * Use common-type api version (Azure#18729) Co-authored-by: Ben Eshed <[email protected]> * fix (Azure#18417) Co-authored-by: Tianen <[email protected]> Co-authored-by: gk-ms <[email protected]> Co-authored-by: Khushboo Baheti <[email protected]> Co-authored-by: Khushboo Baheti <[email protected]> Co-authored-by: Nilambari <[email protected]> Co-authored-by: nimaller <[email protected]> Co-authored-by: pracsb <[email protected]> Co-authored-by: Prachi Bhavsar <[email protected]> Co-authored-by: Matthew Yang <[email protected]> Co-authored-by: utbarn-ms <[email protected]> Co-authored-by: Ben Eshed <[email protected]> Co-authored-by: Satya-anshu <[email protected]> Co-authored-by: bewaterspassover <[email protected]> Co-authored-by: Ben Eshed <[email protected]>
azure-storage-blob SDK version=12.7.0
Java Version:- OpenJDK11
To upload a file on azure storage, we're using below code but we came to know that it fails to upload file more than 200-500MB file size.
BlobClient blobClient = AzureHelper.getBlobContainerClient(AzureHelper.getBlobServiceClient(serviceEndpoint,
account, key), container).getBlobClient(destFile);
I think its causing due to blobClient.upload method behavior is asynchronous
Can anyone help me to sort out this large file size upload issue?
Also please take a note that after uploading a file, we getContentMd5() of uploaded file from blob properties metadata.
So if we take any other solution like BlobRequestOptions then how we get ContentMd5() of file uploaded in chunk to confirm that it returns same MD5 value similar to single upload?
So we want a solution which should handle large file size issue and also content-MD5 of that file.
The text was updated successfully, but these errors were encountered: