Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File upload getting failed for file size larger than 200-500MB #18002

Closed
harisingh-highq opened this issue Dec 7, 2020 · 17 comments
Closed

File upload getting failed for file size larger than 200-500MB #18002

harisingh-highq opened this issue Dec 7, 2020 · 17 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Storage Storage Service (Queues, Blobs, Files)

Comments

@harisingh-highq
Copy link

harisingh-highq commented Dec 7, 2020

azure-storage-blob SDK version=12.7.0
Java Version:- OpenJDK11

To upload a file on azure storage, we're using below code but we came to know that it fails to upload file more than 200-500MB file size.
BlobClient blobClient = AzureHelper.getBlobContainerClient(AzureHelper.getBlobServiceClient(serviceEndpoint,
account, key), container).getBlobClient(destFile);

		bin = prepareEncryptStream(encryptionFlag, is, parametersMap);
	
                    blobClient.upload(bin, length, true);
		
		BlobProperties prop = blobClient.getProperties();
		responseMap.put(HybridConstants.PARAM_FILE_SIZE, String.valueOf(prop.getBlobSize()));
		
		byte[] contentMd5 = prop.getContentMd5();
		 
		String md5Value = RepositoryHelper.getInstance().convertMD5ByteToHexEncodedString(contentMd5);

I think its causing due to blobClient.upload method behavior is asynchronous
image

Can anyone help me to sort out this large file size upload issue?
Also please take a note that after uploading a file, we getContentMd5() of uploaded file from blob properties metadata.
So if we take any other solution like BlobRequestOptions then how we get ContentMd5() of file uploaded in chunk to confirm that it returns same MD5 value similar to single upload?

So we want a solution which should handle large file size issue and also content-MD5 of that file.

@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Dec 7, 2020
@alzimmermsft alzimmermsft added Client This issue points to a problem in the data-plane of the library. Storage Storage Service (Queues, Blobs, Files) labels Dec 7, 2020
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Dec 7, 2020
@gapra-msft
Copy link
Member

@harisingh-highq

Thank you for reporting this issue.

Could you please paste the stack trace or describe the error you are encountering? The upload method should be able to handle large uploads by chunking the data you provide.

@gapra-msft gapra-msft self-assigned this Dec 7, 2020
@rakhmvi
Copy link

rakhmvi commented Dec 8, 2020

Hi @gapra-msft
in my case I use similar code as @harisingh-highq has to upload buffered stream data to Azure. I've noticed that files whose size is greater than 300 MB have empty MD5 value in their properties. This value is mandatory in my case. Could you suggest me how to fill md5?
Kind regards

@harisingh-highq
Copy link
Author

harisingh-highq commented Dec 8, 2020

Hi @gapra-msft
FYI, larger file is uploaded successfully with above code but as @rakhmvi commented we are getting null value in getContentMd5() of files which size is larger than approax 255MB with no any exception thrown.
image
image
image

This content-MD5 value is mandatory in our case.
Also I checked a case like if I upload a file in azure portal and edit it after upload then its content-MD5 value also getting changed which is right and expected behavior.

So now for larger file > 256MB upload, how we get getContentMd5() value from its metadata?

Can you please help us into this issue?
Thanks

@gapra-msft
Copy link
Member

@harisingh-highq and @rakhmvi Thank you for clarifying the issue.

According to the Rest Doc, Get blob properties only returns the contentMd5 in the following situations

If the Content-MD5 header has been set for the blob, this response header is returned so that the client can check for message content integrity.
In version 2012-02-12 and newer, Put Blob sets a block blob’s MD5 value even when the Put Blob request doesn’t include an MD5 header.

So the service will compute the md5 if the blob is small enough to fit in a single put blob request (which I think used to be around 256MB). It looks like you will have to compute the md5 yourself if the file is any larger and you need to get the md5 back.

@niravravalhighq
Copy link

niravravalhighq commented Dec 9, 2020

Hi @gapra-msft thanks for quick responding on this thread. Actually me, @harisingh-highq and @rakhmvi all are working on one project where we stuck at this stage for fetching MD5 for larger size files.

One question I have for your last comment that you suggest that we should calculate MD5 and need to pass in upload call for large file i.e. >256 MB right?

Now seems it is strange because MD5 generally we use for making sure that data is not corrupted on storage. Here if I pass MD5 while upload, and something corruption happened on storage in future and then If I check metadata of MD5 then I will get same MD5 which I sent in past so how can we detect that file is corrupted or not?

@rakhmvi
Copy link

rakhmvi commented Dec 9, 2020

Hi @gapra-msft , thank you for your efforts.

Azure calculates MD5 value for file whose size is less than maxSingleUploadSize value (default is 256MB) of ParallelTransferOptions. If the size of the data is less than or equal to this value, it will be uploaded in a single put rather than broken up into chunks. And so for the single put Azure checks MD5 value which is passed via metadata and actual one being calculated on Azure side. In case values are various then the file is not stored.
However any value (even dummy) can be passed to metadata for a file whose size is larger than maxSingleUploadSize value (default is 256MB) of ParallelTransferOptions and the file is successfully stored with this MD5 (even dummy).

@gapra-msft Could you confirm or clarify above?

@harisingh-highq
Copy link
Author

harisingh-highq commented Dec 9, 2020

Hi @gapra-msft
Thank you for your quick response.

Apart from @niravravalhighq comment, we stuck with calculate MD5 logic also on our side.

Now to calculate MD5 on input stream, we need to read that input stream while doing MD5 calculation.
And as far as I know once we read input stream then we can't re-read same input stream as its set to empty.

This is MD5 calculation logic as per below:-

public String getFileChecksumForInputStream(InputStream is,byte[] byteArray) throws IOException {
MessageDigest digest = null;
StringBuilder sb = new StringBuilder();
try {
digest = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
logger.error(ExceptionUtilityHelper.accessExceptionStackTrace(e));
}
if(digest!=null){
int bytesCount = 0;
// Read input stream update in message digest
while ((bytesCount = is.read(byteArray)) != -1) {
digest.update(byteArray, 0, bytesCount);
}
is.close();
byte[] bytes = digest.digest();
for (int i = 0; i < bytes.length; i++) {
sb.append(Integer.toString((bytes[i] & 0xff) + 0x100, 16).substring(1));
}
}
String sbString = sb.toString();
sb.setLength(0);
return sbString;
}

After execution of above code, I pass same input stream to azure upload API like as below:-

blobClient.upload(is, length, true);

It throws below exception as stream is already closed and we try to re-read it twice.
image

Note:- We have strict requirement to not to store input stream to local temp file. We directly play with request input stream.

Can you please help us to solve this issue like how can we use same input stream
in MD5 calculation and upload API parameter as well?

Thank you. Your help would be appreciable.

@gapra-msft
Copy link
Member

Ah, so storage has 2 concepts of md5

  1. transactionalMd5 - this is the md5 of the data sent in the request (so for a chunked upload, it would be the md5 of the chunk) and the service will validate this md5 to make sure there were no corruptions over the network, but the service will not store this md5. The SDK will actually compute this for you if you set BlobParallelUploadOptions.computeMd5.

  2. BlobHttpHeaders.contentMd5 - this is file based (just stored with the file and not really checked by the service) and returned when you call getProperties. This is the dummy md5 that @rakhmvi pointed out.

@harisingh-highq I think the problem in your code is that you are closing the input stream before passing it to the client. I think you need to keep the IS open and reset it before passing it to the SDK.

@harisingh-highq
Copy link
Author

harisingh-highq commented Dec 10, 2020

@gapra-msft
Thank you for your valuable response.

FYI, even if I remove close() operation and reset() stream before passing it to SDK, its also not working

image

It throws below exception
image

Also if I set it mark(0) position, it is also not working. Because once stream is read we can't read it again by just reset its pointer.

2nd way:-

I also tried another way to copy input stream to ByteArrayOutputStream for future read but it also loads full file to memory
so in case of large file stream, it causing memory issue

image

In above code, baos.toByteArray(); loads whole stream into JVM memory.

3rd way:-
I tried with PipedOutputStream and PipedInputStream which sync read & write parallelly but in this solution it stuck to
pout.write(byteBuffer.array(), 0, len); line after reading some bytes.
BlobClient blobClient = AzureHelper.getBlobContainerClient(AzureHelper.getBlobServiceClient(serviceEndpoint, account, key), container).getBlobClient(destFile);

		PipedOutputStream pout = new PipedOutputStream();
		PipedInputStream pin = new PipedInputStream(pout, 1024*4*6);
		
		bin = prepareEncryptStream(encryptionFlag, is, parametersMap, pin);
		
		ReadableByteChannel channel = Channels.newChannel(bin);
		
		MessageDigest plainDigest =MessageDigest.getInstance("MD5");
		int readBufferSize = 1024*4; //Read 4096 bytes in one channel.read 
		java.nio.ByteBuffer byteBuffer = java.nio.ByteBuffer.allocate(readBufferSize);//4KB default buffer
		int len;
		ByteArrayOutputStream baos = new ByteArrayOutputStream(readBufferSize);
		while((len=channel.read(byteBuffer))>=0)
		{
			int retryCnt = 0;
			if(len!=0){
				/*plain input stream hash value calculation*/
				plainDigest.update(byteBuffer.array(), 0, len);
				pout.write(byteBuffer.array(), 0, len);
			}
			byteBuffer.clear();
		}
		if(plainDigest !=null){
			byte[] plainBytes = plainDigest.digest();
			StringBuilder plainHash = new StringBuilder();
			for (int i = 0; i < plainBytes.length; i++) {
				plainHash.append(Integer.toString((plainBytes[i] & 0xff) + 0x100, 16).substring(1));
			}
			responseMap.put(HybridConstants.PARAM_ENCRYPTED_FILE_HASH_VALUE, plainHash.toString());
		}
		
		BufferedInputStream bis = new BufferedInputStream(pin, readBufferSize);

		blobClient.upload(bis, length, true);

Now can you please provide us solution to overcome this MD5 calculation & upload file simultaneously issue?
Thank you

@rakhmvi
Copy link

rakhmvi commented Dec 11, 2020

Hi @gapra-msft ,

transactionalMd5 - this is the md5 of the data sent in the request (so for a chunked upload, it would be the md5 of the chunk) and the service will validate this md5 to make sure there were no corruptions over the network, but the service will not store this md5. The SDK will actually compute this for you if you set BlobParallelUploadOptions.computeMd5.

As BlobParallelUploadOptions has no setComputeMd5 in12.7 version of sdk, does 12.7 versions sdk guarantee that data can not be corrupted during the upload process?
Kind regards

@harisingh-highq
Copy link
Author

@gapra-msft
I have a query regarding content-MD5 value set in metadata using setHttpHeadersWithResponse as like below:-
blobClient.setHttpHeadersWithResponse(new BlobHttpHeaders() .setContentMd5(Hex.decodeHex(dummyString)) .setContentType("application/octet-stream"), requestConditions).subscribe( response -> System.out.printf("Set HTTP headers completed with status %d%n", response.getStatusCode()));

PLEASE TAKE A NOTE OF THE BELOW IMP QUESTIONS:-

  • Suppose if I am able to set content-md5 value successfully in blob metadata with above code, then if I will edit file manually from azure storage explorer, the content-md5 value will be changed in blob properties accordingly?(For larger file > 256MB size)

  • Here if I pass content-md5 while upload, and something corruption happened on storage in future and then If I check metadata of MD5 then I will get same MD5 which I sent in past so how can we detect that file is corrupted or not?
    (We should get real time md5 value to ensure data integrity)

  • Also as @rakhmvi asked in above question like is it validate set content-md5 and actual content-md5 to ensure there is no corruption during upload process?

@gapra-msft
Copy link
Member

Thanks for your questions,
@rakhmvi

  • As BlobParallelUploadOptions has no setComputeMd5 in12.7 version of sdk, does 12.7 versions sdk guarantee that data can not be corrupted during the upload process?
    No, the SDK did not guarantee no corruptions during a network transfer, but we exposed this feature so customers can be sure their data was transferred successfully over the network.

@harisingh-highq

  • Suppose if I am able to set content-md5 value successfully in blob metadata with above code, then if I will edit file manually from azure storage explorer, the content-md5 value will be changed in blob properties accordingly?(For larger file > 256MB size)
    If your large file has some content-md5 and you modify the file, the service will not modify the content-md5 accordingly. It is up to the person updating the file to call setHttpHeaders and update the content-md5. Alternatively you can also set the BlobHttpHeaders.contentMd5 when calling upload or commitBlockList (so it doesnt have to be a separate network request)

  • Here if I pass content-md5 while upload, and something corruption happened on storage in future and then If I check metadata of MD5 then I will get same MD5 which I sent in past so how can we detect that file is corrupted or not?
    (We should get real time md5 value to ensure data integrity)

    The BlobHttpHeaders.content-md5 value is just a convenient place to store your data's expected md5 value. The value stored on the service provides no guarantees that it is the md5 of the data in storage. To detect if the file is corrupted, you will have to download the file, compute the md5 and compare it to the expected md5 of the data.

  • Also as @rakhmvi asked in above question like is it validate set content-md5 and actual content-md5 to ensure there is no corruption during upload process?
    I think I answered above. The computeMd5 parameter is to ensure successful transfer over the wire and is not stored by the service.

As far as your code snippets that fail, I think you need to use mark with a large enough value that you will be able to reset to the end of the stream. When you say mark(0) - I think you are trying to say you want to come back to position 0. How mark works is the mark is set at the stream's current position and the int passed is how many bytes you can read before you can't reset back to the mark. I hope that makes sense.

@harisingh-highq
Copy link
Author

@gapra-msft
Thank you for your continuous response.

Finally we would like to update you that we're able to calculate content-md5 during upload file with other way and also set it in metadata successfully.

Now as per your previous comment, we're worried about data integrity issue in azure service that it works fine in case of smaller files but found limitation in case of larger files.

The BlobHttpHeaders.content-md5 value is just a convenient place to store your data's expected md5 value. The value stored on the service provides no guarantees that it is the md5 of the data in storage.

(For smaller file < 256MB, content-md5 value is in sync with real time data of blob. But in case of larger files > 256MB, content-md5 value is not in sync with real time data of blob.)

Also I've query regarding our finalize solution as per below:-
image

As you can see highlighted area like we've used BlobAsyncClient for azure blob client connection
and BlobOutputStream for writing to file for upload file.

So is there any difference between BlobAsyncClient v/s BlobClient in term of performance, concurrency, n/w glitch, multi thread, async behavior, upload time diff etc?

Can you please clarify this? I look forward to hearing from you.
Thank you.

@gapra-msft
Copy link
Member

@harisingh-highq

Great to hear that you got your solution working!

**_Now as per your previous comment, we're worried about data integrity issue in azure service that it works fine in case of smaller files but found limitation in case of larger files.

The BlobHttpHeaders.content-md5 value is just a convenient place to store your data's expected md5 value. The value stored on the service provides no guarantees that it is the md5 of the data in storage.

(For smaller file < 256MB, content-md5 value is in sync with real time data of blob. But in case of larger files > 256MB, content-md5 value is not in sync with real time data of blob.)_**

I would like to clarify the service behavior. The service will automatically compute the md5 of data that is uploaded in a single put request (around 256MB but this limit has increased recently) since the service can be sure that the information there is the entire content of the blob, and the blob is essentially immutable. However, for larger blobs when a user calls multiple stage blocks and commit block list, the content of the blob can change, requiring recalculation of the MD5. This is expensive, and the storage service does not support this functionality.

So is there any difference between BlobAsyncClient v/s BlobClient in term of performance, concurrency, n/w glitch, multi thread, async behavior, upload time diff etc?
In general, the BlobClient simply wraps the BlobAsyncClient and blocks on the respective call to make it synchronous. In Reactor (async operations) there are nice ways to handle concurrency and multi threaded scenarios, and you can use it if you are familiar with/comfortable learning the style of programming.

But the BlobOutputStream API is inherently a sync API that uses async buffered upload under the hood whether you get there from a BlobClient or a BlobAsyncClient. The following code snippets get the same BlobOutputStream (The second snippet is the recommended way to get a BlobOutputStream (only because OutputStream is a sync operation))
BlobOutputStream.blockBlobOutputStream(blobAsyncClient, blobOptions, Context.NONE);
blobClient.getBlockBlobClient.getBlobOutputStream(blobOptions);

@harisingh-highq
Copy link
Author

Hi @gapra-msft
As per our above conversation, we are able to upload file with MD5 calculation for large file size successfully.

Now we're facing a new issue in file upload with apache http server only in case of large file size
Please refer #18700

Please help us to sort out this issue?

@gapra-msft
Copy link
Member

Hi @harisingh-highq

Thanks for posting this new issue. I can take look at it and respond on that thread. Since this issue seems to have been resolved, could we close this issue?

@harisingh-highq
Copy link
Author

harisingh-highq commented Feb 5, 2021

Hi @gapra-msft
Yes, you can close this issue as this issue is specific to contentMD5 calculation for large file upload case and it is resolved now.
Thanks a lot

azure-sdk pushed a commit to azure-sdk/azure-sdk-for-java that referenced this issue Apr 20, 2022
Azure Networking 2021-08-01 release of monthly branch (Azure#18440)

* Adds base for updating Microsoft.Network from version stable/2021-05-01 to version 2021-08-01

* Updates readme

* Updates API version in new specs and examples

* Add AppGw swagger changes for L4 proxy (Azure#17561)

* Add AppGw swagger changes for L4 proxy

* Fix Lint Errors

* fix prettier checks

* HubRoutingPreference in VirtualHub (Azure#17609)

* commit1

* commit2

Co-authored-by: Khushboo Baheti <[email protected]>

* MultipleApipa feature VpnSiteLinkConnection and  VirtualNetworkGatewayConnection (Azure#17672)

* VngConnection

* VpnSiteLinkConnection

* fixes

* fixes

* fix2

* fixes

Co-authored-by: Khushboo Baheti <[email protected]>

* Virtual Wan P2S MultiPool feature swagger changes (Azure#17620)

* Virtual Wan P2S MultiPool feature swagger changes

* Fix Swagger LintDiff errors

* Fix LintDiff errors

* Fix errors

* Fix spec

* Fix spec

* Fix spec

* Fix LintDiff errors

* Fix LintDiff errors

* Fix SDK azure-sdk-for-net generation error

* Remove suppression

* Fix errors

* Fix Lintdiff error

* Fix PrettierCheck

* changes (Azure#18002)

* Revert "changes (Azure#18002)" (Azure#18014)

This reverts commit 320ed6a6fc5a68e8af43da303f8e1caaacf24708.

* Add nic auxiliary mode (Azure#17577)

* Add nic auxiliary mode

* fix spacing

* Fixing prettier check

* Restoring package-lock file

* Restoring package json

Co-authored-by: Prachi Bhavsar <[email protected]>

* Connection Draining add new properties (Azure#18052)

* merge

* fix

* fix

* Adding express route port authorization apis (Azure#17582)

* adding apis and updating resource to support ports auth

* moving change to 2021-08-01

* minor: removing change from 2020-07-01

* lintdiff : adding type object

* minor: fixing prettier

* adding authorizations to ports property

* fixing circuitResourceUri property name

* fixing model validation

* changing circuit resource uri type to string

* removing authorizations child reosurce from parent property

* Fix Azure Firewall Policy regressions. Back fix validation issues (Azure#18233)

* Fix regressions in Firewall Policy Swagger / give firewallPolicy.json some love

* Additional lint violations

* remove breaking changes for next time. TO DO

* Revert "remove breaking changes for next time. TO DO"

This reverts commit 8f44a174c73c02d18d829f6dfb1d990488770b23.

* Reintroduce api-version for idps signature based routes. Create better names for enums to be generated in SDKs

* standardize enum names with FirewallPolicy prefix

* Azure Firewall Support of Private IP Ranges in IDPS (Azure#18320)

* Azure Firewall Support of Private IP Ranges in IDPS

* make sure all arrays have x-ms-identifiers

* FirewallPolicy not Firewall policy

fix spellcheck validation

* Ability to update tags on firewall policies (Azure#18322)

* Support updating of Azure Firewall Policy Tags. Includes HTTP Patch example

* Use common-types ErrorDetail

* Ability to update tags for Firewall Policies

* spell check fix for firewallpolicy

* Use future release api-version for example

* Added flush conn to nsg (Azure#18393)

* Added flush conn to nsg

* Updated flushConn to correct location

* Updated description

* Modified T/F to Enabled/Disabled

* Refactoring so that null value appears first

* Reverted FlushConnection to boolean value instead of string

* Revert "Added flush conn to nsg (Azure#18393)" (Azure#18576)

This reverts commit 6541d305880d1cf580496adc01f55197a01e992c.

* Fixing typo in response of idps private ip range feature (Azure#18574)

* Use common-type api version (Azure#18729)

Co-authored-by: Ben Eshed <[email protected]>

* fix (Azure#18417)

Co-authored-by: Tianen <[email protected]>
Co-authored-by: gk-ms <[email protected]>
Co-authored-by: Khushboo Baheti <[email protected]>
Co-authored-by: Khushboo Baheti <[email protected]>
Co-authored-by: Nilambari <[email protected]>
Co-authored-by: nimaller <[email protected]>
Co-authored-by: pracsb <[email protected]>
Co-authored-by: Prachi Bhavsar <[email protected]>
Co-authored-by: Matthew Yang <[email protected]>
Co-authored-by: utbarn-ms <[email protected]>
Co-authored-by: Ben Eshed <[email protected]>
Co-authored-by: Satya-anshu <[email protected]>
Co-authored-by: bewaterspassover <[email protected]>
Co-authored-by: Ben Eshed <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Apr 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Storage Storage Service (Queues, Blobs, Files)
Projects
None yet
Development

No branches or pull requests

5 participants