OOME during GCP upload resulted in creation of empty blob #72018

DaveCTurner · 2021-04-21T13:57:11Z

Elasticsearch version (bin/elasticsearch --version): 7.9.1

Plugins installed: Cloud

JVM version (java -version): Bundled

OS version (uname -a if on a Unix-like system): Cloud

Description of the problem including expected versus actual behavior:

The master node reported an OOME while writing the RepositoryData blob at 2021-03-17T04:30:56.129. However it apparently created an empty blob rather than failing outright:

This rendered the repository unusable: an empty blob is not valid at this location.

Provide logs (if relevant):

The stack trace from an OOME is often not an indication of the reason for the OOME, but in this case it does indicate we were writing the repository data at the time.

[instance-0000000001] fatal error in thread [elasticsearch[instance-0000000001][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3536) ~[?:?]
	at com.google.cloud.BaseWriteChannel.write(BaseWriteChannel.java:135) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.lambda$write$0(GoogleCloudStorageBlobStore.java:280) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1$$Lambda$7716/0x000000080213a840.run(Unknown Source) ~[?:?]
	at java.security.AccessController.executePrivileged(AccessController.java:784) ~[?:?]
	at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?]
	at org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedIOException(SocketAccess.java:44) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.write(GoogleCloudStorageBlobStore.java:280) ~[?:?]
	at java.nio.channels.Channels.writeFullyImpl(Channels.java:74) ~[?:?]
	at java.nio.channels.Channels.writeFully(Channels.java:97) ~[?:?]
	at java.nio.channels.Channels$1.write(Channels.java:172) ~[?:?]
	at org.elasticsearch.common.io.Streams.doCopy(Streams.java:100) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.io.Streams.copy(Streams.java:92) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.writeBlobResumable(GoogleCloudStorageBlobStore.java:275) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.writeBlob(GoogleCloudStorageBlobStore.java:238) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.writeBlob(GoogleCloudStorageBlobContainer.java:82) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.writeBlobAtomic(GoogleCloudStorageBlobContainer.java:87) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeAtomic(BlobStoreRepository.java:1786) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$writeIndexGen$54(BlobStoreRepository.java:1580) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$$Lambda$6993/0x0000000802035440.accept(Unknown Source) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:226) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture$$Lambda$4550/0x0000000801b1a840.accept(Unknown Source) ~[?:?]
	at java.util.ArrayList.forEach(ArrayList.java:1510) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:98) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:127) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.action.StepListener.innerOnResponse(StepListener.java:62) ~[elasticsearch-7.9.1.jar:7.9.1]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-04-21T13:57:25Z

Pinging @elastic/es-distributed (Team:Distributed)

original-brownbear · 2021-04-21T14:07:17Z

Thanks for tracking this down David ... this is actually a pretty obvious failure with the information you provided. Since we close the writer channel to GCS in a finally block and that's where we actually create the blob this makes a lot of sense. I gotta think about a fix for a bit, it's not obvious to me how we can easily prevent this from happening.
Also, technically I think at least with the AWS SDK we could run into the exact same issue.

DaveCTurner · 2021-04-21T14:15:44Z

Ohh yikes yes I see we use a com.google.cloud.WriteChannel to stream the data, and closing that will terminate everything properly, there's no option to abort it. We should use something like Storage.BlobWriteOption#md5Match.

DaveCTurner · 2021-04-21T14:21:08Z

I don't see the same issue with S3 tho: at least, we set the blob length explicitly. The API is different too, we pass an InputStream into the SDK rather than writing the data in chunks to an interface that doesn't support aborting.

In the corner case of uploading a large (>5MB) metadata blob we did not set content validation requirement on the upload request (we automatically have it for smaller requests that are not resumable uploads). This change sets the relevant request option to enforce a CRC32C hash check when writing `BytesReference` to GCS (as is the case with all but data blob writes). The custom CRC32C implementation here can be removed after backporting to 7.x. I copied over the Guava version of CRC32C with slight adjustments for consuming `BytesReference` instead of pulling the Guava dependency into the compile path so that we can use the JDK's implementation of CRC32C in 8.x without having different dependencies in 8.x and 7.x. closes elastic#72018

In the corner case of uploading a large (>5MB) metadata blob we did not set content validation requirement on the upload request (we automatically have it for smaller requests that are not resumable uploads). This change sets the relevant request option to enforce a MD5 hash check when writing `BytesReference` to GCS (as is the case with all but data blob writes) closes #72018

In the corner case of uploading a large (>5MB) metadata blob we did not set content validation requirement on the upload request (we automatically have it for smaller requests that are not resumable uploads). This change sets the relevant request option to enforce a MD5 hash check when writing `BytesReference` to GCS (as is the case with all but data blob writes) closes elastic#72018

In the corner case of uploading a large (>5MB) metadata blob we did not set content validation requirement on the upload request (we automatically have it for smaller requests that are not resumable uploads). This change sets the relevant request option to enforce a MD5 hash check when writing `BytesReference` to GCS (as is the case with all but data blob writes) closes #72018

DaveCTurner added >bug needs:triage Requires assignment of a team area label :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Apr 21, 2021

elasticmachine added the Team:Distributed Meta label for distributed team label Apr 21, 2021

original-brownbear self-assigned this Apr 21, 2021

DaveCTurner removed the needs:triage Requires assignment of a team area label label Apr 21, 2021

original-brownbear mentioned this issue Apr 21, 2021

Ensure GCS Repository Metadata Blob Writes are Atomic #72051

Merged

original-brownbear closed this as completed in #72051 Apr 22, 2021

original-brownbear mentioned this issue Apr 22, 2021

Ensure GCS Repository Metadata Blob Writes are Atomic (#72051) #72070

Merged

original-brownbear mentioned this issue Apr 22, 2021

Ensure GCS Repository Metadata Blob Writes are Atomic (#72051) #72071

Merged

salvatore-campagna mentioned this issue May 23, 2022

OOM on date_histogram with small interval #72619

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OOME during GCP upload resulted in creation of empty blob #72018

OOME during GCP upload resulted in creation of empty blob #72018

DaveCTurner commented Apr 21, 2021

elasticmachine commented Apr 21, 2021

original-brownbear commented Apr 21, 2021

DaveCTurner commented Apr 21, 2021

DaveCTurner commented Apr 21, 2021

OOME during GCP upload resulted in creation of empty blob #72018

OOME during GCP upload resulted in creation of empty blob #72018

Comments

DaveCTurner commented Apr 21, 2021

elasticmachine commented Apr 21, 2021

original-brownbear commented Apr 21, 2021

DaveCTurner commented Apr 21, 2021

DaveCTurner commented Apr 21, 2021