Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOME during GCP upload resulted in creation of empty blob #72018

Closed
DaveCTurner opened this issue Apr 21, 2021 · 4 comments · Fixed by #72051
Closed

OOME during GCP upload resulted in creation of empty blob #72018

DaveCTurner opened this issue Apr 21, 2021 · 4 comments · Fixed by #72051
Assignees
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team

Comments

@DaveCTurner
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version): 7.9.1

Plugins installed: Cloud

JVM version (java -version): Bundled

OS version (uname -a if on a Unix-like system): Cloud

Description of the problem including expected versus actual behavior:

The master node reported an OOME while writing the RepositoryData blob at 2021-03-17T04:30:56.129. However it apparently created an empty blob rather than failing outright:

image

This rendered the repository unusable: an empty blob is not valid at this location.

Provide logs (if relevant):

The stack trace from an OOME is often not an indication of the reason for the OOME, but in this case it does indicate we were writing the repository data at the time.

[instance-0000000001] fatal error in thread [elasticsearch[instance-0000000001][snapshot][T#1]], exiting
java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3536) ~[?:?]
	at com.google.cloud.BaseWriteChannel.write(BaseWriteChannel.java:135) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.lambda$write$0(GoogleCloudStorageBlobStore.java:280) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1$$Lambda$7716/0x000000080213a840.run(Unknown Source) ~[?:?]
	at java.security.AccessController.executePrivileged(AccessController.java:784) ~[?:?]
	at java.security.AccessController.doPrivileged(AccessController.java:554) ~[?:?]
	at org.elasticsearch.repositories.gcs.SocketAccess.doPrivilegedIOException(SocketAccess.java:44) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore$1.write(GoogleCloudStorageBlobStore.java:280) ~[?:?]
	at java.nio.channels.Channels.writeFullyImpl(Channels.java:74) ~[?:?]
	at java.nio.channels.Channels.writeFully(Channels.java:97) ~[?:?]
	at java.nio.channels.Channels$1.write(Channels.java:172) ~[?:?]
	at org.elasticsearch.common.io.Streams.doCopy(Streams.java:100) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.io.Streams.copy(Streams.java:92) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.writeBlobResumable(GoogleCloudStorageBlobStore.java:275) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobStore.writeBlob(GoogleCloudStorageBlobStore.java:238) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.writeBlob(GoogleCloudStorageBlobContainer.java:82) ~[?:?]
	at org.elasticsearch.repositories.gcs.GoogleCloudStorageBlobContainer.writeBlobAtomic(GoogleCloudStorageBlobContainer.java:87) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.writeAtomic(BlobStoreRepository.java:1786) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$writeIndexGen$54(BlobStoreRepository.java:1580) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$$Lambda$6993/0x0000000802035440.accept(Unknown Source) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:112) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:226) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:106) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:98) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture$$Lambda$4550/0x0000000801b1a840.accept(Unknown Source) ~[?:?]
	at java.util.ArrayList.forEach(ArrayList.java:1510) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:98) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:127) ~[elasticsearch-7.9.1.jar:7.9.1]
	at org.elasticsearch.action.StepListener.innerOnResponse(StepListener.java:62) ~[elasticsearch-7.9.1.jar:7.9.1]
@DaveCTurner DaveCTurner added >bug needs:triage Requires assignment of a team area label :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Apr 21, 2021
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team label Apr 21, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear original-brownbear self-assigned this Apr 21, 2021
@DaveCTurner DaveCTurner removed the needs:triage Requires assignment of a team area label label Apr 21, 2021
@original-brownbear
Copy link
Member

Thanks for tracking this down David ... this is actually a pretty obvious failure with the information you provided. Since we close the writer channel to GCS in a finally block and that's where we actually create the blob this makes a lot of sense. I gotta think about a fix for a bit, it's not obvious to me how we can easily prevent this from happening.
Also, technically I think at least with the AWS SDK we could run into the exact same issue.

@DaveCTurner
Copy link
Contributor Author

Ohh yikes yes I see we use a com.google.cloud.WriteChannel to stream the data, and closing that will terminate everything properly, there's no option to abort it. We should use something like Storage.BlobWriteOption#md5Match.

@DaveCTurner
Copy link
Contributor Author

I don't see the same issue with S3 tho: at least, we set the blob length explicitly. The API is different too, we pass an InputStream into the SDK rather than writing the data in chunks to an interface that doesn't support aborting.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 21, 2021
In the corner case of uploading a large (>5MB) metadata blob we did not set content validation
requirement on the upload request (we automatically have it for smaller requests that are not resumable
uploads). This change sets the relevant request option to enforce a CRC32C hash check when writing
`BytesReference` to GCS (as is the case with all but data blob writes). The custom CRC32C implementation
here can be removed after backporting to 7.x. I copied over the Guava version of CRC32C with slight
adjustments for consuming `BytesReference` instead of pulling the Guava dependency into the compile path
so that we can use the JDK's implementation of CRC32C in 8.x without having different dependencies in 8.x
and 7.x.

closes elastic#72018
original-brownbear added a commit that referenced this issue Apr 22, 2021
In the corner case of uploading a large (>5MB) metadata blob we did not set content validation
requirement on the upload request (we automatically have it for smaller requests that are not resumable
uploads). This change sets the relevant request option to enforce a MD5 hash check when writing
`BytesReference` to GCS (as is the case with all but data blob writes)

closes #72018
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 22, 2021
In the corner case of uploading a large (>5MB) metadata blob we did not set content validation
requirement on the upload request (we automatically have it for smaller requests that are not resumable
uploads). This change sets the relevant request option to enforce a MD5 hash check when writing
`BytesReference` to GCS (as is the case with all but data blob writes)

closes elastic#72018
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Apr 22, 2021
In the corner case of uploading a large (>5MB) metadata blob we did not set content validation
requirement on the upload request (we automatically have it for smaller requests that are not resumable
uploads). This change sets the relevant request option to enforce a MD5 hash check when writing
`BytesReference` to GCS (as is the case with all but data blob writes)

closes elastic#72018
original-brownbear added a commit that referenced this issue Apr 22, 2021
In the corner case of uploading a large (>5MB) metadata blob we did not set content validation
requirement on the upload request (we automatically have it for smaller requests that are not resumable
uploads). This change sets the relevant request option to enforce a MD5 hash check when writing
`BytesReference` to GCS (as is the case with all but data blob writes)

closes #72018
original-brownbear added a commit that referenced this issue Apr 22, 2021
In the corner case of uploading a large (>5MB) metadata blob we did not set content validation
requirement on the upload request (we automatically have it for smaller requests that are not resumable
uploads). This change sets the relevant request option to enforce a MD5 hash check when writing
`BytesReference` to GCS (as is the case with all but data blob writes)

closes #72018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants