[storage] "stream error: stream ID x; INTERNAL_ERROR" #784

JeanMertz · 2017-10-10T07:05:36Z

We are reading tens of millions of objects from GCS and seem to be hitting an issue where an error "stream error: stream ID 4163; INTERNAL_ERROR" is returned after processing files for a while.

It's pretty hard to debug the issue, as it takes several hours before the issue occurs, but we've had the issue two consecutive times in a row now.

We are using version eaddaf6dd7ee35fd3c2420c8d27478db176b0485 of the storage package.

Here's the pseudo code of what we are doing:

cs, err := cloudstorage.NewClient(ctx)
// err...
defer cs.Close()
b := cs.Bucket(...)
q := &storage.Query{Prefix: ...}
it := b.Objects(ctx, q)
for {
	a, err := it.Next()
	if err == iterator.Done {
		break
	}

	handleObject(...)
}

We have retry logic built into the handleObject function, but even retrying doesn't help. Also, once the error shows up, it doesn't go away anymore, reading of all lines and files now return the same error.

We're thinking of building some retry logic around the client itself, closing it and opening a new one to see if that works, and we're still digging deeper, but I wanted to report this nonetheless, in case anyone else has also run into this.

The text was updated successfully, but these errors were encountered:

jba · 2017-10-11T18:33:30Z

It's unlikely to be an error in this client. I've reported it to other teams internally.

jba · 2017-10-12T12:37:47Z

What version of Go are you using?

Also, what commit is your google.golang.org/api/storage/v1 package at?

JeanMertz · 2017-10-12T12:48:40Z

google.golang.org/api/storage/v1 at 0aaeb37e5bf8ee5d75973b162ffd0f9f6cfdd91d

$ go version
go version go1.9 darwin/amd64

We've seen it happen another couple of times, no real insights yet, other than it does keep happening.

jba · 2017-10-12T18:19:48Z

I filed a bug against Go. The link should be above this comment.

JeanMertz · 2017-10-17T16:26:54Z

Thanks. I've since rebuilt the service that gave this issue, but am still seeing it occur.

One thing I noticed, the files we are fetching are gzipped, so we do pass the ObjectHandle.NewReader through gzip.NewReader. Not sure if this is relevant, but the code works 99% of the time as explained before (and it doesn't fail on the same file every time), so I can't see anything we're doing wrong here.

anthmgoogle · 2017-11-10T22:05:19Z

This issue seems to have been clarified as a golang/go issue. Suggest continue any required discussion there.

Bankq · 2017-11-21T20:59:02Z

@JeanMertz have you been able to work around this? I've seen the same issue quite often. Thanks!

dansiemon · 2017-12-07T14:15:54Z

I also see this frequently. I have a job that pulls a bunch of 250MB files from cloud storage for processing and 1/4 runs die with this error.

Latest gcloud lib rev.

Bankq · 2017-12-07T19:47:21Z

@dansiemon FWIW I don't see this error anymore when I retried the operation. By retry I meant create a new storage.Object

maddyblue · 2017-12-21T09:17:11Z

Also seeing this when pulling lots of files from GCS (cockroachdb/cockroach#20917). It's medium difficult to retry because it occurs somewhere in the middle of reading a file, which means that any function that needs to be able to retry has to now become idempotent. My stuff was built to stream data to various processing stages, making retrying difficult. Furthermore, the files I'm reading can sometimes be > 1GB, causing retrying to be even more difficult since that's a lot of work or data to buffer just in case we have to reread due to this bug.

@anthmgoogle As bradfitz said (golang/go#22235 (comment)) this is the http2 package just passing on an error it saw, it is not the thing producing the error, so I don't think it's correct to close this bug as being an issue in another project. Could you reopen it and look into this again?

jseeley78 · 2018-02-01T19:20:14Z

I too am hitting this error when pulling down files from gcloud storage and have 3 auto retires (with an exponential backoff). Was anyone able to find anything that helped?

bradfitz · 2018-02-01T19:57:57Z

@jba, related to a discussion I had with @ mikeyu83 the other day, the Go Cloud Storage library should probably try to cut large/failed (or just failed) transfers up into multiple HTTP requests, stitched together an io.Reader for the user concatenated from responses from multiple HTTP Range requests.

joe94 · 2018-02-15T03:17:58Z

We're also seeing this in our project (https://github.com/Comcast/tr1d1um/pull/58/files#diff-111de0df2ea714c4ceaac0c8defba0cfR86) <-(running this PR locally). We are building with go 1.9.3.

Specifically, client.Do() returns some error like the following Post https://[server-path]:443/api/v2: stream error: stream ID 1; INTERNAL_ERROR

Interestingly, we are not sending large chunks of data (like seen in some previous comments) but we're still seeing this issue. We are simply sending some short, well-formatted json payload.

Disabling http/2 on the server side seems to "solve" the issue but I am not sure if that is optimal: xmidt-org/webpa-common@41dd674

We thought http/2 should transparently work in the standard net/http package but that does not seem to be the case. However, it could be the case we are misusing it in which case we would appreciate some help.

joe94 · 2018-02-17T23:56:42Z

Just an update to my post above: After further investigation, we found that it is very likely we were seeing these issues due to overly aggressive http.Server timeouts such as writeTimeout.

jba · 2018-03-06T00:52:34Z

I've had no luck reproducing this. I'm using go 1.10 and the storage client at head (3401587). I've got 10,000 goroutines reading fifty 250M files (each file read my many goroutines) and I don't see any errors.

I still plan to do what Brad suggested above, but I sure would like to be able to reproduce this so I'm not flying blind.

maddyblue · 2018-03-20T17:23:25Z

You may need larger files? This reproduced for us again last night. We have a test that reads some large (12G) files from GCS (not in parallel with any other reads, only one go routine reading) and we got this error again.

If we get a retryable error when reading an object, issue another request for the object's data, using a range that begins where we stopped reading. For now, the only retryable error is the internal stream error that has been seen in #784. Change-Id: Idb10b1859bb276b301c3ccb93b0b8bfc84510354 Reviewed-on: https://code-review.googlesource.com/25630 Reviewed-by: Jean de Klerk <[email protected]> Reviewed-by: Brad Fitzpatrick <[email protected]>

jba · 2018-04-17T12:24:13Z

The commit mentioned above will try to read the remaining content when it sees INTERNAL_ERROR. Since I can't test it IRL, I'm hoping this thread's participants will try it out.

24880: storageccl: pull in fix for reading GCS files r=mjibson a=mjibson The linked issue lists a commit attempting to fix occasional problems we see during large GCS reads. They haven't been able to repro it but think this will fix it. Pull in that update so we can run it for a few months for testing. See: googleapis/google-cloud-go#784 Release note: None 24896: engine: find split keys in the first range of a partition r=tschottdorf,a-robinson,bdarnell,danhhz a=benesch This deserves a roachtest—roachmart should make sure it has the correct number of ranges after inserting data—but I wanted to get this out for review ASAP. --- MVCCFindSplitKey would previously fail to find any split keys in the first range of a partition. As a result, partitioned tables have been observed with multi-gigabyte ranges. This commit fixes the bug. Specifically, MVCCFindSplitKey was assuming that the start key of a range within a table was also the row prefix for the first row of data in the range. This does not hold true for the first range of a table or a partition of a table--that range begins at, for example, /Table/51, while the row begins at /Table/51/1/aardvark. The old code had a special case for the first range in a table, but not for the first range in a partition. (It predates partitioning.) Remove the need for special casing by actually looking in RocksDB to determine the row prefix for the first row of data rather than attempting to derive it from the range start key. This properly handles partitioning and is robust against future changes to range split boundaries. See the commit within for more details on the approach. Release note (bug fix): Ranges in partitioned tables now properly split to respect their configured maximum size. Co-authored-by: Matt Jibson <[email protected]> Co-authored-by: Nikhil Benesch <[email protected]>

maddyblue · 2018-04-18T22:35:37Z

We've pulled in that commit to cockroach. We run some large nightly tests that were failing a few times per week from this bug. I'll report back new results in a while.

apimation · 2018-04-21T07:14:54Z

Hopefully this helps. We solved this on the server side by changing ReadTimeout property on http.Server{}

ghost · 2022-03-31T07:58:05Z

I see this issues during downloading files from ipfs web client (ipfs.io) with go http client.
Ipfs is a p2p software written in go. download speed will change frequently because It's a p2p network.
I think that's a cause of this problem

…AL_ERROR Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Create a custom retryer to retry these errors as suggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: cockroachdb#84162 Release note: None

…AL_ERROR Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Create a custom retryer to retry these errors as suggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162 Release note: None

…85329 84975: storage: add `MVCCRangeKeyStack` for range keys r=nicktrav,jbowens a=erikgrinaker **storage: add `MVCCRangeKeyStack` for range keys** This patch adds `MVCCRangeKeyStack` and `MVCCRangeKeyVersion`, a new range key representation that will be returned by `SimpleMVCCIterator`. It is more compact, for efficiency, and comes with a set of convenience methods to simplify common range key processing. Resolves #83895. Release note: None **storage: return `MVCCRangeKeyStack` from `SimpleMVCCIterator`** This patch changes `SimpleMVCCIterator.RangeKeys()` to return `MVCCRangeKeyStack` instead of `[]MVCCRangeKeyValue`. Callers have not been migrated to properly make use of this -- instead, they call `AsRangeKeyValues()` and construct and use the old data structure. The MVCC range tombstones tech note is also updated to reflect this. Release note: None **storage: migrate MVCC code to `MVCCRangeKeyStack`** Release note: None ***: migrate higher-level code to `MVCCRangeKeyStack`** Release note: None **kvserver/gc: partially migrate to `MVCCRangeKeyStack`** Some parts require invasive changes to MVCC stats helpers. These will shortly be consolidated with other MVCC stats logic elsewhere, so the existing logic is retained for now by using `AsRangeKeyValues()`. Release note: None **storage: remove `FirstRangeKeyAbove()` and `HasRangeKeyBetween()`** Release note: None 85017: Revert "sql: Add database ID to sampled query log" r=THardy98 a=THardy98 Reverts: #84195 This reverts commit 307817e. Removes the DatabaseID field from the `SampledQuery` telemetry log due to the potential of indefinite blocking in the case of a lease acquisition failure. Protobuf field not reserved as no official build was released with these changes yet. Release note (sql change): Removes the DatabaseID field from the `SampledQuery` telemetry log due to the potential of indefinite blocking in the case of a lease acquisition failure. 85024: cloud/gcp: add custom retryer for gcs storage, retry on stream INTERNAL_ERROR r=rhu713 a=rhu713 Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Create a custom retryer to retry these errors as suggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: #85217, #85216, #85204, #84162 Release note: None 85069: optbuilder: handle unnest returning a tuple r=DrewKimball a=DrewKimball Currently, the return types of SRFs that return multiple columns are represented as tuples with labels. The tuple labels are used to decide whether or not to create a single output column for the SRF, or multiple. The `unnest` function can return a single column if it has a single argument, and the type of that column can be a tuple with labels. This could cause the old logic to mistakenly create multiple output columns for `unnest`, which could lead to panics down the line and incorrect behavior otherwise. This commit adds a special case for `unnest` in the `optbuilder` to only expand tuple return types if there is more than one argument (implying more than one output column). Other SRFs do not have the same problem because they either always return the same number of columns, cannot return tuples, or both. Fixes #58438 Release note (bug fix): Fixed a bug existing since release 20.1 that could cause a panic in rare cases when the unnest function was used with a tuple return type. 85100: opt: perf improvements for large queries r=DrewKimball a=DrewKimball **opt: add bench test for slow queries** This commit adds two slow-planning queries pulled from #64793 to be used in benchmarking the optimizer. In addition, the `ReorderJoinsLimit` has been set to the default 8 for benchmarking tests. **opt: add struct for tracking column equivalence sets** Previously, the `JoinOrderBuilder` would construct a `FuncDepSet` from scratch on each call to `addJoins` in order to eliminate redundant join filters. This led to unnecessary large allocations because `addJoins` is called an exponential number of times in query size. This commit adds a struct `EquivSet` that efficiently stores equivalence relations as `ColSets` in a slice. Rather than being constructed on each call to `addJoins`, a `Reset` method is called that maintains slice memory. In the future, `EquivSet` can be used to handle equivalencies within `FuncDepSet` structs as well. This well avoid a significant number of allocations in cases with many equivalent columns, as outlined in #83963. **opt: avoid usage of FastIntMap in optimizer hot paths** Previously, `computeHashJoinCost` would use a `FastIntMap` to represent join equality filters to pass to `computeFiltersCost`. In addition, `GenerateMergeJoins` used a `FastIntMap` to look up columns among its join equality columns. This lead to unnecessary allocations since column IDs are often large enough to exceed the small field of `FastIntMap`. This commit modifies `computeFiltersCost` to take an anonymous function that is used to decide whether to skip an equality condition, removing the need for a mapping between columns. This commit also refactors `GenerateMergeJoins` to simply perform a linear scan of its equality columns; this avoids the allocation issue, and should be fast in practice because the number of equalities will not generally be large. Release note: None 85146: [backupccl] Use Expr for backup's Detached and Revision History options r=benbardin a=benbardin This will allow us to set them to null, which will be helpful for ALTER commands. Release note: None 85234: dev: add rewritable paths for norm tests r=mgartner a=mgartner Tests in `pkg/sql/opt/norm` are similar to tests in `pkg/sql/opt/xform` and `pkg/sql/opt/memo` in that they rely on fixtures in `pkg/sql/opt/testutils/opttester/testfixtures`. This commit adds these fixtures as rewritable paths for norm tests so that `./dev test pkg/sql/opt/xform --rewrite` does not fail with errors like: open pkg/sql/opt/testutils/opttester/testfixtures/tpcc_schema: operation not permitted Release note: None 85325: sql: fix explain gist output to show number of scan span constraints r=cucaroach a=cucaroach If there were span constraints we would always print 1, need to actually append them to get the count right. Fixes: #85324 Release note: None 85327: sql: fix udf logic test r=chengxiong-ruan a=chengxiong-ruan Fixes: #85303 Release note: None 85329: colexec: fix recent concat fix r=yuzefovich a=yuzefovich The recent fix of the Concat operator in the vectorized engine doesn't handle the array concatenation correctly and this is now fixed. Fixes: #85295. Release note: None Co-authored-by: Erik Grinaker <[email protected]> Co-authored-by: Thomas Hardy <[email protected]> Co-authored-by: Rui Hu <[email protected]> Co-authored-by: DrewKimball <[email protected]> Co-authored-by: Andrew Kimball <[email protected]> Co-authored-by: Ben Bardin <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Retry these errors assuggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162 Release note: None

…o v1.21.0 This commit bumps the `cloud.google.com/go/storage` vendor to include the ability to inject custom retry functions when reading and writing from the underlying SDK - https://pkg.go.dev/cloud.google.com/go/storage#Client.SetRetry. The motivation for this change is to combat the high rate of restores we are seeing fail due to an internal http2 stream error that is being surfaced by the SDK in our roachtests. As seen in cockroachdb#85024 we would like to wrap the default retry logic with our custom retry handling for this particular error. This is the recommended solution as per: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Note, the dependencies have been bumped to the version that we have been running on master since the 22.1 branch was cut. Release note (general change): bump cloud.google.com/go/storage from v18.2.0 to v1.21.0 to allow for injection of custom retry logic in the SDK

…AL_ERROR Currently, errors like `stream error: stream ID <x>; INTERNAL_ERROR; received from peer` are not being retried. Create a custom retryer to retry these errors as suggested by: googleapis/google-cloud-go#3735 googleapis/google-cloud-go#784 Fixes: cockroachdb#85217, cockroachdb#85216, cockroachdb#85204, cockroachdb#84162 Release note: None Release justification: add retries for temporary errors that were causing roachtests to fail.

garan82 · 2023-02-16T00:59:02Z

We got a lot of this errors with Go SDK 1.20/1.20.1, and did not have this with Go 1.95. Anyone experiencing the same issue?

Festlandtommy · 2023-05-17T13:36:28Z

We are occasionally getting this error with Go 1.20.
We are reading binary audio (<10MB/req) using http.Server and ran into this error a couple of times.

Edit: Downgrade to Go 1.19 did not fix the issue, also bumping the http.Server.ReadTimeout had no effect.

jba mentioned this issue Oct 12, 2017

x/net/http2: seeing INTERNAL_ERROR golang/go#22235

Closed

anthmgoogle closed this as completed Nov 10, 2017

maddyblue mentioned this issue Dec 21, 2017

sqlccl: sf-100 import failure cockroachdb/cockroach#20917

Closed

jba reopened this Dec 21, 2017

jba self-assigned this Dec 21, 2017

jba added api: storage Issues related to the Cloud Storage API. priority: p1 Important issue which blocks shipping the next release. Will be fixed prior to next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. labels Dec 21, 2017

maddyblue mentioned this issue Apr 17, 2018

storageccl: pull in fix for reading GCS files cockroachdb/cockroach#24880

Merged

ashish-amarnath mentioned this issue Nov 18, 2020

Backup PartiallyFailed vmware-tanzu/velero#3087

Closed

mayocream mentioned this issue Nov 4, 2021

error building image: error building stage: failed to get filesystem from image: stream error: stream ID 3; PROTOCOL_ERROR GoogleContainerTools/kaniko#1360

Open

neild mentioned this issue Jun 9, 2022

net/http: frequent HTTP2 INTERNAL_ERROR errors during module zip download since 2021-10-06 golang/go#51323

Open

msbutler mentioned this issue Jul 11, 2022

roachtest: restore2TB/nodes=32 failed cockroachdb/cockroach#84132

Closed

rhu713 mentioned this issue Jul 29, 2022

cloud/gcp: add custom retryer for gcs storage, retry on stream INTERNAL_ERROR cockroachdb/cockroach#85024

Merged

rhu713 mentioned this issue Aug 2, 2022

release-22.1: cloud/gcp: add custom retryer for gcs storage, retry on stream INTERN… cockroachdb/cockroach#85482

Merged

rhu713 mentioned this issue Aug 3, 2022

cloud: add stream INTERNAL_ERROR to resumable HTTP errors cockroachdb/cockroach#85538

Closed

adityamaru mentioned this issue Aug 8, 2022

release-22.1: vendor: bump cloud.google.com/go/storage from v18.2.0 to v1.21.0 cockroachdb/cockroach#85758

Closed

adityamaru mentioned this issue Aug 8, 2022

release-22.1: vendor: bump cloud.google.com/go/storage from v18.2.0 to v1.21.0 cockroachdb/cockroach#85763

Merged

fische mentioned this issue Aug 16, 2022

Retry storage access on internal errors thought-machine/please-servers#180

Closed

garan82 mentioned this issue Feb 17, 2023

storage: "stream error: stream ID x; INTERNAL_ERROR" when upgrading from Go 1.19.5 to 1.20 or 1.20.1 #7440

Open

vikasmulaje mentioned this issue Oct 3, 2023

Preflight check container fails with Error: failed to extract tarball: stream error: stream ID 41; INTERNAL_ERROR; received from peer redhat-openshift-ecosystem/openshift-preflight#1072

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[storage] "stream error: stream ID x; INTERNAL_ERROR" #784

[storage] "stream error: stream ID x; INTERNAL_ERROR" #784

JeanMertz commented Oct 10, 2017 •

edited

Loading

jba commented Oct 11, 2017

jba commented Oct 12, 2017

JeanMertz commented Oct 12, 2017

jba commented Oct 12, 2017

JeanMertz commented Oct 17, 2017 •

edited

Loading

anthmgoogle commented Nov 10, 2017

Bankq commented Nov 21, 2017

dansiemon commented Dec 7, 2017

Bankq commented Dec 7, 2017

maddyblue commented Dec 21, 2017 •

edited

Loading

jseeley78 commented Feb 1, 2018

bradfitz commented Feb 1, 2018

joe94 commented Feb 15, 2018 •

edited

Loading

joe94 commented Feb 17, 2018 •

edited

Loading

jba commented Mar 6, 2018

maddyblue commented Mar 20, 2018 •

edited

Loading

jba commented Apr 17, 2018

maddyblue commented Apr 18, 2018

apimation commented Apr 21, 2018

ghost commented Mar 31, 2022 •

edited by ghost

Loading

garan82 commented Feb 16, 2023

Festlandtommy commented May 17, 2023 •

edited

Loading

[storage] "stream error: stream ID x; INTERNAL_ERROR" #784

[storage] "stream error: stream ID x; INTERNAL_ERROR" #784

Comments

JeanMertz commented Oct 10, 2017 • edited Loading

jba commented Oct 11, 2017

jba commented Oct 12, 2017

JeanMertz commented Oct 12, 2017

jba commented Oct 12, 2017

JeanMertz commented Oct 17, 2017 • edited Loading

anthmgoogle commented Nov 10, 2017

Bankq commented Nov 21, 2017

dansiemon commented Dec 7, 2017

Bankq commented Dec 7, 2017

maddyblue commented Dec 21, 2017 • edited Loading

jseeley78 commented Feb 1, 2018

bradfitz commented Feb 1, 2018

joe94 commented Feb 15, 2018 • edited Loading

joe94 commented Feb 17, 2018 • edited Loading

jba commented Mar 6, 2018

maddyblue commented Mar 20, 2018 • edited Loading

jba commented Apr 17, 2018

maddyblue commented Apr 18, 2018

apimation commented Apr 21, 2018

ghost commented Mar 31, 2022 • edited by ghost Loading

garan82 commented Feb 16, 2023

Festlandtommy commented May 17, 2023 • edited Loading

JeanMertz commented Oct 10, 2017 •

edited

Loading

JeanMertz commented Oct 17, 2017 •

edited

Loading

maddyblue commented Dec 21, 2017 •

edited

Loading

joe94 commented Feb 15, 2018 •

edited

Loading

joe94 commented Feb 17, 2018 •

edited

Loading

maddyblue commented Mar 20, 2018 •

edited

Loading

ghost commented Mar 31, 2022 •

edited by ghost

Loading

Festlandtommy commented May 17, 2023 •

edited

Loading