Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix gRPC Streaming after Cold Start #3239

Closed
tanzeeb opened this issue Feb 15, 2019 · 4 comments · Fixed by #3257
Closed

Fix gRPC Streaming after Cold Start #3239

tanzeeb opened this issue Feb 15, 2019 · 4 comments · Fixed by #3257
Assignees
Labels
area/autoscale area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug.
Milestone

Comments

@tanzeeb
Copy link
Contributor

tanzeeb commented Feb 15, 2019

In what area(s)?

/area autoscale
/area networking
/area test-and-release

What version of Knative?

HEAD

Expected Behavior

After a service is scaled-to-zero, a streaming gRPC request should return successfully.

Actual Behavior

After a service is scaled-to-zero, a streaming gRPC request times out.

Steps to Reproduce the Problem

See commented out test case in #3205

/assign

@tanzeeb tanzeeb added the kind/bug Categorizes issue or PR as related to a bug. label Feb 15, 2019
@knative-prow-robot knative-prow-robot added area/autoscale area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features labels Feb 15, 2019
@tanzeeb tanzeeb changed the title Support gRPC Streaming after Cold Start Fix gRPC Streaming after Cold Start Feb 15, 2019
@tanzeeb tanzeeb mentioned this issue Feb 15, 2019
5 tasks
@mattmoor
Copy link
Member

After poking at this for a bit and instrumenting things, I think I'm convinced that this is a problem with the retry logic in the activator.

On my cluster, it usually takes 4-5 attempts before it actually connects to the user-container, but that connection always lasts for roughly 55 seconds (whatever is left of 60) and then gets a deadline exceeded.

When I added a time.Sleep(5 * time.Second) between Revision readiness and the attempts to forward the request, the test passes.

I think that in general trying to "rewind" the user's request is creating more problems than it's worth[1]. I think that we should probably pursue a simple probe endpoint that is intercepted and filtered by the queue-proxy (which we can also use for positive hand-off in deactivation). We'd essentially probe this endpoint for queue-proxy readiness, and once ready send the request on to that Pod without the fancy retry logic.

[1] - An example problem I observed parsing some of this code is that the Rewinder will effectively record the entire upstream in memory until the request (stream) is complete. While fixable, these are band-aids over the real problem.

@mattmoor
Copy link
Member

This is essentially what I'm proposing, but done in a manner that we could later use it for this

@mattmoor mattmoor added this to the Serving 0.5 milestone Feb 17, 2019
@mattmoor
Copy link
Member

Here's some commentary from Matt Klein on this topic (thanks to @tcnghia for the pointer): envoyproxy/envoy#3594 (comment)

I think this basically confirms that we shouldn't be trying to send & retry, but check network readiness by other means (e.g. the probe).

mattmoor added a commit to mattmoor/serving that referenced this issue Feb 17, 2019
This adds a header `k-network-probe`, to which the Knative networking elements
respond without forwarding the requests.  They also identify themselves in their
response, so that we know what component is handling the probe.

This is related to: knative#2856, knative#2849, knative#3239
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 17, 2019
This adds a header `k-network-probe`, to which the Knative networking elements
respond without forwarding the requests.  They also identify themselves in their
response, so that we know what component is handling the probe.

This is related to: knative#2856, knative#2849, knative#3239
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 17, 2019
This adds a header `k-network-probe`, to which the Knative networking elements
respond without forwarding the requests.  They also identify themselves in their
response, so that we know what component is handling the probe.

This is related to: knative#2856, knative#2849, knative#3239
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 17, 2019
This adds a header `k-network-probe`, to which the Knative networking elements
respond without forwarding the requests.  They also identify themselves in their
response, so that we know what component is handling the probe.

This is related to: knative#2856, knative#2849, knative#3239
knative-prow-robot pushed a commit that referenced this issue Feb 17, 2019
This adds a header `k-network-probe`, to which the Knative networking elements
respond without forwarding the requests.  They also identify themselves in their
response, so that we know what component is handling the probe.

This is related to: #2856, #2849, #3239
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 17, 2019
When the flag `-enable-network-probing` is passed the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Still TODO:
 - Unit testing for the get probes
 - Disable by default, and `t.Skip()` the streaming GRPC test

These will be `Fixes:` when this is enabled by default.
Related: knative#3239
Related: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 18, 2019
When the flag `-enable-network-probing` is passed the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Still TODO:
 - Disable by default, and `t.Skip()` the streaming GRPC test

These will be `Fixes:` when this is enabled by default.
Related: knative#3239
Related: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 18, 2019
When the flag `-enable-network-probing` is passed the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Still TODO:
 - Disable by default, and `t.Skip()` the streaming GRPC test

These will be `Fixes:` when this is enabled by default.
Related: knative#3239
Related: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 19, 2019
When the flag `-enable-network-probing` is passed the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Still TODO:
 - Disable by default, and `t.Skip()` the streaming GRPC test

These will be `Fixes:` when this is enabled by default.
Related: knative#3239
Related: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 19, 2019
When the flag `-enable-network-probing` is passed the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Still TODO:
 - Disable by default, and `t.Skip()` the streaming GRPC test

These will be `Fixes:` when this is enabled by default.
Related: knative#3239
Related: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 19, 2019
When the flag `-enable-network-probing` is passed the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Still TODO:
 - Disable by default, and `t.Skip()` the streaming GRPC test

These will be `Fixes:` when this is enabled by default.
Related: knative#3239
Related: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 19, 2019
When the flag `-enable-network-probing` is passed (on by default) the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Fixes: knative#3239
Fixes: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 19, 2019
When the flag `-enable-network-probing` is passed (on by default) the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Fixes: knative#3239
Fixes: knative#2856
@tanzeeb
Copy link
Contributor Author

tanzeeb commented Feb 19, 2019

@mattmoor yeah, the the retrying in the Activator is full of problems, +1 to avoiding it altogether.

mattmoor added a commit to mattmoor/serving that referenced this issue Feb 19, 2019
When the flag `-enable-network-probing` is passed (on by default) the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Fixes: knative#3239
Fixes: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 19, 2019
When the flag `-enable-network-probing` is passed (on by default) the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Fixes: knative#3239
Fixes: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 22, 2019
When the flag `-enable-network-probing` is passed (on by default) the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Fixes: knative#3239
Fixes: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 22, 2019
When the flag `-enable-network-probing` is passed (on by default) the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Fixes: knative#3239
Fixes: knative#2856
mattmoor added a commit to mattmoor/serving that referenced this issue Feb 22, 2019
When the flag `-enable-network-probing` is passed (on by default) the activator will replace its retring transport logic with a simple network probe based on knative#3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Fixes: knative#3239
Fixes: knative#2856
knative-prow-robot pushed a commit that referenced this issue Feb 22, 2019
When the flag `-enable-network-probing` is passed (on by default) the activator will replace its retring transport logic with a simple network probe based on #3256 with a similar number of retries to what the retrying transport was previously configured to use.  Enabling this allows the GRPC test with streaming and cold-start fairly reliably on my cluster (and also with the GRPC ping sample in knative/docs, with my fixes).

This change also refactors the GRPC test into 4 tests, for each of the logical things tested, which will hopefully reduce the amount of time this adds to e2e dramatically when we switch to use `t.Parallel()` since it will parallelize the two times this waits for a scale-to-zero.

Fixes: #3239
Fixes: #2856
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale area/networking area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants