Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v0.3.1] Seeing intermittent "internal server error" in Dashboard #580

Closed
bourquep opened this issue Mar 16, 2018 · 29 comments
Closed

[v0.3.1] Seeing intermittent "internal server error" in Dashboard #580

bourquep opened this issue Mar 16, 2018 · 29 comments
Labels
priority/P1 Planned for Release

Comments

@bourquep
Copy link
Contributor

This is on my test cluster, with almost no traffic. Just leaving a deployment page open in my browser, I see this red banner appear/disappear randomly. All my services are behaving properly though.

untitled

I have checked the logs of each container in the controller pod, nothing unusual there.

I'll be happy to provide more info if you tell me where to look!

@seanmonstar
Copy link
Contributor

The controller now has the conduit-proxy injected as a sidecar for itself too, it's possible it is seeing an error, since it returns a 500 for errors. Any time the proxy returns a 500 from an error internally, it now logs some better description as to what is it turning into a 500 (of course, if the other side returned 500, then the proxy just passes that along as not an internal proxy error...).

@bourquep
Copy link
Contributor Author

So if I wanted to get to the root of this error, where would I look?

@hawkw
Copy link
Contributor

hawkw commented Mar 16, 2018

@bourquep Try

kubectl logs -n conduit web-$XXXX conduit-proxy

where web-$XXXX is the name of the web pod in the conduit namespace.

@bourquep
Copy link
Contributor Author

Those are the only lines in this log. All conduit pods have a restart count of 0 since installing 0.3.1.

INFO conduit_proxy using controller at HostAndPort { host: DnsName("proxy-api.conduit.svc.cluster.local"), port: 8086 }
INFO conduit_proxy routing on V4(127.0.0.1:4140)
INFO conduit_proxy proxying on V4(0.0.0.0:4143) to None
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("proxy-api.conduit.svc.cluster.local") to 10.35.241.37
ERR! conduit_proxy::control "controller-client", controller error: Error attempting to establish underlying session layer: operation timed out after Duration { secs: 3, nanos: 0 }
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("proxy-api.conduit.svc.cluster.local") to 10.35.241.37

@hawkw
Copy link
Contributor

hawkw commented Mar 16, 2018

Okay, let's look at the logs from the conduit-proxy container in the public-api pod in the conduit namespace.

@bourquep
Copy link
Contributor Author

bourquep commented Mar 16, 2018

There is no public-api pod, but there is a public-api container in the controller pod.

Here are the logs of all containers in the controller pod:

$ k logs -n conduit controller-598b867959-5qw44 public-api                                                                                                               
time="2018-03-16T18:01:16Z" level=info msg="running conduit version v0.3.1"
time="2018-03-16T18:01:16Z" level=info msg="serving scrapable metrics on :9995"
time="2018-03-16T18:01:16Z" level=info msg="starting HTTP server on :8085"

$ k logs -n conduit controller-598b867959-5qw44 proxy-api                                                                                                                                             
time="2018-03-16T18:01:17Z" level=info msg="running conduit version v0.3.1"
time="2018-03-16T18:01:17Z" level=info msg="serving scrapable metrics on :9996"
time="2018-03-16T18:01:17Z" level=info msg="starting gRPC server on :8086"

$ k logs -n conduit controller-598b867959-5qw44 conduit-proxy                                                                                                                                         
INFO conduit_proxy using controller at HostAndPort { host: DnsName("localhost"), port: 8086 }
INFO conduit_proxy routing on V4(127.0.0.1:4140)
INFO conduit_proxy proxying on V4(0.0.0.0:4143) to None
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("localhost") to 127.0.0.1

$ k logs -n conduit controller-598b867959-5qw44 telemetry                                                                                                                                             
time="2018-03-16T18:01:18Z" level=info msg="running conduit version v0.3.1"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:19Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:20Z" level=info msg="[pods watcher] initialized"
time="2018-03-16T18:01:20Z" level=info msg="[replicasets watcher] initialized"
time="2018-03-16T18:01:20Z" level=info msg="serving scrapable metrics on:9997"
time="2018-03-16T18:01:20Z" level=info msg="starting gRPC server on 127.0.0.1:8087"

$ k logs -n conduit controller-598b867959-5qw44 destination                                                                                                                                           
time="2018-03-16T18:01:17Z" level=info msg="running conduit version v0.3.1"
E0316 18:01:17.246299       1 reflector.go:205] github.com/runconduit/conduit/controller/k8s/endpoints.go:140: Failed to list *v1.Endpoints: Get https://10.35.240.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused
E0316 18:01:18.248587       1 reflector.go:205] github.com/runconduit/conduit/controller/k8s/endpoints.go:140: Failed to list *v1.Endpoints: Get https://10.35.240.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused
E0316 18:01:19.249303       1 reflector.go:205] github.com/runconduit/conduit/controller/k8s/endpoints.go:140: Failed to list *v1.Endpoints: Get https://10.35.240.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused
time="2018-03-16T18:01:20Z" level=info msg="[endpoints watcher] initialized"
time="2018-03-16T18:01:21Z" level=info msg="[endpoints watcher] initialized"
serving scrapable metrics on :9999
time="2018-03-16T18:01:21Z" level=info msg="starting gRPC server on 127.0.0.1:8089"
time="2018-03-16T18:01:53Z" level=info msg="Establishing watch on endpoint conduit/prometheus:9090"
time="2018-03-16T18:01:53Z" level=info msg="Establishing watch on endpoint conduit/api:8085"
time="2018-03-16T18:05:44Z" level=info msg="Establishing watch on endpoint studyo/generator:5000"
[....]

@hawkw
Copy link
Contributor

hawkw commented Mar 16, 2018

There is no public-api pod, but there is a public-api container in the controller pod.
Sorry, that's what I meant, my bad.

Looking at the logs you posted, it's possible the 500 error results from the failure to watch pods in the telemetry service, though I'm not too familiar with this part of Conduit, so I'd have to take a look at the code to confirm. I suspect @siggy or @kl might have more insight...

@bourquep
Copy link
Contributor Author

Keep in mind that this does not seem to affect the proxies and proxied services, just some intermittent glitch in the telemetry/dashboard. Absolutely not a big deal from my (user's) point of view, but probably worth investigating nonetheless to make sure it's not a symptom of something worse lurking somewhere. :)

I'll be happy to provide whatever logs/info you might need.

@hawkw
Copy link
Contributor

hawkw commented Mar 16, 2018

Yeah, I suspect that what's going on here is that the telemetry service is intermittently erroring because it's getting a Connection Refused response from the k8s API when it tries to get a list of pods to associate metrics with, and briefly fails requests for new metrics states from the frontend. Then the request goes through again, and the banner goes away.

@klingerf
Copy link
Contributor

Thanks for sharing these logs. The telemetry service's logs look normal to me, and they don't indicate that the service has failed to serve any requests. This message is expected:

time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"

It's only logged at startup, when the service is attempting to read from the kubernetes API for the first time. It's expected that the first few requests will fail due to the changes from #365, so the process just logs the failure and retries.

If you see this happen again, could you try finding the error response in your web inspector's network tab? I suspect that the response body would give us some indication of where the failure is happening.

@olix0r olix0r added this to the 0.4.0 milestone Mar 19, 2018
@olix0r olix0r added the priority/P1 Planned for Release label Mar 19, 2018
@olix0r
Copy link
Member

olix0r commented Apr 16, 2018

The telemetry system has been replaced entirely in v0.4.0. I'd be surprised if this behavior persists, but it would be good to confirm it does not once the release is out.

@olix0r olix0r modified the milestones: 0.4.0, 0.4.1 Apr 16, 2018
@klingerf klingerf removed their assignment Apr 16, 2018
@siggy siggy modified the milestones: 0.4.1, 0.5.0 Apr 25, 2018
@klingerf klingerf self-assigned this May 29, 2018
@klingerf
Copy link
Contributor

I've been seeing this issue as well recently. It appears that the XHR request to /api/stat is returning a 500, which triggers the banner. The 500 response body is:

{"error":"Post http://api.conduit.svc.cluster.local:8085/api/v1/StatSummary: http: server closed idle connection"}

So that indicates that the request from the web pod to the controller pod is failing. Sure enough I see the same error printed in the web container's log:

$ kubectl -n conduit logs web-657b744c64-mzkvz web
...
http://api.conduit.svc.cluster.local:8085/api/v1/StatSummary: http: server closed idle connection"

But I don't see any other relevant info in the web pod's outbound proxy logs, or the controller pod's inbound proxy logs, or the public-api logs.

@klingerf
Copy link
Contributor

It's also worth noting that the web pod's inbound proxy stats reflect decreased success rate:

$ conduit stat -n conduit deploy/web -t 1h 
NAME   MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   SECURED
web       1/1    99.54%   0.3rps           5ms          10ms          10ms        0%

But the outbound stats from the web to the controller pod do not:

$ conduit stat -n conduit deploy --from deploy/web -t 1h 
NAME         MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   SECURED
controller      1/1   100.00%   0.2rps           0ms           0ms           0ms        0%

@christopherhein
Copy link

This seems to still exist in the new linkerd UI as well. Didn't seem to affect anything just strange :).
screen shot 2018-07-18 at 6 05 57 pm

@klingerf
Copy link
Contributor

Based on the error message, it seems like we could set the public API client's IdleConnTimeout to 0 (unlimited), and that might fix this issue. The default IdleConnTimeout is 90 seconds.

	// IdleConnTimeout is the maximum amount of time an idle
	// (keep-alive) connection will remain idle before closing
	// itself.
	// Zero means no limit.
	IdleConnTimeout time.Duration

@christopherhein
Copy link

That would actually make a lot of sense, I can open the dashboard and set a timer if you'd like @klingerf to see if that would solve it…

@klingerf
Copy link
Contributor

@christopherhein In the process of removing the client's idle connection timeout, I realized that the error message indicates that the server is closing the connection, not the client. So I think what we're actually seeing is an instance of golang/go#19943 (comment). All of our public API requests are POSTs, which means that the go client won't retry on server connection close, making us susceptible to this behavior. Will dig around a bit more to see if there's a different fix we can implement on the client side.

@christopherhein
Copy link

christopherhein commented Jul 20, 2018

Ah, that is very interesting. It seems to recover itself the next time around, in my mind that would mean it could be something we catch and retry, but that is without fully understanding the context @klingerf

@klingerf
Copy link
Contributor

@christopherhein Ok cool, I added a client retry in kl/retry-server-close, and I've published linkerd images from that branch. Mind giving it a try in your environment?

linkerd install --linkerd-version git-4470c8fb | kubectl apply -f -

I've never had a lot of luck reproducing this failure in my env. If it seems to work for you, I'll put this change up for review.

@christopherhein
Copy link

christopherhein commented Jul 20, 2018

Sounds great, let me bring the env back up. Just to be clear when I use the --linkerd-version that will grab that version from Github, I don't need to update my local binary?

@klingerf
Copy link
Contributor

That's right -- the --linkerd-version flag will install that version in the linkerd namespace in your kubernetes cluster. It won't affect your CLI version, but the change that I made only impacts the web server, so it should be no prob to install a new controller using the older CLI.

@christopherhein
Copy link

Awesome, so I got that running, but it still seems to error. I'm getting on both /tps-reports and api/pods is there any chance this is trying to use metrics from the API server /metrics endpoint?

And has anyone experienced this on anything other than EKS…

@klingerf
Copy link
Contributor

@christopherhein Thanks for testing it out! Too bad this doesn't fix it, but will keep investigating. Fwiw I think there may be two separate issues that trigger the red error banner. Sometimes I see that error immediately on initial page load, and other times I see it pop up sporadically after the page has been running for a while. I was expecting this branch to fix the latter error, and I'm still not sure what causes the former.

@christopherhein
Copy link

christopherhein commented Jul 21, 2018

Hmmm, interesting for me it comes up just at some interval I don't really see it on first boot, I'll check it out and see if I can set a timer and see if it correlates to the 90-second timeout, even though I think that's what you removed.

@christopherhein
Copy link

Second test, just opened up the dashboard and for TPS Reports endpoint it might have been fixed, I'm still getting it on /api/pods I also set a timer and it's every ~25seconds

@klingerf
Copy link
Contributor

Hey @christopherhein, really sorry for the delay here. Thanks for testing out those builds last week!

I've done some testing in a GKE cluster, and it appears to me that the "server closed idle connection" error was fixed as part of the v18.7.2 release that we cut earlier this week. Specifically, linkerd/linkerd2-proxy#26 was included in the release, and that fixed an issue with HTTP connection reuse, which I suspect was triggering the "server closed idle connection" error.

Can you give the v18.7.2 release a shot to see if it fixes the issue for you as well? You can install the updated CLI with:

curl https://run.conduit.io/install? | sh

Note that we display the red error banner in the web UI whenever a 5xx error is returned from the web server, and the fix in v18.7.2 only addresses one source of 5xx errors. I've opened #1366 to track adding more information to the error banner, which will help us distinguish between different types of errors going forward. In the meantime, if you see a 5xx error and are able to paste the response body into this issue, that would be super helpful. Thanks!

@christopherhein
Copy link

This is fantastic, thank you @klingerf I've been running it silently in the background for the last 5 minutes and nothing as showed up in console yet… 👏 great work!

@klingerf
Copy link
Contributor

@christopherhein Great news, glad to hear it!

@klingerf
Copy link
Contributor

I'm going to go ahead and close this. Thanks again @christopherhein for your help tracking it down. We can open separate issues if the banner reappears, hopefully with more specificity once #1366 is fixed.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
priority/P1 Planned for Release
Projects
None yet
Development

No branches or pull requests

7 participants