[v0.3.1] Seeing intermittent "internal server error" in Dashboard #580

bourquep · 2018-03-16T17:48:43Z

This is on my test cluster, with almost no traffic. Just leaving a deployment page open in my browser, I see this red banner appear/disappear randomly. All my services are behaving properly though.

I have checked the logs of each container in the controller pod, nothing unusual there.

I'll be happy to provide more info if you tell me where to look!

The text was updated successfully, but these errors were encountered:

seanmonstar · 2018-03-16T18:02:08Z

The controller now has the conduit-proxy injected as a sidecar for itself too, it's possible it is seeing an error, since it returns a 500 for errors. Any time the proxy returns a 500 from an error internally, it now logs some better description as to what is it turning into a 500 (of course, if the other side returned 500, then the proxy just passes that along as not an internal proxy error...).

bourquep · 2018-03-16T18:03:40Z

So if I wanted to get to the root of this error, where would I look?

hawkw · 2018-03-16T18:16:13Z

@bourquep Try

kubectl logs -n conduit web-$XXXX conduit-proxy

where web-$XXXX is the name of the web pod in the conduit namespace.

bourquep · 2018-03-16T18:20:37Z

Those are the only lines in this log. All conduit pods have a restart count of 0 since installing 0.3.1.

INFO conduit_proxy using controller at HostAndPort { host: DnsName("proxy-api.conduit.svc.cluster.local"), port: 8086 }
INFO conduit_proxy routing on V4(127.0.0.1:4140)
INFO conduit_proxy proxying on V4(0.0.0.0:4143) to None
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("proxy-api.conduit.svc.cluster.local") to 10.35.241.37
ERR! conduit_proxy::control "controller-client", controller error: Error attempting to establish underlying session layer: operation timed out after Duration { secs: 3, nanos: 0 }
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("proxy-api.conduit.svc.cluster.local") to 10.35.241.37

hawkw · 2018-03-16T18:31:27Z

Okay, let's look at the logs from the conduit-proxy container in the public-api pod in the conduit namespace.

bourquep · 2018-03-16T18:35:48Z

There is no public-api pod, but there is a public-api container in the controller pod.

Here are the logs of all containers in the controller pod:

$ k logs -n conduit controller-598b867959-5qw44 public-api                                                                                                               
time="2018-03-16T18:01:16Z" level=info msg="running conduit version v0.3.1"
time="2018-03-16T18:01:16Z" level=info msg="serving scrapable metrics on :9995"
time="2018-03-16T18:01:16Z" level=info msg="starting HTTP server on :8085"

$ k logs -n conduit controller-598b867959-5qw44 proxy-api                                                                                                                                             
time="2018-03-16T18:01:17Z" level=info msg="running conduit version v0.3.1"
time="2018-03-16T18:01:17Z" level=info msg="serving scrapable metrics on :9996"
time="2018-03-16T18:01:17Z" level=info msg="starting gRPC server on :8086"

$ k logs -n conduit controller-598b867959-5qw44 conduit-proxy                                                                                                                                         
INFO conduit_proxy using controller at HostAndPort { host: DnsName("localhost"), port: 8086 }
INFO conduit_proxy routing on V4(127.0.0.1:4140)
INFO conduit_proxy proxying on V4(0.0.0.0:4143) to None
INFO conduit_proxy::transport::connect "controller-client", DNS resolved DnsName("localhost") to 127.0.0.1

$ k logs -n conduit controller-598b867959-5qw44 telemetry                                                                                                                                             
time="2018-03-16T18:01:18Z" level=info msg="running conduit version v0.3.1"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:19Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"
time="2018-03-16T18:01:20Z" level=info msg="[pods watcher] initialized"
time="2018-03-16T18:01:20Z" level=info msg="[replicasets watcher] initialized"
time="2018-03-16T18:01:20Z" level=info msg="serving scrapable metrics on:9997"
time="2018-03-16T18:01:20Z" level=info msg="starting gRPC server on 127.0.0.1:8087"

$ k logs -n conduit controller-598b867959-5qw44 destination                                                                                                                                           
time="2018-03-16T18:01:17Z" level=info msg="running conduit version v0.3.1"
E0316 18:01:17.246299       1 reflector.go:205] github.com/runconduit/conduit/controller/k8s/endpoints.go:140: Failed to list *v1.Endpoints: Get https://10.35.240.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused
E0316 18:01:18.248587       1 reflector.go:205] github.com/runconduit/conduit/controller/k8s/endpoints.go:140: Failed to list *v1.Endpoints: Get https://10.35.240.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused
E0316 18:01:19.249303       1 reflector.go:205] github.com/runconduit/conduit/controller/k8s/endpoints.go:140: Failed to list *v1.Endpoints: Get https://10.35.240.1:443/api/v1/endpoints?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused
time="2018-03-16T18:01:20Z" level=info msg="[endpoints watcher] initialized"
time="2018-03-16T18:01:21Z" level=info msg="[endpoints watcher] initialized"
serving scrapable metrics on :9999
time="2018-03-16T18:01:21Z" level=info msg="starting gRPC server on 127.0.0.1:8089"
time="2018-03-16T18:01:53Z" level=info msg="Establishing watch on endpoint conduit/prometheus:9090"
time="2018-03-16T18:01:53Z" level=info msg="Establishing watch on endpoint conduit/api:8085"
time="2018-03-16T18:05:44Z" level=info msg="Establishing watch on endpoint studyo/generator:5000"
[....]

hawkw · 2018-03-16T20:04:08Z

There is no public-api pod, but there is a public-api container in the controller pod.
Sorry, that's what I meant, my bad.

Looking at the logs you posted, it's possible the 500 error results from the failure to watch pods in the telemetry service, though I'm not too familiar with this part of Conduit, so I'd have to take a look at the code to confirm. I suspect @siggy or @kl might have more insight...

bourquep · 2018-03-16T20:08:15Z

Keep in mind that this does not seem to affect the proxies and proxied services, just some intermittent glitch in the telemetry/dashboard. Absolutely not a big deal from my (user's) point of view, but probably worth investigating nonetheless to make sure it's not a symptom of something worse lurking somewhere. :)

I'll be happy to provide whatever logs/info you might need.

hawkw · 2018-03-16T20:11:19Z

Yeah, I suspect that what's going on here is that the telemetry service is intermittently erroring because it's getting a Connection Refused response from the k8s API when it tries to get a list of pods to associate metrics with, and briefly fails requests for new metrics states from the frontend. Then the request goes through again, and the banner goes away.

klingerf · 2018-03-16T23:38:18Z

Thanks for sharing these logs. The telemetry service's logs look normal to me, and they don't indicate that the service has failed to serve any requests. This message is expected:

time="2018-03-16T18:01:18Z" level=error msg="[pods watcher] error establishing watch: github.com/runconduit/conduit/controller/k8s/pods.go:31: Failed to list *v1.Pod: Get https://10.35.240.1:443/api/v1/pods?limit=500&resourceVersion=0: dial tcp 10.35.240.1:443: connect: connection refused"

It's only logged at startup, when the service is attempting to read from the kubernetes API for the first time. It's expected that the first few requests will fail due to the changes from #365, so the process just logs the failure and retries.

If you see this happen again, could you try finding the error response in your web inspector's network tab? I suspect that the response body would give us some indication of where the failure is happening.

olix0r · 2018-04-16T21:24:07Z

The telemetry system has been replaced entirely in v0.4.0. I'd be surprised if this behavior persists, but it would be good to confirm it does not once the release is out.

klingerf · 2018-05-29T23:46:58Z

I've been seeing this issue as well recently. It appears that the XHR request to /api/stat is returning a 500, which triggers the banner. The 500 response body is:

{"error":"Post http://api.conduit.svc.cluster.local:8085/api/v1/StatSummary: http: server closed idle connection"}

So that indicates that the request from the web pod to the controller pod is failing. Sure enough I see the same error printed in the web container's log:

$ kubectl -n conduit logs web-657b744c64-mzkvz web
...
http://api.conduit.svc.cluster.local:8085/api/v1/StatSummary: http: server closed idle connection"

But I don't see any other relevant info in the web pod's outbound proxy logs, or the controller pod's inbound proxy logs, or the public-api logs.

klingerf · 2018-05-29T23:53:59Z

It's also worth noting that the web pod's inbound proxy stats reflect decreased success rate:

$ conduit stat -n conduit deploy/web -t 1h 
NAME   MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   SECURED
web       1/1    99.54%   0.3rps           5ms          10ms          10ms        0%

But the outbound stats from the web to the controller pod do not:

$ conduit stat -n conduit deploy --from deploy/web -t 1h 
NAME         MESHED   SUCCESS      RPS   LATENCY_P50   LATENCY_P95   LATENCY_P99   SECURED
controller      1/1   100.00%   0.2rps           0ms           0ms           0ms        0%

christopherhein · 2018-07-19T17:18:55Z

This seems to still exist in the new linkerd UI as well. Didn't seem to affect anything just strange :).

klingerf · 2018-07-20T01:14:26Z

Based on the error message, it seems like we could set the public API client's IdleConnTimeout to 0 (unlimited), and that might fix this issue. The default IdleConnTimeout is 90 seconds.

	// IdleConnTimeout is the maximum amount of time an idle
	// (keep-alive) connection will remain idle before closing
	// itself.
	// Zero means no limit.
	IdleConnTimeout time.Duration

christopherhein · 2018-07-20T01:27:41Z

That would actually make a lot of sense, I can open the dashboard and set a timer if you'd like @klingerf to see if that would solve it…

klingerf · 2018-07-20T21:53:33Z

@christopherhein In the process of removing the client's idle connection timeout, I realized that the error message indicates that the server is closing the connection, not the client. So I think what we're actually seeing is an instance of golang/go#19943 (comment). All of our public API requests are POSTs, which means that the go client won't retry on server connection close, making us susceptible to this behavior. Will dig around a bit more to see if there's a different fix we can implement on the client side.

christopherhein · 2018-07-20T22:02:12Z

Ah, that is very interesting. It seems to recover itself the next time around, in my mind that would mean it could be something we catch and retry, but that is without fully understanding the context @klingerf

klingerf · 2018-07-20T22:24:31Z

@christopherhein Ok cool, I added a client retry in kl/retry-server-close, and I've published linkerd images from that branch. Mind giving it a try in your environment?

linkerd install --linkerd-version git-4470c8fb | kubectl apply -f -

I've never had a lot of luck reproducing this failure in my env. If it seems to work for you, I'll put this change up for review.

christopherhein · 2018-07-20T22:54:00Z

Sounds great, let me bring the env back up. Just to be clear when I use the --linkerd-version that will grab that version from Github, I don't need to update my local binary?

klingerf · 2018-07-20T22:57:22Z

That's right -- the --linkerd-version flag will install that version in the linkerd namespace in your kubernetes cluster. It won't affect your CLI version, but the change that I made only impacts the web server, so it should be no prob to install a new controller using the older CLI.

christopherhein · 2018-07-20T23:01:28Z

Awesome, so I got that running, but it still seems to error. I'm getting on both /tps-reports and api/pods is there any chance this is trying to use metrics from the API server /metrics endpoint?

And has anyone experienced this on anything other than EKS…

klingerf · 2018-07-21T00:10:30Z

@christopherhein Thanks for testing it out! Too bad this doesn't fix it, but will keep investigating. Fwiw I think there may be two separate issues that trigger the red error banner. Sometimes I see that error immediately on initial page load, and other times I see it pop up sporadically after the page has been running for a while. I was expecting this branch to fix the latter error, and I'm still not sure what causes the former.

christopherhein · 2018-07-21T22:21:03Z

Hmmm, interesting for me it comes up just at some interval I don't really see it on first boot, I'll check it out and see if I can set a timer and see if it correlates to the 90-second timeout, even though I think that's what you removed.

christopherhein · 2018-07-21T22:31:44Z

Second test, just opened up the dashboard and for TPS Reports endpoint it might have been fixed, I'm still getting it on /api/pods I also set a timer and it's every ~25seconds

klingerf · 2018-07-27T21:11:33Z

Hey @christopherhein, really sorry for the delay here. Thanks for testing out those builds last week!

I've done some testing in a GKE cluster, and it appears to me that the "server closed idle connection" error was fixed as part of the v18.7.2 release that we cut earlier this week. Specifically, linkerd/linkerd2-proxy#26 was included in the release, and that fixed an issue with HTTP connection reuse, which I suspect was triggering the "server closed idle connection" error.

Can you give the v18.7.2 release a shot to see if it fixes the issue for you as well? You can install the updated CLI with:

curl https://run.conduit.io/install? | sh

Note that we display the red error banner in the web UI whenever a 5xx error is returned from the web server, and the fix in v18.7.2 only addresses one source of 5xx errors. I've opened #1366 to track adding more information to the error banner, which will help us distinguish between different types of errors going forward. In the meantime, if you see a 5xx error and are able to paste the response body into this issue, that would be super helpful. Thanks!

christopherhein · 2018-07-27T22:02:25Z

This is fantastic, thank you @klingerf I've been running it silently in the background for the last 5 minutes and nothing as showed up in console yet… 👏 great work!

klingerf · 2018-07-27T22:03:08Z

@christopherhein Great news, glad to hear it!

klingerf · 2018-07-28T00:06:42Z

I'm going to go ahead and close this. Thanks again @christopherhein for your help tracking it down. We can open separate issues if the banner reappears, hopefully with more specificity once #1366 is fixed.

olix0r added this to the 0.4.0 milestone Mar 19, 2018

olix0r added the priority/P1 Planned for Release label Mar 19, 2018

olix0r assigned klingerf Mar 19, 2018

olix0r modified the milestones: 0.4.0, 0.4.1 Apr 16, 2018

klingerf removed their assignment Apr 16, 2018

siggy modified the milestones: 0.4.1, 0.5.0 Apr 25, 2018

klingerf self-assigned this May 29, 2018

klingerf modified the milestones: 0.5.0, 0.6.0 Jun 15, 2018

klingerf removed their assignment Jul 11, 2018

klingerf modified the milestones: 0.6.0, 0.7.0 Jul 11, 2018

klingerf mentioned this issue Jul 19, 2018

Periodic Server Closed Idle Connection on /tps-reports endpoint causing UI Error Message #1352

Closed

klingerf mentioned this issue Jul 24, 2018

Start surfacing 500 response bodies in web UI #1366

Closed

klingerf closed this as completed Jul 28, 2018

github-actions bot locked as resolved and limited conversation to collaborators Jul 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v0.3.1] Seeing intermittent "internal server error" in Dashboard #580

[v0.3.1] Seeing intermittent "internal server error" in Dashboard #580

bourquep commented Mar 16, 2018

seanmonstar commented Mar 16, 2018

bourquep commented Mar 16, 2018

hawkw commented Mar 16, 2018 •

edited

Loading

bourquep commented Mar 16, 2018

hawkw commented Mar 16, 2018

bourquep commented Mar 16, 2018 •

edited

Loading

hawkw commented Mar 16, 2018

bourquep commented Mar 16, 2018

hawkw commented Mar 16, 2018

klingerf commented Mar 16, 2018

olix0r commented Apr 16, 2018

klingerf commented May 29, 2018

klingerf commented May 29, 2018

christopherhein commented Jul 19, 2018

klingerf commented Jul 20, 2018

christopherhein commented Jul 20, 2018

klingerf commented Jul 20, 2018

christopherhein commented Jul 20, 2018 •

edited

Loading

klingerf commented Jul 20, 2018

christopherhein commented Jul 20, 2018 •

edited

Loading

klingerf commented Jul 20, 2018

christopherhein commented Jul 20, 2018

klingerf commented Jul 21, 2018

christopherhein commented Jul 21, 2018 •

edited

Loading

christopherhein commented Jul 21, 2018

klingerf commented Jul 27, 2018

christopherhein commented Jul 27, 2018

klingerf commented Jul 27, 2018

klingerf commented Jul 28, 2018

[v0.3.1] Seeing intermittent "internal server error" in Dashboard #580

[v0.3.1] Seeing intermittent "internal server error" in Dashboard #580

Comments

bourquep commented Mar 16, 2018

seanmonstar commented Mar 16, 2018

bourquep commented Mar 16, 2018

hawkw commented Mar 16, 2018 • edited Loading

bourquep commented Mar 16, 2018

hawkw commented Mar 16, 2018

bourquep commented Mar 16, 2018 • edited Loading

hawkw commented Mar 16, 2018

bourquep commented Mar 16, 2018

hawkw commented Mar 16, 2018

klingerf commented Mar 16, 2018

olix0r commented Apr 16, 2018

klingerf commented May 29, 2018

klingerf commented May 29, 2018

christopherhein commented Jul 19, 2018

klingerf commented Jul 20, 2018

christopherhein commented Jul 20, 2018

klingerf commented Jul 20, 2018

christopherhein commented Jul 20, 2018 • edited Loading

klingerf commented Jul 20, 2018

christopherhein commented Jul 20, 2018 • edited Loading

klingerf commented Jul 20, 2018

christopherhein commented Jul 20, 2018

klingerf commented Jul 21, 2018

christopherhein commented Jul 21, 2018 • edited Loading

christopherhein commented Jul 21, 2018

klingerf commented Jul 27, 2018

christopherhein commented Jul 27, 2018

klingerf commented Jul 27, 2018

klingerf commented Jul 28, 2018

hawkw commented Mar 16, 2018 •

edited

Loading

bourquep commented Mar 16, 2018 •

edited

Loading

christopherhein commented Jul 20, 2018 •

edited

Loading

christopherhein commented Jul 20, 2018 •

edited

Loading

christopherhein commented Jul 21, 2018 •

edited

Loading