-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v0.3.1] Seeing intermittent "internal server error" in Dashboard #580
Comments
The controller now has the conduit-proxy injected as a sidecar for itself too, it's possible it is seeing an error, since it returns a 500 for errors. Any time the proxy returns a 500 from an error internally, it now logs some better description as to what is it turning into a 500 (of course, if the other side returned 500, then the proxy just passes that along as not an internal proxy error...). |
So if I wanted to get to the root of this error, where would I look? |
@bourquep Try
where |
Those are the only lines in this log. All conduit pods have a restart count of 0 since installing 0.3.1.
|
Okay, let's look at the logs from the |
There is no Here are the logs of all containers in the
|
Looking at the logs you posted, it's possible the 500 error results from the failure to watch pods in the |
Keep in mind that this does not seem to affect the proxies and proxied services, just some intermittent glitch in the telemetry/dashboard. Absolutely not a big deal from my (user's) point of view, but probably worth investigating nonetheless to make sure it's not a symptom of something worse lurking somewhere. :) I'll be happy to provide whatever logs/info you might need. |
Yeah, I suspect that what's going on here is that the telemetry service is intermittently erroring because it's getting a Connection Refused response from the k8s API when it tries to get a list of pods to associate metrics with, and briefly fails requests for new metrics states from the frontend. Then the request goes through again, and the banner goes away. |
Thanks for sharing these logs. The telemetry service's logs look normal to me, and they don't indicate that the service has failed to serve any requests. This message is expected:
It's only logged at startup, when the service is attempting to read from the kubernetes API for the first time. It's expected that the first few requests will fail due to the changes from #365, so the process just logs the failure and retries. If you see this happen again, could you try finding the error response in your web inspector's network tab? I suspect that the response body would give us some indication of where the failure is happening. |
The telemetry system has been replaced entirely in v0.4.0. I'd be surprised if this behavior persists, but it would be good to confirm it does not once the release is out. |
I've been seeing this issue as well recently. It appears that the XHR request to
So that indicates that the request from the web pod to the controller pod is failing. Sure enough I see the same error printed in the web container's log:
But I don't see any other relevant info in the web pod's outbound proxy logs, or the controller pod's inbound proxy logs, or the public-api logs. |
It's also worth noting that the web pod's inbound proxy stats reflect decreased success rate:
But the outbound stats from the web to the controller pod do not:
|
Based on the error message, it seems like we could set the public API client's
|
That would actually make a lot of sense, I can open the dashboard and set a timer if you'd like @klingerf to see if that would solve it… |
@christopherhein In the process of removing the client's idle connection timeout, I realized that the error message indicates that the server is closing the connection, not the client. So I think what we're actually seeing is an instance of golang/go#19943 (comment). All of our public API requests are POSTs, which means that the go client won't retry on server connection close, making us susceptible to this behavior. Will dig around a bit more to see if there's a different fix we can implement on the client side. |
Ah, that is very interesting. It seems to recover itself the next time around, in my mind that would mean it could be something we catch and retry, but that is without fully understanding the context @klingerf |
@christopherhein Ok cool, I added a client retry in kl/retry-server-close, and I've published linkerd images from that branch. Mind giving it a try in your environment?
I've never had a lot of luck reproducing this failure in my env. If it seems to work for you, I'll put this change up for review. |
Sounds great, let me bring the env back up. Just to be clear when I use the |
That's right -- the |
Awesome, so I got that running, but it still seems to error. I'm getting on both And has anyone experienced this on anything other than EKS… |
@christopherhein Thanks for testing it out! Too bad this doesn't fix it, but will keep investigating. Fwiw I think there may be two separate issues that trigger the red error banner. Sometimes I see that error immediately on initial page load, and other times I see it pop up sporadically after the page has been running for a while. I was expecting this branch to fix the latter error, and I'm still not sure what causes the former. |
Hmmm, interesting for me it comes up just at some interval I don't really see it on first boot, I'll check it out and see if I can set a timer and see if it correlates to the 90-second timeout, even though I think that's what you removed. |
Second test, just opened up the dashboard and for TPS Reports endpoint it might have been fixed, I'm still getting it on |
Hey @christopherhein, really sorry for the delay here. Thanks for testing out those builds last week! I've done some testing in a GKE cluster, and it appears to me that the "server closed idle connection" error was fixed as part of the v18.7.2 release that we cut earlier this week. Specifically, linkerd/linkerd2-proxy#26 was included in the release, and that fixed an issue with HTTP connection reuse, which I suspect was triggering the "server closed idle connection" error. Can you give the v18.7.2 release a shot to see if it fixes the issue for you as well? You can install the updated CLI with:
Note that we display the red error banner in the web UI whenever a 5xx error is returned from the web server, and the fix in v18.7.2 only addresses one source of 5xx errors. I've opened #1366 to track adding more information to the error banner, which will help us distinguish between different types of errors going forward. In the meantime, if you see a 5xx error and are able to paste the response body into this issue, that would be super helpful. Thanks! |
This is fantastic, thank you @klingerf I've been running it silently in the background for the last 5 minutes and nothing as showed up in console yet… 👏 great work! |
@christopherhein Great news, glad to hear it! |
I'm going to go ahead and close this. Thanks again @christopherhein for your help tracking it down. We can open separate issues if the banner reappears, hopefully with more specificity once #1366 is fixed. |
This is on my test cluster, with almost no traffic. Just leaving a deployment page open in my browser, I see this red banner appear/disappear randomly. All my services are behaving properly though.
I have checked the logs of each container in the
controller
pod, nothing unusual there.I'll be happy to provide more info if you tell me where to look!
The text was updated successfully, but these errors were encountered: