-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Worker Segmentation Fault (v0.44.0) #6896
Comments
I also have similar issue. Recently several nodes get NodePressure status and pods getting Evicted, because ingress controller generate so many core-dumps. So many, the total is almost 200GB. Using gdp here what i got
After cleaning up, it takes only several hours and the node get DiskPressure again. |
@alfianabdi Well, first of all the Run the following to install the musl debug symbols prior to running that gdb command:
|
Thanks, finally got it, somehow does not work in arm64 node.
|
@alfianabdi That's the same stacktrace I posted in my original comment and unfortunately it doesn't give us (or the ingress-nginx maintainers) any new information to help debug the issue. It does confirm that you are experiencing the same issue as I was at least (and presumably the same as the rest of the people in this issue even though they haven't posted backtraces to confirm). I just checked the latest changelogs for Alpine 3.13.x, musl, and the newest nginx version and nothing in them looks like it could be helpful. I would not expect this to be resolved with an upcoming ingress-nginx image (unless the issue was caused by something transient in the build). |
Pinging the following (mentioned in the owners file) for visibility. |
Same issue here with 0.44.0. Higher loaded clusters are affected more often. Is this resolved with 0.45.0 (don't think so accoring to the changelog)? |
Yeah we're quite loaded. Each instance doing around 400 ops/sec? We never saw the seg fault but observed almost double the load (cpu) for the same ops, until we rolled back. |
Maybe slightly off topic, as I don't know if the CPU spikes we saw were 100% caused by something in alpine, but would it be worth to provide a debian based image as an option? From what I gathered from #6527, the motivations were:
But would the same goals still be achievable with some trimmed down versions of debian? Like distroless? Although I still love alpine for a lot of things, I have also moved away from it for many projects due to some famous issues like performance hit (mainly because musl libc I think), networking and DNS problems(e.g. https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/#known-issues) |
I haven't tried 0.45.0, but I just compared the image with the 0.44.0 one and the musl library, the nginx binary, and the libluajit library all hashed out to the same as the 0.44.0 ones. I would be very surprised if the issue was resolved given that the issue is most likely in one of those. Unfortunately though I don't think I can be of much more assistance in helping debug this issue. We ended up switching to Traefik for our ingress controller because of this issue (and also because Traefik doesn't close TCP connections when it reloads its config). We no longer have any ingress-controller deployments running at all and have no plans to switch back even if this issue is fixed. |
Can the coredump be reproduced at will ? (For example by traffic to the controller on a kinD or a minikube cluster on laptop) |
Followup got coredump with v0.44.0 and v0.45.0
|
@longwuyuan I don't know that anyone has reproduced this in kinD or minikube. If this issue is happening though for someone it is fairly consistent. Have have multiple k8s clusters that are fairly identical and the issue was present in all of them (the segfaults just happen at a reduced rate on clusters that don't process as much traffic). @LuckySB Can I ask why you're using that kernel? 4.20 has been end-of-life since March of 2019, it is very insecure to be using it now. I see that 5.4 is in elrepo, if I were you I'd just use that as it's a LTS kernel and is supported until Dec 2025. |
any chance of anyone posting a fairly reasonable step by step process to reproduce this problem. Particularly hoping OS, Networking and such information and specifications nailed down for reproducing. Because I am not able to. /remove-kind bug |
/triage needs-information |
@longwuyuan Multiple people have provided stacktraces and additionally have full nginx worker coredumps that they can provide to ingress-nginx core developers (obviously they are sensitive files). I suppose I'm curious as that is not sufficient? |
Hi @ReillyBrogan sorry to hear that you are no longer using ingress-nginx. There is a topic in the upcoming sig-network to figure out how this project can get the appropriate amount of bandwidth from the community. |
@ReillyBrogan I am not able to reproduce. The info available hints at a combination of kernel version of the node, certain volume of traffic so I am guessing cpu/mem available etc. |
I think the issue is related to amount of traffic combined with frequent reloads of nginx not certainly with cpu/mem available. The cluster where we saw the issue very frequently (~50 restarts in 3 days on all 4 ingress pods) was a dev cluster with very frequent config changes and quite a lot of traffic through the ingress controller. Our kubelets in this cluster (55 in total) all have 24 CPUs together with 64GB of memory and an average usage of around 60% (cpu/mem). OS is a RHEL 7.9 with kernel 5.10.15-1.el7.elrepo.x86_64. |
Do you have metrics of inodes, filehandles, conntrack and such resources on the node where the pod was running at the time of the segfault |
Hey; as the segfaults are relatively infrequent and difficult to reproduce - shouldn't we be working with data that's more readily accessible? As I demonstrated above, we're able to observe a loosely double CPU usage between 0.43 and 0.44, it's not a huge leap to say that whatever is causing that additional load is only going to exacerbate config reloads (already a high CPU event). The CPU increase should be relatively trivial to reproduce. In the above example we're running 6 pods with 500m CPU (no limits) with each pod doing around 250-300ops/sec. |
I can confirm we saw the same issue on AKS with version 0.45.0. Issue went away when we downgraded to 0.43.0. |
I can confirm we saw the same issue on GCP/GKE with version 0.45.0. Issue also went away with 0.43.0. From our Compute engine, we also found that:
On this cluster, we have a lot of ingress (~200). We didn't see this issue on a similar cluster with quite similar ingress volume. |
Just learned about this issue the hard way! I confirm the issue is present on 0.46.0 as well. Planning a downgrade till the issue is fixed. |
We have released v0.49.2 and v1.0.2 with the fix. Can you all please test and give us some feedbacks? Thank you so much, specially @doujiang24 and @tao12345666333 for the help! |
Thank you for prompt fix! I've tested
This is without |
Hi @sepich , question: Are you using TLS/SSL for mutual auth? Just for secure endpoint? What's your load currently, and amount of ingress objects? Thanks! |
We do not verify client certs, so it is only TLS for server certs (provided by cert-manager). |
@sepich Also, does anyone other who got segfault have tried the new release? Feedbacks welcome, Thanks! |
Running 4 replicas of v0.49.2 since 24h so far without a crash! There is also one replica of an old v0.44.0 image running in the same cluster since 24h which so far also had no crash. The strange thing is that back in April we had like a dozen of crashes per day with v0.44.X and v0.45.X and currently it seems to be not reproducible. The Cluster is quite large with ~1400 ingress objects and around 200rp/s. From an Infra perspective of view the only significant change was the switch from RHEL 7 (with docker) to Debian 10 (with containerd) but I think this is not related that the problem disappeared. I think it's probably related to some user who previously used a special Ingress configuration (eg. some special annotations), but that's also just a guess... |
I also think it's reasonable. Thanks for your feedback! |
Hi @sepich , I need some help from you, thanks!
Could you please confirm the openssl-dbg package is installed and is loaded properly by gdb?
Again, core file is very appreciated. It's painful to debug such a bug without a core file. Thanks very much! |
Sorry for delay, I was discussing sharing the coredump internally and unfortunately I not allowed to. We can arrange a call with interactive gdb session if you wish.
Another thing i've tried is to build debian with
Please drop me a mail if you have time for the call. Or provide some instructions i can help you with. Thank you. |
A bit more than 48h into testing v0.49.2 (4 replicas), no SIGSEGV so far. In the same time 361 SIGSEGV on v0.44.0 (1 replica), so at least for me the issue seems to be gone and the fix in the lua-resty-balancer module helped. I took a look at nearly two dozens of coredumps and all crashed in lua code, not a single one in SSL related code, so I think I can't help any further at the moment. |
Hi @ghouscht , thanks very much. The |
Hi @sepich , thanks for your help.
Oh, no, seems I got wrong somewhere. I will take a deeper look.
Oh, that's bad news, but I also understand it.
Sure, that will be more helpful. I will contact you by email when I have a further plan and more free time. Maybe one or two days later. Thanks for your patience. |
@doujiang24 some related pertinent issue maybe: |
And #7080 |
OK, @doujiang24 @sepich can we move the OpenSSL issue to another issue (like 7647) and keep there? This way we can say that the specific issue of this case is solved, and now start digging more into OpenSSL Thanks! |
Hi @rikatz , totally agreed. |
/close This one has been solved, thank you very much @doujiang24 I'm moving the discussion about ssl endless coredumps to #7080 |
@rikatz: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Upgrading to 0.49.3 helped here too for those symptoms (no more seg fault, no more restarts).
|
Was there a way to replicate this issue on 0.45.0 by modifying ingresses? Were observing this issue when ingresses are removed from a namespace but I tried replicating this same behavior on the same version but even when the backend reloads coredumps are not logged. |
NGINX Ingress controller version:
0.44.0
Kubernetes version (use
kubectl version
):1.18.3
Environment:
What happened:
We encountered a major production outage a few days ago that was traced back to ingress-nginx. ingress-nginx pod logs were filled with messages like the following:
We discovered that the following message was being printed to the system log as well (timed with the worker exits):
I ultimately identified that whatever was occurring was linked to the version of ingress-nginx we were using and reverted production to 0.43.0 until we could identify the underlying issue.
We have a few other lower-load ingress-nginx deployments that have remained at 0.44.0 and have observed apparently random worker crashes however there are always enough running workers and these are infrequent enough that things seemingly remain stable.
I was able to get a worker coredump from one of those infrequent crashes and the backtrace is as follows:
One of the major differences between 0.43.0 and 0.44.0 is the update to Alpine 3.13, perhaps the version of musl in use is the issue and it would be appropriate to revert that change until Alpine has released a fixed version?
/kind bug
The text was updated successfully, but these errors were encountered: