-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bump go-dqlite to v1.5.1 #29
Bump go-dqlite to v1.5.1 #29
Conversation
2f9968e
to
a12a000
Compare
@ibuildthecloud hey! could we get this merged and bumped in k3s so that k3s-io/k3s#1639 can be solved (maybe)? awesome work @freeekanayaka and @ibuildthecloud & rancher teams, love k3s and dqlite. |
@freeekanayaka still getting segfaults and panics in a multi master HA of k3s with dqlite 1.4.1 : k3s-io/k3s#1639 (comment) |
The go-dqlite 1.4.1 release won't be enough. You need to use the latest master commits in canonical/dqlite and canonical/raft C libraries. Have you built them too? |
Seems I had not, I had only replaced go-dqlite in go.mod and thought it would get the rest of dependencies on its own. I dug into the build system of k3s and it's a bit odd but it should be good now! In |
I didn't cut a release yet, so there's currently no version tag. If using a commit hash (the one from HEAD master) does not work, please let me know and I'll do a release, since it's also about time. |
For sake of clarity I went on and cut 3 releases: go-dqlite v1.5.0, dqlite v1.4.1 and raft v0.9.18. Those should be the ones to test. Note that for kine go-dqlite v1.5.0 is basically the same as v1.4.1, there are just more helpers added which are not used by kine. On the other hand dqlite v1.4.1 and raft v0.9.18 do have fixes that should be relevant for kine. |
@freeekanayaka Could do just what you said, observing stability now! Thanks. |
@freeekanayaka been running for 8 hours now with several services installed on top of the cluster, no issues in particular that come from dqlite I believe. Though I had some issues, but I don't think they're about dqlite. |
Great! Thanks for the feedback, keep me posted if you notice anything. |
@freeekanayaka As a side note, maybe get dqlite into https://github.com/google/oss-fuzz? |
@freeekanayaka Ah.. speaking of which, it just crashed again:
|
@leolb-aphp can you provide more details about your setup? How many k3s nodes? And ideally provide me with the exact yaml to reproduce your deployment (with kubectl apply -f). I have seen "panic: runtime error: index out of range" myself too at the beginning, but not anymore with the latest fixes. |
I have 3 masters and deployments are various, JupyterHub k8s 0.9 beta.4, Longhorn 0.8 for the default storage class, Rancher 2.4.2 (with built in monitoring enabled), Harbor 1.3.2, all from Helm. Default configuration for everything besides JupyterHub that gets:
Rancher can be installed like this:
Rancher has default configuration templates for services such as Harbor or Longhorn, just take those. And by the way, I have to wait 8-10 hours for the crash to happen and I'm actively fiddling with the cluster trying out things. Is there a way to be absolutely sure that k3s is built with the correct version of dqlite etc? Can it write versions to disk somehow? Do you have a repo where you built it from so we are sure we get same results? Can you describe your build process? |
@freeekanayaka Also I'm thinking this could be the cause, I used go-dqlite 1.4.1 because 1.5.0 did not build:
|
Here's my build process: Clone and cd into: https://github.com/leolb-aphp/dqlite-build branch master Run:
(or sort it out if you arent behind a corporate proxy) Clone and cd into: https://github.com/leolb-aphp/k3s.git branch release-1.17 Run:
(workaround because things did not work using the Makefile.. it's probably my corporate proxy but I hardcoded modifications in my branch so you still need to do this) (You need to be part of the "docker" group so you can use the docker socket, or install podman-docker if it works) Latest commits on every of these repos use go-dqlite 1.5.0 because I just changed them but my current tests use go-dqlite 1.4.1 because it does not build. |
I'm thinking maybe it is due to docker build cache now, re-trying. |
Sorry, the build error is a mistake on my part. I've pushed a branch to fix it. Will release v1.5.1 shortly. In any case, for k8s 1.5.0 and 1.5.1 should be exactly like v1.4.1. |
@leolb-aphp fyi I just pushed the v1.5.1 tag, so your build should work now. |
@freeekanayaka Cool! Thanks a lot. By the way, the cluster got back on its feets on its own with automatic service restart and Kubernetes mechanisms of resilience. Interesting! |
Still getting:
Is it on my side now? |
I'm thinking the panic may be facilitated by lossy/bad networking conditions |
Got these:
So the first node, the one that is used as cluster init (presumably k3s handles multi master in a bad way by making later joined nodes connect to the first connected node, which makes the whole cluster probably resilient to the cluster init node I think?). The cluster init node gets panics some times and the rest of nodes get segfaults, not at the same time but that's the behavior. |
Please see my case in k3s-io/k3s#1633. I have the same thoughts about HA... |
@SteelCrow as far as I could observe myself, the latest version of the dqlite dependencies have fixed most of the issues. I'd suggest for the k3s team to make a new build with those dependencies upgraded (or do it yourself) and then retry. |
Yes, this is a bit on "your" side. It's a new dependency and so the go.mod configuration of k3s needs to be upgraded. I'll do this in this PR and bump it from 1.4.1 to 1.5.0. |
@leolb-aphp, actually looking at this traceback more carefully I noticed that this can't be something generated by v1.4.1 or v1.5.0, since the line numbers and function names in the traceback you pasted could be generated only by v1.4.0. So for some reason you haven't tested v1.4.1 or v1.5.0, but v1.4.0 or lower. |
Sorry about that, I misspelled issue, this one k3s-io/k3s#1639 |
I think it’s a problem with direct dependency in k3s build, so you need to update it there too. When I first compiled it, I occasionally missed it. |
a12a000
to
2250cd0
Compare
@ibuildthecloud any plan at upgrading go-dqlite, libdqlite and raft in k3s? |
I did everything of that, I posted my build process here: #29 (comment) |
I'm not familiar with k3s build process. But again, the traceback you pasted does come from go-dqlite v1.4.0 or earlier. Not sure about the version of the libdqlite and libraft shared libraries, but you need to make sure those are up-to-date too. Too bad that the k3s devs seem unresponsive. |
sorry we haven't been answering @freeekanayaka, please feel free to ping myself or @cjellick in the future, in github or slack. if we can update the pr to use 1.5.1 hopefully that is good enough to move forward (& might help to have pr for dqlite-build) thanks for testing @leolb-aphp, probably want to make sure you |
As mentioned in this PR's description, I've changed the branch to use go-dqlite 1.5.0, so everything should be ready for merging. On top of this change, you'll have to make sure you also update the dqlite and raft C libraries in the k3s build process (dqlite v1.4.1 and raft v0.9.18). |
thanks, is there a reason for not going to go-dqlite 1.5.1? |
2250cd0
to
66bb240
Compare
Ugh, no reason. Sorry, I updated this PR to point to 1.5.1 now, not 1.5.0. |
Thanks @freeekanayaka! |
Thanks a lot! I deployed a |
For your information, no stability issues since then! |
I have no hard evidence, but anecdotally some testing that I have done suggests that v1.4.1 has a few fixes that might help with canonical/dqlite#190 and improve stability in general, especially for the kind of workload k8s produces.
EDIT: since v1.5.1 has been released, I've updated this PR to use v1.5.1 instead of v1.4.1. It shouldn't make any difference for kine, but just to have the latest.