Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reboot only within time window specified on commandline #66

Merged
merged 8 commits into from
Oct 24, 2019

Conversation

jjjordanmsft
Copy link
Contributor

Similar to #20 but allows us to specify, e.g., business hours so we don't get awoken in the middle of the night :-).

@jjjordanmsft jjjordanmsft changed the title Allow reboot to be scheduled Reboot only within time window specified on commandline Apr 4, 2019
@patrickmslatteryvt
Copy link

👍

@jjjordanmsft
Copy link
Contributor Author

I've been picking at this a bit since opening, but I think that should be everything. Open to code review feedback.

@patrickmslatteryvt
Copy link

It would be really great if you added days to exclude reboots on, such as what chaoskube supports, see: https://github.com/linki/chaoskube
Not many of us want reboots on Christmas day or July 4th :-)

@jjjordanmsft
Copy link
Contributor Author

@patrickmslatteryvt It's not a bad idea, but I'd prefer to put that into a separate PR at this point. My team probably wouldn't make use of it. I'll see if I can adopt this into my change when I get a few more free cycles.

@sylr
Copy link

sylr commented May 16, 2019

@awh can you review this please ?

@unsafecode
Copy link

Is there any chance to have this merged? It would be incredibly useful!

@tkaepp
Copy link

tkaepp commented Jun 5, 2019

@unsafecode i applied this patch on our test-cluster today. I cloned everything into $GOPATH and pushed an image to our private registry. I also rebased the pull request onto the tag 1.2.0 to make sure I don't have too many changes.

Here are the Steps I did:

  • Getting a branch of the PR git fetch origin pull/66/head:pull66
  • Rebase onto 1.2.0 git rebase 1.2.0
  • Creating the local docker image GO111MODULE=on make image
  • Retagging for our registry & pushing it.
  • Fixing RBAC apigroup from extensions to apps for the daemonset.

I would also like this to be in the next release.

@alexeldeib
Copy link

alexeldeib commented Aug 9, 2019

bump? @awh

@alexeldeib
Copy link

@dholbach @hiddeco @rade can anyone look at this PR? Seems useful to a lot of folks, including ourselves in production. Would love to get some eyes on this...been quite a while with silence.

@dholbach dholbach requested a review from awh August 20, 2019 06:39
@stealthybox
Copy link
Contributor

I would merge this scheduling feature -- there a bunch of reasons you wouldn't want your cluster churning when nobody is around.

You can't anticipate something like the Monzo etcd/Linkerd outage. It's best if that doesn't wake up your on-call at 2am.

If my apps are going down when kured is rebooting Nodes, paging an SRE in the middle of the night isn't going to encourage fixing anything. A team with high operational toil will just choose to stop running kured.

Its better to be able to page the dev-team during the day when everybody is awake.
Thanks for taking the time to contribute this patch.

@stealthybox
Copy link
Contributor

stealthybox commented Sep 19, 2019

When merged with master, this patch builds kured successfully and tests pass :

cd github.com/weaveworks/kured
hub checkout https://github.com/weaveworks/kured/pull/66

git checkout master
git merge git merge jjjordanmsft-master

dep ensure
make cmd/kured/kured
go test ./... -v -count 1
# github.com/weaveworks/kured/cmd/prom-active-alerts
cmd/prom-active-alerts/main.go:17:16: undefined: alerts.PrometheusCountActive
?   	github.com/weaveworks/kured/cmd/kured	[no test files]
=== RUN   TestParseWeekdays
--- PASS: TestParseWeekdays (0.00s)
=== RUN   TestParseWeekdaysErrors
--- PASS: TestParseWeekdaysErrors (0.00s)
=== RUN   TestTimeWindows
--- PASS: TestTimeWindows (0.00s)
PASS
ok  	github.com/weaveworks/kured/pkg/timewindow	0.001s

@Thraganux
Copy link

Can we merge this if all checks are passing? This is a real need for my team and I prefer to use master builds for stable releases, rather than one-off builds from random branches.
Thanks

@stealthybox
Copy link
Contributor

@tkaepp appears to have successfully built and used this on a real cluster.
@markine merged this patch into this fork: https://github.com/AltSchool/kured.
I presume @jjjordanmsft and @alexeldeib are also using it.

@Thraganux, can you confirm that you've built this and it's working as intended?

@stealthybox stealthybox merged commit 4beddb5 into kubereboot:master Oct 24, 2019
@stealthybox
Copy link
Contributor

I'm merging this so we get a master CI build -- @Thraganux could you check that it works with and without the feature enabled? Cheers

@jjjordanmsft Thank you so much for this high-quality patch.

@alexeldeib
Copy link

@stealthybox for transparency the trio of @jjjordanmsft @Thraganux and I are colleagues :)

I rebuilt a fresh image off 4d2b876 (previously based on 1.2.0) and have been soaking that in several of our clusters for the last ~week with the feature enabled. I'm happy to test the same with the CI image (pending circleci issues?). It's been pretty uneventful, fortunately.

@stealthybox
Copy link
Contributor

Thanks so much @alexeldeib 🎊
Do you happen to be running any clusters without the time constraints?

This new feature's behavior is likely superior for most user's clusters, but I'm just looking for an anecdote that the old behavior still works.

@stealthybox
Copy link
Contributor

Oops, CI failed to push the build:
https://circleci.com/gh/weaveworks/kured/132

@stealthybox
Copy link
Contributor

stealthybox commented Oct 24, 2019

Overrode CI /w my personal credentials (momentarily)

The master-4beddb5 tag is up now:
https://hub.docker.com/r/weaveworks/kured/tags?page=1&name=master
https://circleci.com/gh/weaveworks/kured/133

We'll figure out why the dedicated CI docker account is being denied access soon.

@alexeldeib
Copy link

We aren't now, but I can take the new image and test with and without. Not likely to get to it today at the rate I'm going, but tomorrow I'll grab the latest CI image and (de)configure the feature in a few places.

@alexeldeib
Copy link

FYI, I rolled out a small test (~25 node clusters, 1 each with this feature on/off) based on that tag and things have been looking good 👍 will be promoting the same image to more envs, but not seeing any cause for concern.

@stealthybox
Copy link
Contributor

Brilliant -- that's as good of a user report as we could have asked for.
Thanks for taking the time to do that @alexeldeib

@easkay-castlight
Copy link

easkay-castlight commented Nov 4, 2019

Is there a particular release or version that this is planned for? Would love to get this functionality into use. :)

@stealthybox
Copy link
Contributor

@easkay-castlight At this point in time, I recommend using weaveworks/kured:master-4beddb5 and considering it a release.

Our CI infra needs some work and we(Weaveworks) do not have much pre-KubeCon bandwidth.

We would love and welcome users volunteering to maintain kured more formally.

@easkay-castlight
Copy link

@stealthybox Thanks for the info, that's really helpful. I would be interested in helping out for sure, not quite sure how/where to get started. Will take a look later this evening (GMT).

@DaveAurionix
Copy link

DaveAurionix commented Nov 13, 2019

@alexeldeib I'm doing something silly, can you sanity-check this? I'm using tag master-4beddb5 mentioned above but the pod logs always show unknown flag for any of the new flags, for example:

Error: unknown flag: --reboot-days wed if that is the first new flag in the pod template, or Error: unknown flag: --start-time 05:00:00 if I try to set only start-time.

Example pod template snippet:

          command:
            - /usr/bin/kured
            - --ds-name=kured
            - --ds-namespace=kured
            - --end-time "05:00:00"
            - --reboot-days wed
            - --start-time 3am
            - --time-zone Europe/London

You've been using that tag for a while, what am I doing wrong?

This is the simplest config I've tried:-

          command:
            - /usr/bin/kured
            - --ds-name=kured
            - --ds-namespace=kured
            - --start-time 03:00:00

@DaveAurionix
Copy link

@alexeldeib Sorry ignore me. I've realised where I'm being an idiot (I added the new params without an equals sign in the above snippets).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.