Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add an st2canary pod that validates st2.packs.volumes #323

Merged
merged 27 commits into from
Feb 4, 2023

Conversation

cognifloyd
Copy link
Member

@cognifloyd cognifloyd commented Jul 12, 2022

Closes #321

We often get questions about getting the volumes working, but the errors are very hard to debug by the time st2 is running. So, I'm trying to prevent people from installing this chart if their values for volumes do not lead to a valid setup. And, if someone tries to upgrade from an install w/o volumes enabled to an install with it enabled, then I want the upgrade to fail early if the volumes settings are not working.

There are several sources for volumes issues:

  • the values are wrong
  • the k8s cluster is configured incorrectly
  • the underlying storage is configured incorrectly

I really want a way to push people to figure that out before the chart finishes installing something that is broken.

A key component of this is that it runs BEFORE the chart gets installed or updated. Especially with production installs, volumes issues should be found and fixed as early as possible, which is before the rest of the components get installed or upgraded.

@cognifloyd cognifloyd self-assigned this Jul 12, 2022
@pull-request-size pull-request-size bot added the size/L PR that changes 100-499 lines. Requires some effort to review. label Jul 12, 2022
@pull-request-size pull-request-size bot added size/XL PR that changes 500-999 lines. Consider splitting work into several ones that easier to review. and removed size/L PR that changes 100-499 lines. Requires some effort to review. labels Jul 29, 2022
@cognifloyd cognifloyd marked this pull request as ready for review July 29, 2022 22:32
@cognifloyd cognifloyd removed the WIP label Jul 29, 2022
@cognifloyd cognifloyd requested a review from arm4b July 29, 2022 22:41
@cognifloyd cognifloyd modified the milestones: v1.0.0, v0.100.0 Jul 30, 2022
@cognifloyd
Copy link
Member Author

@armab I want to get this in the next release as it should help isolate issues with st2.packs.volumes config and make it easier to identify and help people with that.

Copy link
Member

@arm4b arm4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering what this Pod does, verifying write filesystem access, can we move the logic to the original tests dir, maybe even possible under the BATS tests?

It can be the same pod
https://github.com/StackStorm/stackstorm-k8s/blob/master/templates/tests/st2tests-pod.yaml
or perhaps a new one dedicated to volume testing?

And so it could be run with the same testing mechanism helm test exactly when it's needed for troubleshooting, instead of running on every Helm deployment which feels a bit off for production.

@cognifloyd
Copy link
Member Author

I've never thought of using helm test as a debugging tool. Our current test script is designed for running in CI. It would have to be a lot more robust if we want people to run it on their customized st2 clusters. For example, assuming that logging in with the st2admin user/pass is often not valid (eg w/ ldap).

@cognifloyd
Copy link
Member Author

@mamercad What do you think of this pod?

We often get questions about getting the volumes working, but the errors are very hard to debug by the time st2 is running. So, I'm trying to prevent people from installing this chart if their values for volumes do not lead to a valid setup. And, if someone tries to upgrade from an install w/o volumes enabled to an install with it enabled, then I want the upgrade to fail early if the volumes settings are not working.

There are several sources for volumes issues:

  • the values are wrong
  • the k8s cluster is configured incorrectly
  • the underlying storage is configured incorrectly

I really want a way to push people to figure that out before the chart finishes installing something that is broken.

A key component of this is that it runs BEFORE the chart gets installed or updated. helm test can only run AFTER the chart is installed, so it does not meet my requirements (plus that has all of the CI assumptions baked in, so it is not useful for end users to debug with). I suppose we could have a values flag to opt-out of running this pod, but I want to make sure people run it before installing the chart.

@armab doesn't like the idea of running this with production clusters, but that's explicitly one of my goals. Especially with production installs, volumes issues should be found and fixed as early as possible, which is before the rest of the components get installed or upgraded.

Thoughts?

@mamercad
Copy link
Contributor

So, I'm trying to prevent people from installing this chart if their values for volumes do not lead to a valid setup. And, if someone tries to upgrade from an install w/o volumes enabled to an install with it enabled, then I want the upgrade to fail early if the volumes settings are not working.

I'm very much a fan of this.

@mamercad
Copy link
Contributor

@armab doesn't like the idea of running this with production clusters, but that's explicitly one of my goals. Especially with production installs, volumes issues should be found and fixed as early as possible, which is before the rest of the components get installed or upgraded.

I'm not overly familiar with StackStorm HA at this point in time (I'm not running it in production yet), but, I will say that as an operator of other Kubernetes applications, I'd personally rather have an application never get off of the ground (versus being broken while getting off of the ground) if that's a possibility. Adding simple RW testing to a Pod whose purpose and name implies that it's for testing (heh, maybe it should be renamed to st2canary) seems like a fine idea to me.

In short, I'm in the "fail as early as possible" camp as well.

@cognifloyd
Copy link
Member Author

Ooh. I like the idea of calling it st2canary. Very nice. I'm going to use that.

@cognifloyd cognifloyd changed the title add an st2tests pod that validates st2.packs.volumes add an st2canary pod that validates st2.packs.volumes Dec 21, 2022
Copy link
Member

@arm4b arm4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments.

Can we make it a Job, same as others that we already have and call more explicitly what it does like st2-verify-pack-write-access (or somewhat close to that)?

The jobs we already have:
https://github.com/StackStorm/stackstorm-k8s/blob/master/templates/jobs.yaml
and as I remember if a job like register-content failed, the entire helm deployment will report a failure too. So logically that's following what you want to do.

Screenshot 2022-12-21 at 21-22-31 add an st2canary pod that validates st2 packs volumes · StackStorm_stackstorm-k8s@21a961c

That would be a bit more consistent with the existing mechanics.

templates/hooks/pod_packs-volumes.yaml Outdated Show resolved Hide resolved
values.yaml Show resolved Hide resolved
@cognifloyd
Copy link
Member Author

cognifloyd commented Dec 22, 2022

Yeah, that makes sense. I'll work on:

  • turning it into a job
  • adding more detailed error messaging with:
    • a link to the docs
    • what to do in case of failure

@cognifloyd cognifloyd removed this from the v0.110.0 milestone Jan 28, 2023
@pull-request-size pull-request-size bot added size/L PR that changes 100-499 lines. Requires some effort to review. and removed size/XL PR that changes 500-999 lines. Consider splitting work into several ones that easier to review. labels Jan 29, 2023
@cognifloyd cognifloyd marked this pull request as ready for review January 29, 2023 20:42
@cognifloyd cognifloyd requested a review from arm4b January 29, 2023 20:43
Copy link
Member

@arm4b arm4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving it to the jobs, looks more consistent.

Left a few more comments and observations.

templates/jobs.yaml Outdated Show resolved Hide resolved
templates/jobs.yaml Outdated Show resolved Hide resolved
values.yaml Outdated Show resolved Hide resolved
@CLAassistant
Copy link

CLAassistant commented Feb 3, 2023

CLA assistant check
All committers have signed the CLA.

Copy link
Member

@arm4b arm4b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot! 👍

@cognifloyd cognifloyd merged commit d55e668 into StackStorm:master Feb 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Helm K8s size/L PR that changes 100-499 lines. Requires some effort to review.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a pre-install,pre-upgrade,pre-rollback hook that validates st2.packs.volumes
4 participants