Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding specific logic for dealing with validation jobs #599

Closed

Conversation

fabianbaier
Copy link
Member

@fabianbaier fabianbaier commented Jul 18, 2019

What type of PR is this?

/component operator
/kind bug

What this PR does / why we need it:

This PR is intended to short term fix some business logic in the planexecution_controller.go. In particular, when dealing with jobs in general. It looks like we don't have right now any Operator that actually uses the Delete boolean, nor do we actually have a test that makes sure a validation job was successfully created and completed. Also, I think the zookeeper test in the current kudobuilder/operators repo seems to cover all the test cases but we might want to break it down into individual tests, e.g. one for just the validation phase to finish and one for just a new validation job will come up after parameter changes are applied.

Which issue(s) this PR fixes:

Fixes #586

Special notes for your reviewer:

Run a clean Minikube and install KUDO from this branch:

  • minikube start --vm-driver=hyperkit --cpus=6 --memory=9216 --disk-size=10g
  • In this branch root: kubectl apply -f config/crds && kubectl apply -f config/rbac

Make sure you have the most recent kudobuilder/operator repo cloned. In my case I can find it under /Users/fabianbaier/go/src/github.com/kudobuilder/operators.

Make sure you have the most recent Zookeeper operator installed ( The one with a validation phase and Delete bool set to true - at the moment this should be as easy as kubectl kudo install zookeeper --skip-instance )

Now you can try it out via (make sure you adjust your path to your operator repo):

go run ./cmd/kubectl-kudo/main.go test --start-control-plane=false /Users/fabianbaier/go/src/github.com/kudobuilder/operators/repository/zookeeper/tests

You can see it in action by watching the output of your go run ./cmd/kubectl-kudo/main.go test terminal from above but also having a new terminal open that runs: kubectl get pods -w --all-namespaces. The output for the test in the go run ./cmd/kubectl-kudo/main.go test terminal should look like:

--- PASS: kudo (116.17s)
    harness.go:155: Running tests using configured kubeconfig.
    --- PASS: kudo/harness (0.00s)
        --- PASS: kudo/harness/zookeeper-upgrade-test (115.30s)
            logger.go:37: 17:03:15 | zookeeper-upgrade-test | Creating namespace: kudo-test-concise-crab
            logger.go:37: 17:03:15 | zookeeper-upgrade-test/0-install | starting test step 0-install
            logger.go:37: 17:03:15 | zookeeper-upgrade-test/0-install | Instance:kudo-test-concise-crab/zk created
            logger.go:37: 17:05:06 | zookeeper-upgrade-test/0-install | test step completed 0-install
            logger.go:37: 17:05:06 | zookeeper-upgrade-test/1-resize | starting test step 1-resize
            logger.go:37: 17:05:06 | zookeeper-upgrade-test/1-resize | Instance:kudo-test-concise-crab/zk updated
            logger.go:37: 17:05:10 | zookeeper-upgrade-test/1-resize | test step completed 1-resize
            logger.go:37: 17:05:10 | zookeeper-upgrade-test | Deleting namespace: kudo-test-concise-crab
PASS

Which means that not just a validation job was running but changing it parameters and applying it also worked. The flow of it should be in general:

  1. Create all Zookeeper instances (zk-zookeeper-0, zk-zookeeper-1, zk-zookeeper-2)
  2. Once they are running, start the validation phase and create a validation job
  3. Wait for the job to finish (which tries to resolve those zookeeper instances)
  4. Have the job deleted if the bool in this phase/step was set to true
  5. Apply some new parameters (here new cpu value of 200m) to the Zookeeper operator
  6. Create a new job with those new parameters as well and validate
  7. Finish if everything looks good in the assertion

The output in the 2nd terminal that watches what happens to our pods looks promising:

kudo-test-concise-crab      zk-zookeeper-0                     0/1     Pending             0          0s
kudo-test-concise-crab      zk-zookeeper-0                     0/1     Pending             0          0s
kudo-test-concise-crab      zk-zookeeper-1                     0/1     Pending             0          0s
kudo-test-concise-crab      zk-zookeeper-1                     0/1     Pending             0          0s
kudo-test-concise-crab      zk-zookeeper-2                     0/1     Pending             0          0s
kudo-test-concise-crab      zk-zookeeper-2                     0/1     Pending             0          0s
kudo-test-concise-crab      zk-zookeeper-1                     0/1     Pending             0          2s
kudo-test-concise-crab      zk-zookeeper-2                     0/1     Pending             0          2s
kudo-test-concise-crab      zk-zookeeper-0                     0/1     Pending             0          2s
kudo-test-concise-crab      zk-zookeeper-1                     0/1     ContainerCreating   0          2s
kudo-test-concise-crab      zk-zookeeper-2                     0/1     ContainerCreating   0          2s
kudo-test-concise-crab      zk-zookeeper-0                     0/1     ContainerCreating   0          2s
kudo-test-concise-crab      zk-zookeeper-0                     0/1     Running             0          4s
kudo-test-concise-crab      zk-zookeeper-1                     0/1     Running             0          5s
kudo-test-concise-crab      zk-zookeeper-2                     0/1     Running             0          5s
kudo-test-concise-crab      zk-zookeeper-1                     1/1     Running             0          16s
kudo-test-concise-crab      zk-zookeeper-0                     1/1     Running             0          18s
kudo-test-concise-crab      zk-zookeeper-2                     1/1     Running             0          22s
kudo-test-concise-crab      zk-validation-7lzp2                0/1     Pending             0          0s
kudo-test-concise-crab      zk-validation-7lzp2                0/1     Pending             0          0s
kudo-test-concise-crab      zk-validation-7lzp2                0/1     ContainerCreating   0          0s
kudo-test-concise-crab      zk-validation-7lzp2                1/1     Running             0          2s
kudo-test-concise-crab      zk-validation-7lzp2                0/1     Completed           0          87s
kudo-test-concise-crab      zk-validation-7lzp2                0/1     Terminating         0          87s
kudo-test-concise-crab      zk-validation-7lzp2                0/1     Terminating         0          87s
kudo-test-concise-crab      zk-validation-x276p                0/1     Pending             0          0s
kudo-test-concise-crab      zk-zookeeper-2                     1/1     Terminating         0          110s
kudo-test-concise-crab      zk-validation-x276p                0/1     Pending             0          0s
kudo-test-concise-crab      zk-validation-x276p                0/1     ContainerCreating   0          0s
kudo-test-concise-crab      zk-validation-x276p                0/1     Completed           0          2s
kudo-test-concise-crab      zk-validation-x276p                0/1     Terminating         0          3s
kudo-test-concise-crab      zk-validation-x276p                0/1     Terminating         0          3s
kudo-test-concise-crab      zk-zookeeper-1                     1/1     Terminating         0          114s
kudo-test-concise-crab      zk-zookeeper-0                     1/1     Terminating         0          114s
kudo-test-concise-crab      zk-zookeeper-2                     0/1     Terminating         0          2m20s
kudo-test-concise-crab      zk-zookeeper-1                     0/1     Terminating         0          2m25s
kudo-test-concise-crab      zk-zookeeper-1                     0/1     Terminating         0          2m25s
kudo-test-concise-crab      zk-zookeeper-0                     0/1     Terminating         0          2m25s
kudo-test-concise-crab      zk-zookeeper-2                     0/1     Terminating         0          2m26s
kudo-test-concise-crab      zk-zookeeper-2                     0/1     Terminating         0          2m26s
kudo-test-concise-crab      zk-zookeeper-1                     0/1     Terminating         0          2m36s
kudo-test-concise-crab      zk-zookeeper-1                     0/1     Terminating         0          2m36s
kudo-test-concise-crab      zk-zookeeper-0                     0/1     Terminating         0          2m36s
kudo-test-concise-crab      zk-zookeeper-0                     0/1     Terminating         0          2m36s

As you can see there are actually two validation jobs created. The second one is right after the parameter change, which is intended. There is right now no test that actually checks if this job has the new parameter (so this could be added too) however manually checking showed the new parameter in this job were reflected. that should do it for now.

Overall I think there are a lot of anti patterns and nested if this logic that makes things very hard to debug and probably needs some refactoring.

Does this PR introduce a user-facing change?:

NONE

@jbarrick-mesosphere
Copy link
Member

jbarrick-mesosphere commented Jul 18, 2019

Is the issue that steps with delete: true get immediately deleted after the objects are created without waiting for them to complete?

If so, can we leverage the methods in here to simplify this logic?

@jbarrick-mesosphere
Copy link
Member

I filed #600 so that the harness can help us write a test for this. In the meantime, we should write one in Go.

@alenkacz
Copy link
Contributor

maybe I am just very confused about how delete should work but why don't we have that as two steps? I understand delete as in this step please delete this resurce while with the validation it feels to me more like run the validation job in the step, then delete the resource

so why do we have

steps:
          - name: connection
            tasks:
              - validation
            delete: true

and not two steps, one running the validation and the other running the delete?

@fabianbaier
Copy link
Member Author

maybe I am just very confused about how delete should work but why don't we have that as two steps? I understand delete as in this step please delete this resurce while with the validation it feels to me more like run the validation job in the step, then delete the resource

so why do we have

steps:
          - name: connection
            tasks:
              - validation
            delete: true

and not two steps, one running the validation and the other running the delete?

I hope this clarifies: https://www.youtube.com/watch?v=eW0qfhEVTTY&feature=youtu.be&t=1432

Copy link
Contributor

@alenkacz alenkacz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabianbaier thanks, that helped a bit. What I heard there (correct me if I am wrong):

  • nobody really understands what is delete supposed to do and on which types of objects and what problems is it supposed to be solving (it should not be used instead of regular garbage collection on the cluster)
  • it should be two steps - one step runs the job, second deletes it
  • we definitely need more docs to address the first point I made here

I still don't feel like I have enough context to be able to approve it and we should probably address the points ^^^ as part of this or in other PRs

adding restartPolicty to Job template

creating integration-tests

creating e2e test and added events to the planexecution controller

added e2e test for step-delete with more verbose assertion
@jbarrick-mesosphere jbarrick-mesosphere force-pushed the fb/planexecution_controller-specific-job-logic branch from ac0e3ba to 32368a1 Compare August 12, 2019 22:39
@kudo-ci
Copy link

kudo-ci commented Aug 12, 2019

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fabianbaier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jbarrick-mesosphere
Copy link
Member

I've rebased this one as it solves an issue that I'm having with the MySQL operator. But I agree with @alenkacz that we need to think through the logic here a lot better. This will still be broken for all other non-Job types after merging.

We should likely document the desired behavior in a KEP to ensure we're all on the same page. I think @runyontr also has some ideas on a better fix.

@kudo-ci
Copy link

kudo-ci commented Sep 14, 2019

@fabianbaier: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@fabianbaier
Copy link
Member Author

@fabianbaier thanks, that helped a bit. What I heard there (correct me if I am wrong):

  • nobody really understands what is delete supposed to do and on which types of objects and what problems is it supposed to be solving (it should not be used instead of regular garbage collection on the cluster)
  • it should be two steps - one step runs the job, second deletes it
  • we definitely need more docs to address the first point I made here

I still don't feel like I have enough context to be able to approve it and we should probably address the points ^^^ as part of this or in other PRs

As this is stale for a while now, and as mentioned ( https://kubernetes.slack.com/archives/CG3HTFCMV/p1568656474021300 ) I am happy to close this out if it has been or will be addressed in other PRs or if there is a plan change the logic or planexecution_controller into an easier or maintainable state.

@alenkacz
Copy link
Contributor

alenkacz commented Oct 3, 2019

PE does not exist anymore and we're be changing how delete is implemented in Tasker -> closing

@alenkacz alenkacz closed this Oct 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unable to change Parameters in an Operator with a Job in its Deploy Phase
4 participants