Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support upgrades for connect refactor #509

Merged
merged 6 commits into from
May 10, 2021
Merged

Support upgrades for connect refactor #509

merged 6 commits into from
May 10, 2021

Conversation

ndhanushkodi
Copy link
Contributor

@ndhanushkodi ndhanushkodi commented May 4, 2021

Before the connect refactor, service registration in Consul was managed by the lifecycle sidecar, which would re-register the service with Consul every 10s. Now, service registration is managed by Endpoints controller.

In order to support upgrades to the refactored Endpoints controller, we need Endpoints controller to NOT register or deregister any services managed by lifecycle sidecar. To do this, the annotation consul.hashicorp.com/connect-inject-managed-by is added to pods by the mutating webhook, so endpoints controller will ignore older services managed by lifecycle sidecar (legacy services) for service registration/deregistration.

To make sure endpoints controller only deregisters services managed by endpoints controller, a meta key indicating the service is managed by endpoints controller is added to the Consul service registration.

To support health checks for legacy services, the Endpoints controller will always update the healthcheck for any pod, whether it's managed by Endpoints controller or not. It will only do this for the service healthcheck, not the proxy healthchecks.

How I've tested this PR:

  • Helm install v0.31.1 with the following values
global:
  domain: consul
  datacenter: dc1
  image: hashicorp/consul-enterprise:1.9.4-ent
  enableConsulNamespaces: true
server:
  replicas: 1
  bootstrapExpect: 1

client:
  enabled: true
  grpc: true

ui:
  enabled: true

connectInject:
  enabled: true
  consulNamespaces:
    mirroringK8S: true

controller:
  enabled: true
  • k apply -f static-client.yaml
apiVersion: v1
kind: Service
metadata:
  name: static-client
  namespace: foo
spec:
  selector:
    app: static-client
  ports:
    - port: 80
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: static-client
  namespace: foo
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: static-client
  namespace: foo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: static-client
  template:
    metadata:
      name: static-client
      namespace: foo
      labels:
        app: static-client
      annotations:
        "consul.hashicorp.com/connect-inject": "true"
        "consul.hashicorp.com/connect-service-upstreams": "static-server:1234"
    spec:
      containers:
        - name: static-client
          image: tutum/curl:latest
          command: [ "/bin/sh", "-c", "--" ]
          args: [ "while true; do sleep 30; done;" ]
      serviceAccountName: static-client
  • k apply -f static-server.yaml
apiVersion: v1
kind: Service
metadata:
  name: static-server
  namespace: foo
spec:
  selector:
    app: static-server
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8080
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: static-server
  namespace: foo
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: static-server
  namespace: foo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: static-server
  template:
    metadata:
      name: static-server
      namespace: foo
      labels:
        app: static-server
      annotations:
        "consul.hashicorp.com/connect-inject": "true"
    spec:
      containers:
        - name: static-server
          image: docker.mirror.hashicorp.services/kschoche/http-echo:latest
          args:
            - -text="hello world"
            - -listen=:8080
          ports:
            - containerPort: 8080
              name: http
          readinessProbe:
            exec:
              command: ['sh', '-c', 'test ! -f /tmp/unhealthy']
            initialDelaySeconds: 1
            failureThreshold: 1
            periodSeconds: 1
      serviceAccountName: static-server
      terminationGracePeriodSeconds: 0
  • Check that both are up:
 k -n foo exec deploy/static-client -c static-client -- curl -s localhost:1234
  • Helm upgrade to latest with the following values:
helm upgrade --install nitya ../../consul-helm/ -f consul-values.yml
global:
  domain: consul
  datacenter: dc1
  image: hashicorp/consul-enterprise:1.10.0-ent-beta1
  imageK8S: "gcr.io/nitya-293720/consul-k8s-dev:hc-upgrades5"
  enableConsulNamespaces: true
server:
  replicas: 1
  bootstrapExpect: 1

client:
  enabled: true
  grpc: true

ui:
  enabled: true

connectInject:
  enabled: true
  consulNamespaces:
    mirroringK8S: true

controller:
  enabled: true
  • Check that the services/checks still exist (check all agents if necessary):
k exec pod/nitya-consul-9hhn7 -- curl -s localhost:8500/v1/agent/services?ns=foo | jq .
k exec pod/nitya-consul-9hhn7 -- curl -s localhost:8500/v1/agent/checks?ns=foo | jq .
  • Mark the static-server unhealthy and see that the check is updated even though it's a legacy service:
k -n foo exec deploy/static-server -c static-server -- touch /tmp/unhealthy
# Check the agent that static server is on and confirm it's health check is critical
k exec pod/nitya-consul-9hhn7 -- curl -s localhost:8500/v1/agent/checks?ns=foo | jq .
  • Delete the pods of the static-client deployment
  • Check that the services/checks are there for the new static-client pods, and the services/checks have been removed for the old static client pods. (The new services will have metadata that they are managed by endpoints controller)

How I expect reviewers to test this PR:
Code review and if you can, the steps above.

Checklist:

  • Tests added
  • CHANGELOG entry added (HashiCorp engineers only, community PRs should not add a changelog entry)

@ndhanushkodi ndhanushkodi marked this pull request as ready for review May 4, 2021 16:27
@ndhanushkodi ndhanushkodi marked this pull request as draft May 4, 2021 16:27
@ndhanushkodi ndhanushkodi requested a review from kschoche May 4, 2021 17:08
@ndhanushkodi ndhanushkodi marked this pull request as ready for review May 4, 2021 17:08
@ndhanushkodi ndhanushkodi requested review from a team and lkysow and removed request for a team May 4, 2021 17:08
@@ -925,11 +927,321 @@ func TestReconcileUpdateEndpoint(t *testing.T) {
expectedProxySvcInstances []*api.CatalogService
expectedAgentHealthChecks []*api.AgentCheck
}{
// Legacy services are not managed by endpoints controller, but endpoints controller
// will still add/update the legacy service's health checks.
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reviewers since EP controller tests are long:
The tests I added were that for a legacy service:

  • the health check is added when the pod is healthy
  • the health check is added when the pod is unhealthy
  • the health check is updated from healthy --> unhealthy
  • the health check is updated from unhealthy --> healthy

Since when an agent is rolled it will lose the health checks, I wanted to test that they are added back for a legacy service, and that they are updated if they already exist.

@ndhanushkodi ndhanushkodi requested a review from kschoche May 4, 2021 21:29
@@ -210,6 +223,111 @@ func (r *EndpointsController) SetupWithManager(mgr ctrl.Manager) error {
).Complete(r)
}

// getServiceCheck will return the health check for this pod and service if it exists.
func getServiceCheck(client *api.Client, healthCheckID string) (*api.AgentCheck, error) {
filter := fmt.Sprintf("CheckID == `%s`", healthCheckID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we may encounter a bug similar to the one where we weren't using %q when defining a filter in connect-init ? I'm guessing the `'s around "%s" make it work?

Copy link
Contributor Author

@ndhanushkodi ndhanushkodi May 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup! And I didn't notice issues testing it end to end, and that codepath would be used for updating the health check which I did test :)

Copy link
Contributor

@kschoche kschoche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

@lkysow lkysow requested review from a team and ishustava and removed request for lkysow and a team May 6, 2021 23:45
Copy link
Member

@lkysow lkysow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @ndhanushkodi. I couldn't complete this review. I removed myself from reviewers and added the plat team back and @ishustava was assigned.

I didn't get to finish but overall my concerns are that we're intertwining the code paths for endpoints-controller managed pods and legacy pods and that I'm worried that this is going to set us up for confusion.

I was looking to see if there was a clean way at the top of the for loop to deal with the legacy pod and then continue the loop without carrying the isRegisteredByEndpointsCtrl boolean throughout the rest of the loop code. I think that someone will add something at the bottom of the loop and be unaware that the code is dealing with legacy pods.

connect-inject/annotations.go Outdated Show resolved Hide resolved
connect-inject/endpoints_controller.go Outdated Show resolved Hide resolved
@@ -263,21 +379,13 @@ func (r *EndpointsController) createServiceRegistrations(pod corev1.Pod, service
Address: pod.Status.PodIP,
Meta: meta,
Namespace: r.consulNamespace(pod.Namespace),
Check: &api.AgentServiceCheck{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if my understanding is wrong: we're not creating the check here because we're going to create it after the service register so the check also gets created for non-endpoints-ctrl pods?

If so, why don't we create it here and then after serviceRegister if the service isn't managed by endpoints controller we create the check?

@@ -247,6 +247,10 @@ func (h *Handler) Handle(_ context.Context, req admission.Request) admission.Res
}
pod.Labels[keyInjectStatus] = injected

// Add the managed-by label since services are now managed by endpoints controller. This is to support upgrading
// from consul-k8s without Endpoints controller to consul-k8s with Endpoints controller.
pod.Labels[keyManagedBy] = endpointsController
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make more sense as an annotation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it fit well with the pattern for the keyInjectStatus label. So annotations are how users configure our software to run, and labels are things we put on pods as we manage them.

connect-inject/endpoints_controller.go Show resolved Hide resolved
Copy link
Member

@lkysow lkysow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving for now to not block.

@ndhanushkodi
Copy link
Contributor Author

Correct me if my understanding is wrong: we're not creating the check here because we're going to create it after the service register so the check also gets created for non-endpoints-ctrl pods?

Yes

If so, why don't we create it here and then after serviceRegister if the service isn't managed by endpoints controller we create the check?

Addressed below.

overall my concerns are that we're intertwining the code paths for endpoints-controller managed pods and legacy pods and that I'm worried that this is going to set us up for confusion.
I was looking to see if there was a clean way at the top of the for loop to deal with the legacy pod and then continue the loop without carrying the isRegisteredByEndpointsCtrl boolean throughout the rest of the loop code. I think that someone will add something at the bottom of the loop and be unaware that the code is dealing with legacy pods.

@lkysow I'm addressing these comments here:

The reason why I made the health check registration separate from the service registration was so we could have logic along the lines of:

// logic to create local client for agent
if managedByEndpointsController{
// create the service registration (without the healthcheck)
// register the service
}
// idempotently upsert the healthcheck (legacy and endpoints controller managed pods)

I could see how a future contributor might add logic meant for endpoints-controller managed pods to the end of that, and have that apply to legacy and endpoints controller managed pods.

One option to reduce confusion is with a comment describing that the code after the if managedByEndpointsController block applies to legacy and endpoints controller pods. I will go ahead and do this option in this PR so when this is merged with the existing logic it's less confusing.

Then, in a future PR, we could make a refactor to have the logic look like this:

// logic to create local client for agent
if legacy{
// upsert healthcheck for legacy pod
} else {
// create service registration with healthcheck
// register the service
}

@ishustava and @kschoche I'm looking for feedback on whether you think I should go forward with the future refactor option after merging this PR with some more documentation. I'm also looking for feedback on that refactor idea.

Before the connect refactor, service registration in Consul was managed
by the lifecycle sidecar, which would re-register the service with
Consul every 10s. Now, service registration is managed by Endpoints
controller.

In order to support upgrades to the refactored Endpoints controller, we
need Endpoints controller to NOT register or deregister any services
managed by lifecycle sidecar. To do this, the annotation consul.hashicorp.com/connect-inject-managed-by
is added to pods managed by endpoints controller, so endpoints
controller will ignore older services managed by lifecycle sidecar
(legacy services) for service registration/deregistration.

To support health checks for legacy services, the Endpoints controller
will always update the healthcheck for any pod, whether it's managed by
Endpoints controller or not.
Legacy services have the proxy healthcheck coupled to the service
registration, so that can remain in endpoints controller as well.
@ndhanushkodi ndhanushkodi merged commit 75b5d6a into master May 10, 2021
@ndhanushkodi ndhanushkodi deleted the upgrades branch May 10, 2021 19:17
@ishustava
Copy link
Contributor

@ndhanushkodi

Then, in a future PR, we could make a refactor to have the logic look like this:

I like the refactor idea to make it more clear.

@kschoche
Copy link
Contributor

@ndhanushkodi

Then, in a future PR, we could make a refactor to have the logic look like this:

I like the refactor idea to make it more clear.

Likewise!

@ndhanushkodi
Copy link
Contributor Author

Thanks @ishustava @kschoche !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants