Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable nodelocal dnscache on prow build clusters #1680

Merged
merged 2 commits into from
Feb 19, 2021

Conversation

chaodaiG
Copy link
Contributor

No description provided.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters labels Feb 18, 2021
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. wg/k8s-infra size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 18, 2021
@ameukam
Copy link
Member

ameukam commented Feb 18, 2021

@chaodaiG How of curiosity, what's the reason for using this feature ?

EDIT: Saw the discussion kubernetes/test-infra#20716

@ameukam
Copy link
Member

ameukam commented Feb 18, 2021

/assign @spiffxp @BenTheElder

@chaodaiG
Copy link
Contributor Author

@chaodaiG How of curiosity, what's the reason for using this feature ?

It was because of this thread: kubernetes/test-infra#20716 (comment)

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Feb 19, 2021
@BenTheElder
Copy link
Member

/hold

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Feb 19, 2021
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 19, 2021
Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm
This isn't actually updating k8s-infra-prow-build, but that's fine by me, I'll see what rollout looks like on k8s-infra-prow-build-trusted first

pod_namespace = "test-pods" // MUST match whatever prow is configured to use when it schedules to this cluster
cluster_sa_name = "prow-build" // Name of the GSA and KSA that pods use by default
boskos_janitor_sa_name = "boskos-janitor" // Name of the GSA and KSA used by boskos-janitor
enable_node_local_dns_cache = "true" // Enable NodeLocal DNSCache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is unused right now

cluster_sa_name = "prow-build-trusted" // Name of the GSA and KSA that pods use by default
gcb_builder_sa_name = "gcb-builder" // Name of the GSA and KSA that pods use to be allowed to run GCB builds and push to GCS buckets
prow_deployer_sa_name = "prow-deployer" // Name of the GSA and KSA that pods use to be allowed to deploy to prow build clusters
enable_node_local_dns_cache = "true" // Enable NodeLocal DNSCache
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit to self: I guess meant to keep the locals block for "vars that are going to be reused by multiple resources" vs. "all configurable things go up here", but I didn't comment as such (or do so consistently, e.g. bigquery_location doesn't below up here by such convention)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 19, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: BenTheElder, chaodaiG, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 19, 2021
@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

spiffxp@spiffxp-macbookpro:prow-build-trusted (dns-cache %)$ terraform plan
#...
  # module.prow_build_cluster.google_container_cluster.prod_cluster[0] will be updated in-place
#...

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 19, 2021
@k8s-ci-robot k8s-ci-robot merged commit f2a4f41 into kubernetes:main Feb 19, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 19, 2021
@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

Took ~26min to apply (seems long, maybe due to this being a regional cluster?)

spiffxp@spiffxp-macbookpro:prow-build-trusted (main)$ terraform apply
# ... 
module.prow_build_cluster.google_container_cluster.prod_cluster[0]: Modifications complete after 25m49s [id=projects/k8s-infra-prow-build-trusted/locations/us-central1/clusters/prow-build-trusted]

Meanwhile, from https://cloud.google.com/kubernetes-engine/docs/how-to/nodelocal-dns-cache#enabling

If you are using maintenance windows, your nodes will not be recreated until a maintenance window occurs. If you prefer not to wait, you can manually "upgrade" the node pool to the same version it is already using, by setting the --cluster-version flag to the same GKE version the control plane is already running. You must use the gcloud command if you use this workaround. See the caveats section of the maintenance window documentation for more information.

We are using maintenance windows

$ gcloud container clusters describe prow-build-trusted --project=k8s-infra-prow-build-trusted --region=us-central1 --format="value(maintenancePolicy)"
resourceVersion=44fb9d63;window={'dailyMaintenanceWindow': {'duration': 'PT4H0M0S', 'startTime': '11:00'}}

I'm probably going to force an upgrade to avoid waiting

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

Per https://cloud.google.com/kubernetes-engine/docs/how-to/nodelocal-dns-cache#verifying_that_is_enabled, check that it's enabled:

spiffxp@cloudshell:~ (k8s-infra-prow-build-trusted)$ k --context=gke_k8s-infra-prow-build-trusted_us-central1_prow-build-trusted get pods -n kube-system -o wide | grep node-local-dns
spiffxp@cloudshell:~ (k8s-infra-prow-build-trusted)$

So, not running. Confirming that it's been configured for the cluster:

addonsConfig:
  dnsCacheConfig:
    enabled: true
  horizontalPodAutoscaling: {}
  httpLoadBalancing: {}
  kubernetesDashboard:
    disabled: true
  networkPolicyConfig: {}

So yeah, time to force nodes to upgrade

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

Using gcloud to do this per https://cloud.google.com/kubernetes-engine/docs/concepts/maintenance-windows-and-exclusions#caveats_for_maintenance_windows

$ gcloud container node-pools list --project=k8s-infra-prow-build-trusted --cluster=prow-build-trusted --region=us-central1
NAME                                      MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
trusted-pool1-20200430235251092500000001  n1-standard-8  200           1.16.15-gke.6000
$ gcloud container clusters upgrade \
  --project=k8s-infra-prow-build-trusted \
  --region=us-central1 \
  prow-build-trusted \
  --node-pool=trusted-pool1-20200430235251092500000001 \
  --cluster-version=1.16.15-gke.6000
Upgrading prow-build-trusted... Updating trusted-pool1-20200430235251092500000001, done with 0 out of 3 nodes (0.0%): 1 being processed...

Just based on how long this has taken thus far, I suspect k8s-infra-prow-build won't get this completely until end of day (it has many more nodes to upgrade). May batch up with an upgrade to v1.17, we should have already gotten there as part of stable release channel?

@chaodaiG
Copy link
Contributor Author

Just based on how long this has taken thus far, I suspect k8s-infra-prow-build won't get this completely until end of day (it has many more nodes to upgrade). May batch up with an upgrade to v1.17, we should have already gotten there as part of stable release channel?

That sounds about right to me. And yes I have seen v1.17 in stable channel, so batching up makes sense to me

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

Deployed to k8s-infra-prow-build-trusted. Need to see some job traffic to verify things are still working as expected

$ k --context=gke_k8s-infra-prow-build-trusted_us-central1_prow-build-trusted get pods -n kube-system -o wide | grep node-local-dns
node-local-dns-6z877                                             1/1     Running   0          43m   10.128.0.33   gke-prow-build-trust-trusted-pool1-20-caef7902-a7r0   <none>           <none>
node-local-dns-7dcjp                                             1/1     Running   0          38m   10.128.0.34   gke-prow-build-trust-trusted-pool1-20-5e50b276-1wj9   <none>           <none>
node-local-dns-gk2lj                                             1/1     Running   0          46m   10.128.0.32   gke-prow-build-trust-trusted-pool1-20-fd38017e-4yal   <none>           <none>

Timestamps if anyone needs to correlate disruptive behavior during this time.

$ gcloud container operations list --project=k8s-infra-prow-build-trusted --filter="startTime>='2021-02-19'" --format="table(startTime,endTime,operationType,name,status)"
START_TIME                      END_TIME                        TYPE            NAME                              STATUS
2021-02-19T14:28:32.358854864Z  2021-02-19T14:28:32.602803386Z  UPDATE_CLUSTER  operation-1613744912358-e2104c61  DONE
2021-02-19T14:28:42.728906016Z  2021-02-19T14:54:08.93231119Z   UPDATE_CLUSTER  operation-1613744922728-74f4f253  DONE
2021-02-19T15:13:14.422720553Z  2021-02-19T15:25:56.940972811Z  UPGRADE_NODES   operation-1613747594422-88695ac3  DONE

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

being able to filter by cluster on deck would be nice, in the meantime

curl https://prow.k8s.io/prowjobs.js?omit=pod_spec,decoration_config >prowjobs.js
<prowjobs.js jq \
'.items
  | map(
      select(
        .spec.cluster == "k8s-infra-prow-build-trusted" 
        and .status.pendingTime >= "2021-02-19T14"
      )
    ) 
  | sort_by(.status.pendingTime) 
  | map(
      .status | {time: .pendingTime, state, url}
    )'
[
  {
    "time": "2021-02-19T14:16:35Z",
    "state": "failure",
    "url": "https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-csi-driver-smb-push-images/1362768045499486208"
  },
  {
    "time": "2021-02-19T14:26:27Z",
    "state": "success",
    "url": "https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/kops-postsubmit-push-to-staging/1362770527264968704"
  },
  {
    "time": "2021-02-19T14:38:13Z",
    "state": "success",
    "url": "https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-gcr-prod-backup/1362773489949347840"
  },
  {
    "time": "2021-02-19T15:08:13Z",
    "state": "success",
    "url": "https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-k8sio-image-promo/1362781036726980608"
  },
  {
    "time": "2021-02-19T15:10:26Z",
    "state": "success",
    "url": "https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-k8sio-image-promo/1362781597782249472"
  },
  {
    "time": "2021-02-19T15:44:26Z",
    "state": "success",
    "url": "https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-csi-driver-host-path-push-images/1362790152912506880"
  },
  {
    "time": "2021-02-19T15:54:28Z",
    "state": "success",
    "url": "https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/post-cluster-api-push-images/1362792678915313664"
  }
]

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

I'm satisfied, will open followup PR to apply to k8s-infra-prow-build

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

Opened #1686

@BenTheElder
Copy link
Member

Thank you both!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/prow Setting up or working with prow in general, prow.k8s.io, prow build clusters cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants