jobs erroring out due to failure to fetch token #20816

BenTheElder · 2021-02-10T07:20:07Z

What happened:

Seeing jobs fail early on with no logs more and more often lately, and not signs we've seen before like "failed scheduling" (need to increase nodes) or github clone flakes (kubernetes/kubernetes is a huge clone, still need to mitigate that ##18226).

Instead we're seeing:

 - containerID: containerd://2372ffea321a2dec129d222dc4b43ee8774b4c0528a2e36d9271c6a51819b35e
    image: gcr.io/k8s-prow/initupload:v20210209-12c3370f77
    imageID: gcr.io/k8s-prow/initupload@sha256:4108bc2d7ddb6a2272d525b060d305edd5f9575430ca022ceef22475140ce713
    lastState: {}
    name: initupload
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: containerd://2372ffea321a2dec129d222dc4b43ee8774b4c0528a2e36d9271c6a51819b35e
        exitCode: 1
        finishedAt: "2021-02-10T07:02:42Z"
        message: |
          iter close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2Flatest-build.txt\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout writer close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2F1359393190783750144%2Fclone-log.txt\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout upload error: writer close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2F1359393190783750144%2Fclone-records.json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout writer close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2F1359393190783750144%2Fbuild-log.txt\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout writer close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2F1359393190783750144%2Ffinished.json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout]","file":"prow/cmd/initupload/main.go:45","func":"main.main","level":"fatal","msg":"Failed to initialize job","severity":"fatal","time":"2021-02-10T07:02:42Z"}
        reason: Error
        startedAt: "2021-02-10T06:46:42Z"

Which is not really surfaced to the user, unless they know to go digging into the recorded pod state. Instead you just see:

Test started today at 10:46 PM failed after 16m35s. (more info)

Setting aside the UX, we should fix this error, we're seeing it a lot, even on jobs that are running with guaranteed QoS.

What you expected to happen:

Ideally the failure mode should be surfaced in deck
Even more ideally this should not be causing jobs to fail / the root cause should be mitigated

How to reproduce it (as minimally and precisely as possible):

This is happening pretty often I think, but it's hard to pin down as the job state is just "failed" and the runtime is ~ 16m36s (which seems oddly specific?)

with https://prow.k8s.io/?repo=kubernetes%2Fkubernetes&type=presubmit&state=failure I see 28 hits for 16m36s

Please provide links to example occurrences, if any:

kubernetes/kubernetes#98768
kubernetes/kubernetes#98948 (comment)

Anything else we need to know?:

/area prow

The text was updated successfully, but these errors were encountered:

BenTheElder · 2021-02-10T07:20:36Z

cc @chaodaiG per go.k8s.io/oncall

BenTheElder · 2021-02-10T07:24:11Z

I would move kubernetes/kubernetes#98768 here, but CI signal is probably tracking that one, and it predates this. Filing here with additional details and for visibility.

chaodaiG · 2021-02-10T16:01:07Z

The initupload logs are the tail logs, I may need to see the full log for better understanding. Which cluster does this use? How can I get access to it? @BenTheElder

chaodaiG · 2021-02-10T16:16:51Z

Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout, I have searched a little bit, and there are a couple of possibilities:

https://oauth2.googleapis.com/token has intermittent connection issue. The is not very likely, as it could be mitigated by Podutils: Retry uploads #19884 .
Network issue on the cluster.
DNS issue. I have seen freshly created node having short period of DNS issue, does this build cluster happen to have node autoscaling turned on?

spiffxp · 2021-02-10T17:21:54Z

@chaodaiG

The initupload logs are the tail logs, I may need to see the full log for better understanding. Which cluster does this use? How can I get access to it?

if the job is using cluster: k8s-infra-prow-build you'll need to look at the k8s-infra-prow-build project in the kuberentes.io gcp org

owner access to this project (and all other prow-related projects in kubernetes.io) is defined by https://github.com/kubernetes/k8s.io/blob/5b2304f875d8b877764a7bde5398eb6f4ffb0494/groups/groups.yaml#L645-L665

viewer access is defined by https://github.com/kubernetes/k8s.io/blob/5b2304f875d8b877764a7bde5398eb6f4ffb0494/groups/groups.yaml#L667-L701

I'll open a PR to reconcile the #test-infra-oncall entries, they look stale to me

chaodaiG · 2021-02-10T17:43:35Z

Thank you @spiffxp , please tag me on the PR

spiffxp · 2021-02-10T17:57:25Z

@chaodaiG opened kubernetes/k8s.io#1636 and cc'ed you

spiffxp · 2021-02-10T17:58:43Z

/priority important-soon
/sig testing
/wg k8s-infra
/milestone v1.21
/kind flake
re "more and more often lately" in description

Part of kubernetes#20816

chaodaiG · 2021-02-11T00:45:30Z

Some stats, scanning through all prow jobs in the past 24 hours (~24000), by downloading podinfo.json from gcs, so far only finished half of them:

total: 12476
failed: 2874
initupload failure: 85

spiffxp · 2021-02-11T07:09:14Z

oauth2: cannot fetch token started showing up in logs for k8s-infra-prow-build sometime late Feb 2, early Feb 3.

https://console.cloud.google.com/logs/query;query=%22oauth2:%20cannot%20fetch%20token%22%0Aresource.labels.container_name%3D%22initupload%22;timeRange=P14D?project=k8s-infra-prow-build

But nothing of the sort for k8s-infra-prow-build-trusted
https://console.cloud.google.com/logs/query;query=%22oauth2:%20cannot%20fetch%20token%22%0Aresource.labels.container_name%3D%22initupload%22;timeRange=2021-02-02T22:14:00.000Z%2F2021-02-11T08:00:00.000Z?project=k8s-infra-prow-build-trusted

Nor k8s-prow or k8s-prow-builds

What changed?

PR's merged within that time period were irrelevant (k8s.gcr.io manifests, kubernets.io group updates)
So unless someone manually ran commands, cluster resources and underlying weren't intentionally changed
It may be possible this was our GKE maintenance window, and nodes were upgraded?
Would this correspond to a change in code in the initupload container?
(deferring looking at audit logs and trying to narrow down the time range to tomorrow)

BenTheElder · 2021-02-12T04:19:24Z

It may be possible this was our GKE maintenance window, and nodes were upgraded?

Nodes from Jan 22nd exist still in greenhouse's pool, but the autoscaled node pool has all nodes from today.

Checking the cluster operation logs: It looks like @spiffxp upgraded the cluster on the 10th. I don't see any cluster operations around February 2nd at all.
https://console.cloud.google.com/logs/viewer?project=k8s-infra-prow-build&organizationId=758905017065&minLogLevel=0&expandAll=false&timestamp=2021-02-12T04:11:49.687000000Z&customFacets=&limitCustomFacetWidth=false&dateRangeEnd=2021-02-12T04:09:56.858Z&interval=PT1H&resource=gke_cluster%2Fcluster_name%2Fprow-build%2Flocation%2Fus-central1&dateRangeUnbound=backwardInTime&scrollTimestamp=2021-02-11T05:51:11.338541094Z

I don't think this is the case.

BenTheElder · 2021-02-12T04:23:57Z

Nor k8s-prow or k8s-prow-builds

Is k8s-infra-prow-builds the only one with autoscaling? (I think so?)

If so I suspect this is an issue with autoscaling the in-cluster DNS. @chaodaiG mentioned a workaround for this applied to another prow cluster.

EDIT: and this has been a fairly obvious suspect for some time IMHO, but I forgot to mention.
We should probably try the workaround there (adjusting the DNS scaling parameters).

spiffxp · 2021-02-22T17:26:05Z

/close
Same query link as above... seems to have stopped happening near 2020-02-19 20:00:00Z

How does this line up with updating k8s-infra-prow-build?

2021-02-19T19:40:21Z - control plane upgrade to v1.17 finished (ref: Update k8s-infra prow build clusters from v1.16 to release channels k8s.io#1541 (comment))
2021-02-19T20:17:07Z - greenhouse node pool upgrade to v1.17 starts (ref: Update k8s-infra prow build clusters from v1.16 to release channels k8s.io#1541 (comment))
2021-02-19T20:28:56Z - last log entry of failure
2021-02-19T20:31:35Z - greenhouse node pool upgrade to v1.17 finished (ref: Update k8s-infra prow build clusters from v1.16 to release channels k8s.io#1541 (comment))
2021-02-19T21:06:44Z - main node pool upgrade to v1.17 starts (ref: Update k8s-infra prow build clusters from v1.16 to release channels k8s.io#1541 (comment))

So the problem stopped due to one of:

node recreation
adding node local dns (which required node recreation to take effect)
upgrading nodes to v1.17 (which required node recreation to take effect)
something completely unrelated to updating the cluster

It took about 4.5h to fully recreate nodes, I opted not to try these out one-by-one to determine exactly which one would solve it

k8s-ci-robot · 2021-02-22T17:26:14Z

@spiffxp: Closing this issue.

In response to this:

/close
Same query link as above... seems to have stopped happening near 2020-02-19 20:00:00Z

How does this line up with updating k8s-infra-prow-build?

2021-02-19T19:40:21Z - control plane upgrade to v1.17 finished (ref: Update k8s-infra prow build clusters from v1.16 to release channels k8s.io#1541 (comment))

2021-02-19T20:17:07Z - greenhouse node pool upgrade to v1.17 starts (ref: Update k8s-infra prow build clusters from v1.16 to release channels k8s.io#1541 (comment))

2021-02-19T20:28:56Z - last log entry of failure

2021-02-19T20:31:35Z - greenhouse node pool upgrade to v1.17 finished (ref: Update k8s-infra prow build clusters from v1.16 to release channels k8s.io#1541 (comment))

2021-02-19T21:06:44Z - main node pool upgrade to v1.17 starts (ref: Update k8s-infra prow build clusters from v1.16 to release channels k8s.io#1541 (comment))

So the problem stopped due to one of:

node recreation

adding node local dns (which required node recreation to take effect)

upgrading nodes to v1.17 (which required node recreation to take effect)

something completely unrelated to updating the cluster

It took about 4.5h to fully recreate nodes, I opted not to try these out one-by-one to determine exactly which one would solve it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label Feb 10, 2021

k8s-ci-robot added the area/prow Issues or PRs related to prow label Feb 10, 2021

BenTheElder mentioned this issue Feb 10, 2021

[Flakey Test] ci-kubernetes-verify-master fails to fetch token, upload fails. kubernetes/kubernetes#98768

Closed

BenTheElder assigned chaodaiG Feb 10, 2021

spiffxp mentioned this issue Feb 10, 2021

Update k8s-infra-prow groups, add prow OWNERS files kubernetes/k8s.io#1636

Merged

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Feb 10, 2021

k8s-ci-robot added this to the v1.21 milestone Feb 10, 2021

k8s-ci-robot added wg/k8s-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 10, 2021

chaodaiG added a commit to chaodaiG/test-infra that referenced this issue Feb 11, 2021

Deck reports init container failure

47852ab

Part of kubernetes#20816

chaodaiG added a commit to chaodaiG/test-infra that referenced this issue Feb 11, 2021

Deck reports init container failure

5d68278

Part of kubernetes#20816

chaodaiG mentioned this issue Feb 11, 2021

Deck reports init container failure #20827

Merged

BenTheElder mentioned this issue Feb 12, 2021

shorten scheduling priorities taint key kubernetes/kubernetes#99020

Merged

chaodaiG mentioned this issue Feb 16, 2021

Prow jobs are failing with 'Could not resolve host: github.com' #20716

Closed

spiffxp mentioned this issue Feb 22, 2021

Add custon kube-dns-autoscaler configmap to prow build clusters kubernetes/k8s.io#1691

Merged

k8s-ci-robot closed this as completed Feb 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jobs erroring out due to failure to fetch token #20816

jobs erroring out due to failure to fetch token #20816

BenTheElder commented Feb 10, 2021

BenTheElder commented Feb 10, 2021

BenTheElder commented Feb 10, 2021

chaodaiG commented Feb 10, 2021

chaodaiG commented Feb 10, 2021

spiffxp commented Feb 10, 2021

chaodaiG commented Feb 10, 2021

spiffxp commented Feb 10, 2021

spiffxp commented Feb 10, 2021

chaodaiG commented Feb 11, 2021

spiffxp commented Feb 11, 2021 •

edited

Loading

BenTheElder commented Feb 12, 2021

BenTheElder commented Feb 12, 2021 •

edited

Loading

spiffxp commented Feb 22, 2021

k8s-ci-robot commented Feb 22, 2021

jobs erroring out due to failure to fetch token #20816

jobs erroring out due to failure to fetch token #20816

Comments

BenTheElder commented Feb 10, 2021

BenTheElder commented Feb 10, 2021

BenTheElder commented Feb 10, 2021

chaodaiG commented Feb 10, 2021

chaodaiG commented Feb 10, 2021

spiffxp commented Feb 10, 2021

chaodaiG commented Feb 10, 2021

spiffxp commented Feb 10, 2021

spiffxp commented Feb 10, 2021

chaodaiG commented Feb 11, 2021

spiffxp commented Feb 11, 2021 • edited Loading

BenTheElder commented Feb 12, 2021

BenTheElder commented Feb 12, 2021 • edited Loading

spiffxp commented Feb 22, 2021

k8s-ci-robot commented Feb 22, 2021

spiffxp commented Feb 11, 2021 •

edited

Loading

BenTheElder commented Feb 12, 2021 •

edited

Loading