Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kOps: Test timeouts are not diagnosable when run in parallel #20738

Closed
justinsb opened this issue Feb 4, 2021 · 4 comments
Closed

kOps: Test timeouts are not diagnosable when run in parallel #20738

justinsb opened this issue Feb 4, 2021 · 4 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@justinsb
Copy link
Member

justinsb commented Feb 4, 2021

We have some test timeouts, for example: https://testgrid.k8s.io/sig-cluster-lifecycle-kops#kops-grid-calico-amzn2-k20-docker

It's very difficult to see what the cause of the timeout is. It appears that ginkgo only logs tests after they complete, it doesn't log on a signal, it doesn't write the junit output on a signal. The ginkgo output isn't sufficiently self-descriptive to facilitate scripting.

Possibly passing --progress, --trace or -stream might help here.

We can also try a serial test to see if we can find the problem (I'll probably try this!)

cc @rifelpet

@justinsb justinsb added the kind/bug Categorizes issue or PR as related to a bug. label Feb 4, 2021
@justinsb justinsb changed the title kOps: Test timeouts are not diagnosable in parallel kOps: Test timeouts are not diagnosable when run in parallel Feb 4, 2021
justinsb added a commit to justinsb/test-infra that referenced this issue Feb 4, 2021
Hoping to diagnose out why tests are timing out.

Issue kubernetes#20738
justinsb added a commit to justinsb/test-infra that referenced this issue Feb 4, 2021
Hoping to diagnose out why tests are timing out.

Issue kubernetes#20738
justinsb added a commit to justinsb/test-infra that referenced this issue Feb 4, 2021
Hoping to diagnose out why tests are timing out.

Issue kubernetes#20738
@BenTheElder
Copy link
Member

this seems like something to fix in kubetest2? 🙃

@rifelpet
Copy link
Member

thanks to e2e.test's -test.timeout= we now have artifacts on timed out jobs. I've been doing some investigation trying to pinpoint any specific tests - any tests no longer having its status reported indicates it might be hung. Unfortunately there aren't any consistently missing test results, suggesting that random missing tests might be due to them not yet being ran on a hung ginkgo runner and ginkgo randomizing the order of tests on each runner.

One interesting clue is that all of the jobs that timeout are missing their junit_01.xml artifact (example) whereas successful jobs have that file (kubetest2 example, kubetest 1 example). Because test suites are randomized across ginkgo nodes this leads me to believe its not a specific test that is at fault, rather something with our configuration of ginkgo, e2e.test, or the prow job container.

Looking at e2e.test flags, besides --provider and --gce-zone which are being added in kubernetes/kops#10847, our kubetest2 setup is no longer passing these flags:

--gce-region=...
--gce-multizone=false
--cluster-tag=<clustername>
--repo-root=.
--num-nodes=0
--disable-log-dump=true

I may investigate the impact of these flags. I know that --num-nodes defaults to the number of ready nodes and is used to skip certain tests so we might be no longer skipping certain tests.

Our serial job is timing out after 5 hours, we can consider extending that too.

@rifelpet
Copy link
Member

rifelpet commented Mar 8, 2021

The timeouts were fixed by increasing the prow job's memory in #20931

We also have increased visibility with per-ginkgo-node logs via ginkgo's --debug (#20893) and more useful cluster artifacts when tests timeout (#20850). I will remove our serial job soon now that we're done with troubleshooting.

/close

@k8s-ci-robot
Copy link
Contributor

@rifelpet: Closing this issue.

In response to this:

The timeouts were fixed by increasing the prow job's memory in #20931

We also have increased visibility with per-ginkgo-node logs via ginkgo's --debug (#20893) and more useful cluster artifacts when tests timeout (#20850). I will remove our serial job soon now that we're done with troubleshooting.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants