Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build dashboard triage log #52653

Closed
bcmills opened this issue May 2, 2022 · 110 comments
Closed

build dashboard triage log #52653

bcmills opened this issue May 2, 2022 · 110 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. umbrella
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented May 2, 2022

We've been publishing minutes for various recurring discussions (proposal review, Go 2 review, compiler & runtime meeting notes). This issue is an attempt to apply the same pattern for builder triage.

We'll add a post here for the commands run throughout the week to triage failures on the Go build dashboard (https://build.golang.org).

For each day's triage, I first run fetchlogs to fetch the previous day's logs (and then some, because fetchlogs doesn't yet have a date flag). Then, I use greplogs to identify failures since the previous run, excluding known-bad commits and known-flaky builders.

greplogs --triage outputs Markdown containing GitHub task lists. Entries that have been triaged will be checked off the corresponding post.

The commands to perform a typical triage run look like:

$ fetchlogs -branch=release-branch.go1.19,release-branch.go1.18
$ fetchlogs -n 1024 -repo all
$ greplogs --triage --since=$LAST_TRIAGE_DATE

fetchlogs may take several minutes to finish; greplogs should be faster.

If the greplogs output has too much noise (such as due to a large build break or malfunctioning builder), use the --omit, --since, and/or --before flags to prune it down. When you've got it down to a manageable size, paste the Markdown output from greplogs into a new comment on this issue.

Then, check off the failures from the list as you triage them. (It's ok to leave entries unchecked if you haven't gotten to them yet, but try to finish the last run before moving on to a new one.)

@bcmills bcmills added Builders x/build issues (builders, bots, dashboards) umbrella labels May 2, 2022
@bcmills bcmills added this to the Unreleased milestone May 2, 2022
@bcmills

This comment was marked as resolved.

@bcmills

This comment was marked as resolved.

@bcmills
Copy link
Contributor Author

bcmills commented May 2, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|2278a51\|3ce203d\|e7c56fe\|a5dd684\|e7b0559 --since=2022-04-29 --before=2022-05-02

(29 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 2, 2022

Notes for today:

@bcmills
Copy link
Contributor Author

bcmills commented May 2, 2022

(The above commands are using a greplogs patched for triage: patches are in aclements/go-misc#11.)

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

greplogs --triage -l -E . --omit=loong64\|plan9-386\|plan9-amd64\|linux-amd64-unified\|freebsd-arm-paulzhol --since=2022-05-02 --before=2022-05-03 --details

(112 matching logs)

Lots of fallout from #52666. I'll notch that out by omitting the affected x/sys runs.

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

greplogs --triage -l -E . --omit=loong64\|plan9-386\|plan9-amd64\|linux-amd64-unified\|freebsd-arm-paulzhol\|b6088cc --since=2022-05-02 --before=2022-05-03

(48 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

Notes for today:

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

I see some major breakage on the dashboard for tomorrow, so I'm going to go ahead and run the logs from just before that.
(The tree is broken at CL 353989 and mostly-fixed at CL 397018, save for a few builders with remaining failures.)

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 4, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 5, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing --since=2022-05-04 --before=2022-05-05 --details

(601 matching logs)

Clearly I need to notch out some of yesterday's breakage. 😅

@bcmills
Copy link
Contributor Author

bcmills commented May 5, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|a5481fb\|78819d0\|f52b4ec --since=2022-05-04 --before=2022-05-05

(52 matching logs)

That's better, but still in bad shape. 😞

@bcmills
Copy link
Contributor Author

bcmills commented May 5, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

Another rough day for the builders.

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing --since=2022-05-05 --before=2022-05-06 --details

(348 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|5073c1c\|a7ab208\|983906f\|bb1f441\|7c74b0d --since=2022-05-05 --before=2022-05-06

(39 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 9, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing --since=2022-05-06 --before=2022-05-09

(57 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 9, 2022

Notes:

@heschi heschi added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 9, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 10, 2022

greplogs --triage -l -E . --omit=loong64\|plan9-\(386\|amd64\)\|linux-amd64-unified\|freebsd-arm-paulzhol\|openbsd-mips64-jsing --since=2022-05-09 --before=2022-05-09T16:02:00

(4 matching logs)

greplogs --triage -l -E . --omit=loong64\|plan9-\(386\|amd64\)\|linux-amd64-unified\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|ppc64\|riscv64\|darwin-arm64 --since=2022-05-09T16:02:00 --before=2022-05-10

(18 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 10, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 11, 2022

greplogs --triage -l -E . --omit=loong64\|plan9-\(386\|amd64\)\|linux-amd64-unified\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|ppc64\|riscv64 --since=2022-05-10 --before=2022-05-10T17:20:00

(14 matching logs)

greplogs --triage -l -E . --omit=loong64\|plan9-\(386\|amd64\)\|linux-amd64-unified\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|riscv64 --since=2022-05-10T17:20:00 --before=2022-05-11

(13 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 11, 2022

Notes:

  • The solaris-amd64-oraclerel builder is catching a lot of platform-independent invalid assumptions about timeouts in tests, but the high rate of other failures on that builder (such as cmd/compile,runtime: frequent test timeouts on solaris-amd64-oraclerel #51443) makes it tedious to diagnose. @rorth, I'm going to skip triaging that builder (and consider the port broken) until some of these issues are resolved.
  • ppc64 and ppc64le are fixed as of CL 405116; riscv64 is still broken at head.

@toothrot
Copy link
Contributor

toothrot commented Jul 20, 2022

@toothrot
Copy link
Contributor

@cherrymui
Copy link
Member

cherrymui commented Jul 27, 2022

greplogs --triage --since=2022-07-20

@cherrymui
Copy link
Member

cherrymui commented Jul 27, 2022

@dmitshur
Copy link
Contributor

dmitshur commented Aug 5, 2022

greplogs --triage --since=2022-07-27

(127 matching logs)

@findleyr
Copy link
Contributor

findleyr commented Aug 5, 2022

In the last triage batch, there were a couple duplicate issues filed for gopls flakes that had already been fixed (with closed issues).

That's fine, I don't mind de-duping, but is there a way that I can preempt the triage process by making sure the existing issues are associated with the flakes? Perhaps a label I can add, or a particular format to the issue I create?

@heschi
Copy link
Contributor

heschi commented Aug 19, 2022

greplogs --triage --since=2022-08-04 --omit dragonfly-amd64-622 --omit android-arm.\*-corellium --omit linux-ppc64le-.\* --omit openbsd-arm.\*-jsing --omit linux-amd64-alpine --omit 40e737f\|04bbc27\|12ff722 --known-issue #54416=TestTestConn/UnixPipe '--known-issue=#54553=lock ordering' --known-issue=#54503=gopkg.in '--known-issue=#54555=connections still open after closing DB' --known-issue=#53456=TestDebugLines --known-issue=#51323=INTERNAL_ERROR --known-issue=#29951=TestNewIntAllocs --known-issue=#38111=TestLookup --known-issue=#53397=issue52788.go --known-issue=#54337=TestAppendOfMake --known-issue=#53702=issue53702.go --known-issue=#54557=wycheproof --known-issue=#22857=TestLookupLongTXT --known-issue=#54458=TestCgoTraceParser '--known-issue=#54411=newstack at runtime/internal/atomic' --omit js-wasm '--known-issue=#53722=Get "https://proxy.golang.com.cn.*lookup'

(500 matching logs)

@heschi
Copy link
Contributor

heschi commented Aug 19, 2022

I added a feature to greplogs to automatically resolve logs that matched a passed regex, which helped me get the list down from 500 (after omitting half a dozen builders) to ~100. I give up.

@bcmills
Copy link
Contributor Author

bcmills commented Aug 22, 2022

I went back through the entries that had been skipped, and found several issues that needed to be updated or filed:

There are still more failures from that batch waiting to be triaged, but it's clear to me that there is quite a bit of signal among the noise. The weekly triage hadn't happened the week before, either — there wouldn't be so much to sift through if we'd been keeping up with things as the tree reopened.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/425075 mentions this issue: cmd/greplogs: associate known issues by regex

@bcmills
Copy link
Contributor Author

bcmills commented Aug 22, 2022

That's all of the entries that were still unchecked (unless I missed one in the middle), but that 500 number is also suspiciously round. 🤔

@heschi
Copy link
Contributor

heschi commented Aug 22, 2022

The 500 really is just coincidence.

gopherbot pushed a commit to golang/build that referenced this issue Aug 25, 2022
Add a flag --known issue that associates issues with log entries based
on regexp matches. It's used like this:

--known-issue='#53456=TestDebugLines'

Which results in the check box for that log being pre-checked, and the
text '#53456' being added, which turns into a pretty link on GitHub.

It might be nice to group issues as well, but I didn't want to mess with
the chronological ordering.

For golang/go#52653.

Change-Id: If4615cd798ba72c1c1ee3cb43f1d1ad6d4319528
Reviewed-on: https://go-review.googlesource.com/c/build/+/425075
TryBot-Result: Gopher Robot <[email protected]>
Run-TryBot: Heschi Kreinick <[email protected]>
Auto-Submit: Heschi Kreinick <[email protected]>
Reviewed-by: Bryan Mills <[email protected]>
@heschi
Copy link
Contributor

heschi commented Aug 29, 2022

OK, trying to catch up on triage. After a couple hours' work of building --known-issue flags, I have the following. Most of the remainder are one-offs. Note that this excluding a number of very widespread failures, see the egrep -v.

greplogs --triage --since=2022-08-19 --omit freebsd-arm64-dmgk '--known-issue=tools broken by unified IR=without types was imported from' --known-issue=#54655=SuggestedFix/a4\|TestRunDespiteErrors --known-issue=#38111=TestLookup --known-issue=#54742=proxyconnect '--known-issue=go.dev/cl/424854 broke cgo=_Ctype_struct_sqlite3_context|_Ctype_struct___FILE|_Ctype_struct__IO_FILE' '--known-issue=#54553=lock ordering' '--known-issue=#53722=Get "https://proxy.golang.com.cn.*lookup' '--known-issue=go.dev/cl/419014 broke darwin-amd64=TestSignalForwardingGo' --known-issue=#54747=TestSubmit '--known-issue=go.dev/cl/425004 fixed non-hermetic vulndb test=TestScanModule' '--known-issue=go.dev/cl/425274 fixed nondeterministic golden tests=TestCommand/testdata' '--known-issue=#54749=unknown caller pc' --known-issue=#54750=TestServe/tcp --known-issue=#54751=TestIssue32814/Modules --known-issue=#54752=TestIdleTimeout | egrep -v 'unified IR|go.dev/cl/419014|#54655|go.dev/cl/424854|go.dev/cl/425274'

(1116 matching logs)

@heschi
Copy link
Contributor

heschi commented Aug 29, 2022

@heschi
Copy link
Contributor

heschi commented Sep 1, 2022

greplogs --triage --since=2022-08-28 --omit corellium --known-issue=#38111=TestLookup --known-issue=#54742=proxyconnect '--known-issue=#53722=Get "https://proxy.golang.com.cn.*lookup' --known-issue=#54747=TestSubmit '--known-issue=#54749=unknown caller pc' --known-issue=#54750=TestServe/tcp --known-issue=#54751=TestIssue32814/Modules --known-issue=#54752=TestIdleTimeout '--known-issue=go.dev/cl/426495=RecentTag not implemented' '--known-issue=#54729=unexpected stale targets' '--known-issue=go.dev/cl/426535=field links not found'

@mknyszek
Copy link
Contributor

mknyszek commented Sep 6, 2022

greplogs --triage --since=2022-09-01 --omit corellium --known-issue 'go.dev/cl/421882=Incomplete not declared by package cgo' --known-issue '#54814=missing xsym in relocation' --known-issue 'go.dev/cl/426803=TestCommandLine\/' --known-issue 'go.dev/cl/426075=undefined\: atomic' --known-issue 'go.dev/cl/427585=TestCommand ' --known-issue '#54903=TestNormalTerms ' --known-issue '#54885=SignalInVDSO|maymorestack=mayMoreStackPreempt\n--- FAIL\: TestCgoPprofCallback' --known-issue='go.dev/cl/427135=TestDebugCallUnsafePoint' --known-issue='network flake=502 Bad Gateway' --known-issue='#49387=(?ms)connection refused.*FAIL\s+golang\.org/x/tools/internal/jsonrpc2_v2' --known-issue='#46047=(?m)FAIL: TestIdleTimeout.*\n.*goroutine leak detected' | grep -v 'go\.dev\/cl\/' | grep -v 'network flake' | grep -v '#54903' | grep -v '#54885'

(1509 matching logs)

Note that I excluded a bunch of noisy but presumably fixed failures (the grep -vs in the command). They're either CLs that have been reverted, issues that AFAICT were just network flakes, or map to issues that have since been fixed. The only filter that refers to an issue that's not fixed is #54903, because it's been failing very hard on all builders for some time now, so it produced a lot of noise. Everything else that isn't checked still needs triaging.

@mknyszek
Copy link
Contributor

mknyszek commented Sep 6, 2022

There are a handful of unchecked boxes left but I'm going to sign off for today. I think they're almost all new issues, or issues I can't find already. I'll pick those up and update this post tomorrow.

Filed:

Updated:

@toothrot
Copy link
Contributor

toothrot commented Sep 13, 2022

greplogs --triage --since=2022-09-06 --known-issue=#54891=TestServerCancelsReadHeaderTimeoutWhenIdle --known-issue=#46520=TestIdleTimeout --known-issue=#55050=TestScript/mod_proxy_errors --known-issue=#55051=TestTextHandlerAlloc --known-issue=#55051=TestAlloc --known-issue=#55052=TestTextHandlerSource --known-issue=#53972=buildrelease_test --known-issue=#54655=SuggestedFix/a4\\\|TestRunDespiteErrors --known-issue=#38111=TestLookup --known-issue=#54742=proxyconnect '--known-issue=go.dev/cl/424854 broke cgo=_Ctype_struct_sqlite3_context|_Ctype_struct___FILE|_Ctype_struct__IO_FILE' '--known-issue=#54553=lock ordering' '--known-issue=#53722=Get "https://proxy.golang.com.cn.*lookup' '--known-issue=go.dev/cl/419014 broke darwin-amd64=TestSignalForwardingGo' --known-issue=#54747=TestSubmit '--known-issue=go.dev/cl/425004 fixed non-hermetic vulndb test=TestScanModule' '--known-issue=now fixed nondeterministic golden tests=TestCommand/testdata' '--known-issue=#54749=unknown caller pc' --known-issue=#54750=TestServe/tcp --known-issue=#54751=TestIssue32814/Modules --known-issue=#54752=TestIdleTimeout --known-issue=#54903=TestNormalTerms '--known-issue=#55054=runtime error: index out of range.*with length 0' --known-issue=#55055=TestLogTextHandler --known-issue=#55055=TestConnections

@bcmills
Copy link
Contributor Author

bcmills commented Sep 14, 2022

Filed:

Updated:

Rerouted to package owners:

Closed as already fixed:

Closed as duplicate:

@bcmills
Copy link
Contributor Author

bcmills commented Nov 16, 2022

We're now using watchflakes instead of this tracking issue for build dashboard triage.

@bcmills bcmills closed this as not planned Won't fix, can't repro, duplicate, stale Nov 16, 2022
@golang golang locked and limited conversation to collaborators Nov 16, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. umbrella
Projects
None yet
Development

No branches or pull requests

8 participants