Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Travis manifest_upload job occasionally fails to create merge_pr_* tags or upload manifest #10572

Closed
foolip opened this issue Apr 23, 2018 · 43 comments

Comments

@foolip
Copy link
Member

foolip commented Apr 23, 2018

Running a variation of the script in Ecosystem-Infra/ecosystem-infra-stats#6 (comment) to make sure that all PRs merged after 2017-07-01 have been tagged, in particular ones created well before that point, I noticed that some recently merged PRs have also not been tagged.

#10543 is the most recent. The PR and GitHub API says the PR was merged in commit 0628e2e, but that's #10124. There are two Travis jobs appear to have been run with that as checked out revision: https://travis-ci.org/w3c/web-platform-tests/jobs/369319256 and https://travis-ci.org/w3c/web-platform-tests/jobs/369365460. The second says "ERROR:__main__:Tag creation failed", but the first does not. At the very least something seems to have gone wrong on GitHub's side here.

#10445 was processed in https://travis-ci.org/w3c/web-platform-tests/jobs/365663222 but log says "ERROR:__main__:No PR found for master". Why?

For #10194 it seems like no build was triggered, in https://travis-ci.org/w3c/web-platform-tests/builds I see only the one before and after: https://travis-ci.org/w3c/web-platform-tests/builds/361106934 + https://travis-ci.org/w3c/web-platform-tests/builds/361263818

Those three are the only that are not tagged since #8005 was merged, the rest were merged before that, the latest of those being #4974.

@foolip foolip added the infra label Apr 23, 2018
@foolip
Copy link
Member Author

foolip commented Apr 23, 2018

For #10445, because no tag was created, untagged-9aa6616bd18782ef19ad was created instead.

tag_master.py would have generated the URL https://api.github.com/search/issues?q=type:pr+is:merged+repo:w3c/web-platform-tests+a5bb6d23ac757fcb5e8aef6fd7ed7587324982db which now returns one result. However, at the time, the "No PR found for master" line was reached, so it must have returned zero results.

Possibly there is a race here, that the GitHub API isn't guaranteed to return the results of recent changes, or there was an internal problem which manifested as no results instead of a non-OK HTTP status code.

@csnardi
Copy link
Member

csnardi commented Apr 23, 2018

The changes made in #10543 were already made in #10502 -- I assume GitHub when rebasing & merging created an empty commit and so that was swallowed up somehow (not sure how git actually deals with that, but it would make sense).

@foolip
Copy link
Member Author

foolip commented Apr 24, 2018

Oh, that would explain it indeed. @jgraham, in this situation, it seems like rather than treat the PR as merged, it would be better to close it? In any case it's appropriate that no merge_pr_* tag was created, because nothing on master actually happened, or at least commit 0628e2e is no better a place to put the tag than some of the previous commits.

@foolip
Copy link
Member Author

foolip commented Apr 26, 2018

I have created the merge_pr_10194 and merge_pr_10445 tags, and deleted the untagged-9aa6616bd18782ef19ad that was automatically created.

#10543 is still an odd case that will need @jgraham's input.

@jgraham
Copy link
Contributor

jgraham commented Apr 26, 2018

Oh, if we ended up with an empty set of changes after rebase, it seems like everything is fine and we don't need to worry. What am I missing?

@foolip
Copy link
Member Author

foolip commented Apr 26, 2018

@jgraham the problem is that the PR claims to be merged at commit 0628e2e, but that's just what master happened to be when the rebase happened. And https://github.com/w3c/web-platform-tests/pull/10543/files still lists changes that weren't made as part of that PR.

Seems to me that the rebase must have resulted in no commits on the branch (not an empty commit), and that case could instead be handled by abandoning the PR, that would be less confusing to understand after the fact.

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

It just happened again. untagged-f360b6391e175d60f813 was created for commit 4f39716 which is from #10663, and the build was https://travis-ci.org/w3c/web-platform-tests/builds/371818852.

This time the log says:

++python tools/ci/tag_master.py
ERROR:__main__:Tag creation failed:
HTTP Error 422: Unprocessable Entity

Don't know why that is, maybe a GitHub API problem.

foolip added a commit that referenced this issue Apr 27, 2018
This way, the Travis job will fail, and ci_manifest.sh will not
continue to upload a manifest for an untagged-* tag.

Related to #10572 but
doesn't fix it, rather it'll make the Travis jobs fail when this
problem happens, making it easier to notice.
@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

#10673 is to make the Travis job fail if no merge_pr_* tag could be created, but it won't help with the problem just reported in #10572 (comment).

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

I restarted https://travis-ci.org/w3c/web-platform-tests/jobs/371818853 but that didn't fix the problem, now it says "ERROR:__main__:No PR found for master" instead.

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

I have deleted untagged-f360b6391e175d60f813 and created merge_pr_10663 instead.

foolip added a commit that referenced this issue Apr 27, 2018
This way, the Travis job will fail, and ci_manifest.sh will not
continue to upload a manifest for an untagged-* tag.

Related to #10572 but
doesn't fix it, rather it'll make the Travis jobs fail when this
problem happens, making it easier to notice.

Drive-by: use the sha instead of "master" in log output
foolip added a commit that referenced this issue Apr 27, 2018
@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

#10673 worked, because we just had a broken build from master: https://travis-ci.org/w3c/web-platform-tests/jobs/372015780, which is for #10678.

That PR doesn't show up in https://api.github.com/search/issues?q=type:pr+is:merged+repo:w3c/web-platform-tests+438f2694a95cb9c7f329ef89c1d37fa7140106f0 even now, so restarting it can't help. @annevk, I presume you did nothing unusual when merging that, just used the buttons?

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

The PR merged immediately before it is #10675 and https://api.github.com/search/issues?q=type:pr+is:merged+repo:w3c/web-platform-tests+cc29a15b5c00d90149d8f60f96ca4f77a989a2d7 does find that, but I can't see what the difference could be.

The temptation to blame everything odd on GitHub is strong, should we call this "search indexing is broken"? dunno

@annevk
Copy link
Member

annevk commented Apr 27, 2018

Right, per https://github.com/w3c/web-platform-tests/commits/master it's been broken for a while.

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

@annevk most of the red you see there will be because of #10653, https://travis-ci.org/w3c/web-platform-tests/builds is what I'm looking at to find broken Travis builds on master.

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

But https://travis-ci.org/w3c/web-platform-tests/jobs/372023058 also failed. The run for #10673 and the four master pushes that followed passed, otherwise I'd strongly suspect that I just broke it with my changes. But https://api.github.com/search/issues?q=type:pr+is:merged+repo:w3c/web-platform-tests+420757ba95a3b59dbb5d66e23fa2f9380f20280b does return a PR, and now actually so does the search from #10572 (comment).

So it definitely looks like there's a race here. It takes some amount of time after the PR is merged (updating master) until the API will find PRs by searching for commit, and the Travis job might run before that.

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

I'll try to restart those two jobs now to see if they succeed or if the tags end up in the wrong place.

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

Note that travis_retry probably won't fix this problem unless it also has waiting built in. Otherwise it'll just try n times fast and fail anyway.

@foolip
Copy link
Member Author

foolip commented Apr 27, 2018

OK, job restart worked and created merge_pr_10678 + merge_pr_10671, I didn't do those manually.

moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue May 2, 2018
…y fails, a=testonly

Automatic update from web-platform-testsUse a non-zero exit code if tag_master.py fails (#10673)

This way, the Travis job will fail, and ci_manifest.sh will not
continue to upload a manifest for an untagged-* tag.

Related to web-platform-tests/wpt#10572 but
doesn't fix it, rather it'll make the Travis jobs fail when this
problem happens, making it easier to notice.

Drive-by: use the sha instead of "master" in log output

--

wpt-commits: 1e5d263997a3cd3f09471ab642774eb3bb150e84
wpt-pr: 10673
moz-v2v-gh pushed a commit to mozilla/gecko-dev that referenced this issue May 2, 2018
…s wrong in tag_master.py, a=testonly

Automatic update from web-platform-testsLog the response body when something goes wrong in tag_master.py (#10681)

Towards debugging web-platform-tests/wpt#10572.
--

wpt-commits: ae34fb2b39f8b2cf9c3989a566ab938cde45dec0
wpt-pr: 10681
@foolip
Copy link
Member Author

foolip commented Oct 9, 2018

However, it's run long enough that I can tell that the latest trouble was just 15 days ago with #13187. That was tagged but there's no release, because Travis failed due to Octokit::BadGateway:
https://travis-ci.org/web-platform-tests/wpt/jobs/432494382

There are more like that.

The most recent with no tags is #12824, where the Travis build associated with commit d7627ae (https://travis-ci.org/web-platform-tests/wpt/jobs/432494382) oddly thinks itself to be for commit 00c1fd9 and a PR build. This is weird, possibly the result of GitHub or Travis backend bugs.

#12854 also got no tag, but for network error reasons:
https://travis-ci.org/web-platform-tests/wpt/jobs/424950964

As far as my script has reached, there are problems with ~1.5% (36/2214) of merged PRs. That's too much, so I'll leave this open.

I'll also manually add some missing tags.

@foolip
Copy link
Member Author

foolip commented Oct 9, 2018

Ah, wait, I've made a mistake! #12824 wasn't targeting the master branch, so that explains the weirdness. My script should skip PRs that target anything but master.

@foolip foolip mentioned this issue Oct 9, 2018
@foolip
Copy link
Member Author

foolip commented Oct 9, 2018

#11452 is like #10572 (comment) so that case still happens.

@foolip
Copy link
Member Author

foolip commented Oct 12, 2018

I've run a tweaked version of my script over all PRs and have the output here:
https://gist.github.com/foolip/1146e08bd820b5ddb110a06b93b9a085

At total of merged 5377 PRs after 2017-07-01. Upload the manifest as MANIFEST-.json.gz was on 2017-10-30, so PRs before that don't have releases, just tags. There's a lot of manifest backfilling we'd could do, before and after that point.

Good news is that only two PRs show up as missing tags, and that's #10543 and #11452, already explained in this issue.

@foolip
Copy link
Member Author

foolip commented Oct 12, 2018

#10605 had no manifest, again because of a network error in the deploy step. It was recent, so I restarted https://travis-ci.org/web-platform-tests/wpt/jobs/440405376, but that then fails with "HTTP Error 422: Unprocessable Entity" because the tag already exists...

@foolip
Copy link
Member Author

foolip commented Oct 12, 2018

Note that there are some tags for which releases exist but the assets (manifest file) do not. Those are merge_pr_10605 and merge_pr_10247, and I'll just delete the releases to put them in the same category as 'no release'.

@foolip
Copy link
Member Author

foolip commented Oct 12, 2018

Alright. So far in 2018 it looks like we have 53 merge_pr_ tags that don't have the manifest assets. We could try to improve the reliability and maybe even backfill for all tagged commit. The question is if it matters. @lukebjerring, if this were perfectly reliable, would that allow us to do something cool on wpt.fyi that we currently cannot do?

@Hexcles
Copy link
Member

Hexcles commented Oct 16, 2018

Briefly discussed this with Philip today. We need some better deploy mechanisms. We can keep pushing the limits of the CI system, by adding more retries to each step, and making sure later steps only happen when earlier steps succeed (e.g. make sure we have "set -e"). Or rather, it'd be good if there's a better system to define such workflow or pipeline. GitHub just announced Actions which looks interesting and could be simpler than square pegging the workflow into Travis, but I haven't looked into the details yet.

@foolip
Copy link
Member Author

foolip commented Oct 22, 2018

There was a problem today again: #13638

@foolip
Copy link
Member Author

foolip commented Nov 13, 2018

A novel type of failure in https://travis-ci.org/web-platform-tests/wpt/jobs/454430391, where the deploy step failed with "No output has been received in the last 10m0s".

I wonder if anyone is actually using Travis to deploy stuff at any significant scale :/

As @Hexcles says, this seems like a thing we should eventually migrate to the new Actions.

@foolip
Copy link
Member Author

foolip commented Nov 13, 2018

I clicked some buttons as soon as Actions was announced, and the enrollment page now says "This account is already on the waitlist for the GitHub Actions Beta!"

@Hexcles
Copy link
Member

Hexcles commented Nov 13, 2018

@foolip that's new for this particular task, but not really novel for Travis in general. Travis often has large-scale task timeouts like this, which seems like transient infra issues. In fact, a bunch of stability checks failed yesterday for the same reason.

@mdittmer
Copy link
Contributor

friendly ping. If this is to stay priority:roadmap should we find an owner for it?

@foolip
Copy link
Member Author

foolip commented Jan 25, 2019

Yep, until this is fixed we won't be able to depend on having a manifest for wpt.fyi, which would be useful. @lukebjerring do you want to own this, or what's the priority of having the manifest available reliably?

@lukebjerring
Copy link
Contributor

No, I don't want to own this.
wpt.fyi falls back to the latest manifest when it 404's for a particular SHA, so it's not blocked on having this be reliable, it's just a lower-quality feature (since there can be manifest mismatch in some cases).

@foolip
Copy link
Member Author

foolip commented Feb 5, 2019

@lukebjerring glad to hear it's not a blocker for wpt.fyi :)

This is continuing to fail intermittently on Travis CI, https://github.com/web-platform-tests/wpt/runs/50028082 being a recent case.

I've played around with GitHub Actions enough to be confident we can do this using GitHub actions when it exits beta, but until then let's leave it alone.

@zcorpan
Copy link
Member

zcorpan commented Feb 13, 2019

https://travis-ci.com/web-platform-tests/wpt/jobs/177122640 might be another case of this issue.

++python tools/ci/tag_master.py
ERROR:__main__:Tag creation failed:
HTTP Error 422: Unprocessable Entity

I restarted the job but it errored again. I don't know how to access the log of the first attempt. The PR was green.

@foolip
Copy link
Member Author

foolip commented Feb 13, 2019

After a restart the logs of previous runs are gone I think.

I've seen Error 422 before when working on this, but I don't remember when it happens.

@foolip
Copy link
Member Author

foolip commented May 10, 2019

This was moved to GitHub Actions in #15781, so I'll close this.

GitHub Actions is probably not 100% reliable either, but I'll file another issue when it becomes a problem.

@foolip foolip closed this as completed May 10, 2019
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Oct 2, 2019
…y fails, a=testonly

Automatic update from web-platform-testsUse a non-zero exit code if tag_master.py fails (#10673)

This way, the Travis job will fail, and ci_manifest.sh will not
continue to upload a manifest for an untagged-* tag.

Related to web-platform-tests/wpt#10572 but
doesn't fix it, rather it'll make the Travis jobs fail when this
problem happens, making it easier to notice.

Drive-by: use the sha instead of "master" in log output

--

wpt-commits: 1e5d263997a3cd3f09471ab642774eb3bb150e84
wpt-pr: 10673

UltraBlame original commit: eacd312dd4c62c177e30f73d2e1c204cf8b7ce82
gecko-dev-updater pushed a commit to marco-c/gecko-dev-comments-removed that referenced this issue Oct 2, 2019
…s wrong in tag_master.py, a=testonly

Automatic update from web-platform-testsLog the response body when something goes wrong in tag_master.py (#10681)

Towards debugging web-platform-tests/wpt#10572.
--

wpt-commits: ae34fb2b39f8b2cf9c3989a566ab938cde45dec0
wpt-pr: 10681

UltraBlame original commit: a21d7c9772895dd9881ee9d88e63a04f88517b18
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Oct 3, 2019
…y fails, a=testonly

Automatic update from web-platform-testsUse a non-zero exit code if tag_master.py fails (#10673)

This way, the Travis job will fail, and ci_manifest.sh will not
continue to upload a manifest for an untagged-* tag.

Related to web-platform-tests/wpt#10572 but
doesn't fix it, rather it'll make the Travis jobs fail when this
problem happens, making it easier to notice.

Drive-by: use the sha instead of "master" in log output

--

wpt-commits: 1e5d263997a3cd3f09471ab642774eb3bb150e84
wpt-pr: 10673

UltraBlame original commit: eacd312dd4c62c177e30f73d2e1c204cf8b7ce82
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified that referenced this issue Oct 3, 2019
…s wrong in tag_master.py, a=testonly

Automatic update from web-platform-testsLog the response body when something goes wrong in tag_master.py (#10681)

Towards debugging web-platform-tests/wpt#10572.
--

wpt-commits: ae34fb2b39f8b2cf9c3989a566ab938cde45dec0
wpt-pr: 10681

UltraBlame original commit: a21d7c9772895dd9881ee9d88e63a04f88517b18
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Oct 3, 2019
…y fails, a=testonly

Automatic update from web-platform-testsUse a non-zero exit code if tag_master.py fails (#10673)

This way, the Travis job will fail, and ci_manifest.sh will not
continue to upload a manifest for an untagged-* tag.

Related to web-platform-tests/wpt#10572 but
doesn't fix it, rather it'll make the Travis jobs fail when this
problem happens, making it easier to notice.

Drive-by: use the sha instead of "master" in log output

--

wpt-commits: 1e5d263997a3cd3f09471ab642774eb3bb150e84
wpt-pr: 10673

UltraBlame original commit: eacd312dd4c62c177e30f73d2e1c204cf8b7ce82
gecko-dev-updater pushed a commit to marco-c/gecko-dev-wordified-and-comments-removed that referenced this issue Oct 3, 2019
…s wrong in tag_master.py, a=testonly

Automatic update from web-platform-testsLog the response body when something goes wrong in tag_master.py (#10681)

Towards debugging web-platform-tests/wpt#10572.
--

wpt-commits: ae34fb2b39f8b2cf9c3989a566ab938cde45dec0
wpt-pr: 10681

UltraBlame original commit: a21d7c9772895dd9881ee9d88e63a04f88517b18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants