Suggestion: pager/alerts/auto create issues for infra-related issues #2359

mmarchini · 2020-06-17T18:37:10Z

Sometimes jobs will fail on easily fixable problems like read-only fs for days until someone notices it. Ideally, when a job fails for an infra-related issue, collaborators will ping the WG, but this either doesn't happen sometimes or the WG is overloaded with pings for multiple reasons (not only for infra-related issues). On top of that, GitHub notifications interface doesn't provide an easy way to look at all pings to a specific team, and since most of us are in multiple teams we have a lot of mixed "Team Mention" notifications.

What if we could identify (most) infra issues on Jenkins and let the WG know in a timely fashion? This would help us act on issues sooner when we are available. This wouldn't imply an SLA for the WG, we're still all volunteers, if no one is available the issue will remain unfixed until someone is, which is fine.

I'm not sure if the best approach is to create an issue here, to send a notification on IRC, email, or to use something like PagerDuty. Regardless of how we get notified, I believe we can accomplish this with a small effort and without introducing too much maintenance burden. Jenkins is already hooked to github-bot, and ncu-ci has some heuristics to identify infra and Jenkins issues (although today it bulks infra issues with build issues, which is easily fixable). We could use similar heuristics on github-bot to identify potential infra issues and send alerts when they happen. Or github-bot could forward all failures to a separate service which will do that (if we want to decouple but don't want to add a new hook to Jenkins). We could even add thresholds for certain errors (for example, "read-only fs" could trigger on the first occurence, but more flaky issues like corrupted git directory could require X out of Y failures to trigger).

What do y'all think? If folks are on board with this, I can implement a proof of concept.

The text was updated successfully, but these errors were encountered:

Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).

AshCripps · 2020-06-17T19:21:13Z

Is there some kind of jenkins plugin that could do all this for us? rather than have to set a separate thing up.

mmarchini · 2020-06-17T19:28:18Z

Not familiar with Jenkins plugins enough to answer if there is one, but a requirement would be for the plugin to be able to parse the output of failed jobs to determine whether it is a code-related issue, a Jenkins issue, or an issue with the agent.

mmarchini · 2020-06-17T20:13:46Z

Looked at some plugins, the closest I could find is Build Failure Analyzer, but it only categorizes failures in the interface, it doesn't send notifications. Could also be an useful feature to have, and it could even improve the user experience of collaborators.

We don't necessarily need to use the github bot for this, the main point of the issue is to notify the appropriate folks when there's an actionable failure happening on an agent. If I missed a Jenkins plugin which does that, and if using that Jenkins plugin will have lower maintenance burden than implementing it ourselves, it works too.

rvagg · 2020-06-18T10:56:17Z

I'd be fine with something that pingged me via email or some other priority channel as long as I can control the hours of the day I can be bothered. My GitHub notifications are shunted to a separate email folder that I don't read till after I've started my workday. Sometimes "critical" issues escape my notice until well into my workday (sometimes entirely over a weekend). If I get a direct email or someone pings me in IRC I'm usually on it during the hours my devices are allowed to bother me. Perhaps the easiest thing is just mapping out who's available to handle what kinds of problems during what times of day and can be best contacted by what means.

AshCripps · 2020-06-18T11:28:24Z

I had a thought that if we were to create a solution for this ourselves, it might be a good opportunity for the new people looking to joining the working group to help out on as its requires no real access for testing.

mmarchini · 2020-06-18T19:12:05Z

I'd be fine with something that pingged me via email or some other priority channel as long as I can control the hours of the day I can be bothered.

This should be easy to do if we implement the workflow ourselves. For example, we could have an Action on this repository which will respond to webhook calls, and when it receives a failure it creates an issue here + ping folks who are available (which could be set up as a JSON/Yaml/INI config file, easy to change via PR).

edit: to implement it as an Action: create a repository_dispatch action workflow here; change github-bot so it sends a POST to the appropriate GitHub API with the job name + job id; check if ncu-ci knows how to handle said job name (checking if it is a -pr or -commit should be enough); call ncu-ci url and filter for INFRA_FAILURE, JENKINS_FAILURE and GIT_FAILURE (those are the ones we might need to take action); if one of the failures mentioned before is present, create an issue via API calls, check the members availability config file and send notifications to them (email, IRC mentioning folks available, etc). I probably missed something, but if this works as I'm thinking, it should be fairly straightforward to implement and maintain.

I had a thought that if we were to create a solution for this ourselves, it might be a good opportunity for the new people looking to joining the working group to help out on as its requires no real access for testing.

That sounds like something new folks could work on, yes. If no one works on it I can too, but I want to work on a commit queue for nodejs/node first.

mhdawson · 2020-06-22T21:30:02Z

I'm +1 for notifications. We used to have more like those for machines going offline, failed jobs etc. which were useful but those have stopped working over time and we've not had time to fix.

jbergstroem · 2020-07-01T02:49:37Z

related: #2370

Grafana hooks into many of these type of communication mediums; we just need to monitor what we expect to work and alert when it doesn't. Once we have github integration I will expand who has access (read: as many as possible).

jbergstroem · 2020-07-02T05:02:38Z

So, we're currently collecting system information from ci, ci-release, www, gh-bot, metrics (the host that runs these alerts) and in a foreseeable future the backup host. We're also collecting influxdb statistics (predicting when metrics host could become an issue) as well as all web [nginx] servers (ci, ci-release and www) basic data. Finally, we will soon be collecting a bit of jenkins information (build info, timings, etc).

I'd like to shift the discussion from medium (how and where to ping people; this seems to be relatively straightforward) to when and what do we alert about? I have a few ideas but since I've AWOL:ed for a while I prefer hearing from the community as part of the implementation design.

Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).

mmarchini · 2020-07-21T19:35:32Z

I still think it would be good to use ncu for that, it already categorizes errors, we just need to choose which categories to alert. Some errors like full disk are detectable by grafana, but what about read-only filesystems (happens somewhat frequently)?

github-actions · 2021-05-18T01:01:49Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).

mmarchini mentioned this issue Jun 17, 2020

ncu-ci: create INFRA_FAILURES category nodejs/node-core-utils#441

Merged

jbergstroem mentioned this issue Jul 3, 2020

Collecting input: validating releases and improving cache #2375

Closed

github-actions bot added the stale label May 18, 2021

github-actions bot closed this as completed Jun 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion: pager/alerts/auto create issues for infra-related issues #2359

Suggestion: pager/alerts/auto create issues for infra-related issues #2359

mmarchini commented Jun 17, 2020 •

edited

Loading

AshCripps commented Jun 17, 2020

mmarchini commented Jun 17, 2020

mmarchini commented Jun 17, 2020

rvagg commented Jun 18, 2020

AshCripps commented Jun 18, 2020

mmarchini commented Jun 18, 2020 •

edited

Loading

mhdawson commented Jun 22, 2020

jbergstroem commented Jul 1, 2020

jbergstroem commented Jul 2, 2020 •

edited

Loading

mmarchini commented Jul 21, 2020

github-actions bot commented May 18, 2021

Suggestion: pager/alerts/auto create issues for infra-related issues #2359

Suggestion: pager/alerts/auto create issues for infra-related issues #2359

Comments

mmarchini commented Jun 17, 2020 • edited Loading

AshCripps commented Jun 17, 2020

mmarchini commented Jun 17, 2020

mmarchini commented Jun 17, 2020

rvagg commented Jun 18, 2020

AshCripps commented Jun 18, 2020

mmarchini commented Jun 18, 2020 • edited Loading

mhdawson commented Jun 22, 2020

jbergstroem commented Jul 1, 2020

jbergstroem commented Jul 2, 2020 • edited Loading

mmarchini commented Jul 21, 2020

github-actions bot commented May 18, 2021

mmarchini commented Jun 17, 2020 •

edited

Loading

mmarchini commented Jun 18, 2020 •

edited

Loading

jbergstroem commented Jul 2, 2020 •

edited

Loading