-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: pager/alerts/auto create issues for infra-related issues #2359
Comments
Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).
Is there some kind of jenkins plugin that could do all this for us? rather than have to set a separate thing up. |
Not familiar with Jenkins plugins enough to answer if there is one, but a requirement would be for the plugin to be able to parse the output of failed jobs to determine whether it is a code-related issue, a Jenkins issue, or an issue with the agent. |
Looked at some plugins, the closest I could find is Build Failure Analyzer, but it only categorizes failures in the interface, it doesn't send notifications. Could also be an useful feature to have, and it could even improve the user experience of collaborators. We don't necessarily need to use the |
I'd be fine with something that pingged me via email or some other priority channel as long as I can control the hours of the day I can be bothered. My GitHub notifications are shunted to a separate email folder that I don't read till after I've started my workday. Sometimes "critical" issues escape my notice until well into my workday (sometimes entirely over a weekend). If I get a direct email or someone pings me in IRC I'm usually on it during the hours my devices are allowed to bother me. Perhaps the easiest thing is just mapping out who's available to handle what kinds of problems during what times of day and can be best contacted by what means. |
I had a thought that if we were to create a solution for this ourselves, it might be a good opportunity for the new people looking to joining the working group to help out on as its requires no real access for testing. |
This should be easy to do if we implement the workflow ourselves. For example, we could have an Action on this repository which will respond to webhook calls, and when it receives a failure it creates an issue here + ping folks who are available (which could be set up as a JSON/Yaml/INI config file, easy to change via PR). edit: to implement it as an Action: create a
That sounds like something new folks could work on, yes. If no one works on it I can too, but I want to work on a commit queue for nodejs/node first. |
I'm +1 for notifications. We used to have more like those for machines going offline, failed jobs etc. which were useful but those have stopped working over time and we've not had time to fix. |
related: #2370 Grafana hooks into many of these type of communication mediums; we just need to monitor what we expect to work and alert when it doesn't. Once we have github integration I will expand who has access (read: as many as possible). |
So, we're currently collecting system information from I'd like to shift the discussion from medium (how and where to ping people; this seems to be relatively straightforward) to when and what do we alert about? I have a few ideas but since I've AWOL:ed for a while I prefer hearing from the community as part of the implementation design. |
Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).
I still think it would be good to use |
This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made. |
Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).
Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).
Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).
Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).
Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).
Some errors we see are caused by underlying infrastructure issues (most commonly filesystem corruption). Correctly classifying can help when collecting statistics, when pinging the build team, or even for automated notification (see nodejs/build#2359).
Sometimes jobs will fail on easily fixable problems like read-only fs for days until someone notices it. Ideally, when a job fails for an infra-related issue, collaborators will ping the WG, but this either doesn't happen sometimes or the WG is overloaded with pings for multiple reasons (not only for infra-related issues). On top of that, GitHub notifications interface doesn't provide an easy way to look at all pings to a specific team, and since most of us are in multiple teams we have a lot of mixed "Team Mention" notifications.
What if we could identify (most) infra issues on Jenkins and let the WG know in a timely fashion? This would help us act on issues sooner when we are available. This wouldn't imply an SLA for the WG, we're still all volunteers, if no one is available the issue will remain unfixed until someone is, which is fine.
I'm not sure if the best approach is to create an issue here, to send a notification on IRC, email, or to use something like PagerDuty. Regardless of how we get notified, I believe we can accomplish this with a small effort and without introducing too much maintenance burden. Jenkins is already hooked to
github-bot
, and ncu-ci has some heuristics to identify infra and Jenkins issues (although today it bulks infra issues with build issues, which is easily fixable). We could use similar heuristics ongithub-bot
to identify potential infra issues and send alerts when they happen. Orgithub-bot
could forward all failures to a separate service which will do that (if we want to decouple but don't want to add a new hook to Jenkins). We could even add thresholds for certain errors (for example, "read-only fs" could trigger on the first occurence, but more flaky issues like corrupted git directory could require X out of Y failures to trigger).What do y'all think? If folks are on board with this, I can implement a proof of concept.
The text was updated successfully, but these errors were encountered: