Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: pager/alerts/auto create issues for infra-related issues #2359

Closed
mmarchini opened this issue Jun 17, 2020 · 11 comments
Closed

Suggestion: pager/alerts/auto create issues for infra-related issues #2359

mmarchini opened this issue Jun 17, 2020 · 11 comments
Labels

Comments

@mmarchini
Copy link
Contributor

mmarchini commented Jun 17, 2020

Sometimes jobs will fail on easily fixable problems like read-only fs for days until someone notices it. Ideally, when a job fails for an infra-related issue, collaborators will ping the WG, but this either doesn't happen sometimes or the WG is overloaded with pings for multiple reasons (not only for infra-related issues). On top of that, GitHub notifications interface doesn't provide an easy way to look at all pings to a specific team, and since most of us are in multiple teams we have a lot of mixed "Team Mention" notifications.

What if we could identify (most) infra issues on Jenkins and let the WG know in a timely fashion? This would help us act on issues sooner when we are available. This wouldn't imply an SLA for the WG, we're still all volunteers, if no one is available the issue will remain unfixed until someone is, which is fine.

I'm not sure if the best approach is to create an issue here, to send a notification on IRC, email, or to use something like PagerDuty. Regardless of how we get notified, I believe we can accomplish this with a small effort and without introducing too much maintenance burden. Jenkins is already hooked to github-bot, and ncu-ci has some heuristics to identify infra and Jenkins issues (although today it bulks infra issues with build issues, which is easily fixable). We could use similar heuristics on github-bot to identify potential infra issues and send alerts when they happen. Or github-bot could forward all failures to a separate service which will do that (if we want to decouple but don't want to add a new hook to Jenkins). We could even add thresholds for certain errors (for example, "read-only fs" could trigger on the first occurence, but more flaky issues like corrupted git directory could require X out of Y failures to trigger).

What do y'all think? If folks are on board with this, I can implement a proof of concept.

mmarchini added a commit to mmarchini/node-core-utils that referenced this issue Jun 17, 2020
Some errors we see are caused by underlying infrastructure issues (most
commonly filesystem corruption). Correctly classifying can help when
collecting statistics, when pinging the build team, or even for
automated notification (see
nodejs/build#2359).
@AshCripps
Copy link
Member

Is there some kind of jenkins plugin that could do all this for us? rather than have to set a separate thing up.

@mmarchini
Copy link
Contributor Author

Not familiar with Jenkins plugins enough to answer if there is one, but a requirement would be for the plugin to be able to parse the output of failed jobs to determine whether it is a code-related issue, a Jenkins issue, or an issue with the agent.

@mmarchini
Copy link
Contributor Author

Looked at some plugins, the closest I could find is Build Failure Analyzer, but it only categorizes failures in the interface, it doesn't send notifications. Could also be an useful feature to have, and it could even improve the user experience of collaborators.

We don't necessarily need to use the github bot for this, the main point of the issue is to notify the appropriate folks when there's an actionable failure happening on an agent. If I missed a Jenkins plugin which does that, and if using that Jenkins plugin will have lower maintenance burden than implementing it ourselves, it works too.

@rvagg
Copy link
Member

rvagg commented Jun 18, 2020

I'd be fine with something that pingged me via email or some other priority channel as long as I can control the hours of the day I can be bothered. My GitHub notifications are shunted to a separate email folder that I don't read till after I've started my workday. Sometimes "critical" issues escape my notice until well into my workday (sometimes entirely over a weekend). If I get a direct email or someone pings me in IRC I'm usually on it during the hours my devices are allowed to bother me. Perhaps the easiest thing is just mapping out who's available to handle what kinds of problems during what times of day and can be best contacted by what means.

@AshCripps
Copy link
Member

I had a thought that if we were to create a solution for this ourselves, it might be a good opportunity for the new people looking to joining the working group to help out on as its requires no real access for testing.

@mmarchini
Copy link
Contributor Author

mmarchini commented Jun 18, 2020

I'd be fine with something that pingged me via email or some other priority channel as long as I can control the hours of the day I can be bothered.

This should be easy to do if we implement the workflow ourselves. For example, we could have an Action on this repository which will respond to webhook calls, and when it receives a failure it creates an issue here + ping folks who are available (which could be set up as a JSON/Yaml/INI config file, easy to change via PR).

edit: to implement it as an Action: create a repository_dispatch action workflow here; change github-bot so it sends a POST to the appropriate GitHub API with the job name + job id; check if ncu-ci knows how to handle said job name (checking if it is a -pr or -commit should be enough); call ncu-ci url and filter for INFRA_FAILURE, JENKINS_FAILURE and GIT_FAILURE (those are the ones we might need to take action); if one of the failures mentioned before is present, create an issue via API calls, check the members availability config file and send notifications to them (email, IRC mentioning folks available, etc). I probably missed something, but if this works as I'm thinking, it should be fairly straightforward to implement and maintain.

I had a thought that if we were to create a solution for this ourselves, it might be a good opportunity for the new people looking to joining the working group to help out on as its requires no real access for testing.

That sounds like something new folks could work on, yes. If no one works on it I can too, but I want to work on a commit queue for nodejs/node first.

@mhdawson
Copy link
Member

I'm +1 for notifications. We used to have more like those for machines going offline, failed jobs etc. which were useful but those have stopped working over time and we've not had time to fix.

@jbergstroem
Copy link
Member

related: #2370

Grafana hooks into many of these type of communication mediums; we just need to monitor what we expect to work and alert when it doesn't. Once we have github integration I will expand who has access (read: as many as possible).

@jbergstroem
Copy link
Member

jbergstroem commented Jul 2, 2020

So, we're currently collecting system information from ci, ci-release, www, gh-bot, metrics (the host that runs these alerts) and in a foreseeable future the backup host. We're also collecting influxdb statistics (predicting when metrics host could become an issue) as well as all web [nginx] servers (ci, ci-release and www) basic data. Finally, we will soon be collecting a bit of jenkins information (build info, timings, etc).

I'd like to shift the discussion from medium (how and where to ping people; this seems to be relatively straightforward) to when and what do we alert about? I have a few ideas but since I've AWOL:ed for a while I prefer hearing from the community as part of the implementation design.

codebytere pushed a commit to nodejs/node-core-utils that referenced this issue Jul 21, 2020
Some errors we see are caused by underlying infrastructure issues (most
commonly filesystem corruption). Correctly classifying can help when
collecting statistics, when pinging the build team, or even for
automated notification (see
nodejs/build#2359).
@mmarchini
Copy link
Contributor Author

I still think it would be good to use ncu for that, it already categorizes errors, we just need to choose which categories to alert. Some errors like full disk are detectable by grafana, but what about read-only filesystems (happens somewhat frequently)?

@github-actions
Copy link

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

@github-actions github-actions bot added the stale label May 18, 2021
johnfrench3 pushed a commit to johnfrench3/core-utils-node that referenced this issue Nov 2, 2022
Some errors we see are caused by underlying infrastructure issues (most
commonly filesystem corruption). Correctly classifying can help when
collecting statistics, when pinging the build team, or even for
automated notification (see
nodejs/build#2359).
renawolford6 added a commit to renawolford6/node-dev-build-core-utils that referenced this issue Nov 10, 2022
Some errors we see are caused by underlying infrastructure issues (most
commonly filesystem corruption). Correctly classifying can help when
collecting statistics, when pinging the build team, or even for
automated notification (see
nodejs/build#2359).
Developerarif2 pushed a commit to Developerarif2/node-core-utils that referenced this issue Jan 27, 2023
Some errors we see are caused by underlying infrastructure issues (most
commonly filesystem corruption). Correctly classifying can help when
collecting statistics, when pinging the build team, or even for
automated notification (see
nodejs/build#2359).
gerkai added a commit to gerkai/node-core-utils-project-build that referenced this issue Jan 27, 2023
Some errors we see are caused by underlying infrastructure issues (most
commonly filesystem corruption). Correctly classifying can help when
collecting statistics, when pinging the build team, or even for
automated notification (see
nodejs/build#2359).
shovon58 pushed a commit to shovon58/node-core-utils that referenced this issue Jun 9, 2023
Some errors we see are caused by underlying infrastructure issues (most
commonly filesystem corruption). Correctly classifying can help when
collecting statistics, when pinging the build team, or even for
automated notification (see
nodejs/build#2359).
patrickm68 added a commit to patrickm68/NodeJS-core-utils that referenced this issue Sep 14, 2023
Some errors we see are caused by underlying infrastructure issues (most
commonly filesystem corruption). Correctly classifying can help when
collecting statistics, when pinging the build team, or even for
automated notification (see
nodejs/build#2359).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants