grafana/monitoring: request for github client credentials #2370

jbergstroem · 2020-06-30T12:43:12Z

As part of improving monitoring, I've set up a grafana, influxdb and telegraf instance at Joyent, available at https://grafana.nodejs.org. I would like to set up a github oauth2 client to handle ACL, similar to how we do authorization with jenkins.

cc @mmarchini

jbergstroem · 2020-06-30T15:44:35Z

To elaborate, we will collect vm basics (cpu, ram, disk, net, ..) per host and use grafana to both make dashboards available (to build and likely a larger crowd) as well as setting up alerts to notify people once things are not working as intended.

AshCripps · 2020-06-30T15:49:58Z

Will this be targetting a subset of machines or do you plan to roll it out to all machines?

jbergstroem · 2020-06-30T15:53:18Z

Will this be targetting a subset of machines or do you plan to roll it out to all machines?

Telegraf supports all kinds of architectures, so this basically boils down to incorporating the ansible role as well as getting it deployed across the fleet. Right now I did this as a "make it work"-work, but will add the automation bits at some point.

For now, I will focus on critical machines: www, ci, ci-release, backup, unencrypted and gh-bot (I'm almost done)

As you can imagine, we can do much, much more with this setup other than monitoring such as graphing jenkins build times over time or whathaveyou. I think allowing interested users to create their own dashboards (and share with the broader community) would be a great goal.

MylesBorins · 2020-06-30T16:36:02Z

If you are looking to get a +1 for using a GitHub app you should open an issue on http://github.com/nodejs/admin to ask permission with a link to the app you plan to install (assuming I understood the request appropriately)

mmarchini · 2020-06-30T17:38:45Z

I suggested Johan open an issue here first to share more detailed context (and to let folks know this is being worked on), and then reference it on nodejs/admin.

FWIW I'm +1 on this effort, it's something I wanted to implement a while back but never got the time to do.

mhdawson · 2020-07-02T21:49:57Z

Thanks for the heads up. Once we can log in I'd be interested in getting access.

AshCripps · 2020-07-09T14:30:58Z

Telegraf supports all kinds of architectures, so this basically boils down to incorporating the ansible role as well as getting it deployed across the fleet. Right now I did this as a "make it work"-work, but will add the automation bits at some point.

@jbergstroem I'd be happy to help with this, I also think it would be good to get the monitoring host itself into ansible so the machine can be recreated easily in the event of disaster.

mmarchini · 2020-07-21T01:18:22Z

Not sure if there's anything that can be configured on Grafana (don't know if GitHub allows this level of granularity), but the OAuth requests read-only permission to all orgs it can, not only to nodejs.

jbergstroem · 2020-07-22T03:29:44Z

Hey all - just an update: it works but for it to scale we really need the enterprise plugin to "sync" teams similarly to the jenkins plugin. I got help reaching out to the grafana team and they will help us out! I had a few days off but will be back in action from tomorrow and will finish setting this up. After it has been done I would really appreciate all help we can get:

Help getting telegraf installed on as many hosts as possible
Create dashboards to provide visibility over service quality
Add more ways to measure the quality of service (for instance, pulling data from Jenkins)
Create alerts and make sure the proper people/teams get them.

github-actions · 2021-05-19T00:57:22Z

This issue is stale because it has been open many days with no activity. It will be closed soon unless the stale label is removed or a comment is made.

AshCripps · 2021-05-19T08:18:12Z

Adding the never stale label as this will still be useful to have - espically the alerting for the files like the rootfs filling up as it did last night - #2592 (comment)

AshCripps · 2021-05-20T14:40:20Z

I wouldnt mind have a go at setting up the alerting if someone from @nodejs/build-infra wouldn't mind sharing the admin password to the grafana with me.

richardlau · 2021-05-20T14:51:11Z

AFAIK the grafana admin password wasn't added to secrets. cc @jbergstroem

jbergstroem · 2021-05-20T16:22:51Z

AFAIK the grafana admin password wasn't added to secrets. cc @jbergstroem

Will add -- done!

jbergstroem · 2021-05-20T16:26:26Z

I wouldnt mind have a go at setting up the alerting if someone from @nodejs/build-infra wouldn't mind sharing the admin password to the grafana with me.

The ACL for our grafana is inherited via the github groups btw.

AshCripps · 2021-05-20T16:39:03Z

@jbergstroem oh so does that mean we should have admin rights already? or is that for infra members only

jbergstroem · 2021-05-20T17:24:50Z

@jbergstroem oh so does that mean we should have admin rights already? or is that for infra members only

I don't think admin necessarily, but you should be able to create/edit dashboards which also implies alerting. Let me know if that's not the case.

AshCripps · 2021-05-20T18:59:17Z

I see to only have a view role - doesnt let me edit dashboards at all and dashboards is the only thing in my side menu

jbergstroem · 2021-05-20T20:51:56Z

I see to only have a view role - doesnt let me edit dashboards at all and dashboards is the only thing in my side menu

I can't quite figure out how the inheritance from the group works; logged in as admin and gave your user admin rights. As we scale we can also assign "editor" roles

AshCripps · 2021-05-20T22:00:04Z

Great that works for me now, thanks!

AshCripps · 2021-05-21T16:47:41Z

Got a basic alert setup, should trigger when the machines hit 95% disk full (let me know if thats too high). Currently it will post a message to #nodejs-build-infra-alerts in the openjs slack (thanks to Brian for helping me set up the integration).

I made a new chart below the current disk usage to show a percentage and alert on that

I did this to stop it constant triggering cause the other graph uses total as well which would causing it to trigger.

jbergstroem mentioned this issue Jun 30, 2020

Request to create new oauth app for grafana authentication nodejs/admin#522

Closed

jbergstroem mentioned this issue Jul 1, 2020

Suggestion: pager/alerts/auto create issues for infra-related issues #2359

Closed

github-actions bot added the stale label May 19, 2021

AshCripps added never stale and removed stale labels May 19, 2021

AshCripps mentioned this issue Jun 15, 2021

Request for oauth credential for AWX instance nodejs/admin#620

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grafana/monitoring: request for github client credentials #2370

grafana/monitoring: request for github client credentials #2370

jbergstroem commented Jun 30, 2020

jbergstroem commented Jun 30, 2020

AshCripps commented Jun 30, 2020

jbergstroem commented Jun 30, 2020 •

edited

Loading

MylesBorins commented Jun 30, 2020

mmarchini commented Jun 30, 2020

mhdawson commented Jul 2, 2020

AshCripps commented Jul 9, 2020

mmarchini commented Jul 21, 2020

jbergstroem commented Jul 22, 2020

github-actions bot commented May 19, 2021

AshCripps commented May 19, 2021

AshCripps commented May 20, 2021

richardlau commented May 20, 2021

jbergstroem commented May 20, 2021 •

edited

Loading

jbergstroem commented May 20, 2021

AshCripps commented May 20, 2021

jbergstroem commented May 20, 2021

AshCripps commented May 20, 2021

jbergstroem commented May 20, 2021 •

edited

Loading

AshCripps commented May 20, 2021

AshCripps commented May 21, 2021

grafana/monitoring: request for github client credentials #2370

grafana/monitoring: request for github client credentials #2370

Comments

jbergstroem commented Jun 30, 2020

jbergstroem commented Jun 30, 2020

AshCripps commented Jun 30, 2020

jbergstroem commented Jun 30, 2020 • edited Loading

MylesBorins commented Jun 30, 2020

mmarchini commented Jun 30, 2020

mhdawson commented Jul 2, 2020

AshCripps commented Jul 9, 2020

mmarchini commented Jul 21, 2020

jbergstroem commented Jul 22, 2020

github-actions bot commented May 19, 2021

AshCripps commented May 19, 2021

AshCripps commented May 20, 2021

richardlau commented May 20, 2021

jbergstroem commented May 20, 2021 • edited Loading

jbergstroem commented May 20, 2021

AshCripps commented May 20, 2021

jbergstroem commented May 20, 2021

AshCripps commented May 20, 2021

jbergstroem commented May 20, 2021 • edited Loading

AshCripps commented May 20, 2021

AshCripps commented May 21, 2021

jbergstroem commented Jun 30, 2020 •

edited

Loading

jbergstroem commented May 20, 2021 •

edited

Loading

jbergstroem commented May 20, 2021 •

edited

Loading