Monitor NFS servers - critical diagnostics to understand issues #2242

consideRatio · 2023-02-22T20:22:16Z

Ideally we would be able to monitor the NFS servers we rely on in the grafana isntances directly, but unless we can't do that we need at least some way to understand if the NFS servers are overloaded.

I understand it as we rely on cloud provided NFS services GCP Filestore and AWS EFS. Ideally, we should at least learn how to monitor them using the cloud console if we can't provide grafana instances access to the datasources and import pre-defined dashboards for this.

Cloud services

GCP's Filestore service has notes on monitoring
AWS EFS service has notes on monitoring
Azure Files has notes on monitoring

Action points

Explore the options to monitor NFS services performance and come up with refined action points

An idea to pay more to get a more performant EFS service: Move uwhackweeks to a faster EFS server #1236
An idea to suggest using /tmp for anything temp as that could help reduce load on the NFS server: Move uwhackweeks to a faster EFS server #1236 (comment)
An idea to provide a temp folder directly in the home directory to nudge users towards this: basehub: add custom.2i2c.temp_folder config for ~/temp ephemeral storage #2062

The text was updated successfully, but these errors were encountered:

pnasrat · 2023-02-22T21:13:59Z

I believe @yuvipanda already has some graphs that could be added

abkfenris · 2023-02-22T21:20:21Z

I just encountered this kind of issue on EFS, and it took a lot of digging to understand what is going on.

EFS has 3 different throughput modes. Bursting is the default and AWS does some sneaky stuff to make sure it's initially fast, but if you don't put enough data on it right away you can hit a wall and have really variable and hard to diagnose performance.

The key metrics for EFS to look at are Burst Credit Balance, Permitted Throughput, and Throughput Utilization.

If that's what you are encountering, I'd be happy to pull together some of the resources that I found while trying to diagnose it.

consideRatio · 2024-09-20T16:17:00Z

@abkfenris thanks for sharing that - sorry for super-late followup!

I verified that I could see such metrics via AWS CloudWatch right away - nice!!

Closing this issue as its outdated and stale

This was referenced Feb 22, 2023

Monitor node performance - network read/write speeds, ephemeral storage read/write speeds and capacity #2243

Open

Overview of grafana and prometheus related issues #2214

Open

consideRatio mentioned this issue Feb 24, 2023

Move uwhackweeks to a faster EFS server #1236

Closed

2 tasks

consideRatio added the tech:grafana label Sep 9, 2023

consideRatio closed this as completed Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor NFS servers - critical diagnostics to understand issues #2242

Monitor NFS servers - critical diagnostics to understand issues #2242

consideRatio commented Feb 22, 2023 •

edited

Loading

pnasrat commented Feb 22, 2023

abkfenris commented Feb 22, 2023

consideRatio commented Sep 20, 2024

Monitor NFS servers - critical diagnostics to understand issues #2242

Monitor NFS servers - critical diagnostics to understand issues #2242

Comments

consideRatio commented Feb 22, 2023 • edited Loading

Cloud services

Action points

Related

pnasrat commented Feb 22, 2023

abkfenris commented Feb 22, 2023

consideRatio commented Sep 20, 2024

consideRatio commented Feb 22, 2023 •

edited

Loading