-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for max/min/desired shards to queue manager. #5787
Conversation
I don't think this adds much as a Debug log since there is detailed information about the desired shards calculation just a couple lines up. Is it too noisy above the debug level? |
@csmarchbanks yeah I suppose this would make sense as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In think, a warning is fine if this requires operator's attention. In that case, perhaps it would make sense to have a metric for desired and max shards so that it is easy to create an alert for it?
I like the idea of using metrics rather than a warning log. The issue only needs to be brought to an operator's attention if desired shards is greater than max shards for a significant amount of time, so I think individual warnings could be a bit misleading. |
I still think logging is useful to more easily see how many shards the calculation wanted to run. I'll add a metric to count the # of times the calculation has been > maxShards. |
Is it important how often the calculation came up with that result? If you had a gauge "desired shards" (updated after each shard calculation) and a fixed-value metric with the configured "max shards", wouldn't that be sufficient? You could still look at it over time. (I'm not a remote-write expert. My suggestion might be completely off.) |
If we were to alert on a metric for this it would likely be something like if "desired shards is greater than max shards" over some period of time. For example we only alert if remote write is behind by a minute or two for more than 15 minutes. Sounds like a gauge would be sufficient. I don't have a preference either way. |
I would also add a metric for the configured max shards, even if that seems redundant. In some setup, it's hard to get this value from config management, but you want to have it easily available in alerts or even dashboards. |
Added metrics, removed the logging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment that doesn't need to be changed. These metrics look helpful to me!
Signed-off-by: Callum Styan <[email protected]>
> on(job, instance) group_right | ||
prometheus_remote_storage_shards_max{%(prometheusSelector)s}) | ||
) | ||
== 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
== 1
requires a bool operator no? I don't think a simple >
would work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for scalar/scalar comparisons. > should work in here. the RHS however also needs a max_over_time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 added
already be going off, about desired shards being higher than max shards. Signed-off-by: Callum Styan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you only need to update the description (copy-pasta from different alert).
}, | ||
annotations: { | ||
summary: 'Prometheus remote write desired shards calculation wants to run more than configured max shards.', | ||
description: 'Prometheus %(prometheusName)s remote write is {{ printf "%%.1f" $value }}s behind for queue {{$labels.queue}}.' % $._config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This description is from a different alert.
@beorn7 PTAL at the alert. I tested it locally by hardcoding the return value of |
}, | ||
annotations: { | ||
summary: 'Prometheus remote write desired shards calculation wants to run more than configured max shards.', | ||
description: 'Prometheus %(prometheusName)s remote write desired shards calculation wants to run {{ printf $value }} shards, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{%(prometheusSelector)s}` | query | first | value }}.' % $._config, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The query will just pick the first result, not necessarily the one for the Prometheus server affected. I believe it has to be written in the following way:
description: 'Prometheus %(prometheusName)s remote write desired shards calculation wants to run {{ printf $value }} shards, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%%s",%(prometheusSelector)s}` $labels.instance | query | first | value }}.' % $._config,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm okay, is that not what prometheusSelectors
is supposed to be doing? I assumed those would contain the label selectors for the instance that triggered the alert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prometheusSelector
selects all Prometheus servers to monitor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, it would intersect with the alerts firing, i.e. if only one Prometheus server fires this alerts, all is good, but as soon as more than one of the monitored Prometheus servers are affected by this alert, you'll get multiple results in the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 got it, thank you!
shards in description. Signed-off-by: Callum Styan <[email protected]>
Add metrics for max shards, min shards, and desired shards to queue manager. This will help remote write users determine whether they to raise the max shards in their config.
Signed-off-by: Callum Styan [email protected]
@csmarchbanks
cc @tomwilkie @gouthamve