Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for max/min/desired shards to queue manager. #5787

Merged
merged 3 commits into from
Sep 9, 2019
Merged

Add metrics for max/min/desired shards to queue manager. #5787

merged 3 commits into from
Sep 9, 2019

Conversation

cstyan
Copy link
Member

@cstyan cstyan commented Jul 19, 2019

Add metrics for max shards, min shards, and desired shards to queue manager. This will help remote write users determine whether they to raise the max shards in their config.

Signed-off-by: Callum Styan [email protected]

@csmarchbanks
cc @tomwilkie @gouthamve

@csmarchbanks
Copy link
Member

I don't think this adds much as a Debug log since there is detailed information about the desired shards calculation just a couple lines up. Is it too noisy above the debug level?

@cstyan
Copy link
Member Author

cstyan commented Jul 23, 2019

@csmarchbanks yeah I suppose this would make sense as warning level as well, especially since the existing debug logging already takes place prior to the check and assignment of desiredShards = maxShards.

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In think, a warning is fine if this requires operator's attention. In that case, perhaps it would make sense to have a metric for desired and max shards so that it is easy to create an alert for it?

@csmarchbanks
Copy link
Member

I like the idea of using metrics rather than a warning log. The issue only needs to be brought to an operator's attention if desired shards is greater than max shards for a significant amount of time, so I think individual warnings could be a bit misleading.

@cstyan
Copy link
Member Author

cstyan commented Jul 30, 2019

I still think logging is useful to more easily see how many shards the calculation wanted to run. I'll add a metric to count the # of times the calculation has been > maxShards.

@beorn7
Copy link
Member

beorn7 commented Jul 30, 2019

Is it important how often the calculation came up with that result? If you had a gauge "desired shards" (updated after each shard calculation) and a fixed-value metric with the configured "max shards", wouldn't that be sufficient? You could still look at it over time.

(I'm not a remote-write expert. My suggestion might be completely off.)

@cstyan
Copy link
Member Author

cstyan commented Jul 30, 2019

If we were to alert on a metric for this it would likely be something like if "desired shards is greater than max shards" over some period of time. For example we only alert if remote write is behind by a minute or two for more than 15 minutes.

Sounds like a gauge would be sufficient. I don't have a preference either way.

@beorn7
Copy link
Member

beorn7 commented Jul 30, 2019

I would also add a metric for the configured max shards, even if that seems redundant. In some setup, it's hard to get this value from config management, but you want to have it easily available in alerts or even dashboards.

@cstyan cstyan changed the title Log if desiredShards > maxShards Add metrics for max/min/desired shards to queue manager. Aug 1, 2019
@cstyan
Copy link
Member Author

cstyan commented Aug 1, 2019

Added metrics, removed the logging.

storage/remote/queue_manager.go Outdated Show resolved Hide resolved
storage/remote/queue_manager.go Outdated Show resolved Hide resolved
storage/remote/seriesTest/main.go Outdated Show resolved Hide resolved
Copy link
Member

@csmarchbanks csmarchbanks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small comment that doesn't need to be changed. These metrics look helpful to me!

storage/remote/queue_manager.go Outdated Show resolved Hide resolved
> on(job, instance) group_right
prometheus_remote_storage_shards_max{%(prometheusSelector)s})
)
== 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

== 1 requires a bool operator no? I don't think a simple > would work here.

Copy link
Contributor

@brian-brazil brian-brazil Aug 8, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only for scalar/scalar comparisons. > should work in here. the RHS however also needs a max_over_time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 added

already be going off, about desired shards being higher than max shards.

Signed-off-by: Callum Styan <[email protected]>
@cstyan
Copy link
Member Author

cstyan commented Aug 28, 2019

@beorn7 @codesome can we merge this?

Copy link
Member

@beorn7 beorn7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you only need to update the description (copy-pasta from different alert).

},
annotations: {
summary: 'Prometheus remote write desired shards calculation wants to run more than configured max shards.',
description: 'Prometheus %(prometheusName)s remote write is {{ printf "%%.1f" $value }}s behind for queue {{$labels.queue}}.' % $._config,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is from a different alert.

@cstyan
Copy link
Member Author

cstyan commented Sep 4, 2019

@beorn7 PTAL at the alert. I tested it locally by hardcoding the return value of calculateDesiredShards and setting max_shards to 1 in my config, then checking the annotations on the alerts page. It looks correct to me.

},
annotations: {
summary: 'Prometheus remote write desired shards calculation wants to run more than configured max shards.',
description: 'Prometheus %(prometheusName)s remote write desired shards calculation wants to run {{ printf $value }} shards, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{%(prometheusSelector)s}` | query | first | value }}.' % $._config,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The query will just pick the first result, not necessarily the one for the Prometheus server affected. I believe it has to be written in the following way:

              description: 'Prometheus %(prometheusName)s remote write desired shards calculation wants to run {{ printf $value }} shards, which is more than the max of {{ printf `prometheus_remote_storage_shards_max{instance="%%s",%(prometheusSelector)s}` $labels.instance | query | first | value }}.' % $._config,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm okay, is that not what prometheusSelectors is supposed to be doing? I assumed those would contain the label selectors for the instance that triggered the alert.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prometheusSelector selects all Prometheus servers to monitor.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, it would intersect with the alerts firing, i.e. if only one Prometheus server fires this alerts, all is good, but as soon as more than one of the monitored Prometheus servers are affected by this alert, you'll get multiple results in the query.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 got it, thank you!

@beorn7 beorn7 merged commit 3b3eaf3 into prometheus:master Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants