Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: change vtbackup_duration_by_phase to binary vtbackup_duration #12972

Closed
maxenglander opened this issue Apr 25, 2023 · 0 comments · Fixed by #12973
Closed

Feature Request: change vtbackup_duration_by_phase to binary vtbackup_duration #12972

maxenglander opened this issue Apr 25, 2023 · 0 comments · Fixed by #12973
Labels
Component: Backup and Restore Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Feature

Comments

@maxenglander
Copy link
Collaborator

maxenglander commented Apr 25, 2023

Feature Description

After using vtbackup_duration_by_phase for a few weeks in production, I can confidently say that they are pretty awkward to use.

I recommend changing this metric to vtbackup_phase, a binary valued gauge similar to K8s metrics like kube_pod_status_phase. Here's an example of what these metrics could look like:

# HELP vtbackup_phase Active phase.
# TYPE vtbackup_phase gauge
vtbackup_phase{phase="CatchUpReplication"} 0
vtbackup_phase{phase="InitialBackup"} 0
vtbackup_phase{phase="RestoreLastBackup"} 0
vtbackup_phase{phase="TakeNewBackup"} 1

At any given moment, only one phase would be active. In order to calculate how long a phase has been active, you could do something like this:

sum_over_time(vtbackup_phase{phase="TakeNewBackup"}) * <interval>

Where <interval> is the number of seconds between data points.

Use Case(s)

Some issues that would be resolved by the proposed change.

  1. vtbackup currently doesn't report that a phase as active. It only reports the phase duration once that phase completes. This means that there's no way to tell what phase vtbackup is currently in, unless you know enough about the internals of the program to infer the current state from other metrics and logs.
  2. If vtbackup exits before completing a phase, it won't report the time it spent in that phase.
  3. After completing the last phase (TakeNewBackup), vtbackup exits pretty much right away. This means that there might only be a few seconds between vtbackup reporting that phase for the first time and vtbackup exiting, which might not be enough time for the metric collector (e.g. Prometheus) to have a chance to collect that metric. This necessitates using something awkward like --keep-alive-timeout to get keep vtbackup alive long enough for the collector to do at least one scrape.
@maxenglander maxenglander added Type: Feature Needs Triage This issue needs to be correctly labelled and triaged Component: Backup and Restore Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Apr 25, 2023
@GuptaManan100 GuptaManan100 removed the Needs Triage This issue needs to be correctly labelled and triaged label Apr 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Backup and Restore Type: Enhancement Logical improvement (somewhere between a bug and feature) Type: Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants