Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: update changefeed metrics page #101790

Merged
merged 1 commit into from
Jun 5, 2023

Conversation

samiskin
Copy link
Contributor

@samiskin samiskin commented Apr 18, 2023

Resolves #98085
Resolves #98088
Resolves #99409
Resolves #100640
Resolves #97931
Resolves #98086

This PR does the following changes:

  • Added a scale parameter to Metrics so that I could support a duration metric that's being emitted in Seconds rather than Nanoseconds. Would like frontend feedback on whether this is ok.
  • Added support for minutes and hours in Duration graphs
  • There is now a "Changefeed Status" graph to show counts of Running/Paused/Failed
  • There is now a "Commit Latency" graph to show P50,P90, and P99 commit latencies
  • Sink Byte Traffic is now Emitted Bytes
  • Sink Timings has been removed because I don't believe either of the metrics exist anymore
  • Max Changefeed Latency is now Max Checkpoint Latency
  • There is now a Backfill Pending Ranges graph
  • There is now a Protected Timestamp Max Age graph
  • There is now a Schema Registry Registrations graph

Screenshot 2023-04-18 at 5 02 33 PM

Screenshot 2023-04-19 at 12 48 19 PM

Screenshot 2023-04-18 at 4 19 24 PM

Screenshot 2023-04-18 at 4 19 42 PM

Release note (ui change): The metrics page for changefeeds has been updated with new graphs to track backfill progress, protected timestamps age, and number of schema registry registrations.

@samiskin samiskin requested a review from a team April 18, 2023 20:21
@samiskin samiskin requested a review from a team as a code owner April 18, 2023 20:21
@samiskin samiskin requested review from HonoreDB and removed request for a team April 18, 2023 20:21
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@shermanCRL
Copy link
Contributor

So great! Should we take the opportunity to add some explanatory text next to each of the graphs?

@miretskiy
Copy link
Contributor

do we want to include number of currently running/paused changefeeds?

@jayshrivastava
Copy link
Contributor

We can close this one too! #97931

@samiskin
Copy link
Contributor Author

samiskin commented Apr 18, 2023

@shermanCRL I have tooltip text for "Max Checkpoint Latency", "Protected Timestamp Max Age", and "Backfill Pending Ranges", and I think the others are pretty self explanatory.

@miretskiy Added at the very top
Screenshot 2023-04-18 at 5 02 33 PM

@miretskiy
Copy link
Contributor

How difficult is it to change the scale on those graphs? Max protected age -- would be nice not to divide by 1e9;
I don't think we need more than a minute granularity.

@samiskin
Copy link
Contributor Author

Age gets shown as Seconds right now, I looked into adding minutes/hours and it's doable just needs new frontend changes since they currently only support up to seconds. I'll make it show "1h 2m 3s" and see what they think about it.

@shermanCRL
Copy link
Contributor

@samiskin Cool though I think a bit of text saying why you care about this, and how to interpret it -- does it represent a problem -- may help. Maybe a link to relevant docs? I’m imagining that it might prevent an escalation or get a user to a solution faster. @amruss what do you think?

@amruss
Copy link
Contributor

amruss commented Apr 18, 2023

I think that is a good idea, there is a public facing runbook that details why we care about certain changefeed metrics we can either copy from or link to: https://github.com/cockroachlabs/cockroachdb-runbook-template/blob/main/monitoring-alerts/monitoring-dashboard-custom.md#essential-monitoring-metrics-for-changefeeds

@samiskin
Copy link
Contributor Author

Added P50/P90/P99 histogram for Commit Latency as well as the 2nd graph right below "Changefeed Status"

@samiskin samiskin force-pushed the changefeed-metric-updates branch 5 times, most recently from fabe1e3 to 661ced3 Compare April 20, 2023 18:10
@samiskin
Copy link
Contributor Author

Added support for minutes and hours
Screenshot 2023-04-20 at 1 57 56 PM

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! The latest screenshot example looks great. my comments are just around some docs/tests.

Reviewed 6 of 7 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @HonoreDB and @samiskin)


pkg/ui/workspaces/cluster-ui/src/graphs/utils/domain.ts line 224 at r1 (raw file):

}

function ComputeDurationAxisDomain(extent: Extent): AxisDomain {

can you write a few tests for this?


pkg/ui/workspaces/db-console/src/views/shared/components/metricQuery/index.tsx line 92 at r1 (raw file):

  title?: string;
  rate?: boolean;
  scale?: number;

I know the existing code doesn't do this but would you mind adding documentation for this parameter? Seems like others might be motivated to use it.

Copy link
Contributor Author

@samiskin samiskin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @HonoreDB)


pkg/ui/workspaces/cluster-ui/src/graphs/utils/domain.ts line 224 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

can you write a few tests for this?

Added tests in a new pkg/ui/workspaces/cluster-ui/src/graphs/utils/domain.spec.ts file 👍


pkg/ui/workspaces/db-console/src/views/shared/components/metricQuery/index.tsx line 92 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

I know the existing code doesn't do this but would you mind adding documentation for this parameter? Seems like others might be motivated to use it.

Added comment 👍

@samiskin samiskin force-pushed the changefeed-metric-updates branch 2 times, most recently from 3197d11 to f018344 Compare April 25, 2023 16:38
Copy link
Contributor

@jayshrivastava jayshrivastava left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian, @HonoreDB, and @samiskin)

a discussion (no related file):
This is really great work :)



pkg/ccl/changefeedccl/metrics.go line 386 at r3 (raw file):

	metaChangefeedMaxBehindNanos = metric.Metadata{
		Name:        "changefeed.max_behind_nanos",
		Help:        "Time since the changefeed last checkpointed its progress",

The comment implies that it's the time since the most recent checkpoint. It shows the time since the oldest checkpoint. It's also calculated across all changefeeds.

edit: I just saw the tooltip text. We can probably use the first sentence of that here.


pkg/ui/workspaces/db-console/src/views/cluster/util/graphs.ts line 62 at r3 (raw file):

      const scaledValues = result.datapoints.map(v => ({
        ...v,
        value: v.value && v.value * (s.props.scale ?? 1),

The && is confusing. Is that right?

@shermanCRL
Copy link
Contributor

pkg/ui/workspaces/db-console/src/views/cluster/util/graphs.ts line 62 at r3 (raw file):

Previously, jayshrivastava (Jayant) wrote…

The && is confusing. Is that right?

It’s kinda idiomatic truthy javascript. Maybe some parens would make it clearer. Or make it more clearly booleans instead of truthy if it’s confusing.

@shermanCRL
Copy link
Contributor

Remind me @samiskin, is Commit Latency the commit-to-emit latency? Asking because, now that we have reduced latency, where is the best place to observe it?

@samiskin
Copy link
Contributor Author

@shermanCRL Yep, its the time that we confirmed emitting of the message minus the MVCC timestamp. We should be able to view the reduction with this metric, most directly by running without batching (which would add its own latency)

@samiskin
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented May 23, 2023

🕐 Waiting for PR status (GitHub check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

@shermanCRL
Copy link
Contributor

Looks like this needs follow-up?

This PR does the following changes:
- Added a `scale` parameter to `Metrics` so that I could support a duration
  metric that's being emitted in Seconds rather than Nanoseconds.
- Added support for minutes and hours in Duration graphs
- There is now a "Changefeed Status" graph to show counts of
  Running/Paused/Failed
- There is now a "Commit Latency" graph to show P50,P90, and P99 commit
  latencies
- Sink Byte Traffic is now Emitted Bytes
- Sink Timings has been removed because I don't believe either of the metrics
  exist anymore
- Max Changefeed Latency is now Max Checkpoint Latency
- There is now a Backfill Pending Ranges graph
- There is now a Oldest Protected Timestamp graph
- There is now a Schema Registry Registrations graph

Release note (ui change): The metrics page for changefeeds has been updated
with new graphs to track backfill progress, protected timestamps age, and
number of schema registry registrations.
@samiskin
Copy link
Contributor Author

samiskin commented Jun 5, 2023

bors r+

@craig
Copy link
Contributor

craig bot commented Jun 5, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants