feat: add JMX metric for commandRunner status #4019

stevenpyzhang · 2019-12-02T21:42:58Z

Description

#3962 changes the commandRunner to never skip a command if it's gone through the transaction protocol. We need to expose some metric that can be used to alert if the commandRunner thread is stuck on a particular command.

This PR introduces a JMX metric for the commandRunner thread status.

Testing done

Cherry-picked #3962 to this branch for testing.
Put CREATE STREAM qwerqweq(age BIGINT) WITH (KAFKA_TOPIC='foo', VALUE_FORMAT='DELIMITED'); into the command topic

Deleted topic foo
Started server
Watched the metric value in JConsole, after 15 seconds it went from RUNNING to ERROR since it was stuck processing the above command.
Created topic foo with zookeeper
The server completed start up and the metric value went back to RUNNING

Reviewer checklist

Ensure docs are updated if necessary. (eg. if a user visible feature is being added or changed).
Ensure relevant issues are linked (description should include text like "Fixes #")

rodesai

Thanks, @stevenpyzhang! Feedback inline.

ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java

rodesai · 2019-12-03T04:30:17Z

ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java

+    final String metricName = "liveness-indicator";
+    final String description =
+        "A metric indicating the status of the commandRunner. "
+            + "If value 1, the commandRunner is processing commands normally."


If you use a Gauge then you can use a Gauge<String>, and have the metric value be a string value that defines the current status, e.g. "RUNNING" vs "ERROR". With this approach, I'd use an enum to define the possible statuses, and then use the name() method to get the String.

If we define the enum you suggested, what's the advantage of emitting the metric as a string rather than an integer? I'm not sure how well datadog (and other similar tools) plays with string-valued metrics.

It should be ok I think. Wouldn't using strings be similar to how a query's status is tracked by string values?

The advantage is that it makes more sense if you're just dumping the JMX. The datadog agent lets you configure a mapping from a string to a numerical metric. The disadvantage there is that you have to keep the mapping in the dd config in sync with your code (so if you add a new status you have to update your config). I'm leaning toward emitting a numerical metric for that reason. I think we should still implement it as an enum though in our code.

ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/computation/CommandRunner.java

rodesai · 2019-12-05T08:22:13Z

ksql-rest-app/src/main/java/io/confluent/ksql/rest/server/KsqlRestConfig.java

@@ -77,6 +77,13 @@
      "Minimum time between consecutive health check evaluations. Health check queries before "
          + "the interval has elapsed will receive cached responses.";

+  static final String KSQL_COMMAND_RUNNER_HEALTH_CHECK_MS =
+          KSQL_CONFIG_PREFIX + "server.command.runner.healthcheck.ms";


This implies some sort of healtchecking interval. I'd name this something like: server.command.blocked.threshold.error

Changed it to server.command.blocked.threshold.error.ms

rodesai

Looking good. Couple more bits of feedback inline.

ksql-rest-app/src/test/java/io/confluent/ksql/rest/server/computation/CommandRunnerTest.java

rodesai · 2019-12-05T08:44:22Z

ksql-rest-app/src/test/java/io/confluent/ksql/rest/server/computation/CommandRunnerTest.java

+    }
+  }
+
+  private void checkCommandRunnerStatus(


I think the spirit of this test is great - if it passes we're confident the code under test is doing what its supposed to do. The problem is that because it relies on timing, it's prone to spurious test failures. I think we can make a couple tweaks to make the test deterministic:
- pass a mock clock to command runner so we control the time changes
- instead of using a sleep to simulate delays, have the command runner wait on a condition or countdown latch.

so you get something like this:

... givenQueuedCommands(queuedCommand1); Producer<Long> clock = mock(Producer.class); CountDownLatch latch = new CountDownLatch(1); CommandRunner commandRunner = new CommandRunner(..., clock::get, ...); when(clock.get()).thenReturn(0).thenReturn(500).thenReturn(1000).thenReturn(2000); when(statementExecutor.handleStatement()).thenAnswer(i -> latch.await()); Thread t = new Thread(() -> { commandRunner.fetchAndRunCommands(); }); assertThat(commandRunner.checkCommandRunnerStatus(), is(RUNNING)); assertThat(commandRunner.checkCommandRunnerStatus(), is(ERROR)); latch.countDown(); t.join(); assertThat(commandRunner.checkCommandRunnerStatus(), is(RUNNING)); ...

rodesai

LGTM!

stevenpyzhang requested a review from a team as a code owner December 2, 2019 21:42

stevenpyzhang force-pushed the command-runner-metric branch from 49e5d43 to 52bbdb1 Compare December 2, 2019 21:56

rodesai reviewed Dec 3, 2019

View reviewed changes

stevenpyzhang force-pushed the command-runner-metric branch from 52bbdb1 to be00fc8 Compare December 3, 2019 22:54

stevenpyzhang requested review from rodesai and vcrfxia December 4, 2019 19:32

stevenpyzhang changed the title ~~feat: add metric for commandRunner status~~ feat: add JMX metric for commandRunner status Dec 4, 2019

stevenpyzhang force-pushed the command-runner-metric branch from 11e943f to 2c09bc3 Compare December 4, 2019 19:33

feat: add metric for commandRunner status

e5bd35d

stevenpyzhang force-pushed the command-runner-metric branch from 2c09bc3 to e0d55ab Compare December 4, 2019 19:47

separate metric into separate class and added tests

119a5b4

stevenpyzhang force-pushed the command-runner-metric branch from e0d55ab to 119a5b4 Compare December 4, 2019 20:55

rodesai reviewed Dec 5, 2019

View reviewed changes

add clock

6a27b99

stevenpyzhang force-pushed the command-runner-metric branch from 3835391 to 6a27b99 Compare December 5, 2019 22:28

stevenpyzhang requested a review from rodesai December 5, 2019 22:30

rodesai approved these changes Dec 5, 2019

View reviewed changes

stevenpyzhang merged commit 55d75f2 into confluentinc:master Dec 5, 2019

stevenpyzhang mentioned this pull request Feb 26, 2020

CommandRunner metric doesn't accurately report if the thread is running or not #4652

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add JMX metric for commandRunner status #4019

feat: add JMX metric for commandRunner status #4019

stevenpyzhang commented Dec 2, 2019 •

edited

Loading

rodesai left a comment

rodesai Dec 3, 2019

vcrfxia Dec 3, 2019

stevenpyzhang Dec 3, 2019

rodesai Dec 4, 2019

rodesai Dec 5, 2019

stevenpyzhang Dec 5, 2019

rodesai left a comment

rodesai Dec 5, 2019

stevenpyzhang Dec 5, 2019

rodesai left a comment

feat: add JMX metric for commandRunner status #4019

feat: add JMX metric for commandRunner status #4019

Conversation

stevenpyzhang commented Dec 2, 2019 • edited Loading

Description

Testing done

Reviewer checklist

rodesai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodesai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rodesai left a comment

Choose a reason for hiding this comment

stevenpyzhang commented Dec 2, 2019 •

edited

Loading