Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add datadog tracking to record schema validation errors #13393

Merged
merged 33 commits into from
Aug 8, 2022

Conversation

alovew
Copy link
Contributor

@alovew alovew commented Jun 1, 2022

Add Datadog tracking for record schema validation errors

@github-actions github-actions bot added area/platform issues related to the platform area/worker Related to worker labels Jun 1, 2022
@alovew alovew force-pushed the anne/datadog-tracking-schema-validation branch from 9188dc8 to c6a8f55 Compare June 1, 2022 21:39
@alovew alovew temporarily deployed to more-secrets June 1, 2022 21:41 Inactive
@alovew alovew temporarily deployed to more-secrets June 2, 2022 17:48 Inactive
@alovew alovew force-pushed the anne/datadog-tracking-schema-validation branch from df29a52 to 1e0bd00 Compare June 7, 2022 17:04
@alovew alovew temporarily deployed to more-secrets June 7, 2022 17:06 Inactive
@alovew alovew temporarily deployed to more-secrets June 7, 2022 21:58 Inactive
@alovew alovew temporarily deployed to more-secrets June 8, 2022 16:38 Inactive
@alovew alovew requested a review from lmossman June 8, 2022 18:26
Copy link
Contributor

@lmossman lmossman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, main one about refactoring how the datadog-specific inputs are passed into this class to be more explicit, encapsulated, and less prone to NullPointerExceptions. Happy to discuss more over zoom/slack if you want!

Comment on lines 95 to 110
NUM_RECORD_SCHEMA_VALIDATION_ERRORS(MetricEmittingApps.WORKER,
"record_schema_validation_error",
"number of record schema validation errors");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't actually be tracking the number of errors for a given stream, right? Because we are just emitting 1 for each stream that had any errors, and we are capping the number of validations we perform for a stream once we reach a certain number?

Maybe this should be called something like SOURCE_STREAMS_WITH_RECORD_SCHEMA_VALIDATION_ERRORS? Open to other suggestions on naming here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was trying to use NUM because it is a count metric, but maybe NUM_SOURCE_STREAMS_WITH_RECORD_SCHEMA_VALIDATION_ERRORS

Comment on lines 112 to 120
public DefaultReplicationWorker(final String jobId,
final int attempt,
final AirbyteSource source,
final AirbyteMapper mapper,
final AirbyteDestination destination,
final MessageTracker messageTracker,
final RecordSchemaValidator recordSchemaValidator) {
this(jobId, attempt, source, mapper, destination, messageTracker, recordSchemaValidator, null, null);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor means just means that the non-container-orchestrator deployments will be setting those values to null, right? I think that is okay, since we care more about tracking issues in cloud which only uses container orchestrators. But just wanted to confirm my understanding

@@ -354,8 +377,18 @@ private static Runnable getReplicationRunnable(final AirbyteSource source,
}
LOGGER.info("Total records read: {} ({})", recordsRead, FileUtils.byteCountToDisplaySize(messageTracker.getTotalBytesEmitted()));
if (!validationErrors.isEmpty()) {
DogStatsDMetricSingleton.initialize(MetricEmittingApps.WORKER, new DatadogClientConfiguration(configs));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of passing the entire Configs object into this class just to be used here to create the DatadogClientConfiguration object, I think it would be better to just pass that DatadogClientConfiguration object into this class directly. That way we aren't passing in more than we need.

In fact, I think it would be better if we created a new class to contain all of the logic that you added here. Something like a DatadogSchemaValidationCounter class, with a constructor that takes in the DatadogClientConfiguration object and the sourceDockerImage string (and maybe the constructor can split out the repo and version from the image string into separate class variables), and has a method that called something like track() which contains all of the logic to initialize the DogStatsDMetricSingleton, create the validationErrorMetadata, and call the DogStatsDMetricSingleton.count() method.

Then, I think this class should be wrapped in an Optional<> when passed into this DefaultReplicationWorker class, to make it explicit that it is not always set. It should only be set to something in the ReplicationJobOrchestrator class if the DD_AGENT_HOST and DD_DOGSTATSD_PORT env vars are not empty, so that none of this is attempted for Kube OSS users that have not configured datadog. And the track() method on that class (or whatever you choose to call it) should only be called if the Optional is not empty.

The reason for all of this is that it encapsulates the logic related to datadog tracking into a separate class rather than further complicating the logic in this getReplicationRunnable() method. Also, without an approach like the above I think you would need to add more null checks to the logic you've added here, otherwise I believe this will break when someone hasn't set those DD_* variables, or if someone deploys this on a non-container-orchestrator deployment.

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes that makes sense! when you say the class should be wrapped in Optional<> when it's passed in - is that instead of having multiple constructors?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think just having a single constructor would be cleaner

@alovew alovew temporarily deployed to more-secrets June 9, 2022 17:52 Inactive
@alovew
Copy link
Contributor Author

alovew commented Jun 9, 2022

@lmossman I updated this (obvs not a priority to look at right now). I made a separate class for datadog metrics, but made it more generic than you were suggesting - I was thinking we might want to track other kinds of metrics in the replication worker and could use that class for other metrics as well. let me know what you think.

@alovew alovew temporarily deployed to more-secrets June 9, 2022 17:59 Inactive
@alovew alovew force-pushed the anne/datadog-tracking-schema-validation branch from 79d4fc7 to 1e08c84 Compare June 27, 2022 19:38
@alovew alovew temporarily deployed to more-secrets June 27, 2022 19:40 Inactive
@alovew alovew temporarily deployed to more-secrets June 27, 2022 21:36 Inactive
@alovew alovew force-pushed the anne/datadog-tracking-schema-validation branch from f7a5e50 to b1d70a5 Compare June 27, 2022 22:31
@alovew alovew temporarily deployed to more-secrets June 27, 2022 22:33 Inactive
@alovew alovew temporarily deployed to more-secrets June 28, 2022 00:15 Inactive
@alovew alovew temporarily deployed to more-secrets June 28, 2022 01:11 Inactive
@alovew alovew force-pushed the anne/datadog-tracking-schema-validation branch from 51f8fea to bf0a059 Compare June 30, 2022 17:30
@alovew alovew temporarily deployed to more-secrets June 30, 2022 17:32 Inactive
@alovew alovew force-pushed the anne/datadog-tracking-schema-validation branch from bf0a059 to 164928f Compare July 5, 2022 16:43
@alovew alovew temporarily deployed to more-secrets July 5, 2022 16:47 Inactive
@alovew alovew force-pushed the anne/datadog-tracking-schema-validation branch from 164928f to 0d9e30b Compare July 21, 2022 20:35
@alovew alovew temporarily deployed to more-secrets July 21, 2022 20:37 Inactive
@alovew alovew force-pushed the anne/datadog-tracking-schema-validation branch from 0d9e30b to 55b8de7 Compare July 21, 2022 23:18
@alovew alovew temporarily deployed to more-secrets July 21, 2022 23:21 Inactive
@alovew alovew force-pushed the anne/datadog-tracking-schema-validation branch from 5abb899 to 915fc4a Compare August 4, 2022 16:16
@alovew alovew temporarily deployed to more-secrets August 4, 2022 16:19 Inactive
@alovew alovew temporarily deployed to more-secrets August 4, 2022 16:46 Inactive
@alovew alovew temporarily deployed to more-secrets August 4, 2022 17:59 Inactive
@@ -55,6 +57,7 @@
@Slf4j
public class ContainerOrchestratorApp {

private static final Logger LOGGER = LoggerFactory.getLogger(ContainerOrchestratorApp.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need that. There is a logger available inthe log variable

System.setProperty(envVar, envMap.get(envVar));
final String getEnvResult = System.getenv(envVar);
LOGGER.info("getting env " + envVar);
LOGGER.info(getEnvResult);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it for debug only? This will need to be removed, we may have some credential in the env variables.


@Slf4j
public class ReplicationJobOrchestrator implements JobOrchestrator<StandardSyncInput> {

private static final Logger LOGGER = LoggerFactory.getLogger(ReplicationJobOrchestrator.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have a logger present in the log var

LOGGER.info("metric client in async orchestrator pod process");
LOGGER.info(metricClient);

envVars.add(new EnvVar(EnvConfigs.METRIC_CLIENT, metricClient, null));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't you need the Datadog related env vars here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so but apparently not since this is working without it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'working' as in, the metric client is showing up as a properly initialized datadog metric client

@alovew alovew temporarily deployed to more-secrets August 5, 2022 17:11 Inactive
@alovew alovew temporarily deployed to more-secrets August 5, 2022 18:09 Inactive
@alovew alovew temporarily deployed to more-secrets August 5, 2022 19:59 Inactive
@alovew
Copy link
Contributor Author

alovew commented Aug 5, 2022

Tested this on dev and it's working now, env variables are all correctly being passed to the worker.

Copy link
Contributor

@jdpgrailsdev jdpgrailsdev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@alovew alovew dismissed stale reviews from benmoriceau and lmossman August 8, 2022 17:14

dismissing bc comments were addressed

@alovew alovew merged commit 12270cc into master Aug 8, 2022
@alovew alovew deleted the anne/datadog-tracking-schema-validation branch August 8, 2022 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants