Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Metrics stopped working after the upgrade from 1.2.0 to 1.3.0 #340

Closed
petrkoutnycz opened this issue Sep 3, 2024 · 9 comments
Closed
Labels
bug Something isn't working

Comments

@petrkoutnycz
Copy link
Contributor

petrkoutnycz commented Sep 3, 2024

We instruct the temporal SDK to send metrics to the New Relic via OpenTelemetry endpoint. We noticed that this functionality stopped working after the upgrade to 1.3.0.

To investigate the issue, I enabled log forwarding from the client so we get the output like this:

Logging = new(new(TelemetryFilterOptions.Level.Debug, TelemetryFilterOptions.Level.Debug))
{
    Forwarding = new LogForwardingOptions(
        serviceProvider.GetRequiredService<ILogger<TemporalRuntime>>())
}

With version 1.2.0, we have following output from the client:

[10:12:57] [DBUG] [TemporalRuntime] [sdk_core::hyper::client::connect::dns] resolving host="otlp.eu01.nr-data.net" 
[10:12:57] [DBUG] [TemporalRuntime] [sdk_core::hyper::client::connect::http] connecting to 185.221.85.50:443 
[10:12:58] [DBUG] [TemporalRuntime] [sdk_core::hyper::client::connect::http] connected to 185.221.85.50:443 
[10:12:58] [DBUG] [TemporalRuntime] [sdk_core::h2::client] binding client connection 
[10:12:58] [DBUG] [TemporalRuntime] [sdk_core::h2::client] client connection bound

With version 1.3.0, we have this output:

[10:10:45] [DBUG] [TemporalRuntime] [sdk_core::hyper_util::client::legacy::connect::dns] resolving host="otlp.eu01.nr-data.net"
[10:10:46] [DBUG] [TemporalRuntime] [sdk_core::hyper_util::client::legacy::connect::http] connecting to 185.221.85.50:443 
[10:10:46] [DBUG] [TemporalRuntime] [sdk_core::hyper_util::client::legacy::connect::http] connected to 185.221.85.50:443 
[10:10:46] [DBUG] [TemporalRuntime] [sdk_core::tonic::transport::channel::service::reconnect] reconnect::poll_ready: ConnectError(HttpsUriWithoutTlsSupport(())) 
[10:10:46] [DBUG] [TemporalRuntime] [sdk_core::tower::buffer::worker] processing request service.ready=true
[10:10:46] [DBUG] [TemporalRuntime] [sdk_core::tonic::transport::channel::service::reconnect] error: Connecting to HTTPS without TLS enabled 
OpenTelemetry metrics error occurred. Metrics exporter otlp failed with the grpc server returns error (The service is currently unavailable): , detailed error message: Connecting to HTTPS without TLS enabled

Not saying that the error is reported as DEBUG! So it is impossible for us to catch it as we never have DEBUG allowed in production.

Minimal Reproduction

We configure metrics as following:

Metrics = new MetricsOptions
{
    AttachServiceName = false,
    MetricPrefix = "temporal.",
    GlobalTags =
    [
        new KeyValuePair<string, string>("service.name", "mews-app-name")
    ],
    OpenTelemetry = new OpenTelemetryOptions
    {
        Url = new Uri("https://otlp.eu01.nr-data.net/v1/metrics"),
        Headers = [new KeyValuePair<string, string>("api-key", "<THE API KEY>")],
        MetricTemporality = OpenTelemetryMetricTemporality.Delta,
        MetricsExportInterval = TimeSpan.FromSeconds(5)
    },
},

Also tried different Url. According to the New Relic docs, the url should be fine like that. Also tried simply use https://otlp.eu01.nr-data.net:4317 (i.e. with gRPC port and without the v1/metrics appendix) but also without success.

Environment/Versions

  • OS and processor: M1 Mac, Windows
@petrkoutnycz petrkoutnycz added the bug Something isn't working label Sep 3, 2024
@petrkoutnycz petrkoutnycz changed the title [Bug] Metrics stopped to work after the upgrade from 1.2.0 to 1.3.0 [Bug] Metrics stopped working after the upgrade from 1.2.0 to 1.3.0 Sep 3, 2024
@cretz
Copy link
Member

cretz commented Sep 3, 2024

Yes, we inadvertently broke OpenTelemetry over TLS with the latest release (happened in Python too). This is because of a Rust issue with a dependency upgrade (see open-telemetry/opentelemetry-rust#2008). This was fixed in our Core layer at temporalio/sdk-core#801 and we will be issuing a patch release to fix here.

@petrkoutnycz
Copy link
Contributor Author

Thanks 👍

@petrkoutnycz
Copy link
Contributor Author

And as for reporting the error as DEBUG...? I'd guess this problem was present before as well :-)

@cretz
Copy link
Member

cretz commented Sep 3, 2024

It's actually not DEBUG it's just a standalone stderr line. Notice the last line of your log does not have any log prefixing (not to be confused with the Tonic reconnect trace above it which is not an error). https://github.com/open-telemetry/opentelemetry-rust/blob/976bc54dba564afdc8fd7de51a8feb123f44ecdb/opentelemetry/src/global/error_handler.rs#L62 is where this is happening, and we will look into whether we can configure a global handler, though it is difficult because logs are configured per runtime.

@petrkoutnycz
Copy link
Contributor Author

Btw weird thing is that this problem occurs for us on a specific environment even with version 1.2.0
image

@cretz
Copy link
Member

cretz commented Sep 4, 2024

If you'd like, you can try to rebuild the SDK from the branch at the PR in #344 and see if it resolves your issue.

@petrkoutnycz
Copy link
Contributor Author

When do you expect to fix the OTel over TLS problem please? :-)

@cretz
Copy link
Member

cretz commented Sep 6, 2024

It will be fixed on next release, which should hopefully happen soon as a patch release (we are patching another issue).

@cretz
Copy link
Member

cretz commented Sep 11, 2024

This should now be fixed with the 1.3.1 release.

@cretz cretz closed this as completed Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants