Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry integration misses the point: we need to propagate the span/context! #1389

Closed
hermanbanken opened this issue Aug 30, 2021 · 33 comments · Fixed by #1833
Closed
Assignees
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.

Comments

@hermanbanken
Copy link

hermanbanken commented Aug 30, 2021

The code where the Subscriber reads the span from the attributes just creates the span and then throws it away:

const span: Span | undefined = this._constructSpan(message);
if (this.isOpen) {
message.modAck(this.ackDeadline);
this._inventory.add(message);
} else {
message.nack();
}
if (span) {
span.end();
}

A good OpenTelemetry customer would then set the span to the current span in the on("message") handler, so that any subsequent processing is attached to the remote span.

I'll also file a Google support case.

Environment details

  • @google-cloud/pubsub version: master / v2.17.0

Steps to reproduce

  1. Enable OpenTelemetry in both publisher and subscriber using { enableOpenTelemetryTracing: true } options on topic and subscription.
  2. Perform some processing that creates spans, either manually or (for example) do a Datastore Query or Put.

It can be observed that actions are linked to the subscriber span, as can also be seen in this screenshot (spans take <1ms because nothing is done inside):
Screenshot 2021-08-30 at 16 47 23

Relates to #1078.

@product-auto-label product-auto-label bot added the api: pubsub Issues related to the googleapis/nodejs-pubsub API. label Aug 30, 2021
@hermanbanken
Copy link
Author

@sethmaxwl was the scope of the initial OpenTelemetry PR to just link the publish and subscribe side together? I think it missed the point for any consumer application if you want to link the publish to more downstream actions inside the subscriber.

@feywind
Copy link
Collaborator

feywind commented Aug 30, 2021

@hermanbanken Thanks for the comments! We are in the middle of trying to harmonize our treatment of OpenTelemetry for Pub/Sub across languages/libraries. The Node one was sort of a pilot project to see how it goes, and we're hoping to make that better. Feel free to add any comments you have (or anyone who wants to comment!) and we'll add it to the discussion.

@feywind feywind added priority: p2 Moderately-important priority. Fix may not be included in next release. type: enhancement labels Aug 30, 2021
@hermanbanken
Copy link
Author

One remark I have is that it should be documented a little bit more how it is working. For example, the (semantic) key name and the data format of the propagation key could be documented. I believe it was __googPubsubTrace-something and I think it would be nice if that could be not some magic internal constant. Also, why not use the standard w3c trace header name and format?

@weyert
Copy link
Contributor

weyert commented Sep 11, 2021

Good point, I think Aaron suggested the same regarding W3C trace format. Existing propagator should be reusable. I struggled to get it all working with the recent Otel-js. If you want you can raise a PR or I can have a stab at it during my next tech debt day at work

I am guess we should make the span we created the active one? Got to look a bit more into the workings of subscriber class

@weyert
Copy link
Contributor

weyert commented Oct 20, 2021

@hermanbanken Did you hear back from Google Support about your case?

@hermanbanken
Copy link
Author

At September 1st they said that they would working on it "next quarter".

Public tracker: https://issuetracker.google.com/issues/183608476

@hermanbanken
Copy link
Author

That tracker seems to be inaccessible now... weird.

@weyert
Copy link
Contributor

weyert commented Feb 20, 2022 via email

@naseemkullah
Copy link

naseemkullah commented Feb 21, 2022

I came up with this as a workaround (there is a comment in the event listener about what I think this library should provide):

import {Message, PubSub} from '@google-cloud/pubsub';
import {
  ROOT_CONTEXT,
  SpanContext,
  SpanStatusCode,
  trace,
} from '@opentelemetry/api';

const tracer = trace.getTracer('foo');

const pubsub = new PubSub({projectId: 'abc'});

const enableOpenTelemetryTracing = {
  enableOpenTelemetryTracing: true,
};

export async function subcribe() {
  const subscription = pubsub.subscription(
    'testSubscription',
    enableOpenTelemetryTracing
  );

  subscription
    .on('message', (message: Message) => {
      // it would be convenient if everything in this handler
      // apart from `handleMessage(message)` logic was provided by the GCP pub/sub library
      // when enableOpenTelemetryTracing is true
      const spanContext = getSpanContext(message);

      if (spanContext === undefined) {
        handleMessage(message);
      } else {
        const context = trace.setSpanContext(ROOT_CONTEXT, spanContext);

        tracer.startActiveSpan('handleMessage', {}, context, span => {
          try {
            handleMessage(message);
            span.setStatus({code: SpanStatusCode.OK});
          } catch (err) {
            span.setStatus({
              code: SpanStatusCode.ERROR,
              message: err instanceof Error ? err.message : undefined,
            });
            throw err;
          } finally {
            span.end();
          }
        });
      }
    })
    .on('error', (error: Error) => {
      console.error('Received error:', error);
    });
}

function handleMessage(message: Message) {
  console.log('Received message:', message.data.toString());
  message.ack();
}

function getSpanContext(message: Message): SpanContext | undefined {
  try {
    return JSON.parse(
      message.attributes['googclient_OpenTelemetrySpanContext']
    );
  } catch (error) {
    return undefined;
  }
}

worth noting, while the message acknowledgment and any processing leading up to it are part of the same trace, the /google.pubsub.v1.Subscriber/ModifyAckDeadline span is still disconnected

edit: so apparently "span links" are the proper way to link a messages consumption to its production (as @weyert menitons below), above code snippet should probably not be used

@weyert
Copy link
Contributor

weyert commented Feb 21, 2022

Thank you, that looks useful. I think we might be able to use span links to connect modifyackdeadline and the other cools. But to me having it not linked is already a great improvement

@feywind
Copy link
Collaborator

feywind commented Mar 24, 2022

The tracker page is still there for me, internally at least. Weird. Anyway, nothing has happened on this yet, but it's increasing in priority right now. Definitely not forgotten.

@feywind feywind added type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design. and removed priority: p2 Moderately-important priority. Fix may not be included in next release. type: enhancement labels Mar 24, 2022
@weyert
Copy link
Contributor

weyert commented Mar 24, 2022

Yeah, I hope to have a stab at it soon. So far more important stuff at work than this.

@feywind
Copy link
Collaborator

feywind commented Mar 28, 2022

You might not want to put too much into it, just because we've got a revamped design for OT that will be rolling out across all the languages, over time. Of note, it's going to propagate the span and context by way of the W3C spec, as well as doing better about "closing the loop" on the subscriber end. We've got a prototype implementation in Go, and that just needs to be expanded across all the languages.

googleapis/google-cloud-go#5034

If you see anything you want in the Node implementation that's missing from the Go one, please feel free to comment!

@weyert
Copy link
Contributor

weyert commented Mar 28, 2022

Yeah, the changes I wanted to propose via PR would be heavily based on the Go implementation :)

Let me know, if you prefer me to wait, that's fine, happy to wait for a Google implementation if it's coming in short term

@GraemeF
Copy link

GraemeF commented Aug 3, 2022

Great news that this is being fixed! Just spent a day scratching my head as to why my traces were disconnected...

@feywind
Copy link
Collaborator

feywind commented Aug 3, 2022

Sorry for the radio silence on this topic, but it's still a major objective for us. None of this is official until it's official (please don't take it as an announcement/promise), but the current thought is that we're planning to shift it to use the more standard mechanism (w3c headers), and to get it piped all the way through. A preview of the Go version produces a really nice graph in Cloud Trace, showing the time spent in each phase of the library (batching, ack leasing, etc). There isn't currently anything that will be traced in the server, but the end to end trace should at least allow you to see how long it spends in the server before being sent out.

@feywind
Copy link
Collaborator

feywind commented Dec 22, 2022

Mentioned in the PR, but mentioning here too: there's a suuuuper early preview branch now if anyone wants to play around with what's done. I don't recommend using it in any kind of production scenario or assuming any of the interfaces or trace formats are final.

That PR is here: #1659

And the branch in googleapis that anyone can make PRs against if you want: https://github.com/googleapis/nodejs-pubsub/tree/otel-2

I'll be out for a bit but plan to pick it up again in early 2023.

@0xdevalias
Copy link

@feywind Just wanted to check in and see how 2023 was treating this work, any updates yet?

@feywind
Copy link
Collaborator

feywind commented Jun 20, 2023

I've been posting updates on the PR, but I'm planning to make a preview version soon.

@feywind
Copy link
Collaborator

feywind commented Jun 23, 2023

Tiny sample size, I admit, but quick survey for those on this thread and wanting OTel support sooner rather than later: do you consider the trace structures and span names to be part of the public interface? Would it break your code if they changed? Or is it just a human debug tool?

@weyert
Copy link
Contributor

weyert commented Jun 23, 2023

@feywind for me it’s a human debug tool:) I am personally not planning to generate metrics from the traces via like otel-collector

@ervin-pactum
Copy link

do you consider the trace structures and span names to be part of the public interface? Would it break your code if they changed? Or is it just a human debug tool?

Mainly it is human debug tool, but there might be some alerting around it, I would expect that changes to span names or changes in amount of spans emitted to be mentioned in changelog/release notes. Mainly because there is some little cost attached to each single span.

@0xdevalias
Copy link

Pretty much entirely human debug tool, but mentioning any significant changes in the changelog would be helpful for updating any internal process documentation/etc.

@SimenB
Copy link

SimenB commented Jun 26, 2023

Same here as others have said - only used for some alerting and saved searches in grafana, "production code" doesn't care

@danmichaelo
Copy link

Same here.

@feywind
Copy link
Collaborator

feywind commented Jun 27, 2023

Thanks everyone, I appreciate the input. Our main goal for the moment is more of a human debug solution, too, so I think we're all aligned. :) A sort of "soft interface" is an interesting idea, not needing a whole new major as long as updates are mentioned consistently somewhere. That'll be good to take into account.

@feywind
Copy link
Collaborator

feywind commented Sep 20, 2023

That #1833 is the version rebased onto main (4.x). I'm planning to get that released and then go back to do any final backporting to 3.x.

@feywind
Copy link
Collaborator

feywind commented Apr 4, 2024

Quick update, there are some beta testing instructions over here if anyone is interested.

@feywind
Copy link
Collaborator

feywind commented May 1, 2024

Has anyone had a chance to try this out? There are a few things that need to be finished up to get this released, but it'll probably be pretty soon.

@weyert
Copy link
Contributor

weyert commented May 1, 2024

I have tried it during local development and works pretty well for me!

@vesse
Copy link

vesse commented May 6, 2024

@feywind I have now been trying the beta in GCP but don't get the pubsub message handler requests to the same trace than from where it was triggered. We use @opentelemetry/auto-instrumentations-node and Pub/Sub push subscriptions with Cloud Run instance being triggered from the messages.

I can see that the incoming request body contains eg.

attributes: {
  googclient_traceparent: "00-4dccfee9342c29e33b8fcf484aa52ea9-6b72c4b7d37f4bd0-01"
}

where the trace ID matches the trace ID of the publisher, but the incoming HTTP request traceparent has different trace ID.

Note: I have very little prior OTEL experience but trying now to set up tracing so that incoming request trace would contain everything the request handling caused. I don't even know if I'm supposed to do something manually or are push subscriptions even supposed to be supported throughout the chain.

@feywind
Copy link
Collaborator

feywind commented May 6, 2024

@vesse It seems push subscriptions are slightly out of scope for this first pass. You should be able to pull that googclient_traceparent value and use it for further OpenTelemetry tracing.

I'm not sure exactly what is placing a new traceparent in your HTTP headers, but it's something we'll need to figure out for the push support, so thanks for the heads-up :)

@feywind
Copy link
Collaborator

feywind commented Aug 16, 2024

I forgot to post here too: the otel-beta tag is updated to .2 now, which is pretty close to the finished v1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the googleapis/nodejs-pubsub API. type: feature request ‘Nice-to-have’ improvement, new feature or different behavior or design.
Projects
None yet
10 participants