Skip to content
This repository has been archived by the owner on May 23, 2023. It is now read-only.

RFC / WIP: make "auto-finishing" optional and non-Tracer-specific #166

Closed
wants to merge 8 commits into from

Conversation

bhs
Copy link
Contributor

@bhs bhs commented Jul 17, 2017

Per opentracing/specification#80, there are concerns about the existing ActiveSpan/ActiveSpanSource concepts, what they imply for Tracer owners/maintainers, and what they imply about acceptable idioms for ActiveSpan instrumentation.

This PR is not intended for merge as-is (at the very least it needs more testing and documentation), but is intended to illustrate some relatively small API changes that decouple Tracers from the refcounting business.

There are two problems I'm hoping to address here.

  1. Inasmuch as ActiveSpan/Continuation refcounting might be desirable in some circumstances, there are definitely people who do not like it. Furthermore, they do not want every Tracer to reimplement that logic.
  2. It should be possible to register/activate/makeActive a Span with the ActiveSpanSource without getting into the refcounting stuff ^^^.

Again, this PR is best understood in the context of opentracing/specification#80. Comments welcome, but don't worry about nitpicks/etc as this is not meant to be merged in its current state.

bhs added 4 commits July 16, 2017 14:41
... while also moving the refcounting to the util package and creating a
cleaner path to something different should we want it.
@@ -36,7 +34,12 @@ public ThreadLocalActiveSpan activeSpan() {

@Override
public ActiveSpan makeActive(Span span) {
return new ThreadLocalActiveSpan(this, span, new AtomicInteger(1));
return new ThreadLocalActiveSpan(this, span, new AutoFinisher());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this AutoFinisher default makes this PR less-breaking for existing 0.30.x instrumentation... if we made the default Observer null, the PR would become more-breaking but perhaps "cleaner".

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the default fallback should be this way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this behavior should be controlled by a configuration parameter in ThreadLocalActiveSpanSource. Otherwise it seems to defeat the purpose of this PR, which was to give people a way to turn off refcounting / auto-finishing behavior.

Another thing that's unclear to me - those people who asked for no refcounting / auto-finish behavior, are they implementing a Tracer or doing instrumentation of some framework? Because if it's the latter and those frameworks are mixed with other instrumentation that expects auto-finish, then the only way to make this work is to have makeActive(span) == makeActive(span, NoopObserver), i.e. if someone wants auto-finish behavior it must be requested explicitly, not via the Source configuration (it will be a breaking change from 0.30)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a difficult decision - less breaking is easier in the short term, but doesn't truly decouple the active span lifecycle from the ref counting concern. Although it will impact all instrumentations that have used active spans, I think I would prefer to have the instrumentations explicitly request ref counting via setting the AutoFinisher observer, else handle finishing themselves.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the refcounting should/must be a per-instrumentation-site decision. When it's appropriate, it's great. When it's not, it's not.

@yurishkuro

those people who asked for no refcounting / auto-finish behavior, are they implementing a Tracer or doing instrumentation of some framework?

Both, I think... though I'm more concerned with the latter all things being equal (as it's a larger population of people).

We will need to document at the API level what it means when no observer is provided (i.e., if it implies no-auto-finish or yes-auto-finish). I guess my current proposal is that the default should be yes-auto-finish, but that is definitely debatable and is probably the most interesting choice to make in this entire PR.

Very interested in the opinions of the folks I've cc'd on this PR (or any others).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivantopo @objectiser @yurishkuro

To be clear, I am not proposing the current default because it makes migrations from 0.30 easier, I'm proposing it because I still think it's a sensible default. Though I'm getting the sense that others disagree?

The most common case for instrumentation (IMO) will still be a synchronous try() block, and the most concise syntax should auto-finish() the (Active)Span there IMO. There are other ways to accomplish this, of course... we could have other variants of startXYZ() along with startActive() and startManual(), for instance.

But I did want to make it clear that my thinking here is that the refcounting is still a net-good default, but just something that can be opted out of in cases where it's the wrong thing.

Thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to agree with @bhs that defaulting no-arg startActive() to auto-finish/refcounting behavior is a better default for most users. I don't have any data to back that up since our use of Java is relatively limited. But according to this search, a number of opentracing-contrib integration libraries have been upgraded to 0.30 using default auto-finish mechanism, so apparently it is both useful and convenient to use in instrumentation. @pavolloffay - any comments on that?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @yurishkuro, thanks for the corrected search!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a difference between updating a version and using the active span api. There's also a difference between writing code based on discussions like these and production experience with them. I wouldn't jump to conclusions that everything is fine.

What I don't like is that the liability of auto-finishing and its implementation are not only still here, but also still the default. Any problems with this implementation (as noted laboriously in the prior issue and I won't repeat here) are not less likely now!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yurishkuro so far we haven't had any problems with refcounting in any instrumentation lib.

@mabn made a good comment #166 (comment). Postponing finish due to refcounting is nice in async situations but these use cases are mostly rare. Important is also that it can cause weird timing results if continuation of server span is passed to an async background task (but I haven't had that problem so far).

@@ -79,6 +84,8 @@
*/
Continuation capture();

Span wrapped();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply that an ActiveSpan should always wrap a Span instance?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylerbenson thanks for pointing this out... I meant to remove this when I changed ActiveSpan to be a subclass of Span. Though having slept on it I think there might be another (better) way: leave the class hierarchy as it was, then pass the actual Span instance into the Observer. Calling finish() on an ActiveSpan is a little weird since deactivate() seems like it should never happen after finish().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I removed this unused method, btw)


/**
* XXX: fix comment
* {@link AutoFinisher} is an {@link ActiveSpan} wrapper that automatically {@link Span#finish()}es the underlying
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could avoid conjugating method names, e.g. "wrapper that automatically calls finish() on the underlying...`

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No fun, I love conjugating method names...

*
* <p>
* Note that {@link AutoFinisher} works by counting the number of extant {@link ActiveSpan} or
* {@link ActiveSpan.Continuation} references to the underlying {@link ActiveSpan} provided at construction time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

underlying Span? It says underlying ActiveSpan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search/replace bug.

* <p>
* Use {@link AutoFinisher} like this:
* <pre><code>
* try (ActiveSpan span = new AutoFinisher(tracer.buildSpan("...").startActive())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

example doesn't match the no-arg constructor AutoFinisher().

The trick with AutoFinisher is that there can be only one per Span, otherwise ref-counting will be broken

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, syntax changed after writing that comment...

@@ -56,7 +56,7 @@ The common case starts an `ActiveSpan` that's automatically registered for intra
```java
io.opentracing.Tracer tracer = ...;
...
try (ActiveSpan activeSpan = tracer.buildSpan("someWork").startActive()) {
try (ActiveSpan activeSpan = tracer.buildSpan("someWork").startActive(new AutoFinisher())) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the try() would still call close() and finish the span, whether the AutoFinisher is passed or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I reverted to the old BaseSpan/Span/ActiveSpan class hierarchy)

bhs added 2 commits July 17, 2017 21:14
In so doing, bring back the BaseSpan/Span/ActiveSpan relationships
(albeit with SpanFinisher sitting between BaseSpan and Span).
Copy link
Contributor Author

@bhs bhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having slept on this for a few nights, I think this is a positive step. If others agree, I will clean it up, add better docs+tests, and hopefully get it merged.

I also welcome utterly different approaches if they're out there. IMO the ActiveSpan semantics at master are useful, but they do seem to couple too many things together at once and nobody likes that.

cc @pavolloffay @yurishkuro @objectiser @tedsuo @adriancole @jakerobb @felixbarny @ivantopo @tylerbenson who might have opinions about this PR and/or opentracing/specification#80 which motivated it.

interface Observer {
void onCapture(ActiveSpan captured, Continuation destination);
void onActivate(Continuation source, ActiveSpan justActivated);
void onDeactivate(ActiveSpan activeSpan, Finishable finisher);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by factoring out the Finishable interface from Span proper, there's no longer a need to have ActiveSpan inherit from Span rather than BaseSpan... we're also back to a world where you can't call .finish() on an ActiveSpan, and IMO that's a good thing.

I am uneasy about the precise signatures for this Observer interface... it could be smaller/simpler and still satisfy the requirements here (to factor out refcounting).

@@ -36,7 +34,12 @@ public ThreadLocalActiveSpan activeSpan() {

@Override
public ActiveSpan makeActive(Span span) {
return new ThreadLocalActiveSpan(this, span, new AtomicInteger(1));
return new ThreadLocalActiveSpan(this, span, new AutoFinisher());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this behavior should be controlled by a configuration parameter in ThreadLocalActiveSpanSource. Otherwise it seems to defeat the purpose of this PR, which was to give people a way to turn off refcounting / auto-finishing behavior.

Another thing that's unclear to me - those people who asked for no refcounting / auto-finish behavior, are they implementing a Tracer or doing instrumentation of some framework? Because if it's the latter and those frameworks are mixed with other instrumentation that expects auto-finish, then the only way to make this work is to have makeActive(span) == makeActive(span, NoopObserver), i.e. if someone wants auto-finish behavior it must be requested explicitly, not via the Source configuration (it will be a breaking change from 0.30)

@@ -39,12 +38,18 @@
ActiveSpan activeSpan();

/**
* @see #makeActive(Span, ActiveSpan.Observer)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ambiguous - is it a shortcut for makeActive(span, null) or something else? Or perhaps implementation-dependent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to maintain backward compatibility of behaviour, then I think the javadoc comment for all of the methods that have an Observer variant should say that they represent the case of using the AutoFinisher observer, e.g. SpanBuilder.startActive(), ActiveSpanSource.makeActive(Span), Tracer.makeActive(Span).

We may also want to consider deprecating these methods, so that for the 1.0 API the Observer has to explicitly be specified, e.g. whether null or AutoFinisher.

/**
* {@link Finishable} factors out the overloaded {@link #finish()} method(s).
*
* @see ActiveSpan.Observer#onDeactivate(ActiveSpan, Finishable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

afterDeactivate

@@ -39,12 +38,18 @@
ActiveSpan activeSpan();

/**
* @see #makeActive(Span, ActiveSpan.Observer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to maintain backward compatibility of behaviour, then I think the javadoc comment for all of the methods that have an Observer variant should say that they represent the case of using the AutoFinisher observer, e.g. SpanBuilder.startActive(), ActiveSpanSource.makeActive(Span), Tracer.makeActive(Span).

We may also want to consider deprecating these methods, so that for the 1.0 API the Observer has to explicitly be specified, e.g. whether null or AutoFinisher.

@ivantopo
Copy link

There is some additional information I would like to share here: After several discussions we had internally about our tracer implementation and feedback from developers that are not even concerned with tracing, just good developers who strive for the best abstraction to solve the problem at hand, we came across this question:

Why does a Span has to concern itself with whether it is "active" or not?

This clicked on all of us and we asked ourselves whether such thing as an "ActiveSpan" should exist or not.. our conclusion was simple: there should be just a Span, whether it is InScope/Active/Current, that's a totally different concern that has nothing to do with the Span itself. Then, if there is no such thing as an ActiveSpan then Continuations go away as well and we realize that, at a bare minimum, what is actually needed to properly do in-process context propagation is something that allows you to set/get whatever Span is considered to be Active/Current/InScope. We are trying to come up with an API the is simple, clean and truly flexible and reusable in Kamon and to achieve that, these are the very minimal constructs that really have to be exposed. Making a Span Active/Current/InScope might return an object that wraps the Span, maybe also having a reference to whatever was Active/Current/InScope before so that it can be restored later, but there is no reason for this scoped object to be a Span, just something that wraps a Span.

We had a simple utility to get the current context in Kamon 0.6.x series which we thought might be just too low level and that we could replace it with higher level abstractions like the ones provided by OT, but the reality is that even if we wanted to provide higher level abstractions, the basics have to be exposed. I'm sharing all of this because given the fact that OpenTracing aims to be the common denominator when it comes to tracing I guess it's in the best interest of the community to provide just the minimum APIs to stay flexible and achieve the task, while leaving the door open for more advance features to be built on top of the base APIs.

I would recommend everyone involved in this PR to take a look at what Census (using gRPCs context), Brave and even the old Kamon are doing for in-context propagation. Its just simple, it works, it's proven and many things can be built on top of that.

I know there has been quite an investment in ActiveSpanSource/ActiveSpan/Continuation and I actually liked these abstractions at the beginning but the more we exercise them, the more we realize that there is something that doesn't click and a simpler solution could solve the in-process context propagation challenge. Actually, a simpler solution already solved it for us before and in other tracing systems, there isn't a need to reinvent the wheel here. It is not my intention to ask you to throw away that work, but if we came to the conclusion that is in the best interest of our users to stick to a simpler solution then common sense dictates that we should share our point of view and reasoning on this issue, even if it will just be for the sake of sharing ideas and exposing a different angle to this concern. If there is any interest in discussing a "simpler" solution for in-process context propagation I would be glad to participate in the discussion, otherwise I'll still try to provide as much feedback and insights on this PR as possible.

@yurishkuro
Copy link
Member

@ivantopo what you're describing is where OpenTracing API has been prior to 0.30, when it had no opinion about in-process context propagation. It was a major roadblock to instrumentation because it required some 3rd party solution (like jaeger-context) to actually make different frameworks work together.

It is certainly possible to come up with context propagation API that is completely decoupled from tracing, since context propagation is a more general problem (cf. tracing plane by @JonathanMace). But in order to describe the shape of the computation (which is what OT is about), having separate propagation and Span APIs creates a lot of boiler-plate code for instrumentation.

What I like about active span is that it's still an abstraction. Granted, it prescribes certain semantics, but it should be possible to implement ActiveSpanSource on top of any other existing context propagation mechanism if so desired (but I don't know if anyone tried).

I do agree that mental complexity of ActiveSpanSource seems higher than a simple push/pop mechanics of jaeger-context and many other similar implementations.

@bhs
Copy link
Contributor Author

bhs commented Jul 24, 2017

@ivantopo getting the API right is way more important than worrying about existing instrumentation, so thank you for sharing your perspectives and those of your straw-polled team members.

Re existing solutions: the reality is that a normal Java process today does not have a reliable way to access the current/active/whatever-you-want-to-call-it Span. As such, no current solution "works", as that ^^^ is the ultimate design goal here. This is not an endorsement of the current OT approach, but just a reminder of the actual goal we all share.

For what it's worth, the API originally was more of a "Handle" approach with a span() accessor (much like what you suggested), but someone rightly pointed out that it led to extra typing, hence the subclassing relationship.

As far as API layering is concerned, average users are not supposed to think about Continuations... that abstraction is in place for the purposes of power users and those building things like the OT-Java-compatible https://github.com/opentracing-contrib/java-concurrent/blob/master/README.md ... in that model, the capture() and activate() happen behind the scenes, as they should.

I don't know... you're arguing for the basic building blocks. That is actually what the current API aims to provide... by explicitly modeling the back-and-forth handoff of context/continuation state and an active, write-able Span, higher-level and cognitively simpler abstractions can be built.

If I had to propose a different design that's probably more along the lines of what you're arguing for, we would:

  • get rid of BaseSpan and ActiveSpan and consolidate around Span proper
  • create a try()-friendly scoped activation mechanism and an ActiveSpanSource with which to register same
  • tell people to pass a Span around explicitly where they presently pass a Continuation
  • give up on the refcounting functionality entirely (at least as something that would work across library boundaries as part of OT proper)
  • somewhat misc: I might get rid of the "Tracer inherits from ActiveSpanSource" thing and instead create a setter/getter in Tracer for the ActiveSpanSource, thus making them truly pluggable (and testable) across Tracer impls

Is ^^^ more what you're hoping for? If not, please clarify either here or with some other design sketch RFC PR and link to same.

To reiterate, I definitely want to get this right and am happy to put in the time to have 0.30 code migrated accordingly.

@objectiser
Copy link
Contributor

@bhs I think my main concern is the fact that the AutoFinisher is used by default and therefore to disable it you need to use the alternate method and provide a null Observer. I think a simple change to make the default "no observer" would make it a lot cleaner.

The try() usage would remain the same - i.e. call Span.finish for manually started spans, and ActiveSpan.deactivate for active spans. The only difference is that the apps would need to be updated to either supply the AutoFinisher when activating the span, or explicitly call the ActiveSpan.finish.

So minimal change to the current APIs, except the app now explicitly determines how active spans should be finished.

@malafeev
Copy link

I tested it with RxJava instrumentation. No issues :)

@tedsuo
Copy link
Member

tedsuo commented Jul 26, 2017

Thinking about how important simplicity is in an API like this, I am inclined to agree with @objectiser that no Observer should be present by default. This allows us to clearly document the separate concerns, and to have the documented behavior of abstractions like spans also be their default behavior. I'm not saying that autofinishing couldn't be a better practical default (we have made use of a lot of it), but the bar should be high given how important clarity is, and I wonder if it actually makes the instrumentation more understandable by seeing the explicit addition of an AutoFinisher in the places where it is needed.

But, before we split too many hairs on that I'd love it if we could get some feedback from tracer implementors who have expressed some difficulty with reference counting and context management occurring outside their tracer, I notice they haven't weighed in yet.

@malafeev
Copy link

@tedsuo I have only one reason to use AutoFinisher by default: it will not break current instrumentations without inserting AutoFinisher in all makeActive/startActive calls.

@mabn
Copy link
Contributor

mabn commented Jul 26, 2017

Regular spans are perfectly fine to be closed explicitly , e.g. with try-catch block - when the start and finish happen in the same method.

So auto-finishing is useful mostly when jumping between execution contexts (e.g. threads) where it's hard to detect when all Continuations have finished. But - isn't such code very rare? IMHO it will be mostly used in instrumentation libraries. And if it's used in application code then most likely it will be in some centralized place dealing with scheduling tasks or passing messages across threads.

If my understanding is correct then the hassle of adding AutoFinisher will be minimal and the default of noop is much better.

@ivantopo
Copy link

Re existing solutions: the reality is that a normal Java process today does not have a reliable way to access the current/active/whatever-you-want-to-call-it Span. As such, no current solution "works", as that ^^^ is the ultimate design goal here. This is not an endorsement of the current OT approach, but just a reminder of the actual goal we all share.

@bhs saying that no current solution "works" is quite a broad and misleading statement. There are a number of projects succesfully tracing all sort of JVM applications since years ago, before OT and any of the proposed abstractions existed. The reason why there is no reliable way to access the current/active/whatever Span on the JVM doesn't have anything to do with the APIs that surfaced over the years, it has to do with the variety of execution models that are appearing lately with all the reactive, async and non-blocking wave that is covering the JVM. Those challenges are not going to get solved by OT in any way, nor by jaeger-context, tracing plane, gRPC context, my proposal or any other, there is no generic way to solve this problem for everything that runs on the JVM, those challenges can only be solved by developers writing instrumentation based on a deep understanding of how the toolkit they are instrumenting works under the hood so that the decision of when to propagate/activate/deactive can be taken appropriately, the APIs are just a means to that end.

In-process context propagation only requires two things:

  • A means of getting the currently active Span.
  • A means of temporarily set the currently active Span, and the key here is the "temporary" nature of this operation.

It doesn't matter if we are talking about instrumenting old boring servlets or reactive applications, these 2 operations are all we need to instrument. The only difference between these two types of applications is when should something set as active and for how long it should remain like that but all in all, the operations are the same: get and temporarily set. That is the very minimum API required, that is what many solutions out there boil down to and, IMHO, that's the bare minimum OT should expose. Moreover, that is the common denominator across many solutions that are working in production right now, including Jaeger, Brave, Census/gRPC and probably many more beyond my knowledge.

This is our current API:

trait Scope extends AutoCloseable {
  def close(): Unit
}

trait ActiveSpanSource {
  def activeSpan(): Span
  def activate(span: Span): Scope
}

Naming aside, this is all we need. With this we instrument everything: Scala, Twitter and Scalaz Futures, Akka, Play Framework and more to come, and this just works (a simpler version of this has been working for 4+ years for thousands of our users). All sorts of challenges arise when working with these tools, the way of thinking is quite different, deferred work is common, having different threads pools and future callbacks hoping across them is the rule, every single interaction between actors is asynchronous and still, those challenges are being solved with this super simple API.

The current approach with reference counting by default works nice in controlled environments, but it is a otherwise fragile approach for numerous reasons already exposed. The new approach in this PR is adding more API surface to opt-out of a feature that probably shouldn't have been in the scope of OT-api in the first place.

With regards to the boilerplate related to having separate APIs, that's debatable. I see how having clean cut responsibilities between abstractions has to be paid with extra typing, but whether one line more here and there is "a lot" of boilerplate or not seems a very subjective, endless discussion and this is probably not the place to have it.

I would like to hear opinions from the rest of the community and if there is enough interest I would be glad to propose an alternative PR exploring the constructs proposed above. Looking forward to comments.

@bhs
Copy link
Contributor Author

bhs commented Jul 26, 2017

@ivantopo

The reason why there is no reliable way to access the current/active/whatever Span on the JVM (...) has to do with the variety of execution models that are appearing lately with all the reactive, async and non-blocking wave that is covering the JVM. (...) There is no generic way to solve this problem for everything that runs on the JVM, those challenges can only be solved by developers writing instrumentation based on a deep understanding of how the toolkit they are instrumenting works under the hood so that the decision of when to propagate/activate/deactive can be taken appropriately, the APIs are just a means to that end.

Exactly exactly, that is basically what I'm trying to say with my "no current solution works" comment. Of course people have been trying to trace JVM (and every other environment) since the dawn of time. My point is merely that no API for "cross-package propagation" is sufficient in practice for any given "Application Process X" until that API has instrumentation in place for X's dependencies, esp the ones pertaining to control flow.

I really like most of what you said ^^^ and would absolutely welcome an alternative PR, FWIW.

PS: I do want to reemphasize one more critique of the current OT API at master... namely, having an ActiveSpanSource abstraction (which allows for bare TLS, MDC, etc as possible "storage layers") seems somewhat valuable... but unless the ActiveSpanSource can be swapped independent of the Tracer, it's not a meaningfully useful abstraction. So we should probably either just fold the methods into Tracer directly, or we should add some sort of capability like Tracer#setActiveSpanSource (with caveats that active spans prior to that call might be lost in the void... i.e., it's only to be used during initialization code).

@@ -33,6 +33,42 @@
* {@link ActiveSpan.Continuation}s, then re-{@link Continuation#activate()}d later.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR tries to distance from refcounting. There's a line that still talks about that which might be something to remove in this PR https://github.com/bhs/opentracing-java/blob/91cf7e5e500f635b9b765df3a107e0de32ad5f0e/opentracing-api/src/main/java/io/opentracing/ActiveSpan.java#L108

@@ -68,7 +68,7 @@ The above is semantically equivalent to the more explicit try-finally version:
```java
io.opentracing.Tracer tracer = ...;
...
ActiveSpan activeSpan = tracer.buildSpan("someWork").startActive();
ActiveSpan activeSpan = tracer.buildSpan("someWork").startActive(new AutoFinisher());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opinion: I prefer no ActiveSpan type vs adding more options to it. This is generally discussed in #80 so not repeating here

@@ -36,7 +34,12 @@ public ThreadLocalActiveSpan activeSpan() {

@Override
public ActiveSpan makeActive(Span span) {
return new ThreadLocalActiveSpan(this, span, new AtomicInteger(1));
return new ThreadLocalActiveSpan(this, span, new AutoFinisher());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a difference between updating a version and using the active span api. There's also a difference between writing code based on discussions like these and production experience with them. I wouldn't jump to conclusions that everything is fine.

What I don't like is that the liability of auto-finishing and its implementation are not only still here, but also still the default. Any problems with this implementation (as noted laboriously in the prior issue and I won't repeat here) are not less likely now!

@codefromthecrypt
Copy link
Contributor

The ActiveSpan api has a lot of weight, and even more now. Sometimes the cure is worse than the disease. The rest of this comment is on what @ivantopo proposed:

I completely agree with @ivantopo that abstraction is important. I've done a lot of different instrumentation, and supported many different types of people doing instrumentation. We have all been bitten by prior apis that punt calls to close a scope by design (ex the old thread binder in brave leaked spans and was really hard to debug problems like this)

I've gotten a lot of very good feedback about the new scoping apis which are essentially the same sketch as Ivan, and also the same as how census works.

@jakerobb on openzipkin-contrib/brave-opentracing#43 listed some concrete place where he might not know when something finishes before another thing, and that due to it being a dead library having no way to "fix it". There are some concerns about having a reference to something, and what you prefer to have a reference to.

I would prefer we use real scenarios like these to evaluate not just the chosen design, which ActiveSpan clearly is from insiders, but also evaluate alternative designs or the design of doing nothing. IOTW, we talk a lot and it is often really frustrating because there's nothing behind it. We talk in terms of what body of customers etc does X abstractly and rarely the style of "I am doing X" like jake says for example. The latter is far easier to pin down.

TL;DR; If something is important enough to be a api this wide, it should be important enough to catalog actual things and alternatives tried but not taken, and how end users feel about them (from their own mouths!)

@bhs
Copy link
Contributor Author

bhs commented Jul 29, 2017

@ivantopo sorry both for the slight delay and the brevity here – juggling a lot of things right now!

You wrote:

[...] The problem I see with [the refcounting] is that if there is a optional but official (OT-endorsed or even part of OT) way of enabling ref counting and those mechanisms are used in any publicly available instrumentation (which is very likely to happen in OT-contrib) then those instrumentations will only work nicely with those vendors that support reference counting. [...]

Very important to note that the PR under review moves the refcounting out of the vendor impls... so that concern would go away, I think? My hope is that the refcounting is possible but can be implemented 100% outside of the OT core API artifact (and also 100% outside of the Tracer impl). I.e., I'd like it to be possible but never required, esp for instrumentation code. (At master right now it is of course required, which I agree is a problem)

You also wrote:

[...] The whole purpose of baggage is to keep context that is attached to the Trace rather than to any platform-specific construct (like Threads on the JVM). I would argue that OT already provides a way to propagate domain-specific data and that users should use that abstraction rather than enabling alternative approaches. Talking about @mabn's points, the concern of not propagating certain keys to remote entities could be solved by customizing a Tracer's inject behavior without changes to the API. [...]

To do async work, the following things happen, and in this order:

  1. Literally allocate (not run, allocate) the callback and submit it to some executor queue
  2. Possibly in a different thread where the executor lives, dequeue the callback, then ...
  3. Run the callback and cleanup.

I would sleep better if the tracing API had a special affordance for step 1 above (capture() in OT at master, of course). It's useful for the sort of thing that @mabn has already brought up, but it also allows the code that's concerned with context management/propagation to be decoupled from the code that deals with the Span lifecycle (and adding tags/logs/blah). This is something I'd expect you to want, @ivantopo, since you are adamant (and I share this goal, btw) that a revised API tries to separate the concerns of context propagation and lifecycle. Does that make sense?

@jakerobb
Copy link

This conversation has gotten a bit over my head, but @adriancole asked me to chime in.

I did the work in openzipkin-contrib/brave-opentracing#43 to update the brave-opentracing bridge to work with OT-API 0.30. It was fairly straightforward at first; I introduced a dependency on opentracing-util and wired in ThreadLocalActiveSpanSource to handle my reference counting. This worked fine for my use (my company has a suite of ~40 Java services, all instrumented with pure OT-API, backed by Brave+Zipkin).

The initial feedback I got from that was that Brave already has its own context tracking mechanism, and that I should use that rather than run my own in parallel. There was also resistance to introducing a dependency on opentracing-util.

So, I copied TLASS and adapted for my needs. I came up with this: https://github.com/jakerobb/brave-opentracing/blob/a7854242af256035800a31fc692571be7e5df9f2/src/main/java/brave/opentracing/BraveActiveSpanSource.java

Note that BraveActiveSpanSource keeps a Map from SpanId to BraveActiveSpan, which among other things keeps a reference to Brave's SpanInScope, which Brave returns when you start a span, but doesn't expose at any other time. The SpanInScope is necessary in order to remove a span from current context, so I hold on to it in the map, and when I'm later told to close a span (and the refcount reaches zero), I pull it out of the map, close it, and purge from the map.

What I have works, and we have it in production using an internal build of brave-ot. That said, keeping the map around exposes me to race conditions and memory leaks in a scenario where instrumentation is done improperly.

I'd really like it if whatever mechanism OT-API eventually provides for context tracking (assuming that feature is retained) had some acknowledgement that there's a backing engine that needs to be kept in the loop, and that the backing engine probably has its own API which some instrumentations might use for which OT needs to be kept in the loop, and provided explicit hooks to enable that coordination.

@sjoerdtalsma
Copy link
Contributor

sjoerdtalsma commented Aug 1, 2017

I feel it's time to give my opinion as the author of the java-spanmanager contrib library. I'm sorry if my comment has too much focus on java, probably due to my limited exposure to other OT variants.

The discussion in this PR seems about making functionality optional that isn't wanted by all. It does so by adding on an AutoFinisher that may become default behaviour. There were very similar ideas (and discussions) with the initial spanmanager PR. The decision taken there was to not blur an already-complex topic by optional constructs, but rather add them separately as needed. In hindsight I'm glad we chose that route and am thankful for all the review effort spent by many (of which @adriancole, @yurishkuro and @wu-sheng were crucial).
Looking at it from that perspective, auto-finishing is simply a different problem than managing an active span.

Maybe we should revisit the core concepts and create separate API's for them that can be combined but each addressing their own concern?
i.e. ActiveSpanSource becomes only about managing a Span and returning some Continuation that can be closed again.

After all, that is all we need to be able to provide for instance thread pool implementations to automatically activate and deactivate spans in started threads.

For those that wish auto-finishing or have new spans automatically become child of active spans, we could add separate Tracer decorators. Although possible with the current OT-api it involves a lot of cumbersome wrapping of tracers, spanbuilders and spans.
I think adding lifecycle callbacks to the OT-api may make such decorators easier to add on while keeping the ActiveSpanSource a single-concept thing (albeit a rather complex concept).

About the ActiveSpanSource; In my opinion there can be implementations that are bound to explicit contexts or Thread-local variants as well as implementations that keep track of a single active span or maintain a stack of spans. All these decisions should be left up to the implementation. It's possible commonly used implementations find their way into the 'util' realm from the contrib repos (after they're vetted, bug-free and proven commonly used).

Feel free to disagree or try to convince me otherwise but this is my current opinion 😉 ..

@mabn
Copy link
Contributor

mabn commented Aug 3, 2017

I actually like the API with Scope mentioned by @ivantopo and it can be easily adapted to something similar to what exists in master and in this PR:

interface ActiveSpanSource {
    // no changes
    ActiveSpan makeActive(Span span);
}

interface Tracer extends ActiveSpanSource{
    // no changes
    SpanBuilder buildSpan(String operationName);
}

interface Span extends AutoCloseable {
    Span setTag(String key, String value);

    // Creates "an inactive ActiveSpan" which can be "activated" and turned into regular ActiveSpan. It
    // has only a single purpose - to carry some contextual information to another thread. Using
    // "capture" is not mandatory
    Continuation capture();

    /**
     * By default deactivates the span (if it was active) and finishes the span. The behaviour
     * is defined by the lifecycle used to create the span and another lifecycle can be provided
     * at span creation.
     */
    @Override
    void close();

    // yes! it's back!
    void finish();
}

interface SpanBuilder {
    // starts span and makes it active in the current thread
    Span startActive();

    // allows to alter span's lifecycle management - e.g. how Span.close() behaves
    Span startActive(SpanLifecycle lifecycle);

    // no changes, starts inactive span
    Span startManual();
    SpanBuilder childOf(Span span);
}

// represents the scope of span activity. provides access to the span that is active in this scope
// via span(). closed ActiveSpan is no longer valid - it's an error to call span() on it
interface ActiveSpan extends Closeable {
    // It no longer has Span methods! Instead - span get be obtained with span()
    Span span();

    /**
     * By default deactivates current ActiveSpan (but does not finish the span!). The behaviour is
     * defined by the lifecycle used to create the span and another lifecycle can be provided at
     * span creation.
     */
    @Override
    void close();
}

interface Continuation {
    ActiveSpan activate();
}

// interface allowing to slightly alter how span lifecycle behaves
interface SpanLifecycle {
    void onClose(Span span);
}

Some more details: https://gist.github.com/mabn/9246ebb03aa1534b33b98bd728158b11

The main changes:

  • BaseSpan is no more, there is only Span with all the regular methods. ActiveSpan no longer has Span interface and is just a wrapper. It could be called Scope as well.
    Span also got finish() back.

  • Span.close() now deactivates and finishes the span. One can achieve different behaviour by providing specific SpanLifecycle at span creation - e.g. auto finishing.

  • Continuation remains - as I pointed out without capture() good support for thread-locals is impossible. But it doesn't mean that it must be supported by the API. Actually, it could be added in the future without breaking backward-compatibility if there's a need for it... I don't recall anyone explicitly stating that they encountered a case where they needed capture() and the deficiencies around MDC support have workarounds. Maybe the API could survive without it...

btw - i'm referring to threads, but it should be valid for any kind of execution context supported by activeSpanSource

@bhs
Copy link
Contributor Author

bhs commented Aug 5, 2017

@mabn I have been playing around with API variations, too. I like a lot of things about your proposal. A few thoughts:

  • one of the motivators for capture() is MDC-capturing; another is that a refcount-based lifecycle needs to increment the reference count at capture() time, not at activate() time. IMO the MDC use case is not strong enough to justify the added API complexity, and I think (?) it's possible to handle the refcounting entirely external to OpenTracing and without an Observer/Lifecycle mechanism... I'm experimenting with that now. So my current feeling is that capture() can be removed from the core OpenTracing API (even if something similar crops up in contrib or another place that's out of harm's way).
  • startActive() ought to return an ActiveSpan
  • SpanLifecycle feels like it should either be a larger API or feels like it shouldn't exist at all... can it instead by accomplished via wrappers?
  • I'm not thrilled about Tracer inheriting from ActiveSpanSource... unless there's something like Tracer.setActiveSpanSource, there's really no "marketplace" for ActiveSpanSources or guarantee of decomposition there

Thoughts?

@bhs
Copy link
Contributor Author

bhs commented Aug 7, 2017

I have a sketch of a different approach here: https://github.com/bhs/opentracing-java/pull/5/files

Plot summary sans comments for brevity:

public interface Tracer {
    ScopeManager scopeManager();
    void setScopeManager(ScopeManager scopeManager);

    ... existing stuff ...
}

public interface ScopeManager {
    Scope activate(Span span);
    Scope active();
}

public interface Scope extends Closeable {
    void close();
    Span span();
}

More complete thoughts can be found at https://github.com/bhs/opentracing-java/pull/5/files.

Note that this PR ^^^ includes code that makes refcounting opt-in and decoupled from OT proper.

@sjoerdtalsma
Copy link
Contributor

@bhs Am I right that this latest sketch boils down to basically identical as a renamed SpanManager minus the deprecated method?

@bhs
Copy link
Contributor Author

bhs commented Aug 7, 2017

@sjoerdtalsma

Am I right that this latest sketch boils down to basically identical as a renamed SpanManager minus the deprecated method?

Very similar, yes, I should have mentioned that. Only real divergences are the lack of a clear() method, and then of course the fact that Tracer directly integrates with the new API... On that note, I'm not satisfied myself with the fluency (or lack thereof) of startActive with a finish()-on-close-scope helper.

@mabn
Copy link
Contributor

mabn commented Aug 7, 2017

@bhs I agree with your comments. When it comes to your PR return type of startActive might not be the best. What I mean by that is that it forces the use of Scope in the most common case:

try (Scope scope = tracer.buildSpan("...").startActive()) {
    scope.span().setTag(...);
    scope.span().log(...);
}

It would be great if startActive returned the actual Span here as in this simple scenario Scope wouldn't be even visible:

try (Span span = tracer.buildSpan("...").startActive()) {
    span.setTag(...);
    span.log(...);
}

But - it forces adding close to Span and makes it possible for Span to be active without Scope which is not good.

Basically I'd want to have the cake and eat it but I don't really see how.

@bhs
Copy link
Contributor Author

bhs commented Aug 7, 2017

@mabn

It would be great if startActive returned the actual Span here as in this simple scenario Scope wouldn't be even visible [...]

Basically I'd want to have the cake and eat it but I don't really see how.

Yeah, this is basically how ActiveSpan and Span ended up inheriting from BaseSpan at master. Though that has its own problems. /shrug

@tedsuo
Copy link
Member

tedsuo commented Aug 11, 2017

Hi all. @malafeev and I have started a repo for Java tracing examples: https://github.com/opentracing-contrib/java-examples

We've just started, but the hope is that we can use it to illuminate the many edge cases we are concerned about, and fork it to present API choices we are seriously considering. Personally, I'm having a little trouble making such a weighty decision just on the basis of code snippets. Please contribute examples, and let me know if you think this approach has value.

@bhs
Copy link
Contributor Author

bhs commented Aug 13, 2017

@tedsuo @malafeev thank you!

I decided to take advantage of the example-repo work to demonstrate the instrumentation differences between my proposal here and what's at master... there were some syntactic things I didn't like, so I added a Scope.Observer interface that makes it (relatively) clean to add finish-on-Scope-close semantics.

You can see this new Observer interface in my WIP Scope change, and you can see the ramifications for the rough example code in this diff.

By the way, @objectiser pointed out something important in the other demonstration PR... IMO it's very difficult to cleanly handle his concern without introducing most of the complexity of what's in master already (defer(), Continuation, etc), and it seems (?) that most people don't have the appetite for them (I do have that appetite, but I don't want to over-bias towards my personal preferences).

Curious to hear what others think.

@carlosalberto
Copy link
Collaborator

Hey everybody. Summarized the changes and testing results of Ben's latest approach (https://github.com/bhs/opentracing-java/pull/5/files) here https://gist.github.com/carlosalberto/28ba04ae3201d87f91eff924e7d427da

That will get an overall idea for those who haven't been lately following the discussion, as well as for those interested in any further changes to this design - specially the Continuation concept, and whether it should be included in the main api or not.

And hopefully this will get things rolling again (feedback, for a start) :)

@bhs
Copy link
Contributor Author

bhs commented Aug 26, 2017

@carlosalberto thank you for the above, a helpful and accurate summary of things!

@objectiser @pavolloffay @mabn @sjoerdtalsma @ivantopo @jakerobb @yurishkuro @malafeev @adriancole:

Current status given the research done so far

Here's how I see things...

  • There are naming changes/etc which we can and should haggle over, but AFAICT the only major semantic question still on the table is whether a defer()-like method and Continuation-like interface should be a part of the core API.
  • Along those lines, the only compelling reason to include defer and Continuation would be that we want to allow for some sort of auto-finishing (aka "refcounted") Scope/ActiveSpan concept in the API without casts
  • So, the major concern in my mind is whether there's a use case out there that justifies the additional API surface area and complexity.

Note that there is not a debate about whether Tracer impls need to implement refcounting... that can def be factored out into a util layer if we decide to include these extra API concepts. The question is strictly whether the user-facing API should have those concepts.

If there are no major revelations...

My default at this point is to chart a path towards removing defer and Continuation and incorporating some syntactic things like those found in @carlosalberto or my WIP branches to make common-case code terse and foolproof.

Moving towards a mergeable PR

I would like to try to follow the spirit of @tedsou's process doc... most importantly, before merging we should verify sanity in plentiful OT-contrib repos as well as the most visible Tracer impls (e.g., Jaeger and Brave).

RFC

Please chime in with dissenting opinions about ^^^.

@objectiser
Copy link
Contributor

Thanks @carlosalberto for your additional PR, but I think @bhs statement about removing defer/Continuation for now would be best. Unless there is a compelling usecase for it at this stage, it may be better to move to a 1.0 of the API without it, and then possibly revisit for 2.0?

@carlosalberto
Copy link
Collaborator

I'm all cool with leaving Continuation and related stuff out if indeed there's no compelling case (and with the hope that, when/if we add it later to the API, things won't break). I'm waiting for more feedback for people maintaining tracers, so they can give some input. Else, totally fine with wiping out this.

* @return an {@link ActiveSpan} that encapsulates the given {@link Span} and any other
* {@link ActiveSpanSource}-specific context (e.g., the MDC context map)
*/
ActiveSpan makeActive(Span span);
ActiveSpan makeActive(Span span, ActiveSpan.Observer observer);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea (not saying it's a good one... or that I understand all the implications) for making this be more backwards compatible is to make the Observer argument be a var arg. This would have the additional side effect of potentially allowing multiple Observers.

*
* @see BaseSpan#context()
*/
void finish(long finishMicros);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be a good time to consider separating the units from the value? Perhaps a signature like this:
void finish(long finishTime, TimeUnit units)
I feel having implied unit magnitude often results in confusion. In the past, I've had to write code that dynamically figures out the proper magnitude assuming an incorrectly reported scale (seconds vs milliseconds) all because the api was used incorrectly.

@codefromthecrypt
Copy link
Contributor

@carlosalberto which other tracers would you like feedback from? (ex obv mine is to please remove the api) If folks aren't speaking up either they aren't watching or ignoring. You might need to contact someone directly. A lot of open source work ends up meaning actively getting an answer from folks as day jobs are distracting and threads like this are time consuming. Ex. simplify and ask "How would you feel if we removed X for possible revisit 2.0" to any advertised OpenTracing tracer and/or as an issue on any Tracer you can find implemented on github (latter is not that many)

@bhs
Copy link
Contributor Author

bhs commented Sep 10, 2017

To summarize a lot of discussion on this PR, related PRs, Gitter, OTSC calls, etc: the community seems comfortable removing the defer()/Continuation concepts in the core API. I know of several major software orgs who are happy with same, but there are some POC code snippets (e.g., from @carlosalberto and myself) that demonstrate ways to get most of the benefits with some (IMO acceptable) limited loss of generality.

In terms of actual concrete next-steps, I think the basic shape of the ScopeManager POC is a good compromise. There are a few things I would like to address before sending out a real / non-WIP PR against OT-Java:

  • The Scope.Observer pattern feels a little heavyweight, honestly... the only two things I can really imagine right now are "don't finish() on close()" and "do finish() on close()". Would we be better off with an enum? Or an extensible/varargs list of options on Scope activation?
  • Regardless of that mechanism, as @malafeev (and I think others?) have pointed out, the default Scope semantics should probably be FINISH_ON_CLOSE
  • @yurishkuro wondered if Tracer#setScopeManager(...) might be better to add later. I personally like having it, as there's not much point in having a generic ScopeManager interface if an arbitrary ScopeManager implementation can't be paired with an arbitrary Tracer implementation; and the setScopeManager() method is the means to achieving that goal.
  • People using defer()/Continuation should have a well-defined migration path to the new API. If they're willing to change import paths, the essence of the defer()/Continuation API can be implemented by a bridge that in turn supports the Scope-based API... at the very least, there must be an ultra-clear set of recipes to migrate common defer()/Continuation code to whatever specifically replaces it
  • And of course lots of "soft stuff"... out-of-date comments, a README to update, migration instructions from the defer()/Continuation world, and so on

I would like to volunteer to write a PR that incorporates the above (along with anything else that people want to add to the mix). That said, my calendar and email collude to prevent me from writing any actual code during the weeks these days (maybe this will change someday; I'd like to say so, but that may be naive!). But I will definitely participate in any reviewing / shepherding if others have cycles to make progress on the above in the meantime.

(cc @opentracing/otsc, @tylerbenson, @tedsuo)

@tedsuo tedsuo mentioned this pull request Sep 27, 2017
@tedsuo
Copy link
Member

tedsuo commented Sep 27, 2017

Closing this PR now that 0.31 is underway: #189

@tedsuo tedsuo closed this Sep 27, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.