Clock Skew Adjuster considered harmful #1459

eikemeier · 2019-04-04T14:37:26Z

Edit: duplicate of #961.

Requirement - what kind of business use case are you trying to solve?

The current behavior of clock skew adjustments is surprising and, why well-meant, not beneficial.

Problem - what in Jaeger blocks you from solving the requirement?

Excuse the pun, I'm aware of the existing issues.

The clock skew adjuster currently makes the following assumptions:

Spans are from different hosts, with parent and child span having roughly the same duration.
Clock Skew is something unknown to people using monitoring
It is best to hide this fact and instead of relying on administrators to fix this and offer a best guess workaround as an alternative, it's best to make things look “nice”

Considering 1.:

The algorithm uses the not well documented ip tag of the process to distinguish between hosts and assumes a missing tag indicates a different host.

People using OpenTracing or OpenCensus usually only know about the service name, so a missing ip tag is expected, meaning the algorithm compensates nearly all spans. See

Then it compares durations, which makes only sense one assumes that spans roughly have the same size.

Most child spans would be shorter, so this is rarely the case. If a child span is a few nanoseconds longer I don't care, if it is really longer it should be flagged, but not shifted. It's wrong data and really hard to diagnose if it is processed afterwards.

Then the algorithm checks if the end time of the child span is after the end time of the parent. While technically wrong, it is pretty easy to close the child span a couple of nanoseconds too late - probably a non-issue considering tracing, and at most something that should be a warning instead of moving the span around.

Lastly it assumes that span duration somehow correlates to network latency.

Proposal - what do you suggest to solve the problem or improve the existing situation?

If clock skew is an issue, the span context should try to compensate.

Considering 2. and 3. it would IMHO be best to not mess with the data by default and offer beautification as an opt-in workaround, which is different than the currently solutions by turning it on by default and searching for a way to get rid of it.

Also, it might be useful to see clock skew by default, because it may affect other areas too.

Any open questions to address

I saw very few comments where this skew adjustent is actually wanted. Maybe some of the supporters can chime in?

The text was updated successfully, but these errors were encountered:

jpkrohling · 2019-04-05T09:12:02Z

best to not mess with the data by default and offer beautification as an opt-in workaround

+1

Also, it might be useful to see clock skew by default, because it may affect other areas too.

As someone looking for the root cause of a problem, how would you like this info to be presented to you?

I saw very few comments where this skew adjustent is actually wanted.

From what I remember, this is particularly useful for mobile clients, where the clock might be really off. Perhaps clients where the clock might be off should be expected to send a hint within the span?

eikemeier · 2019-04-07T14:29:50Z

Also, it might be useful to see clock skew by default, because it may affect other areas too.

As someone looking for the root cause of a problem, how would you like this info to be presented to you?

In Java for example you would have all kinds of problems. So i.e. having a span in red would be fine. This should be rare, which means that I wouldn't probably care about nanosecond differences, especially spans that are closed slightly too late.

It's like having log entries (which usually have timestamps) and you see that B is happening before A, although impossible. Pointing this out is usually sufficient.

I saw very few comments where this skew adjustent is actually wanted.

From what I remember, this is particularly useful for mobile clients, where the clock might be really off. Perhaps clients where the clock might be off should be expected to send a hint within the span?

Yes, that would be “opt-in”. One could also add a “ping collector” command, which could give an idea about the clock difference and latency, assuming there is no jitter or other factors. I'm unsure if this is really worth the effort.

Still auto-fixing might be strange too, since you'll always run into the “B before A” case. And the algorithm does not seem to be very successful on mobile clients, see issue #722.

TBBle · 2020-02-05T12:53:54Z

After jaegertracing/jaeger-operator#871, we now have a process tag host.ip added by Jaeger Operator-injected Agents, which would be a better input for the Clock Skew Adjustor. The process-local ip tag added by the Jaeger client is within a kubernetes container, and hence won't match other containers sharing a realtime clock, leading to Clock Skew Adjustor misbehaviour.

So it might be beneficial to this improvement direction to make it possible to configure the tag used to identify "same host as parent".

eikemeier · 2020-02-05T13:04:17Z

So it might be beneficial to this improvement direction to make it possible to configure the tag used to identify "same host as parent".

We keep getting comments like “I spent a few hours trying to figure out ...”. The point of this issue is that the default behaviour is unexpected, confusing and harmful.

If anyone wants adjustments, fine. But this should be opt-in and off by default. “Better” unexpected and a little less confusing adjustments are still harmful…

pavolloffay · 2020-02-05T13:09:21Z

It seems like a controversial feature. We should consider making it configurable.

Would it also help people if we were labeling the spans which are candidates for clock adjusting?

TBBle · 2020-02-05T13:11:57Z

I was definitely one of those "I spent a few hours" people. In my case, I did have Clock Skew, but lack of an ip tag (using the OpenCensus library) meant that instead of seeing the two hosts's spans shift, the whole pile shifted. #787 made this very hard to diagnose, as even the raw data was showing time-travel effects.

I'd very-much like to see it turn off-able, but I don't feel strongly that it should be burned to the ground. And if it's not removed entirely, making it more useful for environments where gethostinfo('0.0.0.0') doesn't actually give a unique 1:1 mapping with a real-time clock, would be useful too.

And probably lower-hanging fruit in terms of getting a PR up and delivered for disabling the feature entirely, or changing it to off-by-default? Or maybe not, I haven't really looked at the Jaeger codebase.

I also have a use-case for the Clock Skew Adjustor, since I have a down-the-track need to collect spans from web-based javascript pages, and even other random PC users. And experience suggests "random PC users" are awful at keeping their PC clocks accurate.

eikemeier · 2020-02-05T13:52:36Z

I'd very-much like to see it turn off-able, but I don't feel strongly that it should be burned to the ground. And if it's not removed entirely, making it more useful for environments where gethostinfo('0.0.0.0') doesn't actually give a unique 1:1 mapping with a realm-time clock, would be useful too.

Again, I do not vote for “burning to the ground”. It might help people, and thanks to the guys developing and improving it.

The point i'm trying to make is that a (changing) heuristic how to improve faulty data can be helpful, but should be off by default and opt-in, nothing more.

You have some machines with broken clocks which you can't fix - fine. The problem is that you don't know how broken they are. Instead of a simple skew it might be jumping - this happens.

So, for people who are sure it is a simple skew, but are not able to fix that - turn it on, it might help.

For everyone else: I might hide your problem or damage otherwise perfectly correct data. And this is what I'm seeing in most of the cases.

Also, as mentioned above, it destroys data when it's just a programming error. the heuristic could be improved, but this experience should tell you that using it by default is not a very good idea - it is hard enough to understand a system as is, but this adjustment makes it even harder.

flixr · 2020-02-05T14:20:41Z

I also want to add that we ran into this problem, even though all the services were running on the same host (in containers). We initially had the IP field empty and it was adjusted, but using the IP from inside the container also would not have helped. As a workaround we have for now hardcoded spans to all report the same IP.

It seems that in our case the clock skew adjustment happened because:

the IP was unset
we use send spans as json in zipkin format
the zipkin format uses timestamps in microseconds and due to rounding errors (adding elapsed steady time to start of span in system time) in some cases the child spans were finished a microsecond after the parent span which resulted in completely unusable adjustment...

bobrik · 2020-02-05T17:25:08Z

How about the following?

If there is no skew detected, do nothing
If skew is detected, do not adjust it automatically:
- UI shows a clear warning at the top and in adjustable spans
- There's a toggle to enable adjustment next to a warning
- Have an API param to enable adjustments

This way it's a no-op for good spans, and bad ones get the choice, which either points at the issue in their tracing or at clock skew.

jpkrohling · 2020-02-06T09:23:39Z

I added this issue as a topic for the next bi-weekly. Perhaps we can move forward while discussing this with everyone in the same "room": https://docs.google.com/document/d/1ZuBAwTJvQN7xkWVvEFXj5WU9_JmS5TPiNbxCJSvPqX0/edit#heading=h.u5x27f1oevg9

yurishkuro · 2020-02-06T20:12:32Z

We're currently limited by front end. A simple temporary solution, for people who are particularly annoyed by the adjustment, is to add a cli flag to permanently disable adjustment, until someone contributes a respective UI change to toggle it on and off.

jpkrohling · 2020-02-07T09:08:42Z

is to add a cli flag to permanently disable adjustment

Given that this feature breaks most people's expectation, shouldn't it be the other way around, even though it means a change in the current behavior?

yurishkuro · 2020-02-07T14:08:23Z

Might be ok too. I think if we go with such change, we need to still run the adjustment logic but only record a warning saying "had the adjustment been enabled, it would've moved the timestamps by (Delta)".

yurishkuro · 2020-02-07T16:25:56Z

Upon more thought, we may want three-valued flag: off (default), warn, on (current behavior).

nvx · 2020-08-28T02:09:40Z

Just in case it's not apparent to someone searching (like me earlier), PR #2119 added the ability to set --query.max-clock-skew-adjustment=0s which has the effect of disabling adjustments.

jpkrohling · 2020-09-18T13:21:57Z

I think we should really consider setting the default value for the clock-skew to 0s for Jaeger 1.21.x or 2.x. It's simply wrong with all async scenarios. Example, which looks very similar to the plentiful of other examples we've seen related to this issue:

Clocks out of sync are a real problem, but I'm afraid we have more async spans out there than server clocks out of sync.

yurishkuro · 2020-09-18T16:05:38Z

fine with me, we just need to mark is as a breaking change in the log

This was referenced Jan 19, 2020

Support setting process-wide tags in JaegerExporter constructor census-instrumentation/opencensus-python#847

Open

Auto-inject the IP tag for Operator-injected Agents jaegertracing/jaeger-operator#865

Closed

yurishkuro added the duplicate label Aug 29, 2020

yurishkuro mentioned this issue Aug 29, 2020

Explain clock skew adjustment and flag jaegertracing/documentation#430

Merged

jpkrohling self-assigned this Sep 18, 2020

jpkrohling mentioned this issue Sep 25, 2020

Disable clock-skew-adjustment by default. #2513

Merged

yurishkuro closed this as completed in #2513 Sep 25, 2020

joe-elliott mentioned this issue Mar 2, 2022

how to push traces from a remote application using the gateway? grafana/tempo#1319

Closed

jon-whit mentioned this issue Feb 13, 2023

docker: add clock skew adjustment to jaeger openfga/openfga#530

Merged

4 tasks

joe-elliott mentioned this issue Jul 31, 2024

Adjust client/server span times to account for clock skew grafana/tempo#3930

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clock Skew Adjuster considered harmful #1459

Clock Skew Adjuster considered harmful #1459

eikemeier commented Apr 4, 2019 •

edited by yurishkuro

Loading

jpkrohling commented Apr 5, 2019

eikemeier commented Apr 7, 2019 •

edited

Loading

TBBle commented Feb 5, 2020

eikemeier commented Feb 5, 2020 •

edited

Loading

pavolloffay commented Feb 5, 2020

TBBle commented Feb 5, 2020 •

edited

Loading

eikemeier commented Feb 5, 2020

flixr commented Feb 5, 2020

bobrik commented Feb 5, 2020

jpkrohling commented Feb 6, 2020

yurishkuro commented Feb 6, 2020

jpkrohling commented Feb 7, 2020

yurishkuro commented Feb 7, 2020

yurishkuro commented Feb 7, 2020

nvx commented Aug 28, 2020

jpkrohling commented Sep 18, 2020

yurishkuro commented Sep 18, 2020

Clock Skew Adjuster considered harmful #1459

Clock Skew Adjuster considered harmful #1459

Comments

eikemeier commented Apr 4, 2019 • edited by yurishkuro Loading

Requirement - what kind of business use case are you trying to solve?

Problem - what in Jaeger blocks you from solving the requirement?

Proposal - what do you suggest to solve the problem or improve the existing situation?

Any open questions to address

jpkrohling commented Apr 5, 2019

eikemeier commented Apr 7, 2019 • edited Loading

TBBle commented Feb 5, 2020

eikemeier commented Feb 5, 2020 • edited Loading

pavolloffay commented Feb 5, 2020

TBBle commented Feb 5, 2020 • edited Loading

eikemeier commented Feb 5, 2020

flixr commented Feb 5, 2020

bobrik commented Feb 5, 2020

jpkrohling commented Feb 6, 2020

yurishkuro commented Feb 6, 2020

jpkrohling commented Feb 7, 2020

yurishkuro commented Feb 7, 2020

yurishkuro commented Feb 7, 2020

nvx commented Aug 28, 2020

jpkrohling commented Sep 18, 2020

yurishkuro commented Sep 18, 2020

eikemeier commented Apr 4, 2019 •

edited by yurishkuro

Loading

eikemeier commented Apr 7, 2019 •

edited

Loading

eikemeier commented Feb 5, 2020 •

edited

Loading

TBBle commented Feb 5, 2020 •

edited

Loading