Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clock Skew Adjuster considered harmful #1459

Closed
eikemeier opened this issue Apr 4, 2019 · 17 comments · Fixed by #2513
Closed

Clock Skew Adjuster considered harmful #1459

eikemeier opened this issue Apr 4, 2019 · 17 comments · Fixed by #2513
Assignees

Comments

@eikemeier
Copy link

eikemeier commented Apr 4, 2019

Edit: duplicate of #961.

Requirement - what kind of business use case are you trying to solve?

The current behavior of clock skew adjustments is surprising and, why well-meant, not beneficial.

Problem - what in Jaeger blocks you from solving the requirement?

Excuse the pun, I'm aware of the existing issues.

The clock skew adjuster currently makes the following assumptions:

  1. Spans are from different hosts, with parent and child span having roughly the same duration.
  2. Clock Skew is something unknown to people using monitoring
  3. It is best to hide this fact and instead of relying on administrators to fix this and offer a best guess workaround as an alternative, it's best to make things look “nice”

Considering 1.:

The algorithm uses the not well documented ip tag of the process to distinguish between hosts and assumes a missing tag indicates a different host.

People using OpenTracing or OpenCensus usually only know about the service name, so a missing ip tag is expected, meaning the algorithm compensates nearly all spans. See

Then it compares durations, which makes only sense one assumes that spans roughly have the same size.

Most child spans would be shorter, so this is rarely the case. If a child span is a few nanoseconds longer I don't care, if it is really longer it should be flagged, but not shifted. It's wrong data and really hard to diagnose if it is processed afterwards.

Then the algorithm checks if the end time of the child span is after the end time of the parent. While technically wrong, it is pretty easy to close the child span a couple of nanoseconds too late - probably a non-issue considering tracing, and at most something that should be a warning instead of moving the span around.

Lastly it assumes that span duration somehow correlates to network latency.

Proposal - what do you suggest to solve the problem or improve the existing situation?

If clock skew is an issue, the span context should try to compensate.

Considering 2. and 3. it would IMHO be best to not mess with the data by default and offer beautification as an opt-in workaround, which is different than the currently solutions by turning it on by default and searching for a way to get rid of it.

Also, it might be useful to see clock skew by default, because it may affect other areas too.

Any open questions to address

I saw very few comments where this skew adjustent is actually wanted. Maybe some of the supporters can chime in?

@jpkrohling
Copy link
Contributor

best to not mess with the data by default and offer beautification as an opt-in workaround

+1

Also, it might be useful to see clock skew by default, because it may affect other areas too.

As someone looking for the root cause of a problem, how would you like this info to be presented to you?

I saw very few comments where this skew adjustent is actually wanted.

From what I remember, this is particularly useful for mobile clients, where the clock might be really off. Perhaps clients where the clock might be off should be expected to send a hint within the span?

@eikemeier
Copy link
Author

eikemeier commented Apr 7, 2019

Also, it might be useful to see clock skew by default, because it may affect other areas too.

As someone looking for the root cause of a problem, how would you like this info to be presented to you?

In Java for example you would have all kinds of problems. So i.e. having a span in red would be fine. This should be rare, which means that I wouldn't probably care about nanosecond differences, especially spans that are closed slightly too late.

It's like having log entries (which usually have timestamps) and you see that B is happening before A, although impossible. Pointing this out is usually sufficient.

I saw very few comments where this skew adjustent is actually wanted.

From what I remember, this is particularly useful for mobile clients, where the clock might be really off. Perhaps clients where the clock might be off should be expected to send a hint within the span?

Yes, that would be “opt-in”. One could also add a “ping collector” command, which could give an idea about the clock difference and latency, assuming there is no jitter or other factors. I'm unsure if this is really worth the effort.

Still auto-fixing might be strange too, since you'll always run into the “B before A” case. And the algorithm does not seem to be very successful on mobile clients, see issue #722.

@TBBle
Copy link

TBBle commented Feb 5, 2020

After jaegertracing/jaeger-operator#871, we now have a process tag host.ip added by Jaeger Operator-injected Agents, which would be a better input for the Clock Skew Adjustor. The process-local ip tag added by the Jaeger client is within a kubernetes container, and hence won't match other containers sharing a realtime clock, leading to Clock Skew Adjustor misbehaviour.

So it might be beneficial to this improvement direction to make it possible to configure the tag used to identify "same host as parent".

@eikemeier
Copy link
Author

eikemeier commented Feb 5, 2020

So it might be beneficial to this improvement direction to make it possible to configure the tag used to identify "same host as parent".

We keep getting comments like “I spent a few hours trying to figure out ...”. The point of this issue is that the default behaviour is unexpected, confusing and harmful.

If anyone wants adjustments, fine. But this should be opt-in and off by default. “Better” unexpected and a little less confusing adjustments are still harmful…

@pavolloffay
Copy link
Member

It seems like a controversial feature. We should consider making it configurable.

Would it also help people if we were labeling the spans which are candidates for clock adjusting?

@TBBle
Copy link

TBBle commented Feb 5, 2020

I was definitely one of those "I spent a few hours" people. In my case, I did have Clock Skew, but lack of an ip tag (using the OpenCensus library) meant that instead of seeing the two hosts's spans shift, the whole pile shifted. #787 made this very hard to diagnose, as even the raw data was showing time-travel effects.

I'd very-much like to see it turn off-able, but I don't feel strongly that it should be burned to the ground. And if it's not removed entirely, making it more useful for environments where gethostinfo('0.0.0.0') doesn't actually give a unique 1:1 mapping with a real-time clock, would be useful too.

And probably lower-hanging fruit in terms of getting a PR up and delivered for disabling the feature entirely, or changing it to off-by-default? Or maybe not, I haven't really looked at the Jaeger codebase.

I also have a use-case for the Clock Skew Adjustor, since I have a down-the-track need to collect spans from web-based javascript pages, and even other random PC users. And experience suggests "random PC users" are awful at keeping their PC clocks accurate.

@eikemeier
Copy link
Author

I'd very-much like to see it turn off-able, but I don't feel strongly that it should be burned to the ground. And if it's not removed entirely, making it more useful for environments where gethostinfo('0.0.0.0') doesn't actually give a unique 1:1 mapping with a realm-time clock, would be useful too.

Again, I do not vote for “burning to the ground”. It might help people, and thanks to the guys developing and improving it.

The point i'm trying to make is that a (changing) heuristic how to improve faulty data can be helpful, but should be off by default and opt-in, nothing more.

You have some machines with broken clocks which you can't fix - fine. The problem is that you don't know how broken they are. Instead of a simple skew it might be jumping - this happens.

So, for people who are sure it is a simple skew, but are not able to fix that - turn it on, it might help.

For everyone else: I might hide your problem or damage otherwise perfectly correct data. And this is what I'm seeing in most of the cases.

Also, as mentioned above, it destroys data when it's just a programming error. the heuristic could be improved, but this experience should tell you that using it by default is not a very good idea - it is hard enough to understand a system as is, but this adjustment makes it even harder.

@flixr
Copy link

flixr commented Feb 5, 2020

I also want to add that we ran into this problem, even though all the services were running on the same host (in containers). We initially had the IP field empty and it was adjusted, but using the IP from inside the container also would not have helped. As a workaround we have for now hardcoded spans to all report the same IP.

It seems that in our case the clock skew adjustment happened because:

  1. the IP was unset
  2. we use send spans as json in zipkin format
  3. the zipkin format uses timestamps in microseconds and due to rounding errors (adding elapsed steady time to start of span in system time) in some cases the child spans were finished a microsecond after the parent span which resulted in completely unusable adjustment...

@bobrik
Copy link
Contributor

bobrik commented Feb 5, 2020

How about the following?

  • If there is no skew detected, do nothing
  • If skew is detected, do not adjust it automatically:
    • UI shows a clear warning at the top and in adjustable spans
    • There's a toggle to enable adjustment next to a warning
    • Have an API param to enable adjustments

This way it's a no-op for good spans, and bad ones get the choice, which either points at the issue in their tracing or at clock skew.

@jpkrohling
Copy link
Contributor

I added this issue as a topic for the next bi-weekly. Perhaps we can move forward while discussing this with everyone in the same "room": https://docs.google.com/document/d/1ZuBAwTJvQN7xkWVvEFXj5WU9_JmS5TPiNbxCJSvPqX0/edit#heading=h.u5x27f1oevg9

@yurishkuro
Copy link
Member

We're currently limited by front end. A simple temporary solution, for people who are particularly annoyed by the adjustment, is to add a cli flag to permanently disable adjustment, until someone contributes a respective UI change to toggle it on and off.

@jpkrohling
Copy link
Contributor

is to add a cli flag to permanently disable adjustment

Given that this feature breaks most people's expectation, shouldn't it be the other way around, even though it means a change in the current behavior?

@yurishkuro
Copy link
Member

Might be ok too. I think if we go with such change, we need to still run the adjustment logic but only record a warning saying "had the adjustment been enabled, it would've moved the timestamps by (Delta)".

@yurishkuro
Copy link
Member

Upon more thought, we may want three-valued flag: off (default), warn, on (current behavior).

@nvx
Copy link

nvx commented Aug 28, 2020

Just in case it's not apparent to someone searching (like me earlier), PR #2119 added the ability to set --query.max-clock-skew-adjustment=0s which has the effect of disabling adjustments.

@jpkrohling
Copy link
Contributor

I think we should really consider setting the default value for the clock-skew to 0s for Jaeger 1.21.x or 2.x. It's simply wrong with all async scenarios. Example, which looks very similar to the plentiful of other examples we've seen related to this issue:

image

Clocks out of sync are a real problem, but I'm afraid we have more async spans out there than server clocks out of sync.

@yurishkuro
Copy link
Member

fine with me, we just need to mark is as a breaking change in the log

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants