Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] Additional analysis view for detailed trace overview and span statistics #1779

Open
mariusoe opened this issue Sep 6, 2019 · 8 comments

Comments

@mariusoe
Copy link

mariusoe commented Sep 6, 2019

Hi,

me and my team at Novatec and the inspectIT project are currently working on a new section for Jaeger's "trace view" (actually most of the work is done by @fylip97) which can be used to analyse spans contained in a trace and to display - in an aggregated way - statistics based on them. Basically, it is a different view of a trace. It's a table showing different statistics.

Please have a look at the attached image for better understanding.

jaeger_trace_overview

We would like to contribute this as a pull request to the Jaeger repository. But before we start cleaning up the code and create the pull request, I just wanted to know if you are generally interested in such view.

All the best,
Marius

@yurishkuro
Copy link
Member

Hi @mariusoe,

it looks useful, thanks for posting. I have a few questions:

  • Could you explain what the columns mean, specifically "Exc"?
  • How do you interpret the first column? It looks like you group by the span name and then by service name, which feels kind of odd because aggregating endpoints across different services, even if the endpoints are named the same, is kind of meaningless
  • What user stories do you have in mind for this view? What kinds of problems has it helped you to solve? (if we add this view we should have some documentation explaining why)

@Sasasu
Copy link

Sasasu commented Sep 8, 2019

I think "Exc" means the time spent on this service. total time = exc time + other time not in my code

but I still have a question about the first column. but shared the topology can let me understand, I believe.

@AlexanderWert
Copy link

AlexanderWert commented Sep 9, 2019

Hi @yurishkuro , Hi @Sasasu ,

thanks for your feedback and questions!

  1. Exc. Time means exclusive time. Exc. Time = Total Time - Sum(Total time of all blocking, directly called child spans). This of course ist only applicable on synchronous / blocking span execution. However, in this case from our customer experience is a very helpful metric to quickly identify hot spots on the critical paths. So this metric is very related to the discussion on: Sync and Async children (FOLLOWS_FROM) open-telemetry/opentelemetry-specification#65

  2. Regarding the grouping in the first column I fully agree, that we should do the grouping the other way round (so first by service and then by span name). We will change this!

  3. Regarding user stories: We are using Jaeger not only in microservice environments but also for instrumenting old school Java Enterprise applications at customer side. In boths "worlds" (JEE and microservices) such a view helps to quickly analyze and identify root causes and anti-patterns. For instance if you have many spans with the same name on the same service (for instance a Database call) where each span is very quick but the sum dominates the response time, this is an indicator for a n+1 anti-pattern. Or, if there is a span with a high exclusive time, this is an indicator that this span dominates the duration of the trace. And so on.
    We are even thinking about providing some automatic evaluation of such heuristics to provide an additional view that would directly pinpoint to potential hot-spots, anti-patterns, etc. So users won't need to manually click through the tree structure of a huge trace.

We are aware that especially the exclusive time is not fully clean, yet, especially as there is no unambiguous indicator for async / sync calls of a span. However, currently we use heuristics for this.

Regarding OpenTelemtry, I think it would be a great improvement and would help analysis a lot if something like a sync / async (blocking / non-blocking) flag could be introduced to the span data model.

@yurishkuro
Copy link
Member

Re "Exc" - we refer to it as "self-time", e.g. in the graph view of a trace

Screen Shot 2019-09-12 at 6 33 13 PM

@vprithvi
Copy link
Contributor

@AlexanderWert I'm curious - it seems like this table is a proxy for identifying spans that are in the critical path of requests; are the aggregate latency numbers useful on their own, or are they useful mainly to provide ordering?

Could visualizing the critical path in the trace view be better for your use case?

For e.g., see the following screenshot with the critical path highlighted in an ugly red marker.
An advantage is that one can clearly see where in the lifecycle of the request exclusive time is going, for instance, in the first frontend span, it is only at the tail end.

embed-trace-view-with-back-button

@yurishkuro
Copy link
Member

I find the numerical view also useful, especially in a large trace where visualizing critical path might be difficult, while the table can be easily sorted by the maximum impact of self time.

@everett980
Copy link
Contributor

The trace graph is still in an "Experimental" state, but does support coloring based on self time. Refactoring that or adding a button to color based on self/total time as a percentage of parent time may be an intuitive way to surface the critical path.

@AntPeixe
Copy link

Quite a bump... but has there been progress on this? Is there a plan to add the detailed trace overview?

I'm currently using the all-in-one installation for test/dev and would appreciate this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants