[Question] Additional analysis view for detailed trace overview and span statistics #1779

mariusoe · 2019-09-06T13:39:51Z

Hi,

me and my team at Novatec and the inspectIT project are currently working on a new section for Jaeger's "trace view" (actually most of the work is done by @fylip97) which can be used to analyse spans contained in a trace and to display - in an aggregated way - statistics based on them. Basically, it is a different view of a trace. It's a table showing different statistics.

Please have a look at the attached image for better understanding.

We would like to contribute this as a pull request to the Jaeger repository. But before we start cleaning up the code and create the pull request, I just wanted to know if you are generally interested in such view.

All the best,
Marius

yurishkuro · 2019-09-07T14:44:50Z

Hi @mariusoe,

it looks useful, thanks for posting. I have a few questions:

Could you explain what the columns mean, specifically "Exc"?
How do you interpret the first column? It looks like you group by the span name and then by service name, which feels kind of odd because aggregating endpoints across different services, even if the endpoints are named the same, is kind of meaningless
What user stories do you have in mind for this view? What kinds of problems has it helped you to solve? (if we add this view we should have some documentation explaining why)

Sasasu · 2019-09-08T12:40:50Z

I think "Exc" means the time spent on this service. total time = exc time + other time not in my code

but I still have a question about the first column. but shared the topology can let me understand, I believe.

AlexanderWert · 2019-09-09T05:37:58Z

Hi @yurishkuro , Hi @Sasasu ,

thanks for your feedback and questions!

Exc. Time means exclusive time. Exc. Time = Total Time - Sum(Total time of all blocking, directly called child spans). This of course ist only applicable on synchronous / blocking span execution. However, in this case from our customer experience is a very helpful metric to quickly identify hot spots on the critical paths. So this metric is very related to the discussion on: Sync and Async children (FOLLOWS_FROM) open-telemetry/opentelemetry-specification#65
Regarding the grouping in the first column I fully agree, that we should do the grouping the other way round (so first by service and then by span name). We will change this!
Regarding user stories: We are using Jaeger not only in microservice environments but also for instrumenting old school Java Enterprise applications at customer side. In boths "worlds" (JEE and microservices) such a view helps to quickly analyze and identify root causes and anti-patterns. For instance if you have many spans with the same name on the same service (for instance a Database call) where each span is very quick but the sum dominates the response time, this is an indicator for a n+1 anti-pattern. Or, if there is a span with a high exclusive time, this is an indicator that this span dominates the duration of the trace. And so on.
We are even thinking about providing some automatic evaluation of such heuristics to provide an additional view that would directly pinpoint to potential hot-spots, anti-patterns, etc. So users won't need to manually click through the tree structure of a huge trace.

We are aware that especially the exclusive time is not fully clean, yet, especially as there is no unambiguous indicator for async / sync calls of a span. However, currently we use heuristics for this.

Regarding OpenTelemtry, I think it would be a great improvement and would help analysis a lot if something like a sync / async (blocking / non-blocking) flag could be introduced to the span data model.

yurishkuro · 2019-09-12T22:34:21Z

Re "Exc" - we refer to it as "self-time", e.g. in the graph view of a trace

vprithvi · 2019-09-13T17:35:21Z

@AlexanderWert I'm curious - it seems like this table is a proxy for identifying spans that are in the critical path of requests; are the aggregate latency numbers useful on their own, or are they useful mainly to provide ordering?

Could visualizing the critical path in the trace view be better for your use case?

For e.g., see the following screenshot with the critical path highlighted in an ugly red marker.
An advantage is that one can clearly see where in the lifecycle of the request exclusive time is going, for instance, in the first frontend span, it is only at the tail end.

yurishkuro · 2019-09-13T17:55:53Z

I find the numerical view also useful, especially in a large trace where visualizing critical path might be difficult, while the table can be easily sorted by the maximum impact of self time.

everett980 · 2019-09-13T19:08:18Z

The trace graph is still in an "Experimental" state, but does support coloring based on self time. Refactoring that or adding a button to color based on self/total time as a percentage of parent time may be an intuitive way to surface the critical path.

AntPeixe · 2021-08-13T10:40:51Z

Quite a bump... but has there been progress on this? Is there a plan to add the detailed trace overview?

I'm currently using the all-in-one installation for test/dev and would appreciate this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Additional analysis view for detailed trace overview and span statistics #1779

[Question] Additional analysis view for detailed trace overview and span statistics #1779

mariusoe commented Sep 6, 2019

yurishkuro commented Sep 7, 2019

Sasasu commented Sep 8, 2019 •

edited

Loading

AlexanderWert commented Sep 9, 2019 •

edited

Loading

yurishkuro commented Sep 12, 2019

vprithvi commented Sep 13, 2019

yurishkuro commented Sep 13, 2019

everett980 commented Sep 13, 2019

AntPeixe commented Aug 13, 2021

[Question] Additional analysis view for detailed trace overview and span statistics #1779

[Question] Additional analysis view for detailed trace overview and span statistics #1779

Comments

mariusoe commented Sep 6, 2019

yurishkuro commented Sep 7, 2019

Sasasu commented Sep 8, 2019 • edited Loading

AlexanderWert commented Sep 9, 2019 • edited Loading

yurishkuro commented Sep 12, 2019

vprithvi commented Sep 13, 2019

yurishkuro commented Sep 13, 2019

everett980 commented Sep 13, 2019

AntPeixe commented Aug 13, 2021

Sasasu commented Sep 8, 2019 •

edited

Loading

AlexanderWert commented Sep 9, 2019 •

edited

Loading