Flame Graph segment width seems to be weighted by sample count rather than time #3558

mhansen · 2021-09-17T03:43:58Z

This comes up particularly with off-cpu profiling, where one stack sample (when a thread blocks on IO or mutex) might represent a large wall-clock duration.

Compare this screenshot, where a segment has 237 samples and takes 49ms:

With this screenshot, where a segment has 75 samples and takes 9600ms:

I'd expect the 9600ms section to be about 9600/49=195 times larger, but instead the 49ms span is wider.

Here's an offcpu profile to reproduce with:
2021-09-17-200119506-obfuscated.firefox.json.gz

This was generated using Android simpleperf using off-cpu tracing 4000Hz for 30s:

$ simpleperf/app_profiler.py -p com.google.android.apps.maps -r "-f 4000 -g -e cpu-clock --trace-offcpu --duration 30" --activity com.google.android.maps.MapsActivity --skip_collect_binaries

Then converted to firefox profiler format.

┆Issue is synchronized with this Jira Task

The text was updated successfully, but these errors were encountered:

mhansen · 2021-09-17T03:47:08Z

I suppose my first question should be: is this considered a bug? Perhaps there are reasons to weight by sample rather than by time? I'm struggling to think of reasons, but I don't want to rule it out.

It seems a bit confusing to me to have (in the first screenshot) it say 49.7ms (16%) when the overall profile is about 30s long -- 16% would be about 4.8s.

julienw · 2021-09-17T13:22:34Z

You're touching the (current) limitations of importing data from non-gecko sources, and of our history of being a sampling profiler dedicated to Gecko.

Historically the data comes from a sampling profiler. We were approximating the time spent with the simple formula time * sampleCount. And this is basically what you see in the flame graph currently.

But this doesn't work for the stack graph, where we need to fill the holes. So we had to come up with an algorithm to determine if we stay in a function over several samples. As a result we had a duration. But that duration was different than the simple formula obviously, especially when there were holes.

So we decided to use the same algorithm as the stack graph to compute this "tracing time", and use that in the tooltip, in addition to showing the values of "sample count" because this is what we saw really.

I believe we could use the "tracing time" as source for the flame graph, but would that be closer to the truth in case of sampling holes? I'm not so sure, we actually don't know what happens in a sampling hole.

Now fast forward to your issue. One problem is we never really took linux perf as a first-class data provider, so we made it work somewhat, with the assumption than our few users would know the limitations.

I believe there's another way to fix that. I don't know really what the source data looks like, you'll know better. If we do have durations for the samples, we could use the concept of "weight" that we don't use enough but that our various algorithms take into account:

profiler/src/types/profile.js

Lines 97 to 142 in cc73c52

    
           /** 
        
            * Profile samples can come in a variety of forms and represent different information. 
        
            * The Gecko Profiler by default uses sample counts, as it samples on a fixed interval. 
        
            * These samples are all weighted equally by default, with a weight of one. However in 
        
            * comparison profiles, some weights are negative, creating a "diff" profile. 
        
            * 
        
            * In addition, tracing formats can fit into the sample-based format by reporting 
        
            * the "self time" of the profile. Each of these "self time" samples would then 
        
            * provide the weight, in duration. Currently, the tracing format assumes that 
        
            * the timing comes in milliseconds (see 'tracing-ms') but if needed, microseconds 
        
            * or nanoseconds support could be added. 
        
            * 
        
            * e.g. The following tracing data could be represented as samples: 
        
            * 
        
            *     0 1 2 3 4 5 6 7 8 9 10 
        
            *     | | | | | | | | | | | 
        
            *     - - - - - - - - - - - 
        
            *     A A A A A A A A A A A 
        
            *         B B D D D D 
        
            *         C C E E E E 
        
            *                                     . 
        
            * This chart represents the self time. 
        
            * 
        
            *     0 1 2 3 4 5 6 7 8 9 10 
        
            *     | | | | | | | | | | | 
        
            *     A A C C E E E E A A A 
        
            * 
        
            * And finally this is what the samples table would look like. 
        
            * 
        
            *     SamplesTable = { 
        
            *       time:   [0,   2,   4, 8], 
        
            *       stack:  [A, ABC, ADE, A], 
        
            *       weight: [2,   2,   4, 3], 
        
            *     } 
        
            */ 
        
           export type WeightType = 'samples' | 'tracing-ms' | 'bytes'; 
        
           type SamplesLikeTableShape = { 
        
             stack: Array<IndexIntoStackTable | null>, 
        
             time: Milliseconds[], 
        
             // An optional weight array. If not present, then the weight is assumed to be 1. 
        
             // See the WeightType type for more information. 
        
             weight: null | number[], 
        
             weightType: WeightType, 
        
             length: number, 
        
           };

With the caveat that it's not currently part of the sample data in the gecko format, this is only part of the sample data in the processed format. So either we could augment the gecko format to support it, or we can make the linux perf importer output a processed format instead of a gecko format. My prefered option would be the latter, but this is more work.

What do you think?

mhansen · 2021-09-18T23:42:37Z

Thank you for this detailed explanation. This is a tricky topic, and I'm still thinking about it. This is my first draft of trying to clarify my thoughts...

I would say this would happen more for on-cpu profiling

Yes, good point, this will also affect on-CPU profiling where all the time spent off-CPU would be 'filled-in'. I hadn't considered the 'filling in' algorithm -- it may mitigate most of the impacts for off-CPU profiling for my use case by 'filling in' the time spent off-CPU.

I believe we could use the "tracing time" as source for the flame graph,

but would that be closer to the truth in case of sampling holes? I'm not so sure, we actually don't know what happens in a sampling hole. I'd like to understand more about this concern. Are you worried about big unintentional holes (i.e. longer than the sampling frequency) unintentionally caused by data collection problems? When you have a long hole, I think it's usually due to thread blocking, and off-CPU profiling tries to get better data for the cause of that blocked time by taking a stack trace when the thread is descheduled (using the sched:sched_switch event, I think). So you can see that the 'hole' is caused by (e.g.) a futex wait, or an IO write. Many linux perf files are recorded using 'on-CPU' tracing, and don't have this off-cpu sched_switch stacktrace, we should be mindful of those.

If we do have durations for the samples.

We do :-) In the raw perf script output, we have cpu-clock data (nanoseconds) associated with each sample. I'll need to think some more about whether the 'filling in' algorithm you have is good enough -- I think you're using the timestamp, which I suppose is the same information as we get from the cpu-clock.

My prefered option would be the latter, but this is more work.

I'm OK to do more work to get this right. I wasn't sure if the 'processed format' was a stable format for storage or data interchange -- is it? If so then using the processed format sounds good, the processed format has a lot of advantages. I'll think about this some more. I'd want to make sure this has reasonable output for both on-CPU and off-CPU profiles.

…

On Fri, 17 Sept 2021 at 23:22, Julien Wajsberg ***@***.***> wrote: This comes up particularly with off-cpu profiling I have a limited understanding of this, but I would say this would happen more for on-cpu profiling. You're touching the (current) limitations of importing data from non-gecko sources, and of our history of being a sampling profiler dedicated to Gecko. Historically the data comes from a sampling profiler. We were approximating the time spent with the simple formula time * sampleCount. And this is basically what you see in the flame graph currently. But this doesn't work for the stack graph, where we need to fill the holes. So we had to come up with an algorithm to determine if we stay in a function over several samples. As a result we had a duration. But that duration was different than the simple formula obviously, especially when there were holes. So we decided to use the same algorithm as the stack graph to compute this "tracing time", and use that in the tooltip, in addition to showing the values of "sample count" because this is what we saw really. I believe we could use the "tracing time" as source for the flame graph, but would that be closer to the truth in case of sampling holes? I'm not so sure, we actually don't know what happens in a sampling hole. Now fast forward to your issue. One problem is we never really took linux perf as a first-class data provider, so we made it work somewhat, with the assumption than our few users would know the limitations. I believe there's another way to fix that. I don't know really what the source data looks like, you'll know better. If we do have durations for the samples, we could use the concept of "weight" that we don't use enough but that our various algorithms take into account: https://github.com/firefox-devtools/profiler/blob/cc73c52b3c2f803183b14e401e56ea76283dd342/src/types/profile.js#L97-L142 With the caveat that it's not currently part of the sample data in the gecko format, this is only part of the sample data in the processed format. So either we could augment the gecko format to support it, or we can make the linux perf importer output a processed format instead of a gecko format. My prefered option would be the latter, but this is more work. What do you think? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#3558 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAZYOLPVL3UL6CJ6OI6YMTUCM6KLANCNFSM5EGDINJQ> .

gregtatum · 2021-09-20T18:30:06Z

We do :-) In the raw perf script output, we have cpu-clock data
(nanoseconds) associated with each sample.

I would recommend using the weight parameter here. If I were to approach this I would add it to the Gecko Profiler format, and then ensure that timing information in tooltips is using the correct name. I'm not sure its worth converting to the processed format just yet, but you can also add things like memory allocation information to the processed format.

SUM(traced timing)

This is the traced time. This is useful when we don't know the length of the sample, but is somewhat misleading due to gaps in sampling being from things like locks.

SUM(sample weight)

I would really like this to be fully supported and implemented well in the UI. It's really nice for imported profiles.

SUM(sample count * sampling interval)

I don't know that we should show this value, as it's so misleading. As someone who works on the profiler, I kind of like the value as it gives me some information I find useful, but as a profiler user I think this would be misleading.

ryoqun · 2022-11-14T13:46:41Z

fyi, i created a very rough pr to fix this. at least, it should be working #4320

is anyone still interested? if so, i may prioritize finishing up the pr.

julienw · 2022-11-16T17:01:45Z

Yes, we're still interested! Thanks a lot for looking at this :-)

mhansen · 2022-11-19T04:15:16Z

I'm still interested too! I have been working around this issue by looking at weighted profiles with pprof, but would far prefer using the time-series view available in Firefox Profiler.

…

On Thu, 17 Nov 2022 at 04:01, Julien Wajsberg ***@***.***> wrote: Yes, we're still interested! Thanks a lot for looking at this :-) — Reply to this email directly, view it on GitHub <#3558 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAZYONIWB4JCORI5MIIWITWIUHQHANCNFSM5EGDINJQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Waazer · 2022-11-23T13:24:20Z

#لا نهتم بقيود

julienw added the memory label Dec 3, 2021

ryoqun linked a pull request Nov 14, 2022 that will close this issue

(really wip) Perf off cpu import #4320

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flame Graph segment width seems to be weighted by sample count rather than time #3558

Flame Graph segment width seems to be weighted by sample count rather than time #3558

mhansen commented Sep 17, 2021 •

edited by data-sync-user

Loading

mhansen commented Sep 17, 2021

julienw commented Sep 17, 2021 •

edited

Loading

mhansen commented Sep 18, 2021 via email

gregtatum commented Sep 20, 2021 •

edited

Loading

ryoqun commented Nov 14, 2022

julienw commented Nov 16, 2022

mhansen commented Nov 19, 2022 via email

Waazer commented Nov 23, 2022

Flame Graph segment width seems to be weighted by sample count rather than time #3558

Flame Graph segment width seems to be weighted by sample count rather than time #3558

Comments

mhansen commented Sep 17, 2021 • edited by data-sync-user Loading

mhansen commented Sep 17, 2021

julienw commented Sep 17, 2021 • edited Loading

mhansen commented Sep 18, 2021 via email

gregtatum commented Sep 20, 2021 • edited Loading

SUM(traced timing)

SUM(sample weight)

SUM(sample count * sampling interval)

ryoqun commented Nov 14, 2022

julienw commented Nov 16, 2022

mhansen commented Nov 19, 2022 via email

Waazer commented Nov 23, 2022

mhansen commented Sep 17, 2021 •

edited by data-sync-user

Loading

julienw commented Sep 17, 2021 •

edited

Loading

gregtatum commented Sep 20, 2021 •

edited

Loading