Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: statistical allocation profiling #31915

Closed
wants to merge 2 commits into from

Conversation

tkluck
Copy link
Contributor

@tkluck tkluck commented May 3, 2019

Julia has a mature statistical profiler. It sets a timer that captures a backtrace when it is triggered. By the law of large numbers, this gives insight into where an algorithm spends its time, without noticably slowing the program down.

By comparison, finding out where the allocations are happening is quite a bit more cumbersome. It needs starting Julia with a specific command line switch, code execution is much slower, and after program exit, the results are scattered over the file system.

This pull request represents an attempt at bringing the ergonomics of statistical runtime profiling to allocations: "statistical allocation profiling". Similar to how, in the former case, Profile.init configures a delay between backtraces, this branch add an option to specify a fraction of allocations that capture a backtrace.

Example usage:

using Profile
Profile.init(alloc_rate = 0.01)

doublefibonacci(n) = if n <= 2
    return [1, 1]
else
    return doublefibonacci(n - 1) .+ doublefibonacci(n - 2)
end
@profile for i=1:1000; doublefibonacci(15); end

Profile.print() # but better to use e.g. ProfileView or StatProfilerHTML

State of this commit:

  • linux support only
  • not thread-safe
  • no attempt at a friendly human interface; as it is, the Profile.init API almost encourages a linear combination of runtime and allocation profiling. That makes no sense at all.

I'm sending this as a WIP early so I can get feedback before investing time in productionizing this. What do you think?

Julia has a mature statistical profiler. It sets a timer that captures a backtrace
when it is triggered. By the law of large numbers, this gives insight into where
an algorithm spends its time, without noticably slowing the program down.

By comparison, finding out where the allocations are happening is quite bit more
cumbersome. It needs starting Julia with a specific command line switch, code
execution is _much_ slower, and after program exit, the results are scattered
over the file system.

This pull request represents an attempt at bringing the ergonomics of statistical
_runtime_ profiling to allocations: "statistical allocation profiling". Similar
to how, in the former case, `Profile.init` configures a delay between
backtraces, this branch add an option to specify a fraction of allocations
that capture a backtrace.

Example usage:

```julia
using Profile
Profile.init(alloc_rate = 0.01)

doublefibonacci(n) = if n <= 2
    return [1, 1]
else
    return doublefibonacci(n - 1) .+ doublefibonacci(n - 2)
end
@Profile for i=1:1000; doublefibonacci(15); end

Profile.print() # but better to use e.g. ProfileView or StatProfilerHTML
```

State of this commit:

 - linux support only
 - not thread-safe
 - no attempt at a friendly human interface; as it is, the `Profile.init`
   API almost encourages a linear combination of runtime and allocation profiling.
   That makes no sense at all.
@vtjnash
Copy link
Sponsor Member

vtjnash commented May 3, 2019

See also #31534 (I haven't yet looked into either much to compare)

@tkluck
Copy link
Contributor Author

tkluck commented May 3, 2019

@vtjnash thanks for the reference! I wasn't aware of that one.

From skimming the other PR, it looks like the main differences are:

  • @staticfloat's PR makes separate buffers for memory profiling. That's allows for getting some extra specifics but also means that much of the surrounding tooling needs to be adapted (e.g. can't just use ProfileView as-is)
  • @staticfloat's PR has no statistical component; it just tracks everything.
  • @staticfloat's PR has rich filtering options for different kinds of allocations
  • @staticfloat's PR also upgrades the Profile package with a friendly human interface representing this new way of profiling.

@yuyichao
Copy link
Contributor

yuyichao commented May 3, 2019

Why is this not just a display feature? The profile already contain backtraces that includes the allocation function. The only job should be to find those functions in the backtrace and it should not involve changing allocation code.

@tkluck
Copy link
Contributor Author

tkluck commented May 3, 2019

Why is this not just a display feature? The profile already contain backtraces that includes the allocation function. The only job should be to find those functions in the backtrace and it should not involve changing allocation code.

Because that's scaled by time spent, not by number of allocations. It's the latter thing that's the objective of this PR.

This was an oversight in the previous commit.
@timholy
Copy link
Sponsor Member

timholy commented May 4, 2019

Interesting. I like the tunable runtime overhead. While number of allocations is probably what I'd use this for most, sometimes one might want more info about the size of allocations. Using this approach, could one indirectly get that via an option to trigger every n bytes? (Or once the next rand()*n bytes get allocated, if you're worried about periodic phenomena.)

One option worth considering is to collaborate with @staticfloat to finish #31534, and perhaps integrate the tunable runtime overhead of this approach.

@@ -1108,6 +1116,8 @@ JL_DLLEXPORT jl_value_t *jl_gc_pool_alloc(jl_ptls_t ptls, int pool_offset,
jl_gc_safepoint_(ptls);
}
gc_num.poolalloc++;
if(gc_statprofile_sample_rate && rand() < gc_statprofile_sample_rate)
Copy link
Contributor

@chethega chethega May 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following @timholy's comment on adjustible overhead: This implementation calls the RNG on every alloc. Hence, even if the sample rate is close to zero, the overhead does not converge to zero.

An alternative would be something like
if(gc_num.poolalloc++ == gc_num.next_pool_sample) {gc_num.next_pool_sample += gc_statprofile_pool_inverse_rate; jl_profile_record_trace(NULL);}.

With gc_num.next_pool_sample = 0, this would trigger on next wrap-around, i.e. never, and with gc_statprofile_pool_inverse_rate large this would trigger very rarely. We would pay only a single predicted branch on allocs we don't want to sample.

Similar treatment could be applied to gc_num.bigalloc, gc_num.allocd, etc counters. We probably should randomize the increment in order to avoid biases in loops that have period close to commensurable with the inverse rate. While poisson distribution of the gaps (as your code provides) is statistically nicer, something like 1 + (inverse_rate * rand_uint16()) >> 15 is probably good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point. I'll run some timings to see how RNG overhead compares to the allocation itself. If it's significant, I'll investigate the right scheme to use here. If not, there's probably value in keeping Poisson.

@tkluck
Copy link
Contributor Author

tkluck commented May 4, 2019

@timholy thanks for the comments. I'll be glad to work together on combining these pull requests. @staticfloat what do you think?

@timholy
Copy link
Sponsor Member

timholy commented May 13, 2019

@tkluck, thanks again for this. It was extremely useful in JuliaImages/ImageFiltering.jl#94 (comment); highly recommended for anyone else who wants to debug something similar. I am looking forward to whatever form this ends up taking!

@Sacha0
Copy link
Member

Sacha0 commented Oct 20, 2022

Superseded by #42768? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants