Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile: multithread getdict! (redo of #43805) #43816

Merged

Conversation

IanButterworth
Copy link
Sponsor Member

@IanButterworth IanButterworth commented Jan 14, 2022

It was pointed out there were at least 2 problems with #43805 (which is now reverted)

  • Even preallocated dicts are unsafe to multithread write to due to internal fields that are written to each write
  • Spawning at once for every unique ip isn't a good idea

This PR moves to a channel-based approach, and uses one @spawn per nthreads().

I didn't use Threads.@threads here to avoid blocking, so this is possible:

julia> @profile using DifferentialEquations

julia> @time Profile.retrieve();
  1.956626 seconds (1.19 M allocations: 95.429 MiB, 0.41% gc time, 13.90% compilation time)

julia> Threads.@spawn while true
           rand(10,10) * rand(10,10)
       end
Task (runnable) @0x00007f803ffd5710

julia> @time Profile.retrieve();
  2.674690 seconds (4.53 M allocations: 3.756 GiB, 39.55% gc time)

julia> @time Profile.retrieve();
  2.276615 seconds (3.71 M allocations: 3.067 GiB, 39.71% gc time)

(I think the allocations and GC of the spawned loop are being picked up by @time here?)

cc. @vtjnash @timholy

@IanButterworth IanButterworth changed the title Profile: safer getdict! via channels and limiting spawns (fix #43805) Profile: safer getdict! (fix #43805) Jan 15, 2022
@IanButterworth IanButterworth changed the title Profile: safer getdict! (fix #43805) Profile: fix unsafe getdict! (fix #43805) Jan 15, 2022
Threads.@spawn begin
unique_ips = unique(has_meta(data) ? strip_meta(data) : data)
chnl = Channel{Tuple{UInt64,Vector{StackFrame}}}(length(unique_ips)) do ch
Threads.@threads for ip in unique_ips
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will hang if a thread is already running in non-yielding code. Can we at a minimum make this threading optional?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Is there a tidy way to make a @threads optional?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea, I was wondering the same thing when I wrote this. Maybe refactor the internals into its own function?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just pushed what appears to be non-blocking in this case

@IanButterworth IanButterworth changed the title Profile: fix unsafe getdict! (fix #43805) Profile: multithread getdict! (redo of #43805) Jan 15, 2022
# we don't want metadata here as we're just looking up ips
unique_ips = unique(has_meta(data) ? strip_meta(data) : data)
n_unique_ips = length(unique_ips)
n_unique_ips == 0 && return dict
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expensive part is the lookup. Can't you just

iplookups = similar(unique_ips, Vector{StackFrame})
Threads.@threads for i = 1:n_unique_ips
    iplookups[i] = _lookup_corrected(unique_ips[i])
end

and then stash in the Dict in single-threaded mode?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't use @threads because @jpsamaroo pointed out @threads would get blocked if any thread is doing something that isn't yielding.

Though, regarding just putting it into a buffer, I was thinking that would be memory inefficient, but maybe I'm wrong because the dict is empty to start with..

I'll try removing the channel stuff.

Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but you're not using the limited-capacity aspect of the Channel at all, so this is really just a glorified array-store. If you store by the index of the ip you don't have to worry about races.Indeed, you could replace the @threads with @spawned blocks.

stdlib/Profile/src/Profile.jl Outdated Show resolved Hide resolved
@tkf
Copy link
Member

tkf commented Jan 15, 2022

This PR essentially just needs a parallel map. IMHO it doesn't seem like a good idea to write an ad-hoc inefficient parallel map everywhere that needs it. However, Base simply does not have a facility to implement efficient parallel folds, including parallel maps. How about decomposing Profile API in such a way that threaded_getdict! can be implemented outside the julia distribution?

Alternatively, maybe we can write slightly clumsy parallel folds inside Base that can be used for Base and stdlib but intentionally declared as internal to avoid external packages relying on it.

@IanButterworth
Copy link
Sponsor Member Author

IanButterworth commented Jan 16, 2022

How about decomposing Profile API in such a way that threaded_getdict! can be implemented outside the julia distribution?

I guess a faster getdict! could be provided externally for other consumers, but it would still be nice to have a fast method for the Profile.print() report format.

Regarding how to implement this if it stays in Profile, I've pushed what I think is a decent version of this that could exist in current Base. I'm not pushing for it to stay as is, but as an illustration if it were to be generalized into an internal parallel map func as you suggest.

Note that:

  • Only threads() tasks are spawned
  • Base.update_stackframes_callback[] is now executed sequentially to avoid breaking the race-free promise, as you pointed out @tkf
  • It won't block in the way @threads would, which lines it up for Add profiling of already running tasks via SIGINFO/SIGUSR1 #43179 if that gets approved
  • I kept the channel given there might be some benefit to operating the two loops in parallel

For a small profile data buffer:

This PR

julia> @btime Profile.retrieve();
  40.233 ms (19765 allocations: 10.06 MiB)

Master

julia> @btime Profile.retrieve();
  331.755 ms (295629 allocations: 26.67 MiB)

@IanButterworth
Copy link
Sponsor Member Author

How does this look @vtjnash ?

@IanButterworth
Copy link
Sponsor Member Author

Bump

Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@IanButterworth
Copy link
Sponsor Member Author

I was hoping to get your review on this @vtjnash but will merge tomorrow if not

@IanButterworth IanButterworth merged commit e6fa3ec into JuliaLang:master Feb 3, 2022
@IanButterworth IanButterworth deleted the ib/profile_safer_fetch branch February 3, 2022 01:52
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 4, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 4, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 5, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 5, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 6, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 9, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 9, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 9, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 9, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 9, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 10, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 11, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 11, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 11, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 11, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 12, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 12, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 13, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 13, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 13, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 13, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 14, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 14, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 15, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 15, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 15, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 16, 2022
N5N3 added a commit to N5N3/julia that referenced this pull request Feb 17, 2022
LilithHafner pushed a commit to LilithHafner/julia that referenced this pull request Feb 22, 2022
@oscardssmith
Copy link
Member

oscardssmith commented Mar 2, 2022

Sometime between 1.7 and master, Profile printing has gotten way slower. Could this be the cause?

@IanButterworth
Copy link
Sponsor Member Author

Are you comparing both with the same number of threads?

I don't see a regression, and in larger tests saw an improvement with this PR.
The timing isn't precisely reliable because the number of samples won't be exactly the same

1.7.1

julia> Threads.nthreads()
6

julia> using Profile

julia> @profile begin 
       t = time()
       while time() < t + 5
       rand(10,10) * rand(10,10)
       end
       end

julia> @time Profile.print()
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
     ╎2126  @Base/client.jl:495; _start()
     ╎ 2126  @Base/client.jl:309; exec_options(opts::Base.JLOpt...
     ╎  2126  @Base/client.jl:379; run_main_repl(interactive::B...
...
  1.470694 seconds (531.68 k allocations: 34.682 MiB, 80.94% compilation time)

julia> @time Profile.print()
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
     ╎2126  @Base/client.jl:495; _start()
     ╎ 2126  @Base/client.jl:309; exec_options(opts::Base.JLOpt...
     ╎  2126  @Base/client.jl:379; run_main_repl(interactive::B...
...
  0.187074 seconds (44.49 k allocations: 8.632 MiB)

master

julia> Threads.nthreads()
6

julia> using Profile

julia> @profile begin 
       t = time()
       while time() < t + 5
       rand(10,10) * rand(10,10)
       end
       end

julia> @time Profile.print()
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
     ╎2317  @Base/client.jl:522; _start()
     ╎ 2317  @Base/client.jl:318; exec_options(opts::Base.JLOpt...
     ╎  2317  @Base/client.jl:404; run_main_repl(interactive::B...
...
  1.441475 seconds (603.16 k allocations: 41.029 MiB, 0.54% gc time, 75.89% compilation time)

julia> @time Profile.print()
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
     ╎2317  @Base/client.jl:522; _start()
     ╎ 2317  @Base/client.jl:318; exec_options(opts::Base.JLOpt...
     ╎  2317  @Base/client.jl:404; run_main_repl(interactive::B...
...
  0.138633 seconds (42.42 k allocations: 11.697 MiB, 0.80% gc time)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants