Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread safety of printing and matmul #26562

Merged
merged 6 commits into from
Mar 28, 2018
Merged

Conversation

stevengj
Copy link
Member

This PR uses a per-thread buffer to make the generic-matmul thread-safe (fixes #22581), floating-point printing thread-safe (fixes #25727), and printf thread-safe (fixes #21335).

In particular, the old DIGITS array in grisu is supplemented by a DIGITSs = [DIGITS] array that is expanded by __init__ into a per-thread buffer. I kept the old DIGITS array to avoid breaking the few external packages that use this — they can migrate gradually to the thread-safe version.

The basic matmul buffer structure is not changed, so #26008 is neither better nor worse.

As I mentioned in #10441, it would be good to have better support for thread-local variables in the compiler. Ideally I could just put @threadlocal const DIGITS=[....] and it would do the right thing under the hood.

@stevengj stevengj requested a review from Keno March 21, 2018 18:36
@stevengj stevengj added the multithreading Base.Threads and related functionality label Mar 21, 2018
@Keno
Copy link
Member

Keno commented Mar 21, 2018

This approach seems fine with me for now. We'll have to carefully think this through with the new threading runtime, since that'll start scheduling tasks on different threads, so thread-local will no longer be right. cc @kpamnany

@JeffBezanson
Copy link
Sponsor Member

Right, in general we need task-local storage, but I expect thread-local to be much more efficient (since there's a fixed number of threads, among other things). So we should continue to use this technique for internal buffers where we know that tasks won't block while using them (they won't, right? right? 😄 ).

@Keno
Copy link
Member

Keno commented Mar 21, 2018

One possible model would be to an explicit acquire/release for task local buffers, such that they get automatically copied along with the task if switching does happen.

@stevengj
Copy link
Member Author

CI failures are unrelated, I think. MacOS Travis is failing with fatal error: 'cholmod.h' file not found, and 32-bit appveyor seems to be a timeout.

@JeffBezanson
Copy link
Sponsor Member

Would be good to spot check that float to string conversion and printf performance are unaffected. I expect there would be no difference, but you never know.

@stevengj
Copy link
Member Author

stevengj commented Mar 22, 2018

old (master) thread-unsafe version:

julia> foo(x) = @sprintf("%3.4f %10.3g", x, x);
       foo(io, x) = @printf(io, "%3.4f %10.3g", x, x);

julia> @btime string(3.7234);
  338.630 ns (4 allocations: 192 bytes)

julia> @btime foo(3.7234);
  738.968 ns (10 allocations: 576 bytes)

julia> buf = IOBuffer(sizehint=100); @btime print($buf, 3.7234);
  255.677 ns (0 allocations: 0 bytes)

julia> buf = IOBuffer(sizehint=100); @btime foo($buf, 3.7234);
  717.572 ns (5 allocations: 288 bytes)

new thread-safe version:

julia> @btime string(3.7234);
  332.072 ns (4 allocations: 192 bytes)

julia> @btime foo(3.7234);
  760.429 ns (10 allocations: 576 bytes)

julia> buf = IOBuffer(sizehint=100); @btime print($buf, 3.7234);
  257.676 ns (0 allocations: 0 bytes)

julia> buf = IOBuffer(sizehint=100); @btime foo($buf, 3.7234);
  745.630 ns (5 allocations: 288 bytes)

So there seems to be at most a 4% slowdown, usually less. This seems a worthwhile price to pay for thread safety.

If you care about @printf performance, the most important thing is perhaps getting rid of the 5 allocations when writing to a pre-allocated buffer, which pre-date this PR.

@stevengj
Copy link
Member Author

Okay to merge?

@vtjnash
Copy link
Sponsor Member

vtjnash commented Mar 23, 2018

access to DIGITSs is inbounds

I think I would prefer we drop this commit unless it is absolutely conclusive that it is essential. We can refactor some of this code to move the computation of the buffer outside hot loops if necessary.

@stevengj
Copy link
Member Author

@vtjnash, reverted. I just tested and it makes no measurable difference in performance.

@stevengj
Copy link
Member Author

Unrelated timeout error on AppVeyor 64-bit

@stevengj stevengj merged commit b80fc88 into JuliaLang:master Mar 28, 2018
@stevengj stevengj deleted the threadsafe branch March 28, 2018 15:45
@simonbyrne
Copy link
Contributor

Do we need to do the same thing for Grisu.BIGNUMS as well?

@quinnj do you know?

@quinnj
Copy link
Member

quinnj commented Jun 28, 2018

Hmmmm, not sure. I think I remember @vtjnash fixing that a while ago though.

@simonbyrne
Copy link
Contributor

No, I managed to trigger an issue. See #27826

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
6 participants