Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use uv_thread_getaffinity when --threads=auto #42340

Merged
merged 27 commits into from
Feb 19, 2022
Merged

Conversation

tkf
Copy link
Member

@tkf tkf commented Sep 22, 2021

close #35787

This patch detects the "advised" bound of CPU resources using uv_thread_getaffinity. For example, we can now use -tauto with taskset:

$ taskset --cpu-list 0-2 julia -tauto -e '@show Threads.nthreads()'
Threads.nthreads() = 3
$ taskset --cpu-list 0,2,4,8 julia -tauto -e '@show Threads.nthreads()'
Threads.nthreads() = 4

This seems to be very useful in HPC environments since you can just put --threads=auto in the batch script and stop worrying about the inconsistency between Julia's number of threads and the requested number of CPUs. It is also useful in containers like Docker which can use cgroup to designate the available CPUs which in turn is reflected in the affinity setting.

TODOs


uv_thread_getaffinity seems to be able to detect CPU settings in many environments. I checked that the following code can detect the number of CPUs configured by taskset, systemd's AllowedCPUs (cgroup), and Slurm:

const pthread_t = Culong
const uv_thread_t = pthread_t

function uv_thread_getaffinity()
    masksize = ccall(:uv_cpumask_size, Cint, ())
    self = ccall(:uv_thread_self, uv_thread_t, ())
    selfref = Ref(self)
    cpumask = zeros(Cchar, masksize)
    err = ccall(
        :uv_thread_getaffinity,
        Cint,
        (Ptr{uv_thread_t}, Ptr{Cchar}, Cssize_t),
        selfref,
        cpumask,
        masksize,
    )
    @assert err == 0
    n = findlast(isone, cpumask)
    resize!(cpumask, n)
    return cpumask
end

num_cpu() = sum(uv_thread_getaffinity())

@show num_cpu()

@tkf tkf added the multithreading Base.Threads and related functionality label Sep 22, 2021
@DilumAluthge
Copy link
Member

@maleadt @staticfloat Would this potentially be useful for PkgEval?

@tkf
Copy link
Member Author

tkf commented Sep 22, 2021

I think this PR alone is not quite enough for resource-limiting in CI since there has to be some scheduler to set the affinity of the scheduled jobs. I don't know how resource management works in PkgEval, though. I was also kinda hoping that it can be a part of the solution to #42242, provided that buildkite runner can set the affinity.

@maleadt
Copy link
Member

maleadt commented Sep 22, 2021

PkgEval was already updating JULIA_NUM_THREADS accordingly, https://github.com/JuliaCI/PkgEval.jl/blob/b5fda0997792d02115e15a53f3d73d30dc7b87a9/src/run.jl#L144-L148, so this will allow removing that code.

@vchuravy
Copy link
Member

cc: @carstenbauer who looked into some adjacent stuff.

How does this interact with the BLAS threads?

src/sys.c Outdated Show resolved Hide resolved
@carstenbauer
Copy link
Member

carstenbauer commented Sep 22, 2021

Doesn't seem to work with LIKWID / LIKWID.jl, in particular likwid-pin? ☹️

➜  crstnbr@cbserver likwid  likwid-pin -c 1,3,5,7 julia -tauto getaffinity.jl 
[likwid-pin] Main PID -> hwthread 1 - OK
num_cpu() = 1

➜  crstnbr@cbserver likwid  likwid-pin -s 0xffffffffffffffe1 -c 1,3,5,7 julia -tauto getaffinity.jl 
[likwid-pin] Main PID -> hwthread 1 - OK
num_cpu() = 1

(getaffinity.jl contains the code in the OP)

@staticfloat
Copy link
Sponsor Member

I think this PR alone is not quite enough for resource-limiting in CI since there has to be some scheduler to set the affinity of the scheduled jobs.

My assumption is that this is not the responsibility of Julia, but is rather the responsibility of the CI environment; e.g. we do something like tell the buildkite agents ("limit yourselves to these 16 cores") and just divide up the cores of the machine appropriately.

@tkf
Copy link
Member Author

tkf commented Sep 22, 2021

Doesn't seem to work with LIKWID / LIKWID.jl, in particular likwid-pin?

So from https://github.com/RRZE-HPC/likwid/wiki/Likwid-Pin

Because all common OpenMP implementations rely on the pthread API it is possible for likwid-pin to preload a wrapper library to the pthread_create call. In this wrapper, the threads are pinned using the Linux OS API. likwid-pin can also be used to pin serial applications as a replacement for taskset.

Ahhh... this is interesting. So, they are monkey-patching pthread? I can imagine that it's very effective when you don't control all the binaries, but I don't know what can/should be done here. Detecting likwid-pin seems to require hard-coding their monkey-patching strategy. But I prefer not to do anything like that. Magics don't compose.

My assumption is that this is not the responsibility of Julia, but is rather the responsibility of the CI environment

Yes, I agree (that's why I mentioned "a part of"). This PR is about playing nice with the surrounding environment.

@carstenbauer
Copy link
Member

The way I see it (as a non-expert, thus in simple words), we are trying to be smart with this PR and figure out the "advised" number of threads automatically to bound nthreads. This is great when it works, in which case I'm all for it.
However, I'm thinking about cases where it doesn't work as desired. An external tool like likwid-pin has so far worked reasonably well for julia but will not any longer after this PR (please correct me if I'm wrong). This is obviously suboptimal. Perhaps the bounding of nthreads should be opt-in (or at least opt-outable)?

uv_thread_getaffinity seems to be able to detect CPU settings in many environments.

Many environments isn't really enough to enforce nthread bounding, is it?

Detecting likwid-pin seems to require hard-coding their monkey-patching strategy. But I prefer not to do anything like that. Magics don't compose.

I agree, we shouldn't hard-code / special case anything.

@tkf
Copy link
Member Author

tkf commented Sep 22, 2021

uv_thread_getaffinity wraps pthread_getaffinity_np and so it's the "official OS API" on POSIX systems, IIUC. I think supporting "well-behaving" environment that uses official API is good enough. I think LIKWID is kind of an oddball here since it is doing a really (too) clever thing (maybe for programs that ignore the affinity setting?). I wonder if likwid-pin can use both; i.e., set the process affinity and then do the pre-load hack.


Also, related to this, another concern may be that it does not work on macOS:

.. c:function:: int uv_thread_getaffinity(uv_thread_t* tid, char* cpumask, size_t mask_size)

    Gets the specified thread's affinity setting. On Unix, this maps the
    cpu_set_t returned by :man:`pthread_getaffinity_np(3)` to bytes in cpumask.

    The mask_size specifies the number of entries (bytes) in cpumask,
    and must be greater-than-or-equal-to :c:func:`uv_cpumask_size`.

    .. note::
        Thread affinity getting is not atomic on Windows. Unsupported on macOS.

    .. versionadded:: 2.0.0

--- https://github.com/JuliaLang/libuv/blob/julia-uv2-1.42.0/docs/src/threading.rst

Maybe this is fixable in libuv, since it looks like there is a macOS API:
https://developer.apple.com/library/archive/releasenotes/Performance/RN-AffinityAPI/#//apple_ref/doc/uid/TP40006635-CH1-DontLinkElementID_2
https://superuser.com/questions/149312/how-to-set-processor-affinity-on-os-x

@vtjnash
Copy link
Sponsor Member

vtjnash commented Sep 23, 2021

I don't think that API is "enabled" (https://github.com/apple/darwin-xnu/search?q=ml_get_max_affinity_sets is hard-coded to 0), but it is unclear, since it is documented.

@carstenbauer
Copy link
Member

uv_thread_getaffinity wraps pthread_getaffinity_np and so it's the "official OS API" on POSIX systems

I think LIKWID is kind of an oddball here

Fair enough. I'm not particularly focused on LIKWID. I guess my point is more general: IMO, an explicit request by the user, i.e. e.g. -t 4, should be higher in priority than an uv_thread_getaffinity automatism. So my preference would be to bind this to -t auto so that users can "opt-out" by being explicit. (Perhaps one could think about a warning or similar if a user forces a number of threads that's higher than the "advised" number.)

@tkf
Copy link
Member Author

tkf commented Sep 23, 2021

So my preference would be to bind this to -t auto so that users can "opt-out" by being explicit.

Yes, definitely. This PR only changes the behavior of -tauto. Explicit -t has a higher priority:

$ taskset --cpu-list 0,2,4,8 julia -t17 -e '@show Threads.nthreads()'
Threads.nthreads() = 17

But now that I looked at the PR title, it's not really clear from it. Sorry about the confusion!

@tkf tkf changed the title Use uv_thread_getaffinity to bound nthreads Use uv_thread_getaffinity when --threads=auto Sep 23, 2021
@DilumAluthge
Copy link
Member

I'm assuming that, in addition to using this when the -t auto command-line option is passed, we would also use this when the JULIA_NUM_THREADS environment variable is set to auto?

@tkf
Copy link
Member Author

tkf commented Sep 23, 2021

Yeah, JULIA_NUM_THREADS=auto works

$ JULIA_NUM_THREADS=auto taskset --cpu-list 0,2,4,8 julia-dev -e '@show Threads.nthreads()'
Threads.nthreads() = 4
$ JULIA_NUM_THREADS=11 taskset --cpu-list 0,2,4,8 julia-dev -e '@show Threads.nthreads()'
Threads.nthreads() = 11

I only changed the function jl_cpu_threads that is used for guessing the number of threads to use.

src/sys.c Outdated Show resolved Hide resolved
@vchuravy vchuravy added the needs docs Documentation for this change is required label Sep 24, 2021
@tkf
Copy link
Member Author

tkf commented Oct 2, 2021

How does this interact with the BLAS threads?

It looks like OpenBLAS is already doing the right thing when the affinity is set. This is with Julia 1.6:

$ taskset --cpu-list 0-2 julia1.6 -q
julia> using LinearAlgebra

julia> LinearAlgebra.BLAS.get_num_threads()
3

julia> ENV["OPENBLAS_NUM_THREADS"]  # i.e., this is ignored
"8"

Copy link
Sponsor Member

@vtjnash vtjnash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we want Sys.CPU_THREADS() to change dynamically at runtime. Perhaps we should make addprocs and --threads=auto directly check this affinity in some way (cached from started)?

@tkf
Copy link
Member Author

tkf commented Jan 11, 2022

Sys.CPU_THREADS is a global that is set in Sys.__init__ and so it won't change after startup right?

I do want to cache affinity on startup for #42338 though.

@vtjnash vtjnash requested a review from Keno January 29, 2022 20:53
@tkf
Copy link
Member Author

tkf commented Jan 29, 2022

Actually, there's something wrong with linux64 (and freebsed) that I don't understand and can't reproduce locally. I'll turn this to draft to avoid merging it accidentally.

@tkf tkf marked this pull request as draft January 29, 2022 21:23
@tkf
Copy link
Member Author

tkf commented Jan 29, 2022

e.g., https://build.julialang.org/#/builders/70/builds/1821/steps/5/logs/stdio shows

      From worker 4:	fatal: error thrown and no exception handler available.
      From worker 4:	InitError(mod=:Profile, error=ErrorException("could not allocate space for 10000000 instruction pointers per thread being profiled (10 threads, 762.939 MiB total)"))
      From worker 4:	error at ./error.jl:44
      From worker 4:	#init#2 at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:80
      From worker 4:	init##kw at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:67 [inlined]
      From worker 4:	__init__ at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.8/Profile/src/Profile.jl:92

But in other runs (e.g., https://build.julialang.org/#/builders/70/builds/1791/steps/5/logs/stdio) the worker process seems to simply die without useful information (ProcessExitedException)

@tkf
Copy link
Member Author

tkf commented Jan 30, 2022

The HEAD cdd8223 now passes tester_linux64 https://build.julialang.org/#/builders/70/builds/1833

But I'm not sure why would these commits fix the issue 9a3cfd9...e8f469a

tester_freebsd64 is still failing due to timeout (never mind, it is a known issue #42469)

@tkf tkf marked this pull request as ready for review January 30, 2022 06:03
@tkf
Copy link
Member Author

tkf commented Feb 17, 2022

@vtjnash Can we merge this? Or do you want to wait for @Keno to look at it?

@vtjnash vtjnash added the merge me PR is reviewed. Merge when all tests are passing label Feb 17, 2022
@tkf tkf merged commit 3453775 into JuliaLang:master Feb 19, 2022
@tkf tkf deleted the useaffinity branch February 19, 2022 03:24
@tkf tkf removed the merge me PR is reviewed. Merge when all tests are passing label Feb 19, 2022
@IanButterworth
Copy link
Sponsor Member

I'm hitting the following during build on MacOS after this

/Users/ian/Documents/GitHub/julia/src/jloptions.c:449:37: error: implicit declaration of function 'jl_effective_threads' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                jl_options.nprocs = jl_effective_threads();
                                    ^
1 error generated.
make[1]: *** [jloptions.o] Error 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nthread limit when running in taskset/cpuset
9 participants