Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose number of OpenBLAS threads based on process affinity #55572

Closed
carstenbauer opened this issue Aug 23, 2024 · 6 comments · Fixed by #55574
Closed

Choose number of OpenBLAS threads based on process affinity #55572

carstenbauer opened this issue Aug 23, 2024 · 6 comments · Fixed by #55574
Labels
linear algebra Linear algebra multithreading Base.Threads and related functionality

Comments

@carstenbauer
Copy link
Member

carstenbauer commented Aug 23, 2024

Similar to #42340 we should probably also consider the affinity of the Julia process when deciding how many BLAS threads we spawn by default. Currently, we don't:

➜  sca50297@cl7fr1 ~  julia --threads=auto -q
julia> using LinearAlgebra; BLAS.get_num_threads()
16

julia> Threads.nthreads()
32

julia> Sys.CPU_THREADS
32

julia>

➜  sca50297@cl7fr1 ~  taskset -c 0,1 julia --threads=auto -q
julia> using LinearAlgebra; BLAS.get_num_threads()
16

julia> Threads.nthreads()
2

In the latter case, despite the fact that our process is restricted to 2 hardware threads, we spawn 16 BLAS threads. That never seems to be a good choice.

@carstenbauer
Copy link
Member Author

carstenbauer commented Aug 23, 2024

Concrete use case where this will be helpful: HPC. For an MPI application, for example, SLURM automatically sets the affinity for each MPI rank (Julia process). Currently, we are oversubscribing cores as demonstrated above (as @PetrKryslUCSD can tell, because he ran into this issue). If we respected the affinity mask, that'd be much better.

@carstenbauer
Copy link
Member Author

carstenbauer commented Aug 23, 2024

Interestingly, Taka had mentioned the BLAS case as well. He found that OpenBLAS respected the affinity and did the "right thing". Clearly that's not the case anymore, at least not for my test above (Julia 1.10.4). So it seems like either julia or openblas has regressed here.

@giordano
Copy link
Contributor

Related issue: #46226, where it's suggested to use libuv to get the number of available CPUs

@carstenbauer
Copy link
Member Author

carstenbauer commented Aug 23, 2024

Related indeed, but note that #46226 talks about Julia threads not BLAS threads.

@giordano
Copy link
Contributor

Sure, but my point is that we can replace Sys.CPU_THREADS with @ccall uv_available_parallelism()::Cint in

@static if Sys.isapple() && Base.BinaryPlatforms.arch(Base.BinaryPlatforms.HostPlatform()) == "aarch64"
BLAS.set_num_threads(max(1, Sys.CPU_THREADS))
else
BLAS.set_num_threads(max(1, Sys.CPU_THREADS ÷ 2))
end

$ taskset -c 0,1 ./julia --threads=auto -q
julia> Threads.nthreads()
2

julia> @ccall uv_available_parallelism()::Cint
2

julia> Sys.CPU_THREADS
48

@giordano
Copy link
Contributor

giordano commented Aug 23, 2024

This also works with MPI (I'm using OpenMPI here):

$ mpirun -np 6 --map-by node:PE=8 ./julia -e 'println(@ccall uv_available_parallelism()::Cint)'
8
8
8
8
8
8
$ mpirun -np 4 --map-by node:PE=12 ./julia -e 'println(@ccall uv_available_parallelism()::Cint)'
12
12
12
12
$ mpirun -np 2 --map-by node:PE=24 ./julia -e 'println(@ccall uv_available_parallelism()::Cint)'
24
24

And this using Slurm's srun:

[mosgiordano@fj-debug2 bin]$ srun -N 1 -n 6 -c 8 -t 04:00:00 -p short --pty bash
[mosgiordano@fj037 bin]$ srun ./julia -e 'println(@ccall uv_available_parallelism()::Cint)'
8
8
8
8
8
8
[mosgiordano@fj037 bin]$ exit
[mosgiordano@fj-debug2 bin]$ srun -N 1 -n 4 -c 12 -t 04:00:00 -p short --pty bash
[mosgiordano@fj037 bin]$ srun ./julia -e 'println(@ccall uv_available_parallelism()::Cint)'
12
12
12
12
[mosgiordano@fj037 bin]$ exit
[mosgiordano@fj-debug2 bin]$ srun -N 1 -n 2 -c 24 -t 04:00:00 -p short --pty bash
[mosgiordano@fj037 bin]$ srun ./julia -e 'println(@ccall uv_available_parallelism()::Cint)'
24
24

@carstenbauer carstenbauer added linear algebra Linear algebra multithreading Base.Threads and related functionality labels Aug 23, 2024
giordano added a commit that referenced this issue Sep 7, 2024
…threads` (#55574)

This is a safer estimate than `Sys.CPU_THREADS` to avoid oversubscribing
the machine when running distributed applications, or when the Julia
process is constrained by external controls (`taskset`, `cgroups`,
etc.).

Fix #55572
KristofferC pushed a commit that referenced this issue Sep 12, 2024
…threads` (#55574)

This is a safer estimate than `Sys.CPU_THREADS` to avoid oversubscribing
the machine when running distributed applications, or when the Julia
process is constrained by external controls (`taskset`, `cgroups`,
etc.).

Fix #55572
kshyatt pushed a commit that referenced this issue Sep 12, 2024
…threads` (#55574)

This is a safer estimate than `Sys.CPU_THREADS` to avoid oversubscribing
the machine when running distributed applications, or when the Julia
process is constrained by external controls (`taskset`, `cgroups`,
etc.).

Fix #55572
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
linear algebra Linear algebra multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants