Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of cores on nodes calculated incorrectly #12

Open
takaomoriyama opened this issue Feb 1, 2023 · 0 comments
Open

Number of cores on nodes calculated incorrectly #12

takaomoriyama opened this issue Feb 1, 2023 · 0 comments

Comments

@takaomoriyama
Copy link
Contributor

takaomoriyama commented Feb 1, 2023

We have assumed that LSB_MCPU_HOSTS contains a list of hostname cores pairs as follows:

$ echo $LSB_MCPU_HOSTS
host1 7 host2 7 host3 7 host4 7

And the number of cores of each host is computed as follows with some trick to allow duplication of host names.

echo "Num cpus per host is:" $LSB_MCPU_HOSTS
IFS=' ' read -r -a array <<< "$LSB_MCPU_HOSTS"
declare -A associative
i=0
len=${#array[@]}
while [ $i -lt $len ]
do
key=${array[$i]}
value=${array[$i+1]}
associative[$key]+=$value
i=$((i=i+2))
done

The problem is that LSB_MCPU_HOSTS is actually a list of hostname slots pairs as described in Running parallel jobs on specific hosts. slot may contain multiple cores. Thus the calculation above may produce wrong numbers.

Here is an example.

# job submitted by: bsub -n 4 -R "affinity[core(7,same=socket)]" -gpu num=1/task
$ echo $LSB_MCPU_HOSTS
host1 1 host2 3
$ cat $LSB_AFFINITY_HOSTFILE
host1 1,2,3,4,5,6,7
host2 0,2,3,4,6,7,8
host2 19,21,22,23,24,26,27
host2 28,29,37,41,48,49,50

I have requested a job consisting of 4 slots. Each slot has 7 cores and 1 GPU. As a result, 1 slot is allocated on host1 and 3 slots are allocated on host2 as described by LSB_MCPU_HOSTS variable.
The file specified by LSB_MCPU_HOSTS contains a list of slots and core allocation for each slot. Each line of the files is of the form of hostname core-list. core-list is comma separated list of core IDs.
So possible solution is to count up core IDs for each host from $LSB_AFFINITY_HOSTFILE file.

@takaomoriyama takaomoriyama mentioned this issue Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant