-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
So how many GPUs are on a Perlmutter GPU node again? #851
Comments
Is this running on a single node ? I see this line
What if you change it to
and then when you run your job, use
I find these arguments generally work well for me. Beyond this, Im not readily familiar with how the flag Also, can you verify there are indeed 4 GPUs when you run |
I confirm This particular
On a side note, according to |
If
then I would expect you to set
Also, as far as using the virtual threads, some libraries will use them if available, it might not be the merge code specifically, maybe some lower-level code. Where is the code for the annulus worker calling diffBragg? Is this in LS49? |
I had nothing particular in mind when setting Thats a fair question about GPU use – I looked at CPUs only, and somehow neglected to check GPUs themselves. I have just tested both approaches. My idea, The code for annulus worker is in psii_spread repository. A workflow that utilizes it is in LS49 SPREAD README, steps 8–11. |
I think the default -c is 2 (just guessing), could you then add -c4 to my command to use 128 threads ? |
Oh, and if your code is using |
@dermen I think the default is Adding |
Closing because It seems the meaning of |
Can you reopen @Baharis ? I want to pick this up again maybe next week. Last week's PM outage and this week's travel upset my schedule a bit. |
Perlmutter GPU node features 1x AMD EPYC 7763 CPU and 4x NVIDIA A100 GPUs (link). Therefore, it would be reasonable to assume that when running scripts which utilize CUDA or KOKKOS, environment variable
CCTBX_GPUS_PER_NODE
should be set to 4. To my surprise, I discovered today that setting it to anything but 1 causes a CUDA assertion error:This might be an intended behavior, but I found it confusing. Following @JBlaschke suggestion, I made this issue to discuss it.
The issue can be recreated by running the following file:
/global/cfs/cdirs/m3562/users/dtchon/p20231/common/ensemble1/SPREAD/8mosaic/debug/mosaic_lastfiles.sh
on a Perlmutter GPU interactive node:salloc -N 1 -J mosaic_int -A m3562_g -C gpu --qos interactive -t 15
. In you don't have an access or a cctbx installation including psii_spread workers, the relevant portion of the file is essentially:The text was updated successfully, but these errors were encountered: