Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cray OpenSHMEM-X 11 - do not merge #22

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from
Draft

Conversation

jeffhammond
Copy link
Collaborator

@jeffhammond jeffhammond commented Jul 30, 2023

This is a record of all the hacks I am doing for LUMI due to Cray OpenSHMEM-X 11 not being standard-compliant.

Currently, I am able to build with warnings. The most egregious hack is due to the Cray header declaring reductions to return void not int.

I think the following has a workaround in the source already but I guess I didn't activate it properly.

jhammond@uan01:~/shmem4py> /opt/cray/pe/craype/2.7.19/bin/cc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -march=x86-64 -march=x86-64 -fPIC -I/pfs/lustrep3/users/jhammond/shmem4py/src -I/opt/cray/pe/python/3.9.13.1/include/python3.9 -c build/temp.linux-x86_64-cpython-39/shmem4py.api.c -o build/temp.linux-x86_64-cpython-39/build/temp.linux-x86_64-cpython-39/shmem4py.api.o -ferror-limit=3
build/temp.linux-x86_64-cpython-39/shmem4py.api.c:7133:3: warning: call to undeclared function 'shmem_complexd_prod_reduce'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  shmem_complexd_prod_reduce(x0, x1, x2, x3);
  ^
build/temp.linux-x86_64-cpython-39/shmem4py.api.c:7181:5: warning: call to undeclared function 'shmem_complexd_prod_reduce'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  { shmem_complexd_prod_reduce(x0, x1, x2, x3); }
    ^
build/temp.linux-x86_64-cpython-39/shmem4py.api.c:7196:3: warning: call to undeclared function 'shmem_complexd_sum_reduce'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  shmem_complexd_sum_reduce(x0, x1, x2, x3);
  ^
build/temp.linux-x86_64-cpython-39/shmem4py.api.c:7244:5: warning: call to undeclared function 'shmem_complexd_sum_reduce'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  { shmem_complexd_sum_reduce(x0, x1, x2, x3); }
    ^
build/temp.linux-x86_64-cpython-39/shmem4py.api.c:7259:3: warning: call to undeclared function 'shmem_complexf_prod_reduce'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  shmem_complexf_prod_reduce(x0, x1, x2, x3);
  ^
build/temp.linux-x86_64-cpython-39/shmem4py.api.c:7307:5: warning: call to undeclared function 'shmem_complexf_prod_reduce'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  { shmem_complexf_prod_reduce(x0, x1, x2, x3); }
    ^
build/temp.linux-x86_64-cpython-39/shmem4py.api.c:7322:3: warning: call to undeclared function 'shmem_complexf_sum_reduce'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  shmem_complexf_sum_reduce(x0, x1, x2, x3);
  ^
build/temp.linux-x86_64-cpython-39/shmem4py.api.c:7370:5: warning: call to undeclared function 'shmem_complexf_sum_reduce'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
  { shmem_complexf_sum_reduce(x0, x1, x2, x3); }
    ^
In file included from build/temp.linux-x86_64-cpython-39/shmem4py.api.c:570:
In file included from /pfs/lustrep3/users/jhammond/shmem4py/src/libshmem.c:36:
/pfs/lustrep3/users/jhammond/shmem4py/src/libshmem/compat/cray.h:5:6: warning: unused function 'shmem_complexf_sum_to_all' [-Wunused-function]
void shmem_complexf_sum_to_all(float _Complex *dest, const float _Complex *source, int nreduce,
     ^
/pfs/lustrep3/users/jhammond/shmem4py/src/libshmem/compat/cray.h:15:6: warning: unused function 'shmem_complexd_sum_to_all' [-Wunused-function]
void shmem_complexd_sum_to_all(double _Complex *dest, const double _Complex *source, int nreduce,
     ^
/pfs/lustrep3/users/jhammond/shmem4py/src/libshmem/compat/cray.h:25:6: warning: unused function 'shmem_complexf_prod_to_all' [-Wunused-function]
void shmem_complexf_prod_to_all(float _Complex *dest, const float _Complex *source, int nreduce,
     ^
/pfs/lustrep3/users/jhammond/shmem4py/src/libshmem/compat/cray.h:36:6: warning: unused function 'shmem_complexd_prod_to_all' [-Wunused-function]
void shmem_complexd_prod_to_all(double _Complex *dest, const double _Complex *source, int nreduce,
     ^
12 warnings generated.

@jeffhammond jeffhammond changed the title Cray openshmemx 11 Cray OpenSHMEM-X 11 - do not merge Jul 30, 2023
@jeffhammond
Copy link
Collaborator Author

Here is an example of an error in the Cray header:

#define SHMEM_C_TEAM_REDUCE_ALL(TYPENAME,OP,TYPE)                              \
  EXPORT void shmem_##TYPENAME##_##OP##_reduce(shmem_team_t team, TYPE *dest,  \
                                              _CNST TYPE *source, _SIZ_T nreduce)

The OpenSHMEM 1.5 specification says

int shmem_TYPENAME_<op>_reduce(shmem_team_t team, TYPE *dest, const TYPE *source, size_t nreduce);

@jeffhammond
Copy link
Collaborator Author

My hack for complex[fg] was not good:

jhammond@uan01:~/shmem4py> srun -n 4 python3 nstream-numpy-shmem.py 10 10000000
Traceback (most recent call last):
  File "/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py", line 72, in <module>
Traceback (most recent call last):
  File "/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py", line 72, in <module>
Traceback (most recent call last):
  File "/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py", line 72, in <module>
    from shmem4py import shmem
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 18, in <module>
    from shmem4py import shmem
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 18, in <module>
    from shmem4py import shmem
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 18, in <module>
Traceback (most recent call last):
  File "/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py", line 72, in <module>
    from shmem4py import shmem
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 18, in <module>
    from .api import ffi, lib
ImportError: /users/jhammond/.local/lib/python3.9/site-packages/shmem4py/api.abi3.so: undefined symbol: shmem_complexd_prod_reduce
    from .api import ffi, lib
ImportError: /users/jhammond/.local/lib/python3.9/site-packages/shmem4py/api.abi3.so: undefined symbol: shmem_complexd_prod_reduce
    from .api import ffi, lib
ImportError: /users/jhammond/.local/lib/python3.9/site-packages/shmem4py/api.abi3.so: undefined symbol: shmem_complexd_prod_reduce
    from .api import ffi, lib
ImportError: /users/jhammond/.local/lib/python3.9/site-packages/shmem4py/api.abi3.so: undefined symbol: shmem_complexd_prod_reduce
srun: error: nid007283: tasks 0-3: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=4252541.5
jhammond@uan01:~/shmem4py> srun -n 4 python3 nstream-numpy-shmem.py 10 10000000
/opt/cray/pe/python/3.9.13.1/bin/python3: can't open file '/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py': [Errno 2] No such file or directory
/opt/cray/pe/python/3.9.13.1/bin/python3: can't open file '/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py': [Errno 2] No such file or directory
/opt/cray/pe/python/3.9.13.1/bin/python3: can't open file '/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py': [Errno 2] No such file or directory
/opt/cray/pe/python/3.9.13.1/bin/python3: can't open file '/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py': [Errno 2] No such file or directory
srun: error: nid007283: tasks 0-3: Exited with exit code 2
srun: launch/slurm: _step_signal: Terminating StepId=4252541.6

@jeffhammond
Copy link
Collaborator Author

I guess changing the type of a team also matters:

Traceback (most recent call last):
  File "/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py", line 72, in <module>
Traceback (most recent call last):
  File "/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py", line 72, in <module>
Traceback (most recent call last):
  File "/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py", line 72, in <module>
Traceback (most recent call last):
  File "/pfs/lustrep3/users/jhammond/shmem4py/nstream-numpy-shmem.py", line 72, in <module>
    from shmem4py import shmem
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 559, in <module>
    from shmem4py import shmem
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 559, in <module>
    from shmem4py import shmem
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 559, in <module>
    from shmem4py import shmem
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 559, in <module>
    TEAM_WORLD:   Team = Team(lib.SHMEM_TEAM_WORLD)
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 394, in __new__
    TEAM_WORLD:   Team = Team(lib.SHMEM_TEAM_WORLD)
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 394, in __new__
    TEAM_WORLD:   Team = Team(lib.SHMEM_TEAM_WORLD)
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 394, in __new__
    TEAM_WORLD:   Team = Team(lib.SHMEM_TEAM_WORLD)
  File "/users/jhammond/.local/lib/python3.9/site-packages/shmem4py/shmem.py", line 394, in __new__
    raise TypeError(f"unexpected type: {type(team)}")
TypeError: unexpected type: <class 'int'>
    raise TypeError(f"unexpected type: {type(team)}")
TypeError: unexpected type: <class 'int'>
    raise TypeError(f"unexpected type: {type(team)}")
TypeError: unexpected type: <class 'int'>
    raise TypeError(f"unexpected type: {type(team)}")
TypeError: unexpected type: <class 'int'>
srun: error: nid007283: tasks 0-3: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=4252541.8

@dalcinl
Copy link
Member

dalcinl commented Jul 30, 2023

I guess changing the type of a team also matters:

Definitely! This is something we should handle in the build system, with a configure-style check.

Or we just advocate to implementers to stop with the integer-handle obsession, maybe pointing to our MPI-5 efforts.

In the particular case of Cray, we have a very simple point to make: they already use pointers for contexts, why don't they just do the same for teams? Moreover, unsigned long and void* have the same size for the platforms they care about, right?

@jeffhammond
Copy link
Collaborator Author

I think you should just include shmem.h and use shmem_team_t directly instead of trying to deduce what it is. Is that not possible?

@jeffhammond
Copy link
Collaborator Author

Treating void * and unsigned long is gross. There is only one C type for which that is guaranteed to be true: (u)intptr_t. Sure, it's true on *nix but it's not portable in C.

@dalcinl
Copy link
Member

dalcinl commented Jul 30, 2023

Treating void * and unsigned long is gross.

That's not the point I was trying to make. What I meant is that if Cray switches to pointers, then the team handle is the same size as they have now in most 64bit platforms they care.

I think you should just include shmem.h and use shmem_team_t directly instead of trying to deduce what it is. Is that not possible?

This is an issue/limitation of the CFFI package. I never figured out how to tell CFFI this type maybe a pointer or an integral type. And if you assume one or the other, then at some point something breaks (in the sense that the generated C code is broken).

@jeffhammond
Copy link
Collaborator Author

is it possible to use something like this?

typedef union { int i; void* p; intptr_t ip; } shmem4py_team_t;

@dalcinl
Copy link
Member

dalcinl commented Jul 30, 2023

is it possible to use something like this?

typedef union { int i; void* p; intptr_t ip; } shmem4py_team_t;

I've tried that, but they another issues will arise. For example, CFFI would not allow you to compare handles by equality, pretty much as you cannot compare two struct values in C.

@jeffhammond
Copy link
Collaborator Author

sure but these are the internal representation of an opaque type so why should CFFI need to do that? if it needs that, i can do the comparison using something other than ==, can it not?

@jeffhammond
Copy link
Collaborator Author

what is shmem_team_t were an actual struct - what would we do in that case? i assume CFFI can support such types.

@jeffhammond
Copy link
Collaborator Author

jeffhammond commented Jul 30, 2023

nevermind, i RTFM'd and it's impossible

There are a few (obscure) limitations to the supported argument and return types. These limitations come from libffi and apply only to calling function pointers; in other words, they don’t apply to non-variadic cdef()-declared functions if you are using the API mode. The limitations are that you cannot pass directly as argument or return type:

  • a union (but a pointer to a union is fine);
  • a struct which uses bitfields (but a pointer to such a struct is fine);
  • a struct that was declared with “...” in the cdef().

https://cffi.readthedocs.io/en/latest/using.html

@dalcinl
Copy link
Member

dalcinl commented Jul 30, 2023

Jeff, I have tried hard, trust me. I already hit this issue in our first failed attempt to NVSHMEM, because of the type of shmem_ctx_t. The easy "solution" is to run a configure check to determine whether the ctx/team types are pointers or integrals. I have not done that yet because this is the first time I face an implementation (other than NVSHMEM) that does not use pointers.

PS: An OpenSHMEM ABI spec would be even simpler than the MPI case, the API is way smaller. Using pointers instead of integers for handles is a necessary step for that.

@jeffhammond
Copy link
Collaborator Author

i fixed all the compilation errors with shmem4py.api.c by explicitly casting team arguments (under the assumption that unsigned long and void* are compatible) but it still fails at runtime.

dalcinl and others added 2 commits August 1, 2023 17:55
This allows for OpenSHMEM implementations declaring ctx/team types as
either a pointer types or an integral types.

Declaring `shmem_{ctx|team}_t` as an opaque struct type has an annoying
side-effect: now ctx/team handles can no longer be compared for equality.
Therefore, add a couple auxiliary functions `eq_{ctx|team}` to compare
handles for equality. Additionally, in case we need them in the future,
add functions to convert to/from integer values.
@dalcinl
Copy link
Member

dalcinl commented Aug 1, 2023

@jeffhammond I force-pushed your branch. Can you give it a new try?

Do you plan to somehow report upstream all nits we run into? So far I can count the following:

  • It would be ideal if shmem_team_t would be a typedef for a pointer type. If Cray does not take action on this, then at least they should take immediate action in the following item.
  • SHMEM_TEAM_NULL constant should be defined with ULONG_MAX. Right now it is defined as -1 which is of signed int type, but team handles should be of unsigned long type. Otherwise, users get warnings if they compile with -Wconversion.
  • All reductions have the wrong return type void instead of the right return type int.
  • shmem_complex{f|d}_sum_reduce functions are not implemented, they are trivial to support (for example, by invoking the float/double sum_reduce function with twice the buffer size, as the hack I added in this PR).
  • shmem_complex{f|d}_prod_reduce function are not implemented.

@jeffhammond
Copy link
Collaborator Author

Builds fine but crashes because the implementation of teams is buggy.

~/shmem4py> srun -n 16 python3 -m unittest discover -s test
.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 1 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 5 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 8 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 9 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 10 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 11 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 12 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 13 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 14 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

.............................. LIBSMA ERROR: _smai_collect_guts was called with invalid arguments for the start, stride or size. PE_start=0 PE_stride=2 PE_size=16
 PE 15 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

 PE 0 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

 PE 2 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

 PE 3 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

 PE 4 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

 PE 6 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

 PE 7 [Tue Aug  1 21:16:28 2023] [unknown] [nid007249]  Aborting job.

srun: error: nid007249: tasks 0-14: Aborted
srun: launch/slurm: _step_signal: Terminating StepId=4287532.0
slurmstepd: error: *** STEP 4287532.0 ON nid007249 CANCELLED AT 2023-08-01T21:16:28 ***
srun: error: nid007249: task 15: Aborted (core dumped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants