Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix related to issue #131, used wrong variable in calculation of … #132

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

pelahi
Copy link
Collaborator

@pelahi pelahi commented Feb 29, 2024

... offsets sending data.
This is to address issue #131. Matthieu can try this fix.

Copy link

@rtobar rtobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good and well confined, I'll hold merging until @MatthieuSchaller confirms this fixes the problem, but I'll approved now.

@MatthieuSchaller
Copy link

I have tried it and it hasn't crashed but the code timed out before completion.
Maybe I did something silly (likely!) or some comms don't match and we hand. Let me check.

@MatthieuSchaller
Copy link

This seems to have worked. I am crashing when writing the catalogs now but that is another problem I reckon. I am checking with more nodes just to confirm.

@MatthieuSchaller
Copy link

Actually, crashed again.

32 ranks with 32 threads each. 752 baryon particles. 4x more DM.
Last line in stdout:

[0005] [8068.492] [ info] localfield.cxx:24 Getting local velocity density

stderr:

Fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(259): MPI_Sendrecv(sbuf=0x7f8d5b0a2400, scount=-1422197376, MPI_BYTE, dest=5, stag=20, rbuf=0x2d78df050, rcount=49360, MPI_BYTE, src=5, rtag=20, MPI_COMM_WORLD, status=0x7fffafa97410) failed
MPI_Sendrecv(125): Negative count, value is -1422197376
Fatal error in MPI_Sendrecv: Invalid count, error stack:
MPI_Sendrecv(259): MPI_Sendrecv(sbuf=0x6fd96b6f0, scount=49360, MPI_BYTE, dest=4, stag=20, rbuf=0x7f0f34415840, rcount=-1422197376, MPI_BYTE, src=4, rtag=20, MPI_COMM_WORLD, status=0x7ffd3fab6a10) failed
MPI_Sendrecv(126): Negative count, value is -1422197376

@pelahi
Copy link
Collaborator Author

pelahi commented Mar 5, 2024

I'll see if I forgot something when pulling out some minimal changes.

@MatthieuSchaller
Copy link

Would you know of a version SHA that pre-dates the MPI refactoring but provides the same physics features as the latest version?

@pelahi
Copy link
Collaborator Author

pelahi commented Mar 6, 2024

Hi @MatthieuSchaller , just a quick note that I'll look for the SHA with older MPI code but also I will only be able to test large stuff in two days or so when Setonix is back online. I think the issue is I left some variables as ints and should be unsigned ints and also others needed to be long long int (for offsets in the buffer).

@MatthieuSchaller
Copy link

Shall I test these latest changes?

@pelahi
Copy link
Collaborator Author

pelahi commented Mar 13, 2024

@MatthieuSchaller , you're welcome to give it a try but I haven't tried them yet either. I just figured there must be issues with ints, and these updates should correct stuff to using unsigned ints and long long ints when appropriate. I've submitted a few tests to run on some large and small sims and should have results to look at tomorrow. It is possible that stuff might crash but hopefully I haven't forgotten anything.

@pelahi
Copy link
Collaborator Author

pelahi commented Mar 14, 2024

Hi @MatthieuSchaller , I tried with a dm 2600^3 sim, 48 mpi ranks, each rank having ~3-4e8 particles and it ran without issues. Did the updates work for you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants