-
-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prefix Julia error output with rank #360
Comments
I don't think there is a way to do this across different MPI implementations.
Alternatively, you could write to a file with MPI I/O using a shared file pointer ( |
For what it's worth, I was unable to get |
One option would be an interface such as We could define an interface like MPI.Cprint(comm, root) do io
print(io, ...)
end which would be collective over |
I also suggested what I think is a better solution to the MPI forum: |
This sounds like a good suggestion. However, would we benefit from this for the error output of Julia itself? In that case, the Julia executable would have to be somehow "MPI-aware", wouldn't it? |
I'm not quite sure yet how it would work. One option would be to modify Interestingly, I did try out using the shared file pointers with |
I am back at this issue again, since we started parallelizing Trixi.jl with MPI. It's really annoying that if there is a runtime issue that only occurs on a subset of all ranks (or even just one), there is no way to discern this from the error message - instead, you have to re-run again and this time add copious amounts of Do think it would be feasible to convince Julia main to add the option to specify a prefix that is added to all output lines? And would it even be possible to implement something like this in a sane way? I'm think about something like julia --e 'using MPI; MPI.Init(); Base.error_prefix(string(MPI.Comm_rank(MPI.COMM_WORLD)) * ": ")' script.jl that would turn
into
I don't know, as I'm writing this, I can already feel that this is not a very elegant solution, but neither can I come up with something better. It's just hat not being able to re-use the compile cache while developing a Julia package with MPI is already painful enough (compared to compiled languages), but adding the fact that there's no obvious way to connect "compiler errors" to the ranks on which they occur just makes this worse :-( |
If you are running under Slurm or another manager you can also direct the stderr to a file per rank, or create a wrapper script that add the slurm task ID as a prefix to the output. The most reliable way is to use OpenMPI with Adding an option to Julia would be interesting, but very invasive... Especially if you are interested in errors and not just logged messages. |
@vchuravy Thanks a lot for these suggestions! As far as I can tell from the manual, with Slurm I can use, e.g., #SBATCH --error=errors-%j-%t.out which redirects all errors to a file identified by the job id and the task id (= rank).
How would I be able to achieve this?
This is very interesting indeed. EDIT: I just found it... for MPICH, the |
Yeah Simon mentioned that he had trouble with MPICH #360 (comment)
I don't have a ready made solution, but as an example:
which uses and then you can use something like:
Where |
The main pain point for me is interleaving within a line, and these workarounds don't fix that issue. Can something be done about that? Eg line buffering? |
Not that I know of: unfortunately there are no APIs for controlling buffers (each MPI implementation handles the output combination differently). |
Short of writing an I/O handler that controls all output to the terminal, no, I don't think so. Non-interleaving line output to the terminal means that there would have to be a central instance that controls the output, which means global serialization of this problem. Since this is in contrast to the core goals of MPI, I don't think this feature will ever be provided by the MPI libraries themselves. You can do something like this on your own for output to files, using MPI I/O (I've done this for logging purposes before), but it becomes very slow soon (IIRC, with >100 cores the overhead is already significant). Otherwise I think you'll have to implement it yourself, I'm afraid :-/ |
That is pretty annoying. The simplest solution (and probably the only one that makes sense for larger process counts) is to do all the printing on process 0. The problem with that is that external libraries (eg Optim) don't know about MPI. A pretty brutal solution to that is to use |
Yes, all sufficiently large (ie, beyond toy size) MPI-parallel programs that I know of only print from the MPI root. That's no help though if you're debugging and/or experiencing run time errors,where you typically don't control I/O. The problem with external libraries is exactly the reason for me to create this issue (here Julia being the "external" library). |
Currently, if you are running a Julia/MPI program in parallel and something bad happens, you get a lot of
ERROR: LoadError: LoadError: UndefVarError: ...
messages, which are all horrible interleaved. This in itself is a known "user issue" with MPI and (probably) cannot be fixed in an efficient manner. However, it would already help a lot that when running Julia/MPI programs, the error messages include the global rank such that a user has at least a fighting chance in finding out which rank died first. E.g, something likeERROR (rank 2): LoadError: LoadError: UndefVarError: ...
I don't know if this is even possible (injecting information in the Julia runtime output) without changes to upstream Julia, but it would IMHO be a great help to many scientists.
The text was updated successfully, but these errors were encountered: