-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault with vectorization #106
Comments
First remark: you have 256 MPI x 14 threads, which is way more than the number of patches (16*16) that you defined. This means most of your resources will not be used. That said, your error should not happen. I am not sure collisions are the real problem, but will investigate. |
Actually, this error can be caused by the simulation exceeding the time limit on your job request. Have you checked that ? |
This thread and patches issue is a bit unclear to me. What I understand that number of patches should be equal or larger than number of slots the job is submitted to. On our cluster I submitted the job to 256 slots. I don't know if I can control the no of MPI threads per slot. The second issue, I didn't fully understand. If I submit this job without ee collisions and it takes one or two days to finish. But with ee collisions on, it crashed after 2 hours of running. In the stdout, one can see only one entry for the output. It should have at least 10 more entries (n_time=57000) before the simulation is finished. If I haven't understood something then please let me know. |
concerning the first issue, there is some documentation here: http://www.maisondelasimulation.fr/smilei/parallelization.html Concerning the second issue, I just suggested that the problem could have been the requested time. After what you said, I think this was not true, so I will investigate. Note that 2 days to run this case is waaaay too much. This is probably due to the first issue. |
Actually I have no control over these 14 threads being requested. I talked with our SysAdmin and he said that this is done by Smilei. It would be nice if one can force 1 core = 1 thread in Smilei? We have cluster where each node has 28 CPU cores and when running the EPOCH code, we just specify the number of slots or CPU cores to the run the simulation. I agree that if total threads are too large than Smilei runs slower as I had noticed it a while ago. Normally, I try running Smilei on 64 slots/CPU, so that total no of threads do not become large. It would be nice if you can instruct how to limit the no of MPI threads generation in the python script specifying the parameters itself. |
There is a problem of vocabulary here. We do not seem to use the same conventions. Let me give you the naming we use:
If you choose I don't know what you mean by slot. Smilei does not control the number of MPI processes and OpenMP threads. You have to provide them as explained above. Usually, it is good practice to choose the number of MPI processes ( I don't know how EPOCH works, but it may be that EPOCH does not use this hybrid MPI/OpenMP technique, thus would be simpler to use, but would lack some performance. |
Thanks for the detailed answer. I meant by slot the number of CPU cores (without hyperthreading). I have now set the OMP_NUM_THREADS=1 and it has indeed launched 1 OpenMP thread per core. I hope this doesn't compromise the speed advantage of Smilei? I'm running the same simulation again with 256 MPI processes (slots in our convention) to see if it runs now without crashing. I'll let you know the result. |
You misunderstood the option A core is the smallest computing element (it is a piece of hardware). Cores are bundled in nodes: in your case, 28 cores inside one node. Again, a node is a piece of hardware. Do not call a core a CPU core, because it is very misleading: some software call CPU the nodes, not the cores. Furthermore, do not get confused with threads and processes, which are not hardware, but software. You should set
|
Thanks for the quick reply. I agree that terminology is a bit confusing. I talked with our SysAdmin and he says we can't use 10 nodes and OMP_NUM_THREADS=28 as cluster is never empty. This configuration can only work with roundrobin method of Sun Grid Engine on our cluster, which is a special requirement. Our cluster support filling up slots (MPI processes) by default. I can try 60 nodes and OMP_NUM_THREADS=5. I read the Parallelization basics page to understand Smilei working. However, I couldn't find any info about the scaling of Smilei for a given problem. Is it fast with less number of MPI process (10) and higher number of OpenMP threads (28), as you just suggested? Or it doesn't matter, one can choose a higher number of MPI processes (60) and less number of OpenMP threads (5) as I just mentioned? |
Ok. Now I have found out, that choosing less no of MPI processes and higher number of threads is beneficial as shows here. I'll see how should I efficiently run Smilei on our cluster. |
But surprisingly, on choosing OMP_NUM_THREADS=1, the code is running and it hasn't crashed yet. In fact it has surpassed the point where it was crashing earlier with ee collisions. This is puzzling... |
What @mccoys said is correct. I just want to add that the last configuration you used (full mpi and 1 openmp thread per mpi) is totally correct contrary to what you did at first. It is correct but in many cases not the most efficient. This what Fred was saying. Nonetheless this configuration appears to be not that bad when the load remains balanced during the whole simulation among the different mpi process. I would keep your simulation running to the end and when you have the time run it again with 60 nodes and 5 Omp threads to compare. |
When I ran with 50 MPI processes and OMP_NUM_THREADS=5, it crashed after 1 hours 22 mins without displaying any output (See below). With 256 MPI processes OMP_NUM_THREADS=1, it took 5 hours 16 mins to finish. So this is puzzling and it appears that with ee collisions on, Smilei crashes with higher no. OMP_NUM_THREADS. I can try further reducing the total no of threads from the no. of patches (256).
|
It is probably not the cause of the crash but you should not have the vectorization mode in Vectorization(
mode = "off",
) |
@xxirii the vectorization mode is expected. It is due to the fact that collisions require particle sorting. |
@mccoys in this case: Vectorization(
mode = "on",
) |
@Tissot11 Your crash is puzzling, because I have successfully ran 12000+ iterations on 2 MPI process, each containing 64 threads (total of 128 threads). There might be an issue with your installation. Do you have more details? Compilers? Libraries? By the way, if your sysadmin tells you that you cannot use whole nodes (you have to share with other users), then smilei will certainly get less performant, because the communications are increased. The best performance is always achieved by requesting full nodes. If impossible, request smaller portions, but make sure that you obtain as many cores as threads, and that you bind threads to cores. |
I paste below the output of ldd smilei command. We use GCC 6.1, OpenMPI 2.0.1, HDF 1.8.16. I have a launched a new job with vectorisation mode on. Let's see if that runs. I'll try to use roundrobin queue to launch job on 4 MPI processes with 28 thread. linux-vdso.so.1 => (0x00007ffdebba7000) |
Neither with vectorisation mode nor with launching the job on a machine (occupying all cores) with more than 1 thread per core worked for me. Code always crashed saying the segmentation fault. Which version of Openmpi, gcc and hdf5 are you using? |
in my case, gcc 8.3, openmpi 3.1 and hfd5 1.10. The code is usually also tested with intel 2018 and hdf5 1.8.16 |
And which version of python? I need to have exact configuration as yours ( 2 MPI process, each containing 64 threads (total of 128 threads) for this run to figure out the problem on our cluster and also to try running more threads and less cores as you had suggested. |
i use python 3.7, but you should not try my exact configuration. I have run 64 threads per process on another machine which is KNL, thus very different from yours. Your configuration should be fine: I used it recently. The problem is either that it has not been installed properly, or not setup correctly with respect to your hardware. |
Our SysAdmin wanted to have exact configuration that you used in order to find the problem. He will look again after the Easter holidays. In the meantime if you have any idea about what might be wrong on our cluster then please let me know. I would very much want to run Smilei in the most efficient manner. Have a nice eater holidays! |
Wait, I finally got the segmentation fault, after 23000 iterations. This will be hard to investigate ... |
Ok. Please let me know if you figure out the reasons... |
@Tissot11 By any chance did you manage to get a stack trace from your failed simulation ? |
@Tissot11 In fact, we have had several reports, I think, of issues between openmpi and the Could you try either:
The second option should do the trick. It might be slightly slower in some cases, but not much. Side note: you are using a |
I don't have the stack trace of the failed simulations. I can ask our SysAdmin next week when he is back. We have MVAPICH2 also available on our cluster if that is recommended? I'll have to ask if intelMPI can be arranged. I'll compile and run Smilei with config=no_mpi_tm and let you know how it goes. Do you also think that I should take care of the vader issue as well while running the simulation? http://www.maisondelasimulation.fr/smilei/run.html Thanks for the tips on diagnostics! I'm a noob to Smilei. |
You could try the vader thing but I thought it didn't apply to openmpi 3. Not sure though. The best thing to try is no_mpi_tm. |
On the suggestion of our SysAdmin, I compiled the code in debug mode with no_mpi_tm and ran the gdb command on the core files. The output looks different now (because of the debug mode). I paste below the section of backtrace full output. If you have nay suggestions for trying a particular command then please let me know. #0 0x00002b82f488c18c in malloc () from /lib64/libc.so.6 |
I have a lead to investigate, but this is not an easy task. I will get back to you as soon as possible |
Ok. Thanks for looking into this! |
I have made a small progress showing that the same error without collisions, as long as the following option is set for vectorization.
Note that this option is automatically set for collisions. This shows there is some memory corruption due to the adaptive mode, not collisions. Now I will check whether this is also present in the full-vecto mode. |
Error happens again with vectorization |
Thanks for the update! Does the error happen after few iterations as in my case, or it still takes larger number of iterations (>1000) for you? |
I significantly changed your input to reproduce the problem in less time. Still, it takes a few 100s of iterations, and runs for half an hour. We are still investigating |
Ok. Please let me know if you want me to run any test etc on our cluster with respect to this problem. |
I was wondering if you have any update on this issue? |
We are continuing to investigate, but it is a difficult issue, and the last week was holiday for some people here. We will keep you posted. |
Please take your time. I was just curious to ask about it as I plan to run some big simulations soon. |
Hi, Julien |
Hi Julien, This is the line 23 in the PusherFactory.h file? I have replaced it with PusherBoris ?.h. Just to be sure, I need to compile it the code again? Best regards |
No the one line 59 in the method |
Thanks for the quick response. Actually I'm having error in compiling. Could you please write down the whole line that need to be changed? My new line 59 now reads as Push = new PusherBoris ?( params, species ); And I tried with the spacing of ? after PusherBorisV. Below is the output on compiling. In file included from src/Species/SpeciesFactory.h:25:0, |
Could you test the branch |
Thanks for the fix. I compiled the code and now I'm running it with 4 MPI processes with 32 threads (4✕32 configuration). Though the code is running but it seems that this fix has made the SMILEI slower. The same run in a 256✕1 configuration yielded first output in less than 1000 seconds while with 4✕32 configuration now it's yet to produce the first output entry after two hours. |
It could indeed slowdown slightly the behavior of the code (depending of the number of particles per cell) but not like you see it. |
Actually the crash before was occurring for 4✕32 type configurations. 256 x 1 type configurations I could run successfully. But the SMILEI is supposed to be faster in 4✕32 type configurations. I did some quick check and with the fix SMILEI is a bit slower in 256 x 1 type configurations (compared to previous version of SMILEI) but still much faster than 4✕32 type configurations (run with round robin queue system on our cluster). But now I see that you have pushed a new fix. I'll fetch it and compile the code again. I'll let you know if the new fix speeds up the runtime. |
The most important aspect is to solve the segfault ! Please confirm if the segfault is gone. |
With the second fix also segmentation fault seems to be gone based on the results so-far! The code is also much faster compared to the first fix in 4✕32 type configurations. However, 128 x 1 type configurations are still a bit faster than 4✕32 type configurations with this second fix. The speed in 128 x 1 type configurations remain almost same with the first and second fixes. |
Actually, now it seems to run with same speed in both configurations. The previous post was concerning the first output which was slower with 4✕32 type configurations. But for the second output 4✕32 type configurations seem to be marginally faster than 128 x 1 type configurations. I guess I should wait for the simulation to finish. I'll let you know tomorrow. Do you plan to push the fix to the master branch? |
@Tissot11 As this has been merged, I am closing the issue. Don't hesitate to reopen if you see it reappear. |
Yeah, thanks for fixing this issue! Till now it's been working fine. I'll let you know in case it comes again! |
Description
I guess there is a problem with ee collisions. Turning them on cause Smilei to crash saying some segmentation error. I talked with our Sys Admin and he suggests, after looking into core. files that problem is in the Borish Pusher of Smilei. I have fetched the latest version of Smilei via Github to avoid the bug reported in #79
If available, copy-paste faulty code, warnings, errors, etc.
Output of stderr
Output of stdout
Got 256.
Steps to reproduce the problem
If relevant, provide a step-by-step guide
Parameters
The text was updated successfully, but these errors were encountered: