-
Notifications
You must be signed in to change notification settings - Fork 47
Blue Gene SPI
the header files and examples can be found in /bgsys/drivers/ppcfloor/spi/
.
A rather complete documentation of the SPI code is availabe here: IBM MU SPI Doxygen documentation.
I have a test code now, running on 32 nodes, which performs boundary exchange using the SPI, method DirectPut. It can be found in the github repository https://github.com/urbach/Qspi. It combines MPI with the SPI and uses already the tmLQCD MPI-variables. It is based on example code of IBM. How do we deal with the permissions and copyrights?
The runjob command is as follows
runjob --ranks-per-node 1 --envs "MUSPI_NUMINJFIFOS=8" --envs "MUSPI_NUMRECFIFOS=8" --envs "MUSPI_NUMBATIDS=2" --np 32 --cwd ${WORK}/${NAME} --exe ${HOME}/bglhome/head/c99/Qspi/DirectPut
I have now also included a little test of overlapping computation and communication, and it seems to work quite well. A good fraction of the communication can be hidden when there is enough to compute, see DirectPut.c
. (One needs to comment in and out the loop for(int m =...
)
- only communications: 79061 cycles
- only computation: 153625 cycles
- comp + comm, no hide: 232212 cycles
- comp + comm, hide: 179236 cycles
Removing the check that everything was send (which we don't need in tmLQCD, I guess), improves the speed to 176634 cycles.
Increasing the package size from 2048
bytes to 4096
, 8192
, 16384
, 32768
improves speed to 165700, 160231, 157499, 156118 cycles, respectively. The total message size is 32768
bytes right now. So with only one message we almost completely hide the communication!
when I use openMP for the computation loop with 64 threads, I get 109176 cycles when hiding communication, compared to 187953 cycles when not hiding.
-
which subgroup ID should one chose for the the BATs? Currently it is 0, but in the so called SPI "docu" (
/bgsys/drivers/ppcfloor/spi/doc/html/
) it is written: "The MU SPI application should be coded to look for free base address table entries in all of the subgroups associated with the process." -
how to properly run MPI and SPI together? When I use
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &g_mpi_prov);
the SPI does not manage to get the resources needed, even though I use environment variables to reserve them, see above. Also with--envs "PAMID_ASYNC_PROGRESS=1"
it doesn't change. MaybeMPI_THREAD_MULTIPLE
is not needed? WithMPI_THREAD_SERIALIZED
it works fine. UsingMPI_Init
seems to be slightly faster compared toMPI_Init_thread
in general.-
MPI_Init_thread
will always be a bit slower because the MPI implementation provides "per-object" locks (ie, locks for the MPI structure for each thread). ForMPI_THREAD_SERIALIZED
, this is often only true in theory and the implementation is essentially equivalent toMPI_THREAD_MASTER
or even the implementation without threading.MPI_THREAD_MULTIPLE
is needed only when interleaving communication and computation with MPI (because it requiresPAMID_ASYNC_PROGRESS=1
, which requiresMPI_THREAD_MULTIPLE
). I'm currently looking into how this can be achieved in practice to compare to the performance with SPI. If the overhead is small, this can be used for general interleaving for any geometry, saving us the headache of generalizing SPI.- I've looked into this more and I've come to the conclusion that our MPI implementation is doing comm/comp overlap but the overhead is very large. I'm guessing this is because MPI seems to require quite a lot of beef from the CPU and multiple MPI threads get prioritized over work threads in order to do comms, thereby reducing overall performance. All these tests were with MPI_THREAD_SERIALIZED. What I said above is not necessarily true (about PAMID_ASYNC_PROGRESS, but I'm not sure I fully understand...)
-
we could consider to use
MPI_Put
from the MPI-II standard!?- No, this doesn't seem to work as required because of the way it's structured. Also in Lugano some people working on other codes reported that MPI_Put is really slow. Maybe not worth trying?
-
dynamic or static routing?* dynamic seems to be slightly faster in my test programme.
-
whats the optimal package size the best is probably to use one package per direction
-
persistent descriptors: the inverter doesn't converge with SPI if the lines
muDescriptors[j].Message_Length = msize; muDescriptors[j].Pa_Payload = sendBufPAddr; MUSPI_SetRecPutOffset (&muDescriptors[j], roffsets[j]);
are re-set in every call of Hopping_Matrix
. Currently I don't understand, why. A better understanding would be useful.
-
should we do the send loop in SPI with 8 threads?
-
how to properly do it also for the 32 Bit version of halfspinor?