-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When using this together with mpi4py I get a double call to MPI_Init #3
Comments
jthestness
added a commit
that referenced
this issue
Aug 14, 2017
* Introduce MPI allreduce in a new contrib project. This commit adds the tensorflow.contrib.mpi namespace and contrib project, which has a variety of ops that work with MPI. The MPI system works by starting a background thread which communicates between the different processes at a regular interval and schedules asynchronous reductions. At every tick, every rank will notify rank zero of the tensors it is ready to reduce, signifying completion with an empty DONE message. Rank zero will count how many ranks are ready to reduce every tensor, and, whenever a tensor is ready to reduce (that is, every rank is ready to reduce it), rank zero will issue a message to all other ranks directing them to reduce that tensor. This repeats for all the tensors that are ready to reduce, after which rank zero sends all other ranks a DONE message indicating that the tick is complete. Reviewed-by: Joel Hestness <[email protected]> * Allreduce/Allgather: Major changes and fixes (#2) This commit constitutes many major updates to the TF MPI allreduce and allgather ops. Specifically, the following changes are included in this commit: 1) The allreduce and allgather ops had race conditions, which this commit fixes. Specifically, the BackgroundThreadLoop previously allocated temporary and output tensors after the main graph traversal thread has completed its call to MPIAll*::ComputeAsync(). Unfortunately, the ops kernel context's memory allocator is only guaranteed to be valid during the ComputeAsync call. This constraint requires ComputeAsync to allocate all tensors before returning; Otherwise, the memory allocator state may reflect allocations and deallocations from further ops that can cause races for the memory locations. To fix this, hoist the memory allocations to ComputeAsync. In this process, introduce a collective op record, which tracks the parameters of the op (e.g. input, output, and configurations). 2) Many models require capability to allreduce or allgather int64 tensors. We add functionality to handle long long data type (64-bit ints). 3) Eliminate the thread sleep. A major to-do item is to eliminate the need for polling between coordinator threads and other ranks. This change will require the coordinator rank to be able to wake up all other ranks when a collective is ready to be performed, but also for all ranks (i.e. background threads) to be woken up by graph traversal threads. In the meantime, remove the thread sleep, because it introduces significant run time overhead (e.g. >20%) for models with quick-running layers (e.g. few recurrent time-steps or few hidden nodes per layer). * mpi_ops.cc: Move toward more TF nature This commit changes a few bits and pieces to align more closely with Tensorflow structures and organization: 1) Use TF mutexes. TF mutexes provide nice scoping and management around std::mutex, and using them is consistent with other TF code. 2) Remove thread sleep at MPI initialization time. Thread sleep should not be used for polling activity. Instead, this commit replaces sleep-polling with a condition variable: The compute graph traversal thread waits on the condition variable until the background thread has completed initialization and signals the graph traversal thread that initialization is complete. 3) Slim MPI initialization check: Since TF permits many threads to be traversing the compute graph concurrently (e.g. with inter_op_parallelism_threads > 1), some graph traversal threads may not have set their GPU device ID. If such a thread executes an MPI op, it would fail the check in InitializedMPIOnSameDevice, because the background thread would be controlling a GPU with ID other than the default (0). Since graph traversal threads do not perform GPU activity, this GPU ID check was unnecessary. Remove it and refactor to just check whether MPI is initialized (IsMPIInitialized). * Rebase to TF 1.3.0-rc1 complete and tested
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Would it be possible to expose an option to avoid the double initialization issue or to initialize MPI explicitly?
The text was updated successfully, but these errors were encountered: