-
Notifications
You must be signed in to change notification settings - Fork 356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: checkpoint workload fails if upload fails #752
fix: checkpoint workload fails if upload fails #752
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solves the issue just fine, was a little confusing on the first read (execution depends on remembering exactly how generators work in python) but it really is a direct side-effect of fixing the bug.
@@ -0,0 +1,107 @@ | |||
import contextlib |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
praise: tests!
We talked a bit about moving this into a test utility file offline, decided this stuff is all workload manager specific for the most part.
checkpoint_info.get("framework", ""), | ||
checkpoint_info.get("format", ""), | ||
) | ||
if self.rendezvous_info.get_rank() != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GitHub diffs suck 🙃, but I like losing the indentation.
Previously, the ordering of the context manager for storage_manager.save_path() was such that workload completed messages were sent before the upload finished. As a result, the database would sometimes contain checkpoint uuid's that never got uploaded to cloud storage.
7c91cdb
to
4ae3526
Compare
…etermined-ai#752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…etermined-ai#752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
…etermined-ai#752) Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler. Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization. Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async. Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking. Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
Description
Fix a bug that @liamcli found. Asking for a review from @stoksc since it conflicts with a change he has in flight.
Previously, the ordering of the context manager for storage_manager.save_path() was such that workload completed messages were sent before the upload finished. As a result, the database would sometimes contain checkpoint uuid's that never got uploaded to cloud storage.
Test Plan
I added a unit test for the workload manager, which fails without the fix.