Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: checkpoint workload fails if upload fails #752

Merged

Conversation

rb-determined-ai
Copy link
Member

Description

Fix a bug that @liamcli found. Asking for a review from @stoksc since it conflicts with a change he has in flight.

Previously, the ordering of the context manager for storage_manager.save_path() was such that workload completed messages were sent before the upload finished. As a result, the database would sometimes contain checkpoint uuid's that never got uploaded to cloud storage.

Test Plan

I added a unit test for the workload manager, which fails without the fix.

Copy link
Contributor

@stoksc stoksc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solves the issue just fine, was a little confusing on the first read (execution depends on remembering exactly how generators work in python) but it really is a direct side-effect of fixing the bug.

@@ -0,0 +1,107 @@
import contextlib
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: tests!

We talked a bit about moving this into a test utility file offline, decided this stuff is all workload manager specific for the most part.

checkpoint_info.get("framework", ""),
checkpoint_info.get("format", ""),
)
if self.rendezvous_info.get_rank() != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub diffs suck 🙃, but I like losing the indentation.

@stoksc stoksc assigned rb-determined-ai and unassigned stoksc Jun 22, 2020
Previously, the ordering of the context manager for
storage_manager.save_path() was such that workload completed messages
were sent before the upload finished.  As a result, the database would
sometimes contain checkpoint uuid's that never got uploaded to cloud
storage.
@rb-determined-ai rb-determined-ai merged commit 8e5f03a into determined-ai:master Jun 23, 2020
@rb-determined-ai rb-determined-ai deleted the upload-failure branch June 23, 2020 16:49
tayritenour pushed a commit to tayritenour/determined that referenced this pull request Apr 25, 2023
…etermined-ai#752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
eecsliu pushed a commit to eecsliu/determined that referenced this pull request Jun 23, 2023
…etermined-ai#752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
stoksc pushed a commit that referenced this pull request Jun 26, 2023
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
eecsliu pushed a commit that referenced this pull request Jun 28, 2023
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
eecsliu pushed a commit that referenced this pull request Jun 28, 2023
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
stoksc pushed a commit that referenced this pull request Jul 20, 2023
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
eecsliu pushed a commit that referenced this pull request Jul 24, 2023
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
stoksc pushed a commit that referenced this pull request Oct 17, 2023
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
azhou-determined pushed a commit that referenced this pull request Dec 7, 2023
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
wes-turner pushed a commit that referenced this pull request Feb 2, 2024
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
@dannysauer dannysauer added this to the 0.12.10 milestone Feb 6, 2024
rb-determined-ai pushed a commit that referenced this pull request Feb 29, 2024
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
amandavialva01 pushed a commit that referenced this pull request Mar 18, 2024
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
eecsliu pushed a commit that referenced this pull request Apr 18, 2024
…752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
eecsliu pushed a commit to determined-ai/determined-release-testing that referenced this pull request Apr 22, 2024
…etermined-ai#752)

Deletion of the dispatch is done synchronously in the DispatchExited event handler, need to make it async to avoid blocking the event handler.

Moved content of DispatchExited to a go routine (dispatchExited) except for accesses to m.reqList which should remain to avoid the need for additional synchronization.
Identified one other synchronous call to m.removeDispatchEnvironment that needed to be made async.
Added comments to other call sites indicating they are already invoked from an existing go routine so are non-blocking.

Extracted the m.reqList use from startLauncherJob go routine back into the event handler to avoid the need for additional synchronization.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants