Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle failed checkpoints correctly [DET-3853] #1083

Merged
merged 3 commits into from
Aug 14, 2020
Merged

fix: handle failed checkpoints correctly [DET-3853] #1083

merged 3 commits into from
Aug 14, 2020

Conversation

stoksc
Copy link
Contributor

@stoksc stoksc commented Aug 13, 2020

Description

This change fixes a bug in the trial workload sequencer were it is assumes that checkpoints won't fail. In the event of a failure checkpoint, a nil pointer is de-referenced when trying to access checkpoint metrics that don't exist in a failure.

Test Plan

  • add a unit test that covers this case

Commentary (optional)

@@ -485,6 +485,16 @@ def test_fail_on_first_validation() -> None:
)


@pytest.mark.e2e_cpu # type: ignore
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this into a unit test rather than an e2e test? We already have some infrastructure in trial_workload_sequencer_test.go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about both? In theory, this covers more than just the trial workload sequencer. This covers making sure this class of failures is handled by the entire system.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, done

@stoksc stoksc changed the title fix: handle failed checkpoints correctly fix: handle failed checkpoints correctly [DET-3853] Aug 13, 2020
@stoksc stoksc merged commit d0b384a into determined-ai:master Aug 14, 2020
@stoksc stoksc deleted the failure-checkpoint branch August 14, 2020 13:51
@dannysauer dannysauer added this to the 0.13.0 milestone Feb 6, 2024
eecsliu pushed a commit to determined-ai/determined-release-testing that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants