fix: handle failed checkpoints correctly [DET-3853] #1083

stoksc · 2020-08-13T22:14:57Z

Description

This change fixes a bug in the trial workload sequencer were it is assumes that checkpoints won't fail. In the event of a failure checkpoint, a nil pointer is de-referenced when trying to access checkpoint metrics that don't exist in a failure.

Test Plan

add a unit test that covers this case

Commentary (optional)

rb-determined-ai · 2020-08-13T22:18:58Z

e2e_tests/tests/test_system.py

@@ -485,6 +485,16 @@ def test_fail_on_first_validation() -> None:
    )


+@pytest.mark.e2e_cpu  # type: ignore


Can we make this into a unit test rather than an e2e test? We already have some infrastructure in trial_workload_sequencer_test.go.

What about both? In theory, this covers more than just the trial workload sequencer. This covers making sure this class of failures is handled by the entire system.

Works for me.

…i#1083)

stoksc requested a review from rb-determined-ai August 13, 2020 22:14

stoksc assigned rb-determined-ai Aug 13, 2020

cla-bot bot added the cla-signed label Aug 13, 2020

rb-determined-ai approved these changes Aug 13, 2020

View reviewed changes

stoksc changed the title ~~fix: handle failed checkpoints correctly~~ fix: handle failed checkpoints correctly [DET-3853] Aug 13, 2020

rb-determined-ai approved these changes Aug 13, 2020

View reviewed changes

stoksc added 3 commits August 13, 2020 20:10

fix: handle failed checkpoints correctly

f779678

add unit test

7279eb2

assume user cancelled exits finished the current step

d326562

stoksc merged commit d0b384a into determined-ai:master Aug 14, 2020

stoksc deleted the failure-checkpoint branch August 14, 2020 13:51

azhou-determined pushed a commit that referenced this pull request Dec 7, 2023

chore: skip copy .launcher.token for agent slurmcluster (#1083)

4738abe

wes-turner pushed a commit that referenced this pull request Feb 2, 2024

chore: skip copy .launcher.token for agent slurmcluster (#1083)

ccf6673

dannysauer added this to the 0.13.0 milestone Feb 6, 2024

rb-determined-ai pushed a commit that referenced this pull request Feb 29, 2024

chore: skip copy .launcher.token for agent slurmcluster (#1083)

a0b44f3

amandavialva01 pushed a commit that referenced this pull request Mar 18, 2024

chore: skip copy .launcher.token for agent slurmcluster (#1083)

ccb0636

eecsliu pushed a commit that referenced this pull request Apr 18, 2024

chore: skip copy .launcher.token for agent slurmcluster (#1083)

4927cc1

eecsliu pushed a commit to determined-ai/determined-release-testing that referenced this pull request Apr 22, 2024

chore: skip copy .launcher.token for agent slurmcluster (determined-a…

9517412

…i#1083)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: handle failed checkpoints correctly [DET-3853] #1083

fix: handle failed checkpoints correctly [DET-3853] #1083

stoksc commented Aug 13, 2020 •

edited by jira bot

Loading

rb-determined-ai Aug 13, 2020

stoksc Aug 13, 2020

rb-determined-ai Aug 13, 2020

stoksc Aug 13, 2020

		@@ -485,6 +485,16 @@ def test_fail_on_first_validation() -> None:
		)


		@pytest.mark.e2e_cpu # type: ignore

fix: handle failed checkpoints correctly [DET-3853] #1083

fix: handle failed checkpoints correctly [DET-3853] #1083

Conversation

stoksc commented Aug 13, 2020 • edited by jira bot Loading

Description

Test Plan

Commentary (optional)

rb-determined-ai Aug 13, 2020

Choose a reason for hiding this comment

stoksc Aug 13, 2020

Choose a reason for hiding this comment

rb-determined-ai Aug 13, 2020

Choose a reason for hiding this comment

stoksc Aug 13, 2020

Choose a reason for hiding this comment

stoksc commented Aug 13, 2020 •

edited by jira bot

Loading