Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bundle checkpoint / preemptible workers #3882

Merged
merged 78 commits into from
Apr 14, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
29f0a05
Bundle checkpoint / preemptible changes
epicfaace Nov 10, 2021
f784441
worker preemptible checkin
epicfaace Nov 10, 2021
be45063
update
epicfaace Nov 10, 2021
6425c3c
add migration
epicfaace Nov 11, 2021
17615b7
fix
epicfaace Nov 11, 2021
d4c787c
fix
epicfaace Nov 11, 2021
d1454e9
fix
epicfaace Nov 11, 2021
48f4ec5
update
epicfaace Nov 17, 2021
d28d669
Merge branch 'master' of github.com:codalab/codalab-worksheets into c…
epicfaace Dec 2, 2021
a40a19a
Update codalab/model/bundle_model.py
epicfaace Dec 2, 2021
d346f3e
Merge branch 'checkpoint' of github.com:codalab/codalab-worksheets in…
epicfaace Dec 2, 2021
9a7a16b
updates
epicfaace Dec 2, 2021
f674228
add documentation
epicfaace Dec 2, 2021
4ea0f52
add test
epicfaace Dec 2, 2021
ae1004c
fix flake8
epicfaace Dec 2, 2021
c55877f
Update Checkpoints.md
epicfaace Feb 8, 2022
dbbb756
Update codalab/model/bundle_model.py
epicfaace Feb 8, 2022
1180e6f
Update docs/Checkpoints.md
epicfaace Feb 8, 2022
3c2c3ec
Merge branch 'master' into checkpoint
epicfaace Feb 15, 2022
47eef0a
Fixes
epicfaace Feb 15, 2022
f226631
fix
epicfaace Feb 15, 2022
20db8a3
Merge branch 'worker2' into checkpoint
epicfaace Feb 15, 2022
79aab01
rename again
epicfaace Feb 15, 2022
4124e34
Revert "rename again"
epicfaace Feb 15, 2022
877269f
fixes, tests
epicfaace Feb 15, 2022
1ca7900
.dev.yml
epicfaace Feb 16, 2022
47f253c
fixes
epicfaace Feb 16, 2022
59828ea
fix test
epicfaace Feb 16, 2022
c3d1a21
fix doc
epicfaace Feb 16, 2022
030256f
fix bug
epicfaace Feb 16, 2022
311e34d
update doc
epicfaace Feb 16, 2022
5a87afc
fix
epicfaace Feb 21, 2022
f5223a0
fix
epicfaace Feb 21, 2022
7abe85c
fix
epicfaace Feb 21, 2022
58932d1
fix
epicfaace Feb 21, 2022
a39bb2c
fix
epicfaace Feb 21, 2022
d6a1398
fix
epicfaace Feb 21, 2022
4c184d3
comment for now
epicfaace Feb 21, 2022
8596d05
Merge branch 'master' of github.com:codalab/codalab-worksheets into c…
epicfaace Mar 16, 2022
a0dadc3
update test
epicfaace Mar 16, 2022
f0d1530
fix
epicfaace Mar 16, 2022
2dda725
ci tmp
epicfaace Mar 16, 2022
0753818
Update test-setup-preemptible.sh
epicfaace Mar 16, 2022
314638e
fix
epicfaace Mar 16, 2022
dcf960b
Merge branch 'checkpoint' of github.com:codalab/codalab-worksheets in…
epicfaace Mar 16, 2022
743f507
sleep
epicfaace Mar 16, 2022
0c6008c
update test
epicfaace Mar 17, 2022
8f9fd41
format / comments
epicfaace Mar 17, 2022
dea755e
fix command
epicfaace Mar 17, 2022
184e405
fix
epicfaace Mar 17, 2022
289b1a2
Update test.yml
epicfaace Mar 21, 2022
78acc06
Update test_cli.py
epicfaace Mar 21, 2022
47aa494
Update test.yml
epicfaace Mar 21, 2022
5eac2d3
Merge branch 'checkpoint' of github.com:codalab/codalab-worksheets in…
epicfaace Mar 21, 2022
115b8b5
fix
epicfaace Mar 21, 2022
ca03ad1
fix
epicfaace Mar 21, 2022
03da441
update
epicfaace Mar 21, 2022
4ad23e6
fix
epicfaace Mar 21, 2022
4061733
fixes
epicfaace Mar 21, 2022
853d4bc
comment
epicfaace Mar 21, 2022
f978cb8
Update test-setup-preemptible.sh
epicfaace Mar 21, 2022
3e9c0c8
Update test_cli.py
epicfaace Mar 21, 2022
663d35b
Update test.yml
epicfaace Mar 21, 2022
0778d58
Update test.yml
epicfaace Mar 21, 2022
6a0a99d
Update test_cli.py
epicfaace Mar 21, 2022
962017f
Update docker-compose.dev.yml
epicfaace Mar 21, 2022
71fb046
update
epicfaace Mar 28, 2022
9dde912
Merge branch 'master' of github.com:codalab/codalab-worksheets into c…
epicfaace Apr 12, 2022
6d45d83
update
epicfaace Apr 12, 2022
6f6c44a
fix
epicfaace Apr 12, 2022
f397936
revert
epicfaace Apr 12, 2022
b77ec32
fix test
epicfaace Apr 12, 2022
6423cc3
update doc
epicfaace Apr 12, 2022
8ff8a4f
tests
epicfaace Apr 12, 2022
5c315a4
update
epicfaace Apr 12, 2022
398bc81
Update codalab/worker/worker_run_state.py
epicfaace Apr 14, 2022
3f5def8
Update test_cli.py
epicfaace Apr 14, 2022
6b03702
Update worker_run_state.py
epicfaace Apr 14, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions codalab/bundles/run_bundle.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ class RunBundle(DerivedBundle):
METADATA_SPECS.append(MetadataSpec('exitcode', int, 'Exitcode of the process.', generated=True))
METADATA_SPECS.append(MetadataSpec('job_handle', str, 'Identifies the job handle (internal).', generated=True, hide_when_anonymous=True))
METADATA_SPECS.append(MetadataSpec('remote', str, 'Where this job is/was run (internal).', generated=True, hide_when_anonymous=True))
METADATA_SPECS.append(MetadataSpec('remote_history', list, 'All workers where this job has been run (internal); multiple values indicate'
'that the bundle was preempted and moved to a different worker.',
generated=True, hide_when_anonymous=True))
METADATA_SPECS.append(MetadataSpec('preemptible', bool, 'Whether the bundle is currently running / finished on a preemptible worker.', generated=True, hide_when_anonymous=True, default=False))
epicfaace marked this conversation as resolved.
Show resolved Hide resolved
# fmt: on

@classmethod
Expand Down
27 changes: 21 additions & 6 deletions codalab/model/bundle_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -915,7 +915,15 @@ def transition_bundle_preparing(self, bundle, user_id, worker_id, start_time, re

bundle_update = {
'state': State.PREPARING,
'metadata': {'started': start_time, 'last_updated': start_time, 'remote': remote},
'metadata': {
'started': start_time,
'last_updated': start_time,
'remote': remote,
'remote_history': getattr(bundle.metadata, "remote_history", [])
+ [
remote
], # Store the history of which workers ran this bundle before in the bundle metadata.
epicfaace marked this conversation as resolved.
Show resolved Hide resolved
},
}
self.update_bundle(bundle, bundle_update, connection)

Expand Down Expand Up @@ -987,6 +995,8 @@ def transition_bundle_worker_offline(self, bundle):
Transitions bundle to WORKER_OFFLINE state:
Updates the last_updated metadata.
Removes the corresponding row from worker_run if it exists.

If the bundle is preemptible, move the bundle to the STAGED state instead.
"""
with self.engine.begin() as connection:
# Check that it still exists and is running
Expand All @@ -1000,15 +1010,20 @@ def transition_bundle_worker_offline(self, bundle):
# The user deleted the bundle or the bundle finished
return False

if getattr(bundle.metadata, "preemptible", False):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we access the preemptible field directly and have to use getattr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that if we do bundle.metadata["preemptible"], this will break bundles that started running in the previous deploy (but continue running during the deploy) that don't yet have this metadata key yet.

bundle_update = {
epicfaace marked this conversation as resolved.
Show resolved Hide resolved
epicfaace marked this conversation as resolved.
Show resolved Hide resolved
'state': State.STAGED,
'metadata': {'last_updated': int(time.time())},
}
else:
bundle_update = {
'state': State.WORKER_OFFLINE,
'metadata': {'last_updated': int(time.time())},
}
# Delete row in worker_run
connection.execute(
cl_worker_run.delete().where(cl_worker_run.c.run_uuid == bundle.uuid)
)

bundle_update = {
'state': State.WORKER_OFFLINE,
'metadata': {'last_updated': int(time.time())},
}
self.update_bundle(bundle, bundle_update, connection)
return True

Expand Down
1 change: 1 addition & 0 deletions codalab/model/tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -526,6 +526,7 @@
'exit_after_num_runs', Integer, nullable=False
), # Number of jobs allowed to run on worker.
Column('is_terminating', Boolean, nullable=False),
Column('preemptible', Boolean, nullable=False),
mysql_charset=TABLE_DEFAULT_CHARSET,
)

Expand Down
3 changes: 3 additions & 0 deletions codalab/model/worker_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ def worker_checkin(
tag_exclusive,
exit_after_num_runs,
is_terminating,
preemptible,
):
"""
Adds the worker to the database, if not yet there. Returns the socket ID
Expand All @@ -69,6 +70,7 @@ def worker_checkin(
'tag_exclusive': tag_exclusive,
'exit_after_num_runs': exit_after_num_runs,
'is_terminating': is_terminating,
'preemptible': preemptible,
}

# Populate the group for this worker, if group_name is valid
Expand Down Expand Up @@ -206,6 +208,7 @@ def get_workers(self):
'tag_exclusive': row.tag_exclusive,
'exit_after_num_runs': row.exit_after_num_runs,
'is_terminating': row.is_terminating,
'preemptible': row.preemptible,
}
for row in worker_rows
}
Expand Down
4 changes: 4 additions & 0 deletions codalab/worker/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,9 @@ def parse_args():
default=1,
help='The shared memory size of the run container in GB (defaults to 1).',
)
parser.add_argument(
'--preemptible', action='store_true', help='Whether the worker is preemptible.',
)
return parser.parse_args()


Expand Down Expand Up @@ -325,6 +328,7 @@ def main():
delete_work_dir_on_exit=args.delete_work_dir_on_exit,
exit_on_exception=args.exit_on_exception,
shared_memory_size_gb=args.shared_memory_size_gb,
preemptible=args.preemptible,
)

# Register a signal handler to ensure safe shutdown.
Expand Down
3 changes: 3 additions & 0 deletions codalab/worker/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ def __init__(
# A flag indicating if the worker will exit if it encounters an exception
exit_on_exception=False, # type: bool
shared_memory_size_gb=1, # type: int
preemptible=False, # type: bool
):
self.image_manager = image_manager
self.dependency_manager = dependency_manager
Expand Down Expand Up @@ -114,6 +115,7 @@ def __init__(
self.terminate_and_restage = False
self.pass_down_termination = pass_down_termination
self.exit_on_exception = exit_on_exception
self.preemptible = preemptible

self.checkin_frequency_seconds = checkin_frequency_seconds
self.last_checkin_successful = False
Expand Down Expand Up @@ -422,6 +424,7 @@ def checkin(self):
'tag_exclusive': self.tag_exclusive,
'exit_after_num_runs': self.exit_after_num_runs - self.num_runs,
'is_terminating': self.terminate or self.terminate_and_restage,
'preemptible': self.preemptible,
}
try:
response = self.bundle_service.checkin(self.id, request)
Expand Down
5 changes: 5 additions & 0 deletions codalab/worker_manager/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,11 @@ def main():
type=int,
help="The shared memory size in GB of the run container started by the CodaLab Workers.",
)
parser.add_argument(
'--worker-preemptible',
action='store_true',
help='Whether the CodaLab workers are preemptible.',
)
subparsers = parser.add_subparsers(
title='Worker Manager to run',
description='Which worker manager to run (AWS Batch etc.)',
Expand Down
2 changes: 2 additions & 0 deletions codalab/worker_manager/worker_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,8 @@ def build_command(self, worker_id: str, work_dir: str) -> List[str]:
)
if self.args.worker_shared_memory_size_gb:
command.extend(['--shared-memory-size-gb', str(self.args.worker_shared_memory_size_gb)])
if self.args.worker_preemptible:
command.extend(['--worker-preemptible'])

return command

Expand Down
14 changes: 14 additions & 0 deletions docs/Checkpoints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Checkpoints

!!! warning "Unreleased feature"
This feature is still in development. It does not fully work yet and will only work once [#3710](https://github.com/codalab/codalab-worksheets/issues/3710) is resolved.

CodaLab supports the concept of bundle checkpoints, which allow bundles to be stopped and then resumed on different workers. This is especially useful if you want to use preemptible compute nodes, such as [EC2 Spot](https://aws.amazon.com/ec2/spot/) instances. When a worker is preempted, the bundle should be resumed on another worker.

## Workflow steps

This section describes a sample workflow for running bundles on a pool of preemptible workers. The workflow uses the tag `preemptiblepool1` to organize compute, but you can change this tag name as needed.

1. Start a worker manager that runs workers on preemptible compute notes, using the following flags: `--preemptible --worker-tag-exclusive --worker-tag="preemptiblepool1"`. Ensure that all the workers started by the worker manager are configured to share the same network disk for their bundle run working directory / dependency cache.
epicfaace marked this conversation as resolved.
Show resolved Hide resolved
1. User can run a bundle by running `cl run --tag="preemptiblepool1"`.
epicfaace marked this conversation as resolved.
Show resolved Hide resolved
1. The bundle manager will assign the bundle to one of the preemptible workers with tag `preemptiblepool1`. If the worker gets preempted, the bundle transitions back to the `STAGED` state and will then be assigned to another preemptible worker with tag `preemptiblepool1`. The history of workers where a bundle has run is stored in the `remote_history` bundle metadata field.
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ nav:
- Execution: Execution.md
- Competitions: Competitions.md
- Storage Externalization: Externalization.md
- Checkpoints: Checkpoints.md
- Reference:
- CLI Reference: CLI-Reference.md
- REST API Reference: REST-API-Reference.md
Expand Down
1 change: 1 addition & 0 deletions tests/unit/rest/workers_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ def test_checkin(self):
'tag_exclusive': False,
'exit_after_num_runs': 999999999,
'is_terminating': False,
'preemptible': True,
}
response = self.app.post_json('/rest/workers/test_worker/checkin', body)
self.assertEqual(response.status_int, 200)
Expand Down
1 change: 1 addition & 0 deletions tests/unit/server/bundle_manager/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,7 @@ def mock_worker_checkin(
tag_exclusive=False,
exit_after_num_runs=999999999,
is_terminating=False,
preemptible=False,
)
# Mock a reply from the worker
self.bundle_manager._worker_model.send_json_message = Mock(return_value=True)
Expand Down
12 changes: 12 additions & 0 deletions tests/unit/server/bundle_manager/schedule_run_bundles_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,18 @@ def test_finalizing_bundle_goes_offline_if_no_worker_claims(self):
bundle = self.bundle_manager._model.get_bundle(bundle.uuid)
self.assertEqual(bundle.state, State.WORKER_OFFLINE)

def test_reassign_stuck_running_preemptible_bundles(self):
"""If no workers exist to claim a bundle, and the bundle is running on a preemptible worker, it should go to the STAGED state in preparation for being reassigned to another worker."""
bundle = self.create_run_bundle(
State.RUNNING, {"preemptible": True, "remote_history": ["remote1"], "remote": "remote1"}
)
self.save_bundle(bundle)
self.bundle_manager._schedule_run_bundles()

bundle = self.bundle_manager._model.get_bundle(bundle.uuid)
self.assertEqual(bundle.state, State.STAGED)
self.assertEqual(bundle.metadata.remote_history, ["remote1"])

def test_finalizing_bundle_gets_finished(self):
"""If a worker checks in with a "finalizing" message, the bundle should transition
to the FINALIZING and then FINISHED state."""
Expand Down