Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support bundle checkpoint / preemptible workers #3882

Merged
merged 78 commits into from
Apr 14, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
78 commits
Select commit Hold shift + click to select a range
29f0a05
Bundle checkpoint / preemptible changes
epicfaace Nov 10, 2021
f784441
worker preemptible checkin
epicfaace Nov 10, 2021
be45063
update
epicfaace Nov 10, 2021
6425c3c
add migration
epicfaace Nov 11, 2021
17615b7
fix
epicfaace Nov 11, 2021
d4c787c
fix
epicfaace Nov 11, 2021
d1454e9
fix
epicfaace Nov 11, 2021
48f4ec5
update
epicfaace Nov 17, 2021
d28d669
Merge branch 'master' of github.com:codalab/codalab-worksheets into c…
epicfaace Dec 2, 2021
a40a19a
Update codalab/model/bundle_model.py
epicfaace Dec 2, 2021
d346f3e
Merge branch 'checkpoint' of github.com:codalab/codalab-worksheets in…
epicfaace Dec 2, 2021
9a7a16b
updates
epicfaace Dec 2, 2021
f674228
add documentation
epicfaace Dec 2, 2021
4ea0f52
add test
epicfaace Dec 2, 2021
ae1004c
fix flake8
epicfaace Dec 2, 2021
c55877f
Update Checkpoints.md
epicfaace Feb 8, 2022
dbbb756
Update codalab/model/bundle_model.py
epicfaace Feb 8, 2022
1180e6f
Update docs/Checkpoints.md
epicfaace Feb 8, 2022
3c2c3ec
Merge branch 'master' into checkpoint
epicfaace Feb 15, 2022
47eef0a
Fixes
epicfaace Feb 15, 2022
f226631
fix
epicfaace Feb 15, 2022
20db8a3
Merge branch 'worker2' into checkpoint
epicfaace Feb 15, 2022
79aab01
rename again
epicfaace Feb 15, 2022
4124e34
Revert "rename again"
epicfaace Feb 15, 2022
877269f
fixes, tests
epicfaace Feb 15, 2022
1ca7900
.dev.yml
epicfaace Feb 16, 2022
47f253c
fixes
epicfaace Feb 16, 2022
59828ea
fix test
epicfaace Feb 16, 2022
c3d1a21
fix doc
epicfaace Feb 16, 2022
030256f
fix bug
epicfaace Feb 16, 2022
311e34d
update doc
epicfaace Feb 16, 2022
5a87afc
fix
epicfaace Feb 21, 2022
f5223a0
fix
epicfaace Feb 21, 2022
7abe85c
fix
epicfaace Feb 21, 2022
58932d1
fix
epicfaace Feb 21, 2022
a39bb2c
fix
epicfaace Feb 21, 2022
d6a1398
fix
epicfaace Feb 21, 2022
4c184d3
comment for now
epicfaace Feb 21, 2022
8596d05
Merge branch 'master' of github.com:codalab/codalab-worksheets into c…
epicfaace Mar 16, 2022
a0dadc3
update test
epicfaace Mar 16, 2022
f0d1530
fix
epicfaace Mar 16, 2022
2dda725
ci tmp
epicfaace Mar 16, 2022
0753818
Update test-setup-preemptible.sh
epicfaace Mar 16, 2022
314638e
fix
epicfaace Mar 16, 2022
dcf960b
Merge branch 'checkpoint' of github.com:codalab/codalab-worksheets in…
epicfaace Mar 16, 2022
743f507
sleep
epicfaace Mar 16, 2022
0c6008c
update test
epicfaace Mar 17, 2022
8f9fd41
format / comments
epicfaace Mar 17, 2022
dea755e
fix command
epicfaace Mar 17, 2022
184e405
fix
epicfaace Mar 17, 2022
289b1a2
Update test.yml
epicfaace Mar 21, 2022
78acc06
Update test_cli.py
epicfaace Mar 21, 2022
47aa494
Update test.yml
epicfaace Mar 21, 2022
5eac2d3
Merge branch 'checkpoint' of github.com:codalab/codalab-worksheets in…
epicfaace Mar 21, 2022
115b8b5
fix
epicfaace Mar 21, 2022
ca03ad1
fix
epicfaace Mar 21, 2022
03da441
update
epicfaace Mar 21, 2022
4ad23e6
fix
epicfaace Mar 21, 2022
4061733
fixes
epicfaace Mar 21, 2022
853d4bc
comment
epicfaace Mar 21, 2022
f978cb8
Update test-setup-preemptible.sh
epicfaace Mar 21, 2022
3e9c0c8
Update test_cli.py
epicfaace Mar 21, 2022
663d35b
Update test.yml
epicfaace Mar 21, 2022
0778d58
Update test.yml
epicfaace Mar 21, 2022
6a0a99d
Update test_cli.py
epicfaace Mar 21, 2022
962017f
Update docker-compose.dev.yml
epicfaace Mar 21, 2022
71fb046
update
epicfaace Mar 28, 2022
9dde912
Merge branch 'master' of github.com:codalab/codalab-worksheets into c…
epicfaace Apr 12, 2022
6d45d83
update
epicfaace Apr 12, 2022
6f6c44a
fix
epicfaace Apr 12, 2022
f397936
revert
epicfaace Apr 12, 2022
b77ec32
fix test
epicfaace Apr 12, 2022
6423cc3
update doc
epicfaace Apr 12, 2022
8ff8a4f
tests
epicfaace Apr 12, 2022
5c315a4
update
epicfaace Apr 12, 2022
398bc81
Update codalab/worker/worker_run_state.py
epicfaace Apr 14, 2022
3f5def8
Update test_cli.py
epicfaace Apr 14, 2022
6b03702
Update worker_run_state.py
epicfaace Apr 14, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
59 changes: 57 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -355,7 +355,62 @@ jobs:
if: always()
uses: actions/upload-artifact@v1
with:
name: logs-test-protectedmode-${{ matrix.test }}
name: logs-test-${{ matrix.test }}
path: /tmp/logs

test_backend_preemptible_worker:
name: Test backend - preemptible workers
runs-on: ubuntu-latest
needs: [build]
strategy:
matrix:
test:
- preemptible
steps:
- name: Clear free space
run: |
sudo rm -rf /opt/ghc
df -h
- uses: actions/checkout@v2
- uses: actions/setup-python@v1
with:
python-version: 3.6
- uses: actions/cache@v2
with:
path: ~/.cache/pip
key: pip-${{ hashFiles('requirements.txt') }}
restore-keys: |
pip-
- run: pip install -r requirements.txt
- run: pip install -e .
- name: Setup tests
run: |
sudo service mysql stop
python3 codalab_service.py build services --version ${VERSION} --pull
env:
VERSION: ${{ github.head_ref || 'master' }}
- name: Run tests
run: |
python3 codalab_service.py start --services default no-worker worker-preemptible --version ${VERSION}
sleep 20
python3 codalab_service.py start --services worker-preemptible2 --version ${VERSION}
./tests/test-setup-preemptible.sh
python3 test_runner.py --version ${VERSION} ${TEST}
env:
TEST: ${{ matrix.test }}
VERSION: ${{ github.head_ref || 'master' }}
CODALAB_USERNAME: codalab
CODALAB_PASSWORD: codalab
- name: Save logs
if: always()
run: |
mkdir /tmp/logs
for c in $(docker ps -a --format="{{.Names}}"); do docker logs $c > /tmp/logs/$c.log 2> /tmp/logs/$c.err.log; done
- name: Upload logs
if: always()
uses: actions/upload-artifact@v1
with:
name: logs-test-${{ matrix.test }}
path: /tmp/logs

test_backend_azure_blob:
Expand All @@ -371,7 +426,7 @@ jobs:
- run time
- run2
- search read kill write mimic workers edit_user sharing_workers
# - search link read kill write mimic workers edit_user sharing_workers
# - search link read kill write mimic workers edit_user sharing_workers
- resources
- memoize
- copy netcat netcurl
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
"""add preemptible column to worker

Revision ID: f720aaefd0b2
Revises: a325a0756797
Create Date: 2022-02-15 22:26:06.466720

"""

# revision identifiers, used by Alembic.
revision = 'f720aaefd0b2'
down_revision = 'a325a0756797'

from alembic import op
import sqlalchemy as sa


def upgrade():
# ### commands auto generated by Alembic - please adjust! ###
op.add_column('worker', sa.Column('preemptible', sa.Boolean(), nullable=False))
# ### end Alembic commands ###


def downgrade():
# ### commands auto generated by Alembic - please adjust! ###
op.drop_column('worker', 'preemptible')
# ### end Alembic commands ###
5 changes: 5 additions & 0 deletions codalab/bundles/run_bundle.py
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,11 @@ class RunBundle(DerivedBundle):
METADATA_SPECS.append(MetadataSpec('exitcode', int, 'Exitcode of the process.', generated=True))
METADATA_SPECS.append(MetadataSpec('job_handle', str, 'Identifies the job handle (internal).', generated=True, hide_when_anonymous=True))
METADATA_SPECS.append(MetadataSpec('remote', str, 'Where this job is/was run (internal).', generated=True, hide_when_anonymous=True))
METADATA_SPECS.append(MetadataSpec('remote_history', list, 'All workers where this job has been run (internal); multiple values indicate'
'that the bundle was preempted and moved to a different worker.',
generated=True, hide_when_anonymous=True))
METADATA_SPECS.append(MetadataSpec('on_preemptible_worker', bool, 'Whether the bundle is currently running / finished on a preemptible worker.', generated=True, hide_when_anonymous=True,
default=False))
# fmt: on

@classmethod
Expand Down
32 changes: 26 additions & 6 deletions codalab/model/bundle_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -914,10 +914,22 @@ def transition_bundle_preparing(self, bundle, user_id, worker_id, start_time, re
).fetchone()
if not run_row or run_row.user_id != user_id or run_row.worker_id != worker_id:
return False
worker_row = connection.execute(
cl_worker.select().where(and_(cl_worker.c.worker_id == worker_id))
).fetchone()

bundle_update = {
'state': State.PREPARING,
'metadata': {'started': start_time, 'last_updated': start_time, 'remote': remote},
'metadata': {
'started': start_time,
'last_updated': start_time,
'remote': remote,
'on_preemptible_worker': worker_row.preemptible,
'remote_history': getattr(bundle.metadata, "remote_history", [])
+ [
remote
], # Store the history of which workers ran this bundle before in the bundle metadata.
epicfaace marked this conversation as resolved.
Show resolved Hide resolved
},
}
self.update_bundle(bundle, bundle_update, connection)

Expand Down Expand Up @@ -989,6 +1001,8 @@ def transition_bundle_worker_offline(self, bundle):
Transitions bundle to WORKER_OFFLINE state:
Updates the last_updated metadata.
Removes the corresponding row from worker_run if it exists.

If the bundle is preemptible, move the bundle to the STAGED state instead.
"""
with self.engine.begin() as connection:
# Check that it still exists and is running
Expand All @@ -1004,15 +1018,21 @@ def transition_bundle_worker_offline(self, bundle):
# The user deleted the bundle or the bundle finished
return False

if getattr(bundle.metadata, "on_preemptible_worker", False):
# If the bundle is running on a preemptible worker, move the bundle to the STAGED state instead.
bundle_update = {
epicfaace marked this conversation as resolved.
Show resolved Hide resolved
epicfaace marked this conversation as resolved.
Show resolved Hide resolved
'state': State.STAGED,
'metadata': {'last_updated': int(time.time())},
}
else:
bundle_update = {
'state': State.WORKER_OFFLINE,
'metadata': {'last_updated': int(time.time())},
}
# Delete row in worker_run
connection.execute(
cl_worker_run.delete().where(cl_worker_run.c.run_uuid == bundle.uuid)
)

bundle_update = {
'state': State.WORKER_OFFLINE,
'metadata': {'last_updated': int(time.time())},
}
self.update_bundle(bundle, bundle_update, connection)
return True

Expand Down
1 change: 1 addition & 0 deletions codalab/model/tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -521,6 +521,7 @@
'exit_after_num_runs', Integer, nullable=False
), # Number of jobs allowed to run on worker.
Column('is_terminating', Boolean, nullable=False),
Column('preemptible', Boolean, nullable=False), # Whether worker is preemptible.
mysql_charset=TABLE_DEFAULT_CHARSET,
)

Expand Down
3 changes: 3 additions & 0 deletions codalab/model/worker_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ def worker_checkin(
tag_exclusive,
exit_after_num_runs,
is_terminating,
preemptible,
):
"""
Adds the worker to the database, if not yet there. Returns the socket ID
Expand All @@ -69,6 +70,7 @@ def worker_checkin(
'tag_exclusive': tag_exclusive,
'exit_after_num_runs': exit_after_num_runs,
'is_terminating': is_terminating,
'preemptible': preemptible,
}

# Populate the group for this worker, if group_name is valid
Expand Down Expand Up @@ -206,6 +208,7 @@ def get_workers(self):
'tag_exclusive': row.tag_exclusive,
'exit_after_num_runs': row.exit_after_num_runs,
'is_terminating': row.is_terminating,
'preemptible': row.preemptible,
}
for row in worker_rows
}
Expand Down
1 change: 1 addition & 0 deletions codalab/rest/workers.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ def checkin(worker_id):
request.json.get("tag_exclusive", False),
request.json.get("exit_after_num_runs", DEFAULT_EXIT_AFTER_NUM_RUNS),
request.json.get("is_terminating", False),
request.json.get("preemptible", False),
)

for run in request.json["runs"]:
Expand Down
4 changes: 4 additions & 0 deletions codalab/worker/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -191,6 +191,9 @@ def parse_args():
default=1,
help='The shared memory size of the run container in GB (defaults to 1).',
)
parser.add_argument(
'--preemptible', action='store_true', help='Whether the worker is preemptible.',
)
return parser.parse_args()


Expand Down Expand Up @@ -328,6 +331,7 @@ def main():
delete_work_dir_on_exit=args.delete_work_dir_on_exit,
exit_on_exception=args.exit_on_exception,
shared_memory_size_gb=args.shared_memory_size_gb,
preemptible=args.preemptible,
)

# Register a signal handler to ensure safe shutdown.
Expand Down
3 changes: 3 additions & 0 deletions codalab/worker/worker.py
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ def __init__(
# A flag indicating if the worker will exit if it encounters an exception
exit_on_exception=False, # type: bool
shared_memory_size_gb=1, # type: int
preemptible=False, # type: bool
):
self.image_manager = image_manager
self.dependency_manager = dependency_manager
Expand Down Expand Up @@ -114,6 +115,7 @@ def __init__(
self.terminate_and_restage = False
self.pass_down_termination = pass_down_termination
self.exit_on_exception = exit_on_exception
self.preemptible = preemptible

self.checkin_frequency_seconds = checkin_frequency_seconds
self.last_checkin_successful = False
Expand Down Expand Up @@ -428,6 +430,7 @@ def checkin(self):
'tag_exclusive': self.tag_exclusive,
'exit_after_num_runs': self.exit_after_num_runs - self.num_runs,
'is_terminating': self.terminate or self.terminate_and_restage,
'preemptible': self.preemptible,
}
try:
response = self.bundle_service.checkin(self.id, request)
Expand Down
3 changes: 1 addition & 2 deletions codalab/worker/worker_run_state.py
Original file line number Diff line number Diff line change
Expand Up @@ -344,8 +344,7 @@ def mount_dependency(dependency, shared_file_system):
bundle_dir_wait_num_tries=next_bundle_dir_wait_num_tries,
)
else:
remove_path(run_state.bundle_path)
os.makedirs(run_state.bundle_path)
os.makedirs(run_state.bundle_path, exist_ok=True)

# 2) Set up symlinks
docker_dependencies = []
Expand Down
5 changes: 5 additions & 0 deletions codalab/worker_manager/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -114,6 +114,11 @@ def main():
type=int,
help="The shared memory size in GB of the run container started by the CodaLab Workers.",
)
parser.add_argument(
'--worker-preemptible',
action='store_true',
help='Whether the CodaLab workers are preemptible.',
)
subparsers = parser.add_subparsers(
title='Worker Manager to run',
description='Which worker manager to run (AWS Batch etc.)',
Expand Down
2 changes: 2 additions & 0 deletions codalab/worker_manager/worker_manager.py
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,8 @@ def build_command(self, worker_id: str, work_dir: str) -> List[str]:
)
if self.args.worker_shared_memory_size_gb:
command.extend(['--shared-memory-size-gb', str(self.args.worker_shared_memory_size_gb)])
if self.args.worker_preemptible:
command.extend(['--worker-preemptible'])

return command

Expand Down
6 changes: 6 additions & 0 deletions codalab_service.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,8 @@
'worker-manager-cpu',
'worker-manager-gpu',
'worker2',
'worker-preemptible',
'worker-preemptible2',
]

ALL_NO_SERVICES = [
Expand All @@ -59,6 +61,8 @@
'monitor': 'server',
'worker': 'worker',
'worker2': 'worker',
'worker-preemptible': 'worker',
'worker-preemptible2': 'worker',
}

# Max timeout in seconds to wait for request to a service to get through
Expand Down Expand Up @@ -953,6 +957,8 @@ def start_services(self):
else:
self.bring_up_service('worker')
self.bring_up_service('worker2')
self.bring_up_service('worker-preemptible')
self.bring_up_service('worker-preemptible2')

self.bring_up_service('monitor')

Expand Down
48 changes: 48 additions & 0 deletions docker_config/compose_files/docker-compose.yml
Original file line number Diff line number Diff line change
Expand Up @@ -407,6 +407,54 @@ services:
- rest-server
shm_size: '500mb'

worker-preemptible:
image: codalab/worker:${CODALAB_VERSION}
command: |
cl-worker
--server http://rest-server:${CODALAB_REST_PORT}
--work-dir ${CODALAB_WORKER_DIR}
--network-prefix ${CODALAB_WORKER_NETWORK_PREFIX}
--id ${HOSTNAME}-worker-preemptible
--verbose
--preemptible
--tag preemptible
<<: *codalab-base
<<: *codalab-root # Not ideal since worker files saved as root, but without it, can't use docker
depends_on:
- rest-server
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ${CODALAB_WORKER_DIR}:${CODALAB_WORKER_DIR}
networks:
- service
- worker
- rest-server
shm_size: '500mb'

worker-preemptible2:
image: codalab/worker:${CODALAB_VERSION}
command: |
cl-worker
--server http://rest-server:${CODALAB_REST_PORT}
--work-dir ${CODALAB_WORKER_DIR}
--network-prefix ${CODALAB_WORKER_NETWORK_PREFIX}
--id ${HOSTNAME}-worker-preemptible2
--verbose
--preemptible
--tag preemptible
<<: *codalab-base
<<: *codalab-root # Not ideal since worker files saved as root, but without it, can't use docker
depends_on:
- rest-server
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- ${CODALAB_WORKER_DIR}:${CODALAB_WORKER_DIR}
networks:
- service
- worker
- rest-server
shm_size: '500mb'

worker-shared-file-system:
image: codalab/worker:${CODALAB_VERSION}
command: |
Expand Down
14 changes: 14 additions & 0 deletions docs/Checkpoints.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Checkpoints

!!! warning "Unreleased feature"
This feature is still in development. It does not fully work yet and will only work once [#3710](https://github.com/codalab/codalab-worksheets/issues/3710) is resolved.

CodaLab supports the concept of bundle checkpoints, which allow bundles to be stopped and then resumed on different workers. This is especially useful if you want to use preemptible compute nodes, such as [EC2 Spot](https://aws.amazon.com/ec2/spot/) instances. When a worker is preempted, the bundle should be resumed on another worker.

## Workflow steps

This section describes a sample workflow for running bundles on a pool of preemptible workers. The workflow uses the tag `preemptiblepool1` to organize compute, but you can change this tag name as needed.

1. Start a worker manager that runs workers on preemptible compute notes, using the following flags: `--worker-preemptible --worker-tag-exclusive --worker-tag="preemptiblepool1"`. Ensure that all the workers started by the worker manager are configured to share the same network disk for their bundle run working directory / dependency cache.
1. User can run a bundle by running `cl run --request-queue preemptiblepool1`.
1. The bundle manager will assign the bundle to one of the preemptible workers with tag `preemptiblepool1`. If the worker gets preempted, the bundle transitions back to the `STAGED` state and will then be assigned to another preemptible worker with tag `preemptiblepool1`. The history of workers where a bundle has run is stored in the `remote_history` bundle metadata field.
2 changes: 1 addition & 1 deletion docs/Execution.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ You can tag workers and run jobs on workers with those tags. To tag a worker, s

To run a job, simply pass the tag in:

cl run date --request-queue tag=<worker_tag>
cl run date --request-queue <worker_tag>

**Other flags**. Run `cl-worker --help` for information on all the supported flags. Aside
from the `--server`, other important flags include `--work-dir`
Expand Down
Loading