Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient worker check-ins #4024

Merged
merged 12 commits into from
Feb 22, 2022
Merged

More efficient worker check-ins #4024

merged 12 commits into from
Feb 22, 2022

Conversation

epicfaace
Copy link
Member

@epicfaace epicfaace commented Feb 21, 2022

Reasons for making this change

Fixes #4023 and fixes #3645. Running simple bundles is now blazing fast with minimal overhead! This changed the time on my local machine from staging a bundle -> finishing from 52 seconds to 5 seconds. This PR makes it faster in the following ways:

  • Previously, the worker was just blocking when waiting every 5 seconds and not doing any work. This PR lets it continue processing runs while waiting for the next check-in time.
  • Previously, the worker would also wait 5 seconds after finishing all its processing, which this PR removes -- so the worker check-in times will also be closer to the configured 5 seconds rather than something potentially much longer.

Logs for cl run echo with this PR:

User logs:

(venv) :~/environment/codalab/codalab-worksheets $ cl run echo
0x4dfccd1e9e3b441eaaa28c63e6340492

Bundle manager logs:

2022-02-21 20:21:27,037 Staging 0x4dfccd1e9e3b441eaaa28c63e6340492
2022-02-21 20:21:27,066 Starting run bundle 0x4dfccd1e9e3b441eaaa28c63e6340492 on worker ip-172-31-32-14
2022-02-21 20:21:32,297 Acknowledged finalization of run bundle 0x4dfccd1e9e3b441eaaa28c63e6340492 on worker ip-172-31-32-14

Worker logs:

2022-02-21 20:21:27,066 Connected! Successful check in! /opt/codalab/worker/worker.py 429
2022-02-21 20:21:27,067 Received run message: {'type': 'run', 'bundle': {'uuid': '0x4dfccd1e9e3b441eaaa28c63e6340492', 'bundle_type': 'run', 'owner_id': '0', 'command': 'echo', 'data_hash': None, 'state': 'starting', 'frozen': None, 'is_anonymous': False, 'metadata': {'actions': [], 'allow_failed_dependencies': False, 'created': 1645474886, 'description': '', 'exclude_patterns': [], 'last_updated': 1645474887, 'name': 'run-echo', 'remote_history': [], 'request_cpus': 1, 'request_disk': '', 'request_docker_image': '', 'request_gpus': 0, 'request_memory': '2g', 'request_network': True, 'request_priority': 0, 'request_queue': '', 'request_time': '', 'store': '', 'tags': []}, 'args': 'run echo --request-cpus 1 --request-memory 2g --request-network', 'dependencies': [], 'location': None}, 'resources': {'cpus': 1, 'gpus': 0, 'docker_image': 'codalab/default-cpu:latest', 'time': 0, 'memory': 2147483648, 'disk': 0, 'network': True}, 'meta': {'version': '1.4.2'}} /opt/codalab/worker/worker.py 444
2022-02-21 20:21:27,121 Downloading Docker image codalab/default-cpu:latest /opt/codalab/worker/docker_image_manager.py 136
2022-02-21 20:21:27,122 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,152 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,179 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,200 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,219 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,240 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,260 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,287 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,313 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,332 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,350 bundle is not ready yet. uuid: 0x4dfccd1e9e3b441eaaa28c63e6340492. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 20:21:27,375 Download for Docker image codalab/default-cpu:latest complete /opt/codalab/worker/docker_image_manager.py 150
2022-02-21 20:21:27,874 Started Docker container for UUID 0x4dfccd1e9e3b441eaaa28c63e6340492, container ID df2ffef7d2e408081c596ba7d0b3196341b2da24264d7bed56eee7bdee01e220, /opt/codalab/worker/docker_utils.py 215
2022-02-21 20:21:28,309 Finished run with UUID 0x4dfccd1e9e3b441eaaa28c63e6340492, exitcode 0, failure_message None /opt/codalab/worker/worker_run_state.py 569
2022-02-21 20:21:30,146 Bundle 0x4dfccd1e9e3b441eaaa28c63e6340492 is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.UPLOADING_RESULTS /opt/codalab/worker/worker_run_state.py 142
2022-02-21 20:21:30,169 Uploading results for run with UUID 0x4dfccd1e9e3b441eaaa28c63e6340492 /opt/codalab/worker/worker_run_state.py 674
2022-02-21 20:21:30,195 About to read the response... url: /bundles/0x4dfccd1e9e3b441eaaa28c63e6340492/contents/blob/ /opt/codalab/worker/rest_client.py 168
2022-02-21 20:21:30,262 Finished reading the response, url: /bundles/0x4dfccd1e9e3b441eaaa28c63e6340492/contents/blob/ /opt/codalab/worker/rest_client.py 182
2022-02-21 20:21:30,282 Bundle 0x4dfccd1e9e3b441eaaa28c63e6340492 is transitioning from RUN_STAGE.UPLOADING_RESULTS to RUN_STAGE.FINALIZING /opt/codalab/worker/worker_run_state.py 142
2022-02-21 20:21:32,298 Connected! Successful check in! /opt/codalab/worker/worker.py 429
2022-02-21 20:21:32,298 Received mark_finalized message: {'type': 'mark_finalized', 'uuid': '0x4dfccd1e9e3b441eaaa28c63e6340492', 'meta': {'version': '1.4.2'}} /opt/codalab/worker/worker.py 444
2022-02-21 20:21:40,339 Connected! Successful check in! /opt/codalab/worker/worker.py 429
2022-02-21 20:21:48,371 Connected! Successful check in! /opt/codalab/worker/worker.py 429

Logs for cl run echo without this PR:

User logs:

(venv) :~/environment/codalab/codalab-worksheets $ cl run echo
0x3984b6eca9164931a7db198f9f9d6fe1

Bundle manager logs:

2022-02-21 19:55:00,523 Staging 0x3984b6eca9164931a7db198f9f9d6fe1
2022-02-21 19:55:03,955 Starting run bundle 0x3984b6eca9164931a7db198f9f9d6fe1 on worker ip-172-31-32-14
2022-02-21 19:55:52,111 Acknowledged finalization of run bundle 0x3984b6eca9164931a7db198f9f9d6fe1 on worker ip-172-31-32-14

Worker logs:

2022-02-21 19:54:58,287 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:55:03,956 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:55:03,956 Received run message: {'type': 'run', 'bundle': {'uuid': '0x3984b6eca9164931a7db198f9f9d6fe1', 'bundle_type': 'run', 'owner_id': '0', 'command': 'echo', 'data_hash': None, 'state': 'starting', 'frozen': None, 'is_anonymous': False, 'metadata': {'actions': [], 'allow_failed_dependencies': False, 'created': 1645473300, 'description': '', 'exclude_patterns': [], 'last_updated': 1645473303, 'name': 'run-echo', 'remote_history': [], 'request_cpus': 1, 'request_disk': '', 'request_docker_image': '', 'request_gpus': 0, 'request_memory': '2g', 'request_network': True, 'request_priority': 0, 'request_queue': '', 'request_time': '', 'store': '', 'tags': []}, 'args': 'run echo --request-cpus 1 --request-memory 2g --request-network', 'dependencies': [], 'location': None}, 'resources': {'cpus': 1, 'gpus': 0, 'docker_image': 'codalab/default-cpu:latest', 'time': 0, 'memory': 2147483648, 'disk': 0, 'network': True}, 'meta': {'version': '1.4.2'}} /opt/codalab/worker/worker.py 446
2022-02-21 19:55:09,007 Downloading Docker image codalab/default-cpu:latest /opt/codalab/worker/docker_image_manager.py 136
2022-02-21 19:55:09,008 bundle is not ready yet. uuid: 0x3984b6eca9164931a7db198f9f9d6fe1. status message: Pulling docker image codalab/default-cpu:latest Download starting /opt/codalab/worker/worker_run_state.py 303
2022-02-21 19:55:09,241 Download for Docker image codalab/default-cpu:latest complete /opt/codalab/worker/docker_image_manager.py 150
2022-02-21 19:55:12,056 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:55:17,623 Started Docker container for UUID 0x3984b6eca9164931a7db198f9f9d6fe1, container ID 8d35bd9b40a6975a0f5999d38cc2be0843c5ca96b9b19d8ab74c2e59b6eee5d8, /opt/codalab/worker/docker_utils.py 215
2022-02-21 19:55:20,738 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:55:25,790 Finished run with UUID 0x3984b6eca9164931a7db198f9f9d6fe1, exitcode 0, failure_message None /opt/codalab/worker/worker_run_state.py 571
2022-02-21 19:55:30,809 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:55:35,856 Bundle 0x3984b6eca9164931a7db198f9f9d6fe1 is transitioning from RUN_STAGE.CLEANING_UP to RUN_STAGE.UPLOADING_RESULTS /opt/codalab/worker/worker_run_state.py 142
2022-02-21 19:55:38,896 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:55:43,925 Uploading results for run with UUID 0x3984b6eca9164931a7db198f9f9d6fe1 /opt/codalab/worker/worker_run_state.py 676
2022-02-21 19:55:43,943 About to read the response... url: /bundles/0x3984b6eca9164931a7db198f9f9d6fe1/contents/blob/ /opt/codalab/worker/rest_client.py 168
2022-02-21 19:55:43,998 Finished reading the response, url: /bundles/0x3984b6eca9164931a7db198f9f9d6fe1/contents/blob/ /opt/codalab/worker/rest_client.py 182
2022-02-21 19:55:46,978 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:55:52,005 Bundle 0x3984b6eca9164931a7db198f9f9d6fe1 is transitioning from RUN_STAGE.UPLOADING_RESULTS to RUN_STAGE.FINALIZING /opt/codalab/worker/worker_run_state.py 142
2022-02-21 19:55:52,111 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:55:52,112 Received mark_finalized message: {'type': 'mark_finalized', 'uuid': '0x3984b6eca9164931a7db198f9f9d6fe1', 'meta': {'version': '1.4.2'}} /opt/codalab/worker/worker.py 446
2022-02-21 19:56:00,168 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:56:08,223 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:56:16,280 Connected! Successful check in! /opt/codalab/worker/worker.py 431
2022-02-21 19:56:24,341 Connected! Successful check in! /opt/codalab/worker/worker.py 431

@epicfaace
Copy link
Member Author

Haha bundles are now so fast that some of our tests, which relied on bundles being pretty slow, are now failing:

image

@percyliang
Copy link
Collaborator

This is amazing, @epicfaace !!

self.terminate = True
break
self.process_runs()
time.sleep(0.003)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this sleep and why just 0.003 specifically?

Copy link
Member Author

@epicfaace epicfaace Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually not sure if we need it. I added it just in case so it doesn't call process_runs too often.

I used 0.003 as we use sleep(0.003) elsewhere in the codebase when we don't need to sleep for too long.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just get rid of the sleep then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we can

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this cause 100% CPU utilization? 0.003 is very different from zero.

Copy link
Member Author

@epicfaace epicfaace Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I re-added it.

We can always get this in with sleep(0.003) (which is a strict improvement) and see later if there is space for more optimization.

time.sleep(self.checkin_frequency_seconds)
last_checkin = time.time()
# Process runs until it's time for the next checkin.
while not self.terminate and (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this check can we just use threading.timer? In this case it won't block at all.

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. If we use threading.Timer, we won't block, that's correct, but then we'll be running process_runs only once very 5 seconds. What makes bundles much faster is that we need to run process_runs much more often -- not rate-limit it as much as self.checkin is rate-limited.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Workers often check in far less frequently than the configured checkin time Low latency for simple jobs
4 participants