-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use websockets for json communication #4490
Conversation
"working quite well now" 👀 👀 |
…th exponential backoff, it's still not very aggressive at all
…have any sockets open with ws-server
codalab/bin/ws_server.py
Outdated
return | ||
|
||
logger.warning(f"All websockets for worker {worker_id} are currently busy.") | ||
await server_websocket.close(1013, f"All websockets for worker {worker_id} are currently busy.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is 1013?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codalab/lib/download_manager.py
Outdated
logging.info('Unable to reach worker') | ||
|
||
def _get_read_response_stream(self, response_socket_id): | ||
with closing(self._worker_model.start_listening(response_socket_id)) as sock: | ||
header_message = self._worker_model.get_json_message(sock, 60) | ||
header_message = self._worker_model.recv_json_message_with_sock(sock, 60) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename to get_json_message
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do want to keep the with_sock
portion (I changed it to with_unix_socket
in this newest version) since it helps make clear that it's using AF_Unix
sockets. However, I could change recv
back to get
-- I thought recv
was clearer since that's typically the API for getting data from sockets.
Let me know. I can definitely change this.
…or if the client tries to connect to an invalid path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. This new architecture is all documented in your doc, correct? Can you link to it? (maybe we also add a link to it as a comment in the code in ws_server.py
so we never lose it)
codalab/worker/worker.py
Outdated
@@ -836,7 +852,7 @@ def netcat_fn(): | |||
break | |||
total_data.append(data) | |||
s.close() | |||
reply(None, {}, b''.join(total_data)) | |||
reply(None, {}, io.BytesIO(b''.join(total_data))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this change?
codalab/worker/worker.py
Outdated
@@ -136,6 +140,11 @@ def __init__( | |||
|
|||
self.ws_server = ws_server | |||
|
|||
assert ( | |||
num_coroutines > 0 and type(num_coroutines) is int |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this?
@AndrewJGaut is this ready to merge? |
I figured it was enough to link the doc in the PR description, since it's stated pretty clearly. Do you think we should in the code as well? I can do that. |
Let me test it out in dev one more time to be sure. |
…se-websockets-for-json-communication
Confirmed it's good to merge. |
This reverts commit 7c460c1.
This PR addresses #4431. It modifies JSON messaging between the REST server / bundle manager and workers.
An in-depth document describing these changes and their motivation is given here.
Summary
Previously, messages were sent to workers in the HTTP response to worker checkins. However, this was inefficient; a server thread would try to send the message through an AF Unix socket corresponding to the worker using TCP, and so if the sending and receiving threads were not synchronized, the message would not be sent. The need for synchronization made server to worker communication highly inefficient. To make this slightly more efficient, a websocket server was added and would send a ping to the worker meant to receive the message to cause it to checkin immediately so that the receiving server thread would start listening as soon as possible on the AF Unix socket to receive the message. However, desynchronization and corresponding inefficiency persisted.
After this PR, JSON messages will be sent to workers immediately through the websocket server. This means that:
(1) Workers will no longer receive any messages through HTTP responses to the checkin.
(2) No synchronization is required between server threads for JSON messaging.
(3) The websocket server no longer uses the ping functionality.
We also add in authentication of workers and server to the websocket server. (The need for this is described here.)
Speedup
When looking at the
time
tests, the only test that uses worker communication is therun
test. For that test, before the websockets change, it would take over 11 seconds (11.86 seconds here, 11.44 seconds here). After the websockets change, we get 5.57 seconds. That's a speedup of over 2x! (To see the time in those links, click onRun tests using Docker runtime
and then searchcl info -f name
and it will be the number right above the line that comes up.)Future Work / TODO
I might do these in a future PR: