-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dashboard agent] Catch agent port conflict #23024
[dashboard agent] Catch agent port conflict #23024
Conversation
Minimal install failures seesm related |
Yeah, not sure the difference of minimal env. I should improve or modify the test. |
try: | ||
self.http_server = await self._configure_http_server(modules) | ||
except Exception: | ||
logger.exception( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add TODO here to remove this in the future? (and do better port resolution)
dashboard/agent.py
Outdated
gcs_publisher=gcs_publisher, | ||
) | ||
logger.error(message) | ||
# Agent is failed to be started many times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems wrong?
dashboard/agent.py
Outdated
gcs_publisher = GcsPublisher(address=args.gcs_address) | ||
|
||
traceback_str = ray._private.utils.format_error_message(traceback.format_exc()) | ||
message = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need it? In this case, raylet will just fail right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if we open fate sharing by default, we don't need this. I can remove this.
@@ -603,7 +603,8 @@ async def _perform_iteration(self, publisher): | |||
await asyncio.sleep(reporter_consts.REPORTER_UPDATE_INTERVAL_MS / 1000) | |||
|
|||
async def run(self, server): | |||
reporter_pb2_grpc.add_ReporterServiceServicer_to_server(self, server) | |||
if server: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any way to just not run this function instead of adding this logic? Seems a bit ugly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't find better way. The run
is a common interface of dashboard agent module. We can do anything else in this function, besides starting grpc server. So, if we don't run this function, we could lose some code paths or features.
@@ -121,8 +121,6 @@ def test_port_conflict(call_ray_stop_only, shutdown_only): | |||
with pytest.raises(ValueError, match="already occupied"): | |||
ray.init(dashboard_port=9999, include_dashboard=True) | |||
|
|||
sock.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing this?
@@ -330,13 +318,13 @@ def f(): | |||
""" | |||
Test task raises an exception. | |||
""" | |||
with pytest.raises(RuntimeEnvSetupError): | |||
with pytest.raises(ray.exceptions.LocalRayletDiedError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check there's the error message that says the reason is due to agent failure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I can remove this test. In this test(test_runtime_env_broken
), we injected a env to make agent failed. Agent failure will lead to raylet failure. Raylet failure will lead to driver failure(raylet client failed). We can't catch any error message.
ray.get(f.options(runtime_env=runtime_env).remote()) | ||
""" | ||
Test actor task raises an exception. | ||
""" | ||
a = A.options(runtime_env=runtime_env).remote() | ||
with pytest.raises(ray.exceptions.RuntimeEnvSetupError): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
src/ray/raylet/agent_manager.cc
Outdated
RAY_LOG(INFO) << "HandleRegisterAgent, ip: " << agent_ip_address_ | ||
<< ", port: " << agent_port_ << ", pid: " << agent_pid_; | ||
} else { | ||
RAY_LOG(ERROR) << "The grpc port of agent is invalid (be 0), ip: " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use WARNING instead? ERROR will propagate to the driver
src/ray/raylet/agent_manager.cc
Outdated
@@ -162,10 +142,11 @@ void AgentManager::CreateRuntimeEnv( | |||
|
|||
if (runtime_env_agent_client_ == nullptr) { | |||
// If the agent cannot be restarted anymore, fail the request. | |||
if (agent_restart_count_ >= RayConfig::instance().agent_max_restart_count()) { | |||
if (disable_agent_client_ || agent_failed_) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If agent_failed == true, this line should never reach? Maybe instead
RAY_CHECK(!agent_failed)
if (disable_agent_client_)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rmove the flag agent_failed_
.
/// Whether or not we intend to start the agent. This is false if we | ||
/// are missing Ray Dashboard dependencies, for example. | ||
bool should_start_agent_ = true; | ||
std::string agent_ip_address_; | ||
DelayExecutorFn delay_executor_; | ||
RuntimeEnvAgentClientFactoryFn runtime_env_agent_client_factory_; | ||
std::shared_ptr<rpc::RuntimeEnvAgentClientInterface> runtime_env_agent_client_; | ||
bool disable_agent_client_ = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to explain the variable?
Most of comments are about making sure we have the right message when things failed due to this issue |
Thanks! I will address the comments(including another PR) tonight and ping you guys again. |
I think we can remove For Also, what's the exact current behavior when the local node is dead? Can you tell me how the error is propagated to the driver? (is driver just going to crash with check failure?) Also can you add a test to verify fate sharing? I couldn't find it, but I might be missing something (if so, please give me the name of the function!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ray/src/ray/raylet/agent_manager.cc
Line 223 in 2294a7e
CreateRuntimeEnv(job_id, serialized_runtime_env, |
-> We should also handle this better. Maybe just fail the request instead of running the infinite retry? It should be fine since the agent will fate share with raylet
# except grpc port. | ||
run_tasks_with_runtime_env() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't have a test to verify fate sharing?
python/ray/tests/test_dashboard.py
Outdated
} | ||
], | ||
indirect=True, | ||
) | ||
def test_dashboard_agent_restart( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just remove it and instead add a test to verify fate sharing
runtime_env_agent_client_factory_(agent_ip_address_, agent_port_); | ||
RAY_LOG(INFO) << "HandleRegisterAgent, ip: " << agent_ip_address_ | ||
<< ", port: " << agent_port_ << ", pid: " << agent_pid_; | ||
// Note: `agent_port_` should be 0 if the grpc port of agent is in conflict. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add TODO here to remove this workaround?
src/ray/raylet/agent_manager.cc
Outdated
@@ -230,6 +208,11 @@ void AgentManager::CreateRuntimeEnv( | |||
|
|||
void AgentManager::DeleteURIs(const std::vector<std::string> &uris, | |||
DeleteURIsCallback callback) { | |||
if (disable_agent_client_) { | |||
RAY_LOG(ERROR) << "Failed to delete URIs because the agent client is disabled."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since it is the user facing error, we should refine it a bit more.
Based on https://spark.apache.org/error-message-guidelines.html
Failed to delete runtime environment URI because the ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.
- Who encountered the problem?
What was the problem?
When did the problem happen?
Where did the problem happen?
Why did the problem happen?
How can the problem be solved?
We already have the fate sharing test named About the driver failure, the error message is:
I saw this in the https://buildkite.com/ray-project/ray-builders-pr/builds/26582#4eecbf4d-a236-4ab9-b350-10dfd971fb04 |
Hmm. this one seems to be a pretty bad error message... Can you at least mention to the error message
to take a look at raylet.out log file? I think we should find a better way to propagate error messages, but it seems to be difficult to achieve it in the short term. Also do you happen to know why this only happens in minimal install, not a regular test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM given the remaining comment are addressed.
Btw, do you think this should be a part of 1.12? My impression is it could be a breaking change, and having enough time (1 release) seems to be not a bad choice. Do you guys have hard dependencies on this feature?
I don't know the reason. Maybe the test is finished before driver detects the raylet failure. |
I think sharing fate, catching port conflict, and URI reference refactor can enhance the stability, instead of regression. Do we also have 1 or 2 weeks to test before 1.12 release? We have two choices: option 1: Merge current PR and #22828 before the cut of 1.12. And we should do more test for this before 1.12 release. option 2: Merge current PR and #22828 after the cut of 1.12. But we should merge it as soon as possible because it blocks a lot of following PRs. I tend to option 1, but it's also ok to choose option 2. @edoakes @architkulkarni What do you think? |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, just added some nits about the wording in some of the error messages (I know some of them predated this PR, it would be nice to add them but no need to block the PR on this)
One more comment though, can we add in Sang's suggestion to the error messages?
To solve the problem, start ray with a hard-coded agent port.
ray start --dashboard-agent-grpc-port [port]
and make sure the port is not used by other processes.
`.
I think it would be really helpful for users who see the error message.
src/ray/raylet/agent_manager.cc
Outdated
RAY_LOG(WARNING) << "The grpc port of agent is invalid (be 0), ip: " | ||
<< agent_ip_address_ << ", pid: " << agent_pid_ | ||
<< ". Disable the agent client in raylet."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_LOG(WARNING) << "The grpc port of agent is invalid (be 0), ip: " | |
<< agent_ip_address_ << ", pid: " << agent_pid_ | |
<< ". Disable the agent client in raylet."; | |
RAY_LOG(WARNING) << "The gRPC port of the Ray agent is invalid (0), ip: " | |
<< agent_ip_address_ << ", pid: " << agent_pid_ | |
<< ". The agent client in the raylet has been disabled."; |
src/ray/raylet/agent_manager.cc
Outdated
RAY_LOG(ERROR) << "Raylet exits immediately because the ray agent has failed. " | ||
"Raylet fate shares with the agent. It can happen because the " | ||
"Ray agent is unexpectedly killed or failed. See " | ||
"`dashboard_agent.log` for the root cause."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_LOG(ERROR) << "Raylet exits immediately because the ray agent has failed. " | |
"Raylet fate shares with the agent. It can happen because the " | |
"Ray agent is unexpectedly killed or failed. See " | |
"`dashboard_agent.log` for the root cause."; | |
RAY_LOG(ERROR) << "The raylet exited immediately because the Ray agent failed. " | |
"The raylet fate shares with the agent. This can happen because the " | |
"Ray agent was unexpectedly killed or failed. See " | |
"`dashboard_agent.log` for the root cause."; |
src/ray/raylet/agent_manager.cc
Outdated
<< "Runtime environment " << serialized_runtime_env | ||
<< " cannot be created on this node because the agent client is disabled. You " | ||
"see this error message maybe because the grpc port of agent came into " | ||
"conflict. Please see `dashboard_agent.log` to get more details."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< "Runtime environment " << serialized_runtime_env | |
<< " cannot be created on this node because the agent client is disabled. You " | |
"see this error message maybe because the grpc port of agent came into " | |
"conflict. Please see `dashboard_agent.log` to get more details."; | |
<< "The runtime environment " << serialized_runtime_env | |
<< " cannot be created on this node because the agent client is disabled. You " | |
"might see this error message because the gRPC port of the agent came into " | |
"conflict. Please see `dashboard_agent.log` to get more details."; |
src/ray/raylet/agent_manager.cc
Outdated
<< "Failed to delete runtime env URIs because the agent client is disabled. You " | ||
"see this error message maybe because the grpc port of agent came into " | ||
"conflict. Please see `dashboard_agent.log` to get more details."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
<< "Failed to delete runtime env URIs because the agent client is disabled. You " | |
"see this error message maybe because the grpc port of agent came into " | |
"conflict. Please see `dashboard_agent.log` to get more details."; | |
<< "Failed to delete runtime env URIs because the agent client is disabled. You " | |
"might see this error message because the gRPC port of the agent came into " | |
"conflict. Please see `dashboard_agent.log` to get more details."; |
I agree that the stability will be improved by these two PRs, so I think option 1 would be better. |
[this, | ||
job_id, | ||
serialized_runtime_env, | ||
[serialized_runtime_env, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[serialized_runtime_env, | |
[serialized_runtime_env=std::move(serialized_runtime_env), |
Is it possible to do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't do this because serialized_runtime_env
is a reference from task_spec.
src/ray/raylet/agent_manager.cc
Outdated
@@ -261,7 +234,7 @@ void AgentManager::DeleteURIs(const std::vector<std::string> &uris, | |||
request.add_uris(uri); | |||
} | |||
runtime_env_agent_client_->DeleteURIs( | |||
request, [this, uris, callback](Status status, const rpc::DeleteURIsReply &reply) { | |||
request, [callback](Status status, const rpc::DeleteURIsReply &reply) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request, [callback](Status status, const rpc::DeleteURIsReply &reply) { | |
request, [callback=std::move(callback)](Status status, const rpc::DeleteURIsReply &reply) { |
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's ok
Good comments! Fixed! |
Why are these changes needed?
raylet_shares_fate_with_agent
by default.Checks
scripts/format.sh
to lint the changes in this PR.