Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dashboard agent] Catch agent port conflict #23024

Merged
merged 13 commits into from
Mar 15, 2022

Conversation

SongGuyang
Copy link
Contributor

@SongGuyang SongGuyang commented Mar 10, 2022

Why are these changes needed?

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567
Copy link
Contributor

Minimal install failures seesm related

@SongGuyang
Copy link
Contributor Author

Yeah, not sure the difference of minimal env. I should improve or modify the test.

try:
self.http_server = await self._configure_http_server(modules)
except Exception:
logger.exception(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add TODO here to remove this in the future? (and do better port resolution)

gcs_publisher=gcs_publisher,
)
logger.error(message)
# Agent is failed to be started many times.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems wrong?

gcs_publisher = GcsPublisher(address=args.gcs_address)

traceback_str = ray._private.utils.format_error_message(traceback.format_exc())
message = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need it? In this case, raylet will just fail right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if we open fate sharing by default, we don't need this. I can remove this.

@@ -603,7 +603,8 @@ async def _perform_iteration(self, publisher):
await asyncio.sleep(reporter_consts.REPORTER_UPDATE_INTERVAL_MS / 1000)

async def run(self, server):
reporter_pb2_grpc.add_ReporterServiceServicer_to_server(self, server)
if server:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way to just not run this function instead of adding this logic? Seems a bit ugly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find better way. The run is a common interface of dashboard agent module. We can do anything else in this function, besides starting grpc server. So, if we don't run this function, we could lose some code paths or features.

@@ -121,8 +121,6 @@ def test_port_conflict(call_ray_stop_only, shutdown_only):
with pytest.raises(ValueError, match="already occupied"):
ray.init(dashboard_port=9999, include_dashboard=True)

sock.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why removing this?

@@ -330,13 +318,13 @@ def f():
"""
Test task raises an exception.
"""
with pytest.raises(RuntimeEnvSetupError):
with pytest.raises(ray.exceptions.LocalRayletDiedError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you check there's the error message that says the reason is due to agent failure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I can remove this test. In this test(test_runtime_env_broken), we injected a env to make agent failed. Agent failure will lead to raylet failure. Raylet failure will lead to driver failure(raylet client failed). We can't catch any error message.

ray.get(f.options(runtime_env=runtime_env).remote())
"""
Test actor task raises an exception.
"""
a = A.options(runtime_env=runtime_env).remote()
with pytest.raises(ray.exceptions.RuntimeEnvSetupError):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

RAY_LOG(INFO) << "HandleRegisterAgent, ip: " << agent_ip_address_
<< ", port: " << agent_port_ << ", pid: " << agent_pid_;
} else {
RAY_LOG(ERROR) << "The grpc port of agent is invalid (be 0), ip: "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use WARNING instead? ERROR will propagate to the driver

@@ -162,10 +142,11 @@ void AgentManager::CreateRuntimeEnv(

if (runtime_env_agent_client_ == nullptr) {
// If the agent cannot be restarted anymore, fail the request.
if (agent_restart_count_ >= RayConfig::instance().agent_max_restart_count()) {
if (disable_agent_client_ || agent_failed_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If agent_failed == true, this line should never reach? Maybe instead

RAY_CHECK(!agent_failed)
if (disable_agent_client_)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rmove the flag agent_failed_.

/// Whether or not we intend to start the agent. This is false if we
/// are missing Ray Dashboard dependencies, for example.
bool should_start_agent_ = true;
std::string agent_ip_address_;
DelayExecutorFn delay_executor_;
RuntimeEnvAgentClientFactoryFn runtime_env_agent_client_factory_;
std::shared_ptr<rpc::RuntimeEnvAgentClientInterface> runtime_env_agent_client_;
bool disable_agent_client_ = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment to explain the variable?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 11, 2022
@rkooo567
Copy link
Contributor

Most of comments are about making sure we have the right message when things failed due to this issue

@SongGuyang
Copy link
Contributor Author

Thanks! I will address the comments(including another PR) tonight and ping you guys again.

@SongGuyang
Copy link
Contributor Author

SongGuyang commented Mar 11, 2022

@rkooo567 Please take a look again. One question is that can I skip the two tests? 32ce0e2

@SongGuyang SongGuyang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 11, 2022
@rkooo567
Copy link
Contributor

rkooo567 commented Mar 12, 2022

I think we can remove test_dashboard_agent_restart as long as we have a test to check fate sharing.

For test_runtime_env_broken , can we just test it in a different node than the local?

Also, what's the exact current behavior when the local node is dead? Can you tell me how the error is propagated to the driver? (is driver just going to crash with check failure?)

Also can you add a test to verify fate sharing? I couldn't find it, but I might be missing something (if so, please give me the name of the function!)

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CreateRuntimeEnv(job_id, serialized_runtime_env,

-> We should also handle this better. Maybe just fail the request instead of running the infinite retry? It should be fine since the agent will fate share with raylet

# except grpc port.
run_tasks_with_runtime_env()


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't have a test to verify fate sharing?

}
],
indirect=True,
)
def test_dashboard_agent_restart(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just remove it and instead add a test to verify fate sharing

runtime_env_agent_client_factory_(agent_ip_address_, agent_port_);
RAY_LOG(INFO) << "HandleRegisterAgent, ip: " << agent_ip_address_
<< ", port: " << agent_port_ << ", pid: " << agent_pid_;
// Note: `agent_port_` should be 0 if the grpc port of agent is in conflict.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add TODO here to remove this workaround?

@@ -230,6 +208,11 @@ void AgentManager::CreateRuntimeEnv(

void AgentManager::DeleteURIs(const std::vector<std::string> &uris,
DeleteURIsCallback callback) {
if (disable_agent_client_) {
RAY_LOG(ERROR) << "Failed to delete URIs because the agent client is disabled.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is the user facing error, we should refine it a bit more.

Based on https://spark.apache.org/error-message-guidelines.html

Failed to delete runtime environment URI because the ray agent couldn't be started due to the port conflict. See `dashboard_agent.log` for more details. To solve the problem, start ray with a hard-coded agent port. `ray start --dashboard-agent-grpc-port [port]` and make sure the port is not used by other processes.
  • Who encountered the problem?
    What was the problem?
    When did the problem happen?
    Where did the problem happen?
    Why did the problem happen?
    How can the problem be solved?

@SongGuyang
Copy link
Contributor Author

We already have the fate sharing test named test_raylet_and_agent_share_fate.

About the driver failure, the error message is:

::test_runtime_env_broken[dict] 2022-03-11 08:33:06,463	INFO worker.py:978 -- Connecting to existing Ray cluster at address: 172.18.0.3:59955
--
  | [2022-03-11 08:33:15,674 E 6912 13117] core_worker_process.cc:218: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:

I saw this in the https://buildkite.com/ray-project/ray-builders-pr/builds/26582#4eecbf4d-a236-4ab9-b350-10dfd971fb04

@rkooo567
Copy link
Contributor

rkooo567 commented Mar 12, 2022

We already have the fate sharing test named test_raylet_and_agent_share_fate.

About the driver failure, the error message is:

::test_runtime_env_broken[dict] 2022-03-11 08:33:06,463	INFO worker.py:978 -- Connecting to existing Ray cluster at address: 172.18.0.3:59955
--
  | [2022-03-11 08:33:15,674 E 6912 13117] core_worker_process.cc:218: Failed to get the system config from raylet because it is dead. Worker will terminate. Status: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:

I saw this in the https://buildkite.com/ray-project/ray-builders-pr/builds/26582#4eecbf4d-a236-4ab9-b350-10dfd971fb04

Hmm. this one seems to be a pretty bad error message... Can you at least mention to the error message

core_worker_process.cc:218: Failed to get the system config from raylet because it is dead. Worker will terminate.

to take a look at raylet.out log file? I think we should find a better way to propagate error messages, but it seems to be difficult to achieve it in the short term.

Also do you happen to know why this only happens in minimal install, not a regular test?

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM given the remaining comment are addressed.

Btw, do you think this should be a part of 1.12? My impression is it could be a breaking change, and having enough time (1 release) seems to be not a bad choice. Do you guys have hard dependencies on this feature?

@SongGuyang
Copy link
Contributor Author

Also do you happen to know why this only happens in minimal install, not a regular test?

I don't know the reason. Maybe the test is finished before driver detects the raylet failure.

@SongGuyang
Copy link
Contributor Author

Btw, do you think this should be a part of 1.12? My impression is it could be a breaking change, and having enough time (1 release) seems to be not a bad choice. Do you guys have hard dependencies on this feature?

I think sharing fate, catching port conflict, and URI reference refactor can enhance the stability, instead of regression. Do we also have 1 or 2 weeks to test before 1.12 release? We have two choices:

option 1: Merge current PR and #22828 before the cut of 1.12. And we should do more test for this before 1.12 release.

option 2: Merge current PR and #22828 after the cut of 1.12. But we should merge it as soon as possible because it blocks a lot of following PRs.

I tend to option 1, but it's also ok to choose option 2. @edoakes @architkulkarni What do you think?

@kfstorm
Copy link
Member

kfstorm commented Mar 13, 2022

‼️ ACTION REQUIRED ‼️

We've updated our formatting configuration for C++ code. (see #22725)

This PR includes C++ code change. To prevent issues with merging your code, here's what you'll need to do:

  1. Merge the latest changes from upstream/master branch into your branch.
git pull upstream master
git merge upstream/master
  1. Resolve merge conflicts (if necessary).

After running these steps, you'll have the updated C++ formatting configuration.

  1. Format changed files.
scripts/format.sh
  1. Commit your changes.
git add --all
git commit -m "Format C++ code"

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just added some nits about the wording in some of the error messages (I know some of them predated this PR, it would be nice to add them but no need to block the PR on this)

One more comment though, can we add in Sang's suggestion to the error messages?

To solve the problem, start ray with a hard-coded agent port. ray start --dashboard-agent-grpc-port [port] and make sure the port is not used by other processes.
`.

I think it would be really helpful for users who see the error message.

Comment on lines 43 to 45
RAY_LOG(WARNING) << "The grpc port of agent is invalid (be 0), ip: "
<< agent_ip_address_ << ", pid: " << agent_pid_
<< ". Disable the agent client in raylet.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RAY_LOG(WARNING) << "The grpc port of agent is invalid (be 0), ip: "
<< agent_ip_address_ << ", pid: " << agent_pid_
<< ". Disable the agent client in raylet.";
RAY_LOG(WARNING) << "The gRPC port of the Ray agent is invalid (0), ip: "
<< agent_ip_address_ << ", pid: " << agent_pid_
<< ". The agent client in the raylet has been disabled.";

Comment on lines 107 to 110
RAY_LOG(ERROR) << "Raylet exits immediately because the ray agent has failed. "
"Raylet fate shares with the agent. It can happen because the "
"Ray agent is unexpectedly killed or failed. See "
"`dashboard_agent.log` for the root cause.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RAY_LOG(ERROR) << "Raylet exits immediately because the ray agent has failed. "
"Raylet fate shares with the agent. It can happen because the "
"Ray agent is unexpectedly killed or failed. See "
"`dashboard_agent.log` for the root cause.";
RAY_LOG(ERROR) << "The raylet exited immediately because the Ray agent failed. "
"The raylet fate shares with the agent. This can happen because the "
"Ray agent was unexpectedly killed or failed. See "
"`dashboard_agent.log` for the root cause.";

Comment on lines 148 to 151
<< "Runtime environment " << serialized_runtime_env
<< " cannot be created on this node because the agent client is disabled. You "
"see this error message maybe because the grpc port of agent came into "
"conflict. Please see `dashboard_agent.log` to get more details.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<< "Runtime environment " << serialized_runtime_env
<< " cannot be created on this node because the agent client is disabled. You "
"see this error message maybe because the grpc port of agent came into "
"conflict. Please see `dashboard_agent.log` to get more details.";
<< "The runtime environment " << serialized_runtime_env
<< " cannot be created on this node because the agent client is disabled. You "
"might see this error message because the gRPC port of the agent came into "
"conflict. Please see `dashboard_agent.log` to get more details.";

Comment on lines 219 to 221
<< "Failed to delete runtime env URIs because the agent client is disabled. You "
"see this error message maybe because the grpc port of agent came into "
"conflict. Please see `dashboard_agent.log` to get more details.";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<< "Failed to delete runtime env URIs because the agent client is disabled. You "
"see this error message maybe because the grpc port of agent came into "
"conflict. Please see `dashboard_agent.log` to get more details.";
<< "Failed to delete runtime env URIs because the agent client is disabled. You "
"might see this error message because the gRPC port of the agent came into "
"conflict. Please see `dashboard_agent.log` to get more details.";

@architkulkarni
Copy link
Contributor

Btw, do you think this should be a part of 1.12? My impression is it could be a breaking change, and having enough time (1 release) seems to be not a bad choice. Do you guys have hard dependencies on this feature?

I think sharing fate, catching port conflict, and URI reference refactor can enhance the stability, instead of regression. Do we also have 1 or 2 weeks to test before 1.12 release? We have two choices:

option 1: Merge current PR and #22828 before the cut of 1.12. And we should do more test for this before 1.12 release.

option 2: Merge current PR and #22828 after the cut of 1.12. But we should merge it as soon as possible because it blocks a lot of following PRs.

I tend to option 1, but it's also ok to choose option 2. @edoakes @architkulkarni What do you think?

I agree that the stability will be improved by these two PRs, so I think option 1 would be better.

[this,
job_id,
serialized_runtime_env,
[serialized_runtime_env,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[serialized_runtime_env,
[serialized_runtime_env=std::move(serialized_runtime_env),

Is it possible to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't do this because serialized_runtime_env is a reference from task_spec.

@@ -261,7 +234,7 @@ void AgentManager::DeleteURIs(const std::vector<std::string> &uris,
request.add_uris(uri);
}
runtime_env_agent_client_->DeleteURIs(
request, [this, uris, callback](Status status, const rpc::DeleteURIsReply &reply) {
request, [callback](Status status, const rpc::DeleteURIsReply &reply) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
request, [callback](Status status, const rpc::DeleteURIsReply &reply) {
request, [callback=std::move(callback)](Status status, const rpc::DeleteURIsReply &reply) {

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok

@SongGuyang
Copy link
Contributor Author

Looks good to me, just added some nits about the wording in some of the error messages (I know some of them predated this PR, it would be nice to add them but no need to block the PR on this)

One more comment though, can we add in Sang's suggestion to the error messages?

To solve the problem, start ray with a hard-coded agent port. ray start --dashboard-agent-grpc-port [port] and make sure the port is not used by other processes.
`.

I think it would be really helpful for users who see the error message.

Good comments! Fixed!

@raulchen raulchen merged commit f659717 into ray-project:master Mar 15, 2022
@raulchen raulchen deleted the dev_catch_agent_port_conflict branch March 15, 2022 08:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants