-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core]Save task spec in separate table #22650
Conversation
is it still WIP? |
Hmm don't we need it for restarting failed actors? |
Some dashboard test cases failed, I'll try to fix them today. |
We store task spec in separate table and get them when needed. |
The failed windows tests and rllib tests are unrelated. |
@rkooo567 Could you help to review bro? |
@rkooo567 Is it ok to go? |
@@ -790,6 +793,7 @@ void GcsActorManager::DestroyActor(const ActorID &actor_id, | |||
[this, actor_id, actor_table_data](Status status) { | |||
RAY_CHECK_OK(gcs_publisher_->PublishActor( | |||
actor_id, *GenActorDataOnlyWithStates(*actor_table_data), nullptr)); | |||
RAY_CHECK_OK(gcs_table_storage_->ActorTaskSpecTable().Delete(actor_id, nullptr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an edge case where the actor is marked dead, but this delete doesn't finish. This will result in data leak.
It's not a big deal, but we can check this case when loading data after GCS restarts.
Sorry. I’ve been off (until this week). Please feel free to merge the pr! @raulchen |
I am back. @WangTaoTheTonic I saw you added some new commits after it is approved. Is it ready to be merged? Also, can you add unit tests if those new code addition was due to some edge cases? |
Yeah, it's ready to be merged. |
And the failed test cases are not related except the |
@edoakes Hi Edward, this patch changes some proto structures in actor data table, so it impacts dashboard too. The change is only about messages but not api change. |
"timestamp", | ||
"numExecutedTasks", | ||
} | ||
light_message = {k: v for (k, v) in orig_message.items() if k in fields} | ||
if "taskSpec" in light_message: | ||
actor_class = actor_classname_from_task_spec(light_message["taskSpec"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please delete actor_classname_from_task_spec
if it's no longer used anywhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just don't use it here. It's still used by other codes.
if "functionDescriptor" in light_message["taskSpec"]: | ||
light_message["taskSpec"] = { | ||
"functionDescriptor": light_message["taskSpec"]["functionDescriptor"] | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like we're changing the returned schema -- we used to return light_message["taskSpec"]["functionDescriptor"]
but now it's just light_message["functionDescriptor"]
. We likely need to change the dashboard to account for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dashboard uses function descriptor
just for showing, we extract it from the returned schema, no matter in what position it is.
Btw, I need the approval from the protobuf code owner cc @edoakes @wuisawesome |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the code owner parts lgtm, didn't review the rest
Failed test must be unrelated |
@@ -372,6 +375,8 @@ void GcsActorManager::HandleGetNamedActorInfo( | |||
} else { | |||
reply->unsafe_arena_set_allocated_actor_table_data( | |||
iter->second->GetMutableActorTableData()); | |||
RAY_LOG(INFO) << "WANGTAO " << iter->second->GetState(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove this log line or make this more informational
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are these changes needed?
This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs.
As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.