Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core]Save task spec in separate table #22650

Merged
merged 28 commits into from
Apr 12, 2022
Merged

Conversation

WangTaoTheTonic
Copy link
Contributor

@WangTaoTheTonic WangTaoTheTonic commented Feb 25, 2022

Why are these changes needed?

This is a rebase version of #11592. As task spec info is only needed when gcs create or start an actor, so we can remove it from actor table and save the serialization time and memory/network cost when gcs clients get actor infos from gcs.

As internal repository varies very much from the community. This pr just add some manual check with simple cherry pick. Welcome to comment first and at the meantime I'll see if there's any test case failed or some points were missed.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@rkooo567
Copy link
Contributor

rkooo567 commented Mar 1, 2022

is it still WIP?

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 1, 2022
@ericl
Copy link
Contributor

ericl commented Mar 1, 2022

Hmm don't we need it for restarting failed actors?

@WangTaoTheTonic
Copy link
Contributor Author

is it still WIP?

Some dashboard test cases failed, I'll try to fix them today.

@WangTaoTheTonic
Copy link
Contributor Author

WangTaoTheTonic commented Mar 1, 2022

Hmm don't we need it for restarting failed actors?

We store task spec in separate table and get them when needed.

@WangTaoTheTonic WangTaoTheTonic changed the title [WIP][Core]Do not save task spec in Gcs actor table [Core]Save task spec in separate table Mar 1, 2022
@WangTaoTheTonic WangTaoTheTonic removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Mar 1, 2022
@WangTaoTheTonic
Copy link
Contributor Author

The failed windows tests and rllib tests are unrelated.

@WangTaoTheTonic
Copy link
Contributor Author

@rkooo567 Could you help to review bro?

@WangTaoTheTonic
Copy link
Contributor Author

@rkooo567 Is it ok to go?

@@ -790,6 +793,7 @@ void GcsActorManager::DestroyActor(const ActorID &actor_id,
[this, actor_id, actor_table_data](Status status) {
RAY_CHECK_OK(gcs_publisher_->PublishActor(
actor_id, *GenActorDataOnlyWithStates(*actor_table_data), nullptr));
RAY_CHECK_OK(gcs_table_storage_->ActorTaskSpecTable().Delete(actor_id, nullptr));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an edge case where the actor is marked dead, but this delete doesn't finish. This will result in data leak.
It's not a big deal, but we can check this case when loading data after GCS restarts.

@rkooo567
Copy link
Contributor

Sorry. I’ve been off (until this week). Please feel free to merge the pr! @raulchen

@rkooo567
Copy link
Contributor

rkooo567 commented Apr 4, 2022

I am back. @WangTaoTheTonic I saw you added some new commits after it is approved. Is it ready to be merged? Also, can you add unit tests if those new code addition was due to some edge cases?

@WangTaoTheTonic
Copy link
Contributor Author

I am back. @WangTaoTheTonic I saw you added some new commits after it is approved. Is it ready to be merged? Also, can you add unit tests if those new code addition was due to some edge cases?

Yeah, it's ready to be merged.
I added some codes just to prevent the situation that the gcs crashes (no matter what reason) before the task spec is removing but the actor table item has been removed, per @raulchen comments. It's a small sanity check and won't affect the whole pr too much.
The unit test is hard to simulate that situation as we must kill gcs right in the duration between actor table item deleting and task spec item deleting.

@WangTaoTheTonic
Copy link
Contributor Author

And the failed test cases are not related except the LinkCheck, @rkooo567 could you help to confirm?

@WangTaoTheTonic
Copy link
Contributor Author

image
The failed reason of the two cases are same. It looks like the dashboard frontend issue and not related.

@WangTaoTheTonic
Copy link
Contributor Author

@edoakes Hi Edward, this patch changes some proto structures in actor data table, so it impacts dashboard too. The change is only about messages but not api change.
Could you help to take a review? As we need approvals of owners from modules involved.

"timestamp",
"numExecutedTasks",
}
light_message = {k: v for (k, v) in orig_message.items() if k in fields}
if "taskSpec" in light_message:
actor_class = actor_classname_from_task_spec(light_message["taskSpec"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete actor_classname_from_task_spec if it's no longer used anywhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just don't use it here. It's still used by other codes.

Comment on lines -67 to -70
if "functionDescriptor" in light_message["taskSpec"]:
light_message["taskSpec"] = {
"functionDescriptor": light_message["taskSpec"]["functionDescriptor"]
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like we're changing the returned schema -- we used to return light_message["taskSpec"]["functionDescriptor"] but now it's just light_message["functionDescriptor"]. We likely need to change the dashboard to account for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dashboard uses function descriptor just for showing, we extract it from the returned schema, no matter in what position it is.

@rkooo567
Copy link
Contributor

Btw, I need the approval from the protobuf code owner cc @edoakes @wuisawesome

Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code owner parts lgtm, didn't review the rest

@rkooo567
Copy link
Contributor

Failed test must be unrelated

@rkooo567 rkooo567 merged commit 6aefe9b into ray-project:master Apr 12, 2022
@WangTaoTheTonic WangTaoTheTonic deleted the task_spec branch April 13, 2022 06:46
@@ -372,6 +375,8 @@ void GcsActorManager::HandleGetNamedActorInfo(
} else {
reply->unsafe_arena_set_allocated_actor_table_data(
iter->second->GetMutableActorTableData());
RAY_LOG(INFO) << "WANGTAO " << iter->second->GetState();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this log line or make this more informational

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo Sorry for missing that, will remove it in #24204

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants