Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Allow asynchronous job deletion #34058

Conversation

dimitris-athanasiou
Copy link
Contributor

This changes the delete job API by adding
the choice to delete a job asynchronously.
The commit adds a wait_for_completion parameter
to the delete job request. When set to false,
the action returns immediately and the response
contains the task id.

This also changes the handling of subsequent
delete requests for a job that is already being
deleted. It now uses the task framework to check
if the job is being deleted instead of the cluster
state. This is a beneficial for it is going to also
be working once the job configs are moved out of the
cluster state and into an index. Also, force delete
requests that are waiting for the job to be deleted
will not proceed with the deletion if the first task
fails. This will prevent overloading the cluster. Instead,
the failure is communicated better via notifications
so that the user may retry.

Finally, this makes the deleting property of the job
visible (also it was renamed from deleted). This allows
a client to render a deleting job differently.

Closes #32836

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@dimitris-athanasiou
Copy link
Contributor Author

I labelled this with "WIP" only because I have a failing test that I need to resolve. The code is pretty stable so can be reviewed. Also, could @imotov please have a look at the usage of the tasks framework?

@dimitris-athanasiou
Copy link
Contributor Author

I have now resolved the test failure and I've tested in cluster with multiple nodes. Ready for review!

@@ -287,7 +287,7 @@ public Builder deleteJob(String jobId, PersistentTasksCustomMetaData tasks) {
if (job == null) {
throw new ResourceNotFoundException("job [" + jobId + "] does not exist");
}
if (job.isDeleted() == false) {
if (job.isDeleting() == false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, what happens when deleteJob gets called twice in a row while a job is in the deleting status? On the second call, would the error not be ResourceNotFoundException("job [" + jobId + "] does not exist"); Or is the private final SortedMap<String, Job> jobs; repopulated somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now protect against subsequent deletes using the task framework. You can see the implementation in TransportDeleteJobAction.masterOperation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@imotov imotov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some task manager-related comments.

@@ -76,7 +76,7 @@ public TaskResult(TaskInfo task, Exception error) throws IOException {
* Construct a {@linkplain TaskResult} for a task that completed successfully.
*/
public TaskResult(TaskInfo task, ToXContent response) throws IOException {
this(true, task, null, toXContent(response));
this(true, task, null, XContentHelper.toXContent(response, Requests.INDEX_CONTENT_TYPE, true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we store it in the human readable format at the moment. Do we need to make it human readable?

Copy link
Contributor Author

@dimitris-athanasiou dimitris-athanasiou Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I went with human readable as I think these responses will be consumed by users, so it might be best to store the human readable form (if any). But I don't feel strongly about it. Happy to do as you suggest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with human readable, my only concern is that it is somewhat breaking change since it changes the format (by adding new fields). @nik9000 what do you think about this change?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this only changes the stuff in the task part. Which should be fine because we don't map that anyway. And we kind of expect users to be ok with new fields in the http responses and this is fairly similar conceptually. So I'm +1 on it.

throw new UnsupportedOperationException("the Task parameter is required");
private Optional<Task> findExistingStartedTask(Task currentTask) {
return taskManager.getTasks().values().stream().filter(filteredTask ->
currentTask.getDescription().equals(filteredTask.getDescription()) &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comparison seems a bit fragile. If we want to do it with task manager (see my other comment above). I would suggest adding a filter for the action name and then if action matches JobDeletion, we can cast it to JobDeletionTask and get job id from the task instead of description.

ParentTaskAssigningClient parentTaskClient = new ParentTaskAssigningClient(client, taskId);

// Check if there is a deletion task for this job already and if yes wait for it to complete
Optional<Task> existingStartedTask;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with this change we no longer have a race condition, when two tasks could end up waiting for each other as we had before, but it is still very complicated (especially the waitForExistingTaskToComplete part). What would you think about having a static map of arrays of listeners here. On start, in a critical section, you can check if there is already an element with the same job ID present in the map if not, you add your job ID with an empty array and start delete process, if job ID exists - just add yourself to the array of listeners. On completion, again in a critical section, you remove the array with your job ID from the map and execute all listeners that were registered there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to try that! It sounds less complex indeed. Why do you suggest the listener-map should be static? Given the action is a singleton object, it could just be a final member, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you suggest the listener-map should be static?

Well, not enough coffee, I guess. Yes, it should be just a member of action since action is a singleton.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a commit to refactor using listeners. It did simplify things a lot. Could you have another look please?

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just saw one thing that needs to be confirmed to ensure the rename won't cause problems.

@@ -559,6 +573,11 @@ public Builder setResultsIndexName(String resultsIndexName) {
return this;
}

public Builder setDeleting(Boolean deleting) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should not be public, because is there ever a case when the end user would sensibly change the value of this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'll reduce its visibility.

@@ -77,7 +77,7 @@
public static final ParseField MODEL_SNAPSHOT_ID = new ParseField("model_snapshot_id");
public static final ParseField MODEL_SNAPSHOT_MIN_VERSION = new ParseField("model_snapshot_min_version");
public static final ParseField RESULTS_INDEX_NAME = new ParseField("results_index_name");
public static final ParseField DELETED = new ParseField("deleted");
public static final ParseField DELETING = new ParseField("deleting");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please can you double check that we have lenient parsing of job configs from cluster state in 6.0.

The reason is that a 6.0 node may end up parsing a 6.5 job that is being deleted if a full cluster restart of a mixed version cluster happens for some reason. Assuming we have lenient parsing for stored job configs in 6.0 this should work as before - the fact that the job was being deleted when the cluster was shut down will be forgotten, but it can be deleted again - same as it always was in 6.0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and there is lenient parsing in 6.0 so we're good on this front.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

This changes the delete job API by adding
the choice to delete a job asynchronously.
The commit adds a `wait_for_completion` parameter
to the delete job request. When set to `false`,
the action returns immediately and the response
contains the task id.

This also changes the handling of subsequent
delete requests for a job that is already being
deleted. It now uses the task framework to check
if the job is being deleted instead of the cluster
state. This is a beneficial for it is going to also
be working once the job configs are moved out of the
cluster state and into an index. Also, force delete
requests that are waiting for the job to be deleted
will not proceed with the deletion if the first task
fails. This will prevent overloading the cluster. Instead,
the failure is communicated better via notifications
so that the user may retry.

Finally, this makes the `deleting` property of the job
visible (also it was renamed from `deleted`). This allows
a client to render a deleting job differently.

Closes elastic#32836
@dimitris-athanasiou
Copy link
Contributor Author

@imotov Has given me the green light on this as well!

@dimitris-athanasiou dimitris-athanasiou merged commit 4dacfa9 into elastic:master Oct 4, 2018
@dimitris-athanasiou dimitris-athanasiou deleted the add-async-job-deletion-option branch October 4, 2018 23:41
dimitris-athanasiou added a commit that referenced this pull request Oct 5, 2018
This changes the delete job API by adding
the choice to delete a job asynchronously.
The commit adds a `wait_for_completion` parameter
to the delete job request. When set to `false`,
the action returns immediately and the response
contains the task id.

This also changes the handling of subsequent
delete requests for a job that is already being
deleted. It now uses the task framework to check
if the job is being deleted instead of the cluster
state. This is a beneficial for it is going to also
be working once the job configs are moved out of the
cluster state and into an index. Also, force delete
requests that are waiting for the job to be deleted
will not proceed with the deletion if the first task
fails. This will prevent overloading the cluster. Instead,
the failure is communicated better via notifications
so that the user may retry.

Finally, this makes the `deleting` property of the job
visible (also it was renamed from `deleted`). This allows
a client to render a deleting job differently.

Closes #32836
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Oct 5, 2018
* master:
  Rename CCR stats implementation (elastic#34300)
  Add max_children limit to nested sort (elastic#33587)
  MINOR: Remove Dead Code from Netty4Transport (elastic#34134)
  Rename clsuterformation -> testclusters (elastic#34299)
  [Build] make sure there are no duplicate classes in third party audit (elastic#34213)
  BWC Build: Read CI properties to determine java version (elastic#34295)
  [DOCS] Fix typo and add [float]
  Allow User/Password realms to disable authc (elastic#34033)
  Enable security automaton caching (elastic#34028)
  Preserve thread context during authentication. (elastic#34290)
  [ML] Allow asynchronous job deletion (elastic#34058)
jasontedor added a commit to martijnvg/elasticsearch that referenced this pull request Oct 5, 2018
* master: (63 commits)
  [Build] randomizedtesting: Allow property values to be closures (elastic#34319)
  Feature/hlrc ml docs cleanup (elastic#34316)
  Docs: DRY up CRUD docs (elastic#34203)
  Minor corrections in geo-queries.asciidoc (elastic#34314)
  [DOCS] Remove beta label from normalizers (elastic#34326)
  Adjust size of BigArrays in circuit breaker test
  Adapt bwc version after backport
  Follow stats structure (elastic#34301)
  Rename CCR stats implementation (elastic#34300)
  Add max_children limit to nested sort (elastic#33587)
  MINOR: Remove Dead Code from Netty4Transport (elastic#34134)
  Rename clsuterformation -> testclusters (elastic#34299)
  [Build] make sure there are no duplicate classes in third party audit (elastic#34213)
  BWC Build: Read CI properties to determine java version (elastic#34295)
  [DOCS] Fix typo and add [float]
  Allow User/Password realms to disable authc (elastic#34033)
  Enable security automaton caching (elastic#34028)
  Preserve thread context during authentication. (elastic#34290)
  [ML] Allow asynchronous job deletion (elastic#34058)
  HLRC: ML Adding get datafeed stats API (elastic#34271)
  ...
kcm pushed a commit that referenced this pull request Oct 30, 2018
This changes the delete job API by adding
the choice to delete a job asynchronously.
The commit adds a `wait_for_completion` parameter
to the delete job request. When set to `false`,
the action returns immediately and the response
contains the task id.

This also changes the handling of subsequent
delete requests for a job that is already being
deleted. It now uses the task framework to check
if the job is being deleted instead of the cluster
state. This is a beneficial for it is going to also
be working once the job configs are moved out of the
cluster state and into an index. Also, force delete
requests that are waiting for the job to be deleted
will not proceed with the deletion if the first task
fails. This will prevent overloading the cluster. Instead,
the failure is communicated better via notifications
so that the user may retry.

Finally, this makes the `deleting` property of the job
visible (also it was renamed from `deleted`). This allows
a client to render a deleting job differently.

Closes #32836
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants