-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] [ML] RestoreModelSnapshotIT failures #36849
Comments
Pinging @elastic/ml-core |
Muted on the 6.x branch in e31f80c The problem is a version conflict exception updating the job
From the order of the log messages AutodetectCommunicator logs |
I had a PR build fail with a similar version conflict exception: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-2/3034/console
The job this happened to was:
So it's not surprising that results were still being processed at the time the job completion code ran. It's further evidence of a race though, and now these updates are not going via the cluster state thread we need to be more careful to ensure updates to the job only come from one thread at any phase in the job lifecycle. |
#36856) There was a race where the job update in `AutoDetectResultProcessor.updateEstablishedModelMemoryOnJob` could execute after `AutoDetectResultProcessor.awaitCompletion` returned. This was because ` jobUpdateSemaphore` was acquired after the call to `jobResultsProvider.getEstablishedMemoryUsage` and during that call `awaitCompletion` is free to acquire and release the semaphore after which the method returns. This commit fixes the problem. Closes #36849
#36856) There was a race where the job update in `AutoDetectResultProcessor.updateEstablishedModelMemoryOnJob` could execute after `AutoDetectResultProcessor.awaitCompletion` returned. This was because ` jobUpdateSemaphore` was acquired after the call to `jobResultsProvider.getEstablishedMemoryUsage` and during that call `awaitCompletion` is free to acquire and release the semaphore after which the method returns. This commit fixes the problem. Closes #36849
Enable the previously muted test. This reverts commit e31f80c.
There were 2 recent failures of this test
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=virtual&&linux/128/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+periodic/382/console
Both failed with the same error
The text was updated successfully, but these errors were encountered: