Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Improve uniqueness of result document IDs #50644

Merged
merged 5 commits into from
Jan 7, 2020

Conversation

droberts195
Copy link
Contributor

Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes #50613

Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes elastic#50613
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@benwtrent
Copy link
Member

In the case of a job re-running from a snapshot, we delete results directly correct? And do not rely on the IDs to be the same between the two runs?

@droberts195
Copy link
Contributor Author

In the case of a job re-running from a snapshot, we delete results directly correct?

We do if it's an explicit revert with delete_intervening_results set to true. If not then we try to start off from after the last observed input or result, which will generally mean we don't create duplicate results - see

long lastEndTime = Math.max(latestFinalBucketEndTimeMs, latestRecordTimeMs);

This latter scenario will always be the case during a failover from one node to another.

It is actually possible for a model plot document to exist that's more recent than the restart time. This is because model plot documents are written before bucket documents - see https://github.com/elastic/ml-cpp/blob/a9c468cf8b991b8d30f1a9ba2846ff90edaa8bcc/lib/api/CAnomalyJob.cc#L626-L629

So in the worst case we'd persist some model plot documents that would get indexed due to a bulk request filling up, then the node would be killed before the corresponding bucket or data counts documents could be indexed, then the job would restart on a different node, redo the same bucket and we'd get duplicate model plot documents for one bucket. This would only be a problem in the case where the old node was running a version before this fix and the new node was running a version after this fix. I think it's probably worth tolerating this unlikely/single bucket problem to fix the problem of entire partitions never having any model plot.

We do explicit deletes in the case of interim results, so those won't be a problem - see

// Delete any existing interim results generated by a Flush command
// which have not been replaced or superseded by new results.
LOGGER.trace("[{}] Deleting interim results", jobId);
persister.deleteInterimResults(jobId);
deleteInterimRequired = false;

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to have a ID length worst case test where every possible value is at its configurable max.

Looking at the maximum value possible for murmur hash bytes: "170141183460469231722463931679029329919" which is two Long.MAX_VALUE bytes. That as a length of UTF_8 bytes is: 39.

I think we are way under the limit, but it would be nice for such a test to cover that extreme (and unlikely) case.

@droberts195
Copy link
Contributor Author

I think we are way under the limit, but it would be nice for such a test to cover that extreme

I added a test and we're under 200 bytes in total in the worst case.

@droberts195 droberts195 merged commit 1adf4c2 into elastic:master Jan 7, 2020
@droberts195 droberts195 deleted the test_id_uniqueness branch January 7, 2020 10:23
droberts195 added a commit that referenced this pull request Jan 7, 2020
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes #50613
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes elastic#50613
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ML] Result document IDs are not sufficiently unique
4 participants