[ML] Improve uniqueness of result document IDs #50644

droberts195 · 2020-01-06T09:56:03Z

Switch from a 32 bit Java hash to a 128 bit Murmur hash for
creating document IDs from by/over/partition field values.
The 32 bit Java hash was not sufficiently unique, and could
produce identical numbers for relatively common combinations
of by/partition field values such as L018/128 and L017/228.

Fixes #50613

Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613

elasticmachine · 2020-01-06T09:56:05Z

Pinging @elastic/ml-core (:ml)

benwtrent · 2020-01-06T12:08:00Z

In the case of a job re-running from a snapshot, we delete results directly correct? And do not rely on the IDs to be the same between the two runs?

droberts195 · 2020-01-06T17:57:49Z

In the case of a job re-running from a snapshot, we delete results directly correct?

We do if it's an explicit revert with delete_intervening_results set to true. If not then we try to start off from after the last observed input or result, which will generally mean we don't create duplicate results - see

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/datafeed/DatafeedJob.java

Line 95 in 197d5e7

long lastEndTime = Math.max(latestFinalBucketEndTimeMs, latestRecordTimeMs);

This latter scenario will always be the case during a failover from one node to another.

It is actually possible for a model plot document to exist that's more recent than the restart time. This is because model plot documents are written before bucket documents - see https://github.com/elastic/ml-cpp/blob/a9c468cf8b991b8d30f1a9ba2846ff90edaa8bcc/lib/api/CAnomalyJob.cc#L626-L629

So in the worst case we'd persist some model plot documents that would get indexed due to a bulk request filling up, then the node would be killed before the corresponding bucket or data counts documents could be indexed, then the job would restart on a different node, redo the same bucket and we'd get duplicate model plot documents for one bucket. This would only be a problem in the case where the old node was running a version before this fix and the new node was running a version after this fix. I think it's probably worth tolerating this unlikely/single bucket problem to fix the problem of entire partitions never having any model plot.

We do explicit deletes in the case of interim results, so those won't be a problem - see

elasticsearch/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/output/AutodetectResultProcessor.java

Lines 204 to 208 in 5c3dd57

    
           // Delete any existing interim results generated by a Flush command 
        
           // which have not been replaced or superseded by new results. 
        
           LOGGER.trace("[{}] Deleting interim results", jobId); 
        
           persister.deleteInterimResults(jobId); 
        
           deleteInterimRequired = false;

benwtrent

I think it would be nice to have a ID length worst case test where every possible value is at its configurable max.

Looking at the maximum value possible for murmur hash bytes: "170141183460469231722463931679029329919" which is two Long.MAX_VALUE bytes. That as a length of UTF_8 bytes is: 39.

I think we are way under the limit, but it would be nice for such a test to cover that extreme (and unlikely) case.

droberts195 · 2020-01-07T09:37:20Z

I think we are way under the limit, but it would be nice for such a test to cover that extreme

I added a test and we're under 200 bytes in total in the worst case.

Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes #50613

Switch from a 32 bit Java hash to a 128 bit Murmur hash for creating document IDs from by/over/partition field values. The 32 bit Java hash was not sufficiently unique, and could produce identical numbers for relatively common combinations of by/partition field values such as L018/128 and L017/228. Fixes elastic#50613

droberts195 added 2 commits January 3, 2020 14:50

Test for result doc ID uniqueness

1ffb83d

droberts195 added >bug :ml Machine learning v8.0.0 v7.6.0 labels Jan 6, 2020

benwtrent reviewed Jan 6, 2020

View reviewed changes

benwtrent approved these changes Jan 6, 2020

View reviewed changes

droberts195 added 3 commits January 7, 2020 09:15

Fix more tests

5a3b1d8

Merge branch 'master' into test_id_uniqueness

e54559b

Add a test of maximum length

e4366ae

droberts195 merged commit 1adf4c2 into elastic:master Jan 7, 2020

droberts195 deleted the test_id_uniqueness branch January 7, 2020 10:23

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Improve uniqueness of result document IDs #50644

[ML] Improve uniqueness of result document IDs #50644

droberts195 commented Jan 6, 2020

elasticmachine commented Jan 6, 2020

benwtrent commented Jan 6, 2020

droberts195 commented Jan 6, 2020

benwtrent left a comment

droberts195 commented Jan 7, 2020

[ML] Improve uniqueness of result document IDs #50644

[ML] Improve uniqueness of result document IDs #50644

Conversation

droberts195 commented Jan 6, 2020

elasticmachine commented Jan 6, 2020

benwtrent commented Jan 6, 2020

droberts195 commented Jan 6, 2020

benwtrent left a comment

Choose a reason for hiding this comment

droberts195 commented Jan 7, 2020