[BUG] Data loss during primary relocation for segrep backed indexes #6315

ashking94 · 2023-02-14T12:21:02Z

Describe the bug
During primary-primary relocation, encountering data loss when indexing is happening at high TPS. This specifically is starting after initiateTracking happens for the new primary shard. A subset of docs are missing after relocation completes. Also noticing that after relocation handoff is completed, indexing landing on new primary shard uses correct seq no. However, the overall count of docs is not correct.

To Reproduce
Step 1 - Create SegRep index with index.translog.durability as async or request. The issue shows easily on async option.

curl -X PUT "localhost:9200/test-index?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "replication.type" : "SEGMENT",
    "index.translog.durability" : "async",
    "refresh_interval": "1000s"
  }
}
'

Step 2 - Index docs

for i in {1..1000}
do                
   curl --location --request POST "localhost:9202/test-index/_doc" \
    --header 'Content-Type: application/json' \
    --data-raw "{
      \"name\":\"abc${i}\"
    }"
    echo "$i\n"
done

Step 3 - Start relocation of index

curl -XPUT localhost:9201/test-index/_settings -H 'Content-Type: application/json' -d '    
{
  "index.routing.allocation.include._name": "opensearch-node1"
}'

The text was updated successfully, but these errors were encountered:

ashking94 · 2023-02-14T12:25:03Z

Had found an issue for remote-backed indexes relating to relocation - #6214. The same issue exists for segrep indexes as well (validated the same). The fix for remote-backed indexes is present in #6314. I have validated the fix for segrep as well and it seems to work. Pls feel free to start from the same fix and we can reason out any alternate approaches as well. cc @dreamer-89 @mch2

mch2 · 2023-02-14T23:14:52Z

Thanks for raising this @ashking94. We've been discussing this on #6065 - as this is a cause for some of the flakiness with our relocation ITs. An addition I mentioned on 6065 that I like with SR, is to execute a refresh before we do the round of SR for the new primary during relocation. However, I like this change not only for relocation but also during failover scenarios to guarantee we are not leaving ops in the xlog. Will include a test for that while concurrently indexing.

dreamer-89 · 2023-04-05T16:42:22Z

Closing this issue as fixed with #6065

@ashking94 : Please let us know if you still found this bug.

ashking94 added the bug Something isn't working label Feb 14, 2023

mch2 mentioned this issue Feb 14, 2023

[BUG] failing IT test : SegmentReplicationRelocationIT #6065

Closed

mch2 added the distributed framework label Feb 14, 2023

dreamer-89 closed this as completed Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Data loss during primary relocation for segrep backed indexes #6315

[BUG] Data loss during primary relocation for segrep backed indexes #6315

ashking94 commented Feb 14, 2023

ashking94 commented Feb 14, 2023

mch2 commented Feb 14, 2023 •

edited

Loading

dreamer-89 commented Apr 5, 2023

[BUG] Data loss during primary relocation for segrep backed indexes #6315

[BUG] Data loss during primary relocation for segrep backed indexes #6315

Comments

ashking94 commented Feb 14, 2023

ashking94 commented Feb 14, 2023

mch2 commented Feb 14, 2023 • edited Loading

dreamer-89 commented Apr 5, 2023

mch2 commented Feb 14, 2023 •

edited

Loading