Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Data loss during primary relocation for segrep backed indexes #6315

Closed
ashking94 opened this issue Feb 14, 2023 · 3 comments
Closed

[BUG] Data loss during primary relocation for segrep backed indexes #6315

ashking94 opened this issue Feb 14, 2023 · 3 comments
Labels
bug Something isn't working distributed framework

Comments

@ashking94
Copy link
Member

Describe the bug
During primary-primary relocation, encountering data loss when indexing is happening at high TPS. This specifically is starting after initiateTracking happens for the new primary shard. A subset of docs are missing after relocation completes. Also noticing that after relocation handoff is completed, indexing landing on new primary shard uses correct seq no. However, the overall count of docs is not correct.

To Reproduce
Step 1 - Create SegRep index with index.translog.durability as async or request. The issue shows easily on async option.

curl -X PUT "localhost:9200/test-index?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "replication.type" : "SEGMENT",
    "index.translog.durability" : "async",
    "refresh_interval": "1000s"
  }
}
'

Step 2 - Index docs

for i in {1..1000}
do                
   curl --location --request POST "localhost:9202/test-index/_doc" \
    --header 'Content-Type: application/json' \
    --data-raw "{
      \"name\":\"abc${i}\"
    }"
    echo "$i\n"
done

Step 3 - Start relocation of index

curl -XPUT localhost:9201/test-index/_settings -H 'Content-Type: application/json' -d '    
{
  "index.routing.allocation.include._name": "opensearch-node1"
}'
@ashking94 ashking94 added the bug Something isn't working label Feb 14, 2023
@ashking94
Copy link
Member Author

Had found an issue for remote-backed indexes relating to relocation - #6214. The same issue exists for segrep indexes as well (validated the same). The fix for remote-backed indexes is present in #6314. I have validated the fix for segrep as well and it seems to work. Pls feel free to start from the same fix and we can reason out any alternate approaches as well. cc @dreamer-89 @mch2

@mch2
Copy link
Member

mch2 commented Feb 14, 2023

Thanks for raising this @ashking94. We've been discussing this on #6065 - as this is a cause for some of the flakiness with our relocation ITs. An addition I mentioned on 6065 that I like with SR, is to execute a refresh before we do the round of SR for the new primary during relocation. However, I like this change not only for relocation but also during failover scenarios to guarantee we are not leaving ops in the xlog. Will include a test for that while concurrently indexing.

@dreamer-89
Copy link
Member

Closing this issue as fixed with #6065

@ashking94 : Please let us know if you still found this bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
Development

No branches or pull requests

3 participants