You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
During bulk indexing to multiple indices and encountering multiple update_mapping timeouts (30s), some indices have shown inconsistencies between primaries and replicas.
The disk size of indices with these inconsistencies was considerably larger than as expected (as compared to similar data on indices from previous dates), and synced_flush or force_merge commands could not be completed.
JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
During indexing to multiple live indices and performing maintenance on our cluster (creating new indices and deleting others), some of the indexing bulks encountered timeouts for some of their update_mappings (30s). Following these timeouts, we began observing inconsistencies between primaries and replicas. The disk size of indices with these inconsistencies was considerably larger than as expected (as compared to similar data on indices from previous dates), and synced_flush or force_merge commands could not be completed - even after stopping indexing to these indices completely.
Here is an example of a _flush/synced command on one of these indices:
{
"_shards": {
"total": 76,
"successful": 72,
"failed": 4
},
"fae12b2c5f73_newlogs_20180502-01": {
"total": 76,
"successful": 72,
"failed": 4,
"failures": [
{
"shard": 18,
"reason": "out of sync replica; num docs on replica [4212470]; num docs on primary [4212473]",
"routing": {
"state": "STARTED",
"primary": false,
"node": "QeSPYyW7TfOOgR1SdVR42g",
"relocating_node": null,
"shard": 18,
"index": "fae12b2c5f73_newlogs_20180502-01",
"allocation_id": {
"id": "PArVLA9uSyqZ4ZxKHu_yvQ"
}
}
},
{
"shard": 29,
"reason": "out of sync replica; num docs on replica [4211285]; num docs on primary [4211289]",
"routing": {
"state": "STARTED",
"primary": false,
"node": "M_QP7sSsQQ6jhGnAblOESg",
"relocating_node": null,
"shard": 29,
"index": "fae12b2c5f73_newlogs_20180502-01",
"allocation_id": {
"id": "M_ws8hLURsmKzEtNq26SxQ"
}
}
},
{
"shard": 0,
"reason": "out of sync replica; num docs on replica [4210878]; num docs on primary [4210883]",
"routing": {
"state": "STARTED",
"primary": false,
"node": "g8cGEEK0SAGYC_I-FfZprg",
"relocating_node": null,
"shard": 0,
"index": "fae12b2c5f73_newlogs_20180502-01",
"allocation_id": {
"id": "pWX9k1pZTuuWBKHOpNCJHw"
}
}
},
{
"shard": 25,
"reason": "out of sync replica; num docs on replica [4211536]; num docs on primary [4211537]",
"routing": {
"state": "STARTED",
"primary": false,
"node": "FmnjvN8YROKJ6PO8E3QEjA",
"relocating_node": null,
"shard": 25,
"index": "fae12b2c5f73_newlogs_20180502-01",
"allocation_id": {
"id": "5wMI1o01SH2kJW9baYJymw"
}
}
}
]
}
}
Reducing the replica of these indices to 0 and then increasing it back to 1 seems to solve the problem.
However, in some of the indices, a weird state was kept for one of its shards (again, when calling _flush/synced):
For these indices, the problem was eventually solved by closing the index and re-opening it. Upon opening, some of the primary shards have been declared unassigned by elasticsearch and had to be verified and initialized.
Overall, this is a very long, time-consuming and resource-consuming operation - recognizing indices are larger than expected, identifying a primary-replica inconsistency, reducing and increasing the replicas, waiting for the replicas to be fully initialized, closing and re-opening the index, and waiting for elasticsearch to fully initialize the unassigned primary shards.
I think what you are describing is the bug that is fixed by #30244, which will be released in v6.3.0, so I'm going to close this.
However, I'm not completely sure this explains the ongoing operations on primary message that required a close-and-reopen to clear. To avoid confusion with #30244 please could you open a thread on https://discuss.elastic.co issue about that, and include the output of
Hi,
During bulk indexing to multiple indices and encountering multiple update_mapping timeouts (30s), some indices have shown inconsistencies between primaries and replicas.
The disk size of indices with these inconsistencies was considerably larger than as expected (as compared to similar data on indices from previous dates), and synced_flush or force_merge commands could not be completed.
Elasticsearch version:
6.2.3
Plugins installed:
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3
JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)
OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
During indexing to multiple live indices and performing maintenance on our cluster (creating new indices and deleting others), some of the indexing bulks encountered timeouts for some of their update_mappings (30s). Following these timeouts, we began observing inconsistencies between primaries and replicas. The disk size of indices with these inconsistencies was considerably larger than as expected (as compared to similar data on indices from previous dates), and synced_flush or force_merge commands could not be completed - even after stopping indexing to these indices completely.
Here is an example of a _flush/synced command on one of these indices:
Reducing the replica of these indices to 0 and then increasing it back to 1 seems to solve the problem.
However, in some of the indices, a weird state was kept for one of its shards (again, when calling _flush/synced):
For these indices, the problem was eventually solved by closing the index and re-opening it. Upon opening, some of the primary shards have been declared unassigned by elasticsearch and had to be verified and initialized.
Overall, this is a very long, time-consuming and resource-consuming operation - recognizing indices are larger than expected, identifying a primary-replica inconsistency, reducing and increasing the replicas, waiting for the replicas to be fully initialized, closing and re-opening the index, and waiting for elasticsearch to fully initialize the unassigned primary shards.
I believe the initial inconsistency state could be related to #30244, and maybe these discussions as well:
https://discuss.elastic.co/t/flush-threads-stuck-es-6-1-2/117707
https://discuss.elastic.co/t/unable-to-complete-synced-flush-on-idle-index/118449
I'd be happy to provide additional data and information as needed.
Thanks.
The text was updated successfully, but these errors were encountered: