Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control Plane somehow quits after a few minutes. #4248

Closed
fulmicoton opened this issue Dec 8, 2023 · 2 comments · Fixed by #4251
Closed

Control Plane somehow quits after a few minutes. #4248

fulmicoton opened this issue Dec 8, 2023 · 2 comments · Fixed by #4251
Assignees
Labels
bug Something isn't working high-priority

Comments

@fulmicoton
Copy link
Contributor

fulmicoton commented Dec 8, 2023

-12-08T11:11:12.850Z  INFO merge{merge_split_id=01HH4HZMZA5PA3T198QMFVF1WR split_ids=["01HH4HZM9PFXCY6Y9H8THFXPWH", "01HH4HXTC0MEQZ3QKC8E981F01"] typ=Merge}:uploader:upload{split=01HH4HZMZA5PA3T198QMFVF1WR}:store_split: quickwit_indexing::split_store::indexing_split_store: store-split-remote-success split_size_in_megabytes=36.936207 num_docs=53687 elapsed_secs=0.3037853 throughput_mb_s=121.586555 is_mature=false
2023-12-08T11:11:12.850Z  INFO merge{merge_split_id=01HH4HZMZA5PA3T198QMFVF1WR split_ids=["01HH4HZM9PFXCY6Y9H8THFXPWH", "01HH4HXTC0MEQZ3QKC8E981F01"] typ=Merge}:uploader:upload{split=01HH4HZMZA5PA3T198QMFVF1WR}:store_split: quickwit_indexing::split_store::indexing_split_store: store-in-cache
2023-12-08T11:11:12.855Z  INFO merge{merge_split_id=01HH4HZMZA5PA3T198QMFVF1WR split_ids=["01HH4HZM9PFXCY6Y9H8THFXPWH", "01HH4HXTC0MEQZ3QKC8E981F01"] typ=Merge}:publisher{split_update=SplitsUpdate { index_id: "simian_5288585943041499547", new_splits: "01HH4HZMZA5PA3T198QMFVF1WR", checkpoint_delta: None }}: quickwit_indexing::actors::publisher: publish-new-splits new_splits=["01HH4HZMZA5PA3T198QMFVF1WR"] checkpoint_delta=None
2023-12-08T11:11:14.920Z  INFO quickwit_actors::spawn_builder: actor-exit actor_id=Supervisor(ControlPlane)-solitary-kFEb exit_status=success
2023-12-08T11:11:54.836Z ERROR quickwit_control_plane::control_plane: failed to forward local shards update to control plane error=the channel is closed
2023-12-08T11:11:59.836Z ERROR quickwit_control_plane::control_plane: failed to forward local shards update to control plane error=the channel is closed
2023-12-08T11:12:05.212Z  INFO quickwit_indexing::actors::indexer: new-split split_id=01HH4J191S18DFQ0NFFGD4TY1C partition_id=0
2023-12-08T11:12:08.777Z  INFO quickwit_indexing::actors::indexer: send-to-index-serializer commit_trigger=NumDocsLimit split_ids=01HH4J191S18DFQ0NFFGD4TY1C num_docs=100259
2023-12-08T11:12:08.810Z  INFO quickwit_indexing::actors::indexer: new-split split_id=01HH4J1CJ94RPEPJJ418HRJ82B partition_id=0
2023-12-08T11:12:09.392Z  INFO index-doc-batches{index_id=simian_10737006966677483985 source_id=_ingest-source pipeline_uid=01HH4HPD9ANV6CJJ7HNRT4BSQT workbench_id=01HH4J191SRCYEKSN3YW74RJR4}:packager: quickwit_indexing::actors::packager: start-packaging-splits split_ids=["01HH4J191S18DFQ0NFFGD4TY1C"]
2023-12-08T11:12:09.392Z  INFO index-doc-batches{index_id=simian_10737006966677483985 source_id=_ingest-source pipeline_uid=01HH4HPD9ANV6CJJ7HNRT4BSQT workbench_id=01HH4J191SRCYEKSN3YW74RJR4}:packager: quickwit_indexing::actors::packager: create-packaged-split split_id="01HH4J191S18DFQ0NFFGD4TY1C"
2023-12-08T11:12:09.431Z  INFO index-doc-batches{index_id=simian_10737006966677483985 source_id=_ingest-source pipeline_uid=01HH4HPD9ANV6CJJ7HNRT4BSQT workbench_id=01HH4J191SRCYEKSN3YW74RJR4}:uploader: quickwit_indexing::actors::uploader: start-stage-and-store-splits split_ids=["01HH4J191S18DFQ0NFFGD4TY1C"]
2023-12-08T11:12:09.835Z ERROR quickwit_control_plane::control_plane: failed to forward local shards update to control plane error=the channel is closed
2023-12-08T11:12:09.954Z  INFO index-doc-batches{index_id=simian_10737006966677483985 source_id=_ingest-source pipeline_uid=01HH4HPD9ANV6CJJ7HNRT4BSQT workbench_id=01HH4J191SRCYEKSN3YW74RJR4}:uploader:upload{split=01HH4J191S18DFQ0NFFGD4TY1C}:store_split: quickwit_indexing::split_store::indexing_split_store: store-split-remote-success split_size_in_megabytes=66.61766 num_docs=100259 elapsed_secs=0.5011575 throughput_mb_s=132.92758 is_mature=true
2023-12-08T11:12:09.963Z  INFO index-doc-batches{index_id=simian_10737006966677483985 source_id=_ingest-source pipeline_uid=01

Running with:
'chicobonbon(rate=500000)*3' 'chicobonbon(rate=15000000)

@fulmicoton fulmicoton added the bug Something isn't working label Dec 8, 2023
@fulmicoton fulmicoton self-assigned this Dec 8, 2023
@fulmicoton
Copy link
Contributor Author

controlplanequits.txt

@fulmicoton
Copy link
Contributor Author

It shuts down as it runs out of messages, and no-one holds its mailbox.

Possible explanation:

This should not happen because of the ControlLoop message that prevents having any instant when we we have no more message and we have no one holding the mailbox.

We probably have a race condition today that goes as follows:

  • scheduler holds maibox
  • actor loop detects absence of message
  • scheduler sends message and drops mailbox
  • actor loop checks number of mailbox.

fulmicoton added a commit that referenced this issue Dec 9, 2023
The bug could surfaced on any actor looping alone,
like the control plane.

Closes #4248
fulmicoton added a commit that referenced this issue Dec 9, 2023
The bug could surfaced on any actor looping alone,
like the control plane.

Closes #4248
fulmicoton added a commit that referenced this issue Dec 11, 2023
The weak mailbox was messing the refcounting.
and we technically had a race condition.

The bug could surfaced on any actor looping alone,
like the control plane.

Closes #4248
fulmicoton added a commit that referenced this issue Dec 11, 2023
The weak mailbox was messing the refcounting.
and we technically had a race condition.

The bug could surfaced on any actor looping alone,
like the control plane.

Closes #4248
fulmicoton added a commit that referenced this issue Dec 11, 2023
The weak mailbox was messing the refcounting.
and we technically had a race condition.

The bug could surfaced on any actor looping alone,
like the control plane.

Closes #4248
fulmicoton added a commit that referenced this issue Dec 11, 2023
)

The weak mailbox was messing the refcounting.
and we technically had a race condition.

The bug could surfaced on any actor looping alone,
like the control plane.

Closes #4248
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working high-priority
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant