Dist blocker #51

arcusfelis · 2024-03-06T09:26:00Z

In this PR we introduce cets_dist_blocker, which will set cookies that would prevent nodes from reconnecting before cleaning is done.

We use set_cookie/2, which allows to set a particular cookie for a specific remote node (and specific local node). If we set cookie to some random value, the node would not be able to connect to us (also we would not able to connect). But the existing connection will remain. Also, it only affects that specific node, other nodes would use cookie provided by get_cookie/0 function.

The idea is:

on nodeup we set cookie to something different for that particular node, so the next time it tries to reconnect, it will not able to.
on nodedown we would wait for all the cleaner processes to ack that they've finished handling nodedown message (sent using monitor_nodes call).
we only do it on a local node (i.e. no coordination between dist_blockers in cluster). Each MongooseIM node has a dist_blocker process though, just no messaging between them. Reason: simplicity. Also, that approach is enough to fix most of the issues.

This PR addresses "sessionCount is incorrect during upgrades".
The final goal is to make global's prevent_overlapped_partitions reliable.

Proposed changes include:

cets_dist_blocker server, which provides API to block node from reconnecting, unless we do cleaning first.
Test helper to start peer nodes.

PR to MongooseIM: esl/MongooseIM#4234
PR to MongooseHelm: esl/MongooseHelm#38

It prevents nodes from reconnecting too fast

codecov · 2024-03-06T09:27:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.79%. Comparing base (fbaf1d8) to head (b14d8b8).
Report is 3 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #51      +/-   ##
==========================================
+ Coverage   98.35%   98.79%   +0.43%     
==========================================
  Files          10       11       +1     
  Lines         792      831      +39     
==========================================
+ Hits          779      821      +42     
+ Misses         13       10       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…s_done Add dist_blocker_skip_blocking_if_no_cleaners testcase

…s_down

dist_blocker_blocks_if_cleaner_says_done_and_second_cleaner_does_not_ack testcase Common connect_and_disconnect function

chrzaszcz

Looks good. My only concern is how to protect against the cleaner taking too long on some node. Some timeout maybe? But let's not make it too complex at once.

src/cets_dist_blocker.erl

arcusfelis · 2024-03-08T15:48:38Z

Looks good. My only concern is how to protect against the cleaner taking too long on some node. Some timeout maybe? But let's not make it too complex at once.

Could be timeout. Could be a very noisy logging. With timeout allowing node to connect would lead to the "unusual state" (i.e. untested state). It could be ok, it could be just adding more load to the cluster and causing errors.

In this case there is a chance that there would be "that one node, that prevented the join".
But it is fine-tuning, core functionality is here.

Currently we have:

nodes are getting cleaned one-by-one.
cleaning would take some reasonably short amount of time. Though not always, we have code that calls mod_last to update last_seen timestamp from SM on cleanup.
there is a chance that node would be able to connect to one of the nodes that finished cleaning, and not connect to other nodes. Some coordination could be required. Could make a small test and an issue later on, but it is a separate issue.
The idea is to have the basic dist_cleaner, and then maybe we can add cross-node cleaners (i.e. node would set cookie to unblocked on all nodes at once, if at least one node sets cookie to blocked - all nodes in the cluster would set to blocked).

Oh, I think there is 30 seconds grace period during which cets_discovery would not try to reconnect - so, it should be enough for the cluster to be able to accept the node again. Maybe it could be enough.

arcusfelis added 3 commits March 6, 2024 10:18

Create cets_test_peer helper

8d24d27

Add cets_dist_blocker

b009ba6

It prevents nodes from reconnecting too fast

Add tests for unknown messages in cets_dist_blocker

853c4ee

arcusfelis added 4 commits March 6, 2024 11:01

Add dist_blocker_unblocks_if_cleaner_goes_down_and_second_cleaner_say…

21c5777

…s_done Add dist_blocker_skip_blocking_if_no_cleaners testcase

Improve docs

788931f

Add dist_blocker_unblocks_if_cleaner_says_done_and_second_cleaner_goe…

dfa6add

…s_down

More strict order control in tests for dist_blocker

b37ef11

dist_blocker_blocks_if_cleaner_says_done_and_second_cleaner_does_not_ack testcase Common connect_and_disconnect function

arcusfelis force-pushed the dist-blocker branch from 6073b63 to b37ef11 Compare March 6, 2024 11:48

Use two groups of tests in cets_dist_blocker_SUITE

43f5298

chrzaszcz reviewed Mar 8, 2024

View reviewed changes

src/cets_dist_blocker.erl Outdated Show resolved Hide resolved

src/cets_dist_blocker.erl Outdated Show resolved Hide resolved

Fix comments after the review

b14d8b8

chrzaszcz approved these changes Mar 8, 2024

View reviewed changes

chrzaszcz merged commit 0f69ec9 into main Mar 8, 2024
9 checks passed

chrzaszcz deleted the dist-blocker branch March 8, 2024 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dist blocker #51

Dist blocker #51

arcusfelis commented Mar 6, 2024 •

edited

Loading

codecov bot commented Mar 6, 2024 •

edited

Loading

chrzaszcz left a comment

arcusfelis commented Mar 8, 2024

Dist blocker #51

Dist blocker #51

Conversation

arcusfelis commented Mar 6, 2024 • edited Loading

codecov bot commented Mar 6, 2024 • edited Loading

Codecov Report

chrzaszcz left a comment

Choose a reason for hiding this comment

arcusfelis commented Mar 8, 2024

arcusfelis commented Mar 6, 2024 •

edited

Loading

codecov bot commented Mar 6, 2024 •

edited

Loading