-
Notifications
You must be signed in to change notification settings - Fork 426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert dist blocker #4251
Revert dist blocker #4251
Conversation
The dist_blocker feature was introduced to protect disconnected nodes from reconnecting before other nodes finished their clenaup. However, when dist_blocker kicks in, multiple unexpected error messages would appear on several nodes. This means that the solution cannot be accepted.
elasticsearch_and_cassandra_26 / elasticsearch_and_cassandra_mnesia / 509759e small_tests_25 / small_tests / 509759e small_tests_26 / small_tests / 509759e small_tests_26_arm64 / small_tests / 509759e ldap_mnesia_25 / ldap_mnesia / 509759e dynamic_domains_mysql_redis_26 / mysql_redis / 509759e ldap_mnesia_26 / ldap_mnesia / 509759e dynamic_domains_pgsql_mnesia_26 / pgsql_mnesia / 509759e internal_mnesia_26 / internal_mnesia / 509759e dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / 509759e graphql_SUITE:tls_enabled:tls_connect_admin_selfsigned_certificate{error,{{assertMatch,[{module,graphql_SUITE},
{line,259},
{expression,"Result"},
{pattern,"{ error , { tls_alert , { bad_certificate , _ } } }"},
{value,{error,connection_closed}}]},
[{graphql_SUITE,tls_connect_admin_selfsigned_certificate,1,
[{file,"/home/circleci/project/big_tests/tests/graphql_SUITE.erl"},
{line,259}]},
{test_server,ts_tc,3,[{file,"test_server.erl"},{line,1793}]},
{test_server,run_test_case_eval1,6,
[{file,"test_server.erl"},{line,1302}]},
{test_server,run_test_case_eval,9,
[{file,"test_server.erl"},{line,1234}]}]}} pgsql_mnesia_25 / pgsql_mnesia / 509759e dynamic_domains_pgsql_mnesia_25 / pgsql_mnesia / 509759e pgsql_mnesia_26 / pgsql_mnesia / 509759e mysql_redis_26 / mysql_redis / 509759e mssql_mnesia_26 / odbc_mssql_mnesia / 509759e pgsql_cets_26 / pgsql_cets / 509759e dynamic_domains_mssql_mnesia_26 / odbc_mssql_mnesia / 509759e |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## consistent-node-cleanup #4251 +/- ##
===========================================================
- Coverage 84.44% 84.43% -0.02%
===========================================================
Files 552 552
Lines 33549 33546 -3
===========================================================
- Hits 28331 28324 -7
- Misses 5218 5222 +4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! 👍
The
dist_blocker
feature was introduced to protect a disconnected node from reconnecting before other nodes finished the cleanup to avoid issues with inconsistent CETS tables. The solution was to change the cookie value onnodeup
(sic!) for the particular node until the cleanup for that node is finished.There are following issues with that solution:
nodedown
is received, and cleanup is started. However, when the node comes back online,nodedown
would be received, followed bynodeup
. The problem is thatdist_blocker
is preventing the node from connecting, which might delay reconnection. In MongooseHelm tests this was causing a 15-second delay before reconnection, but it could be more in some configurations. There is even potential for a deadlock, but I couldn't trigger it in tests.dist_blocker
kicks in, multiple error messages would appear on several nodes, informing about the blocked connections. These unwanted and surprising logs could confuse the user.Because of these concerns,
dist_blocker
is disabled in this PR. It is enough to disable it in MIM, because CETS does not enable it by default.The MongooseHelm tests have shown that after #4250 it is difficult to reproduce the issue - it occurred only once in about 50 tests, which try to quickly restart MIM to trigger the issue. Keep in mind, that with
dist_blocker
it has also failed at least twice on CI.We can prevent the issues in different ways, and we can do so in separate PR's.