Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart db service once domain core is down #3224

Closed

Conversation

arcusfelis
Copy link
Contributor

This PR addresses MIM-1470

Proposed changes include:

  • Restart db service once domain core is down
  • Add logging for initial sync
  • Add test demonstrating that Core process is restarted, but domain service is not

Add logging for initial sync
Add test demonstrating that Core process is restarted, but domain service is not
@codecov
Copy link

codecov bot commented Aug 18, 2021

Codecov Report

Merging #3224 (91ab7b8) into master (6d68ad1) will decrease coverage by 0.00%.
The diff coverage is 88.63%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3224      +/-   ##
==========================================
- Coverage   80.31%   80.31%   -0.01%     
==========================================
  Files         398      399       +1     
  Lines       32505    32519      +14     
==========================================
+ Hits        26108    26119      +11     
- Misses       6397     6400       +3     
Impacted Files Coverage Δ
src/domain/mongoose_lazy_routing.erl 84.84% <33.33%> (ø)
src/domain/mongoose_subdomain_core.erl 89.52% <33.33%> (ø)
src/domain/service_domain_db.erl 79.71% <94.44%> (+1.58%) ⬆️
src/domain/mongoose_domain_core.erl 88.23% <100.00%> (+0.14%) ⬆️
src/domain/mongoose_domain_db_cleaner.erl 75.00% <100.00%> (ø)
src/domain/mongoose_domain_loader.erl 76.36% <100.00%> (+0.89%) ⬆️
src/domain/mongoose_domain_sql.erl 82.07% <100.00%> (+0.17%) ⬆️
src/domain/mongoose_domain_sup.erl 100.00% <100.00%> (ø)
src/ejabberd_sup.erl 86.20% <100.00%> (+0.49%) ⬆️
src/ejabberd.erl 45.00% <0.00%> (-10.00%) ⬇️
... and 11 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6d68ad1...91ab7b8. Read the comment docs.

@mongoose-im

This comment has been minimized.

@@ -130,6 +130,7 @@ get_start_args() ->
%% gen_server callbacks
%%--------------------------------------------------------------------
init([Pairs, AllowedHostTypes]) ->
service_domain_db:reset_last_event_id(),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix for the @NelsonVides's bug

receive_all_check_for_updates(),
PageSize = 1000,
LastEventId2 = mongoose_domain_loader:check_for_updates(LastEventId, PageSize),
maybe_set_last_event_id(LastEventId, LastEventId2),
TRef = erlang:send_after(Interval, self(), check_for_updates),
State#{last_event_id => LastEventId2, check_for_updates => TRef}.
State#{last_event_id => LastEventId2, check_for_updates_tref => TRef}.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix for @chrzaszcz bug with too many queries.

@mongoose-im
Copy link
Collaborator

mongoose-im commented Aug 19, 2021

small_tests_24 / small_tests / 0036269
Reports root / small


internal_mnesia_24 / internal_mnesia / 0036269
Reports root/ big
OK: 1588 / Failed: 0 / User-skipped: 288 / Auto-skipped: 0


small_tests_22 / small_tests / 0036269
Reports root / small


dynamic_domains_23 / pgsql_mnesia / 0036269
Reports root/ big
OK: 1644 / Failed: 0 / User-skipped: 111 / Auto-skipped: 0


dynamic_domains_24 / pgsql_mnesia / 0036269
Reports root/ big
OK: 1644 / Failed: 0 / User-skipped: 111 / Auto-skipped: 0


small_tests_23 / small_tests / 0036269
Reports root / small


ldap_mnesia_24 / ldap_mnesia / 0036269
Reports root/ big
OK: 1515 / Failed: 0 / User-skipped: 361 / Auto-skipped: 0


ldap_mnesia_22 / ldap_mnesia / 0036269
Reports root/ big
OK: 1515 / Failed: 0 / User-skipped: 361 / Auto-skipped: 0


ldap_mnesia_23 / ldap_mnesia / 0036269
Reports root/ big
OK: 1515 / Failed: 0 / User-skipped: 361 / Auto-skipped: 0


pgsql_mnesia_22 / pgsql_mnesia / 0036269
Reports root/ big
OK: 3087 / Failed: 0 / User-skipped: 184 / Auto-skipped: 0


pgsql_mnesia_24 / pgsql_mnesia / 0036269
Reports root/ big
OK: 3087 / Failed: 0 / User-skipped: 184 / Auto-skipped: 0


elasticsearch_and_cassandra_24 / elasticsearch_and_cassandra_mnesia / 0036269
Reports root/ big
OK: 1891 / Failed: 0 / User-skipped: 284 / Auto-skipped: 0


pgsql_mnesia_23 / pgsql_mnesia / 0036269
Reports root/ big
OK: 3087 / Failed: 0 / User-skipped: 184 / Auto-skipped: 0


mysql_redis_24 / mysql_redis / 0036269
Reports root/ big
OK: 3076 / Failed: 1 / User-skipped: 201 / Auto-skipped: 0

mam_SUITE:rdbms_cache_prefs_cases:messages_filtered_when_prefs_default_policy_is_never
{error,{test_case_failed,"ASSERT EQUAL\n\tExpected []\n\tValue [ok]\n"}}

Report log


mssql_mnesia_24 / odbc_mssql_mnesia / 0036269
Reports root/ big
OK: 3087 / Failed: 0 / User-skipped: 184 / Auto-skipped: 0


riak_mnesia_24 / riak_mnesia / 0036269
Reports root/ big
OK: 1738 / Failed: 0 / User-skipped: 287 / Auto-skipped: 0

@@ -148,6 +148,7 @@ select_from(FromId, Limit) ->
Pool = get_db_pool(),
Args = rdbms_queries:add_limit_arg(Limit, [FromId]),
{selected, Rows} = execute_successfully(Pool, domain_select_from, Args),
?LOG_ERROR(#{what => select_from, rows => Rows, from_id => FromId, limit => Limit}),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error seems a bit hardcore here 🤔

SupFlags = #{strategy => rest_for_one,
intensity => 100, period => 5,
shutdown => 1000},
ChildSpecs = [],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have here this supervisor, why don't we give the supervisor its child specs and we get the children started here, instead of hardcoded in the mongooseim:start() entry point? Also, a strategy of rest_for_one would restart all chlidren started after the one that died (which kinda seems right here), and right now it is not clear in a single place the order that children are started as.

@@ -543,6 +546,37 @@ db_restarts_properly(_) ->
end,
mongoose_helper:wait_until(F, true, #{time_left => timer:seconds(15)}).

db_loads_domains_after_node_joins_cluster(Config) ->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not exactly the scenario that I had in mind, but a much simpler one: put a bunch of domains one one node, then add an empty second node to cluster with the first, then we expect this second node to serve the same domains the first does.
How does this scenario you propose proves my scenario is also valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an empty second node

It is never empty, if it is running.
It gets domains from DB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is explicitly not true. From my tests, it only loads the static toml config, nothing else. Try this (using macros from dynamic_domains_SUITE

    insert_domains([mim()], ?DOMAINS),
    Config1 = cluster_nodes(?CLUSTER_NODES, Config),
    R1 = rpc(mim(), ets, tab2list, [mongoose_domain_core]),
    ct:pal("Value ~p~n", [R1]),
    R2 = rpc(mim2(), ets, tab2list, [mongoose_domain_core]),
    ct:pal("Value ~p~n", [R2]),
    ok.

And you'll see they report different values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code:

  • does not start service_domain_db on both of nodes.
  • does not start service_domain_db after clustering.
  • You can check that service is running by calling rpc(mim(), mongoose_service, is_loaded, [service_domain_db])
  • it uses mongoose_domain_core:insert/3 to insert domains. This function does not touches database and should be only used internally or for testing.
  • use mongoose_domain_api:insert_domain instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually running this branch against that test still returns different ets tables. Am I wrong to expect both ets tables to be equal?

@mongoose-im
Copy link
Collaborator

mongoose-im commented Aug 19, 2021

small_tests_24 / small_tests / 91ab7b8
Reports root / small


internal_mnesia_24 / internal_mnesia / 91ab7b8
Reports root/ big
OK: 1588 / Failed: 0 / User-skipped: 288 / Auto-skipped: 0


small_tests_22 / small_tests / 91ab7b8
Reports root / small


dynamic_domains_24 / pgsql_mnesia / 91ab7b8
Reports root/ big
OK: 1644 / Failed: 0 / User-skipped: 111 / Auto-skipped: 0


dynamic_domains_23 / pgsql_mnesia / 91ab7b8
Reports root/ big
OK: 1644 / Failed: 0 / User-skipped: 111 / Auto-skipped: 0


small_tests_23 / small_tests / 91ab7b8
Reports root / small


ldap_mnesia_22 / ldap_mnesia / 91ab7b8
Reports root/ big
OK: 1515 / Failed: 0 / User-skipped: 361 / Auto-skipped: 0


ldap_mnesia_24 / ldap_mnesia / 91ab7b8
Reports root/ big
OK: 1515 / Failed: 0 / User-skipped: 361 / Auto-skipped: 0


ldap_mnesia_23 / ldap_mnesia / 91ab7b8
Reports root/ big
OK: 1515 / Failed: 0 / User-skipped: 361 / Auto-skipped: 0


pgsql_mnesia_22 / pgsql_mnesia / 91ab7b8
Reports root/ big
OK: 3086 / Failed: 1 / User-skipped: 184 / Auto-skipped: 0

dynamic_domains_SUITE:with_mod_dynamic_domains_test:iq_handling_for_subdomain
{error,
  {{assertion_failed,assert,is_iq_result,
     [{xmlel,<<"iq">>,
        [{<<"to">>,<<"subdomain2.example.test">>},
         {<<"type">>,<<"get">>},
         {<<"id">>,<<"8017f8a72d8efab669e07258b280eeda">>}],
        [{xmlel,<<"query">>,[{<<"xmlns">>,<<"dummy.namespace">>}],[]}]}],
     {xmlel,<<"iq">>,
       [{<<"from">>,<<"subdomain2.example.test">>},
        {<<"to">>,<<"[email protected]/res1">>},
        {<<"type">>,<<"error">>},
        {<<"xml:lang">>,<<"en">>},
        {<<"id">>,<<"8017f8a72d8efab669e07258b280eeda">>}],
       [{xmlel,<<"query">>,[{<<"xmlns">>,<<"dummy.namespace">>}],[]},
        {xmlel,<<"error">>,
          [{<<"code">>,<<"404">>},{<<"type">>,<<"cancel">>}],
          [{xmlel,<<"remote-server-not-found">>,
             [{<<"xmlns">>,
             <<"urn:ietf:params:xml:ns:xmpp-stanzas">>}],
             []},
           {xmlel,<<"text">>,
             [{<<"xmlns">>,
             <<"urn:ietf:params:xml:ns:xmpp-stanzas">>}],
             [{xmlcdata,<<"From s2s (waiting)">>}]}]}]},
     "<iq from='subdomain2.example.test' to='[email protected]/res1' type='error' xml:lang='en' id='8017f8a72d8efab669e07258b280eeda'><query xmlns='dummy.namespace'/><error code='404' type='cancel'><remote-server-not-found xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'/><text xmlns='urn:ietf:params:xml:ns:xmpp-stanzas'>From s2s (waiting)</text></error></iq>"},
   [{escalus_new_assert,assert_true,2,
      [{file,
         "/home/circleci/app/big_tests/_build/default/lib...

Report log


pgsql_mnesia_23 / pgsql_mnesia / 91ab7b8
Reports root/ big
OK: 3087 / Failed: 0 / User-skipped: 184 / Auto-skipped: 0


elasticsearch_and_cassandra_24 / elasticsearch_and_cassandra_mnesia / 91ab7b8
Reports root/ big
OK: 1891 / Failed: 0 / User-skipped: 284 / Auto-skipped: 0


pgsql_mnesia_24 / pgsql_mnesia / 91ab7b8
Reports root/ big
OK: 3087 / Failed: 0 / User-skipped: 184 / Auto-skipped: 0


mssql_mnesia_24 / odbc_mssql_mnesia / 91ab7b8
Reports root/ big
OK: 3087 / Failed: 0 / User-skipped: 184 / Auto-skipped: 0


mysql_redis_24 / mysql_redis / 91ab7b8
Reports root/ big
OK: 3070 / Failed: 0 / User-skipped: 201 / Auto-skipped: 0


riak_mnesia_24 / riak_mnesia / 91ab7b8
Reports root/ big
OK: 1738 / Failed: 0 / User-skipped: 287 / Auto-skipped: 0


pgsql_mnesia_22 / pgsql_mnesia / 91ab7b8
Reports root/ big
OK: 3087 / Failed: 0 / User-skipped: 184 / Auto-skipped: 0

@arcusfelis
Copy link
Contributor Author

See #3226

@arcusfelis arcusfelis closed this Aug 20, 2021
@vkatsuba vkatsuba deleted the mu-restart-domain-db-service-once-core-is-down branch August 20, 2021 10:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants