Establish replication connections more efficiently in an edge case #136

nars1 · 2018-01-26T14:12:13Z

Final Release Note

Initiating replication connections between Source and Receiver Servers is more efficient. Previously, in rare cases, the Source Server unnecessarily disconnect the connection and reconnected. [YDB#136]

Description

If the source server is started with a non-zero compression level (-cmplvl qualifier), and the receiver server is started with a zero compression level, the receiver server, as part of its initial handshake, sends a REPL_CMP2UNCMP message to the source server to let it know to send uncompressed records in the replication pipe. It is possible in rare cases, that before this message is sent, the source server closes the connection for other reasons (e.g. heartbeat from the receiver was not received for a while etc.). In that case, the receiver server could incorrectly send the REPL_CMP2UNCMP message as part the initial handshake with the source server on a new connection (which happens soon after the source server closes the old connection) and if this happens, the source server could get confused by this unexpected message at the start of a connection and close the connection once more with a message of the form "UNKNOWN msg (type = ...) received when waiting for msg (type = ...). Closing connection." in the source server log file.

The fix for this is that the receiver server does not send the REPL_CMP2UNCMP message (which was needed for the old connection) the moment a new connection is established with the source server.

Draft Release Note

The replication receiver server sends the correct handshake messages on a new connection with the source server. In YottaDB r1.10 and before, it was possible in rare cases for the source server to unnecessarily disconnect the connection.

…ts reset; Not doing so could affect new connection when it gets established This is an issue observed from a test failure (in the merge/tp_stress subtest) during internal testing. The primary failure symptom was an assert failure in the source server %GTM-F-ASSERT, Assert failed in sr_unix/gtmsource_process_ops.c line 1167 for expression (FALSE) The source server assert failed because it was expecting a REPL_INSTINFO message (msgtype == 29) but instead got a REPL_CMP2UNCMP message (msgtype == 25). The merge/tp_stress subtest induces a REPL_CMP2UNCMP situation using white-box code so the fact that the receiver sent a REPL_CMP2UNCMP message is not surprising. But the timing of the send is where the problem is. The issue is that we maintain the fact that a CMP2UNCMP message needs to be sent in a static variable "send_cmp2uncmp" and had set it to TRUE (in gtmrecv_poll_actions.c) but forgot to clear it before moving on to the new connection. So this leftover static variable caused the incorrect send and resulted in the assert failure of the source server. In pro, the source server would have issued an error message and closed the connection and opened a new connection with the receiver server. So the user would not notice any issues because of this (in the sense, the source server will not go down) even though they can see evidence of an unnecessary disconnect/reconnect sequence in the source server log. As part of the connection reset, we currently reset a few static variables of the old connection but do not reset the send_cmp2uncmp variable. There are a few other statics in use too and are best cleared whenever a connection gets reset. So that is how this fix proceeds. All statics related to the current connection are bundled up in a conn_state_t structure and maintained in the "curr_conn_state" static variable. Anytime a connection gets reset, the entire structure gets cleared.

…t; Not doing so could affect new connection when it gets established This is an issue observed from a test failure (in the merge/tp_stress subtest) during internal testing. The primary failure symptom was an assert failure in the source server %GTM-F-ASSERT, Assert failed in sr_unix/gtmsource_process_ops.c line 1167 for expression (FALSE) The source server assert failed because it was expecting a REPL_INSTINFO message (msgtype == 29) but instead got a REPL_CMP2UNCMP message (msgtype == 25). The merge/tp_stress subtest induces a REPL_CMP2UNCMP situation using white-box code so the fact that the receiver sent a REPL_CMP2UNCMP message is not surprising. But the timing of the send is where the problem is. The issue is that we maintain the fact that a CMP2UNCMP message needs to be sent in a static variable "send_cmp2uncmp" and had set it to TRUE (in gtmrecv_poll_actions.c) but forgot to clear it before moving on to the new connection. So this leftover static variable caused the incorrect send and resulted in the assert failure of the source server. In pro, the source server would have issued an error message and closed the connection and opened a new connection with the receiver server. So the user would not notice any issues because of this (in the sense, the source server will not go down) even though they can see evidence of an unnecessary disconnect/reconnect sequence in the source server log. As part of the connection reset, we currently reset a few static variables of the old connection but do not reset the send_cmp2uncmp variable. There are a few other statics in use too and are best cleared whenever a connection gets reset. So that is how this fix proceeds. All statics related to the current connection are bundled up in a conn_state_t structure and maintained in the "curr_conn_state" static variable. Anytime a connection gets reset, the entire structure gets cleared.

…ssed out in YottaDB#136) soon after connection gets reset Issue YottaDB#136 moved all state related information in gtmrecv_poll_actions.c into a structure variable "curr_conn_state". But there was another global variable "gtmrecv_send_cmp2uncmp" which was set in gtmrecv_process.c and was the same as "curr_conn_state.send_cmp2uncmp" which meant whenever "curr_conn_state.send_cmp2uncmp" got cleared (due to the memset changed introduced in YottaDB#136), "gtmrecv_send_cmp2uncmp" should have also been cleared. Not doing so, meant it was still possible for a stale old connection related state variable to be carried over to a new connection resulting in an assert failure in the source server in rare cases (just like YottaDB#136) in dbg or a connection loss and reconnect sequence in pro (this did happen in one in-house test failure). This is now fixed so both variables are kept in sync. It was originally thought of avoiding the duplication and replacing all usages of "gtmrecv_send_cmp2uncmp" with "curr_conn_state.send_cmp2uncmp" but it was not trivial due to the need to move the "conn_state_t" type definition from gtmsource_poll_actions.c into a header file so gtmrecv_process.c can also inherit that definition. And since this is an edge case, and there was a risk of other subtle regressions, the trivial change was adopted even though it is not as desirable.

…t in #136) soon after connection gets reset Issue #136 moved all state related information in gtmrecv_poll_actions.c into a structure variable "curr_conn_state". But there was another global variable "gtmrecv_send_cmp2uncmp" which was set in gtmrecv_process.c and was the same as "curr_conn_state.send_cmp2uncmp" which meant whenever "curr_conn_state.send_cmp2uncmp" got cleared (due to the memset changed introduced in #136), "gtmrecv_send_cmp2uncmp" should have also been cleared. Not doing so, meant it was still possible for a stale old connection related state variable to be carried over to a new connection resulting in an assert failure in the source server in rare cases (just like #136) in dbg or a connection loss and reconnect sequence in pro (this did happen in one in-house test failure). This is now fixed so both variables are kept in sync. It was originally thought of avoiding the duplication and replacing all usages of "gtmrecv_send_cmp2uncmp" with "curr_conn_state.send_cmp2uncmp" but it was not trivial due to the need to move the "conn_state_t" type definition from gtmsource_poll_actions.c into a header file so gtmrecv_process.c can also inherit that definition. And since this is an edge case, and there was a risk of other subtle regressions, the trivial change was adopted even though it is not as desirable.

nars1 added the upstream issue label Jan 26, 2018

nars1 added this to the r120 milestone Jan 26, 2018

nars1 self-assigned this Jan 26, 2018

nars1 closed this as completed Jan 26, 2018

ksbhaskar changed the title ~~Replication source server could unnecessarily close (and reopen) its connection with the receiver server in rare cases~~ Establish replication connections more quickly in an edge case Mar 19, 2018

ksbhaskar changed the title ~~Establish replication connections more quickly in an edge case~~ Establish replication connections more efficiently in an edge case Mar 19, 2018

nars1 mentioned this issue May 8, 2018

Establish replication connections more efficiently in an edge case #237

Closed

nars1 mentioned this issue May 8, 2018

[#237] Reset one more connection-related state information (missed out in #136) soon after connection gets reset #239

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establish replication connections more efficiently in an edge case #136

Establish replication connections more efficiently in an edge case #136

nars1 commented Jan 26, 2018 •

edited by ksbhaskar

Loading

Establish replication connections more efficiently in an edge case #136

Establish replication connections more efficiently in an edge case #136

Comments

nars1 commented Jan 26, 2018 • edited by ksbhaskar Loading

Final Release Note

Description

Draft Release Note

nars1 commented Jan 26, 2018 •

edited by ksbhaskar

Loading