-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Establish replication connections more efficiently in an edge case #136
Comments
nars1
added a commit
to nars1/YottaDB
that referenced
this issue
Jan 26, 2018
…ts reset; Not doing so could affect new connection when it gets established This is an issue observed from a test failure (in the merge/tp_stress subtest) during internal testing. The primary failure symptom was an assert failure in the source server %GTM-F-ASSERT, Assert failed in sr_unix/gtmsource_process_ops.c line 1167 for expression (FALSE) The source server assert failed because it was expecting a REPL_INSTINFO message (msgtype == 29) but instead got a REPL_CMP2UNCMP message (msgtype == 25). The merge/tp_stress subtest induces a REPL_CMP2UNCMP situation using white-box code so the fact that the receiver sent a REPL_CMP2UNCMP message is not surprising. But the timing of the send is where the problem is. The issue is that we maintain the fact that a CMP2UNCMP message needs to be sent in a static variable "send_cmp2uncmp" and had set it to TRUE (in gtmrecv_poll_actions.c) but forgot to clear it before moving on to the new connection. So this leftover static variable caused the incorrect send and resulted in the assert failure of the source server. In pro, the source server would have issued an error message and closed the connection and opened a new connection with the receiver server. So the user would not notice any issues because of this (in the sense, the source server will not go down) even though they can see evidence of an unnecessary disconnect/reconnect sequence in the source server log. As part of the connection reset, we currently reset a few static variables of the old connection but do not reset the send_cmp2uncmp variable. There are a few other statics in use too and are best cleared whenever a connection gets reset. So that is how this fix proceeds. All statics related to the current connection are bundled up in a conn_state_t structure and maintained in the "curr_conn_state" static variable. Anytime a connection gets reset, the entire structure gets cleared.
nars1
added a commit
that referenced
this issue
Jan 26, 2018
…t; Not doing so could affect new connection when it gets established This is an issue observed from a test failure (in the merge/tp_stress subtest) during internal testing. The primary failure symptom was an assert failure in the source server %GTM-F-ASSERT, Assert failed in sr_unix/gtmsource_process_ops.c line 1167 for expression (FALSE) The source server assert failed because it was expecting a REPL_INSTINFO message (msgtype == 29) but instead got a REPL_CMP2UNCMP message (msgtype == 25). The merge/tp_stress subtest induces a REPL_CMP2UNCMP situation using white-box code so the fact that the receiver sent a REPL_CMP2UNCMP message is not surprising. But the timing of the send is where the problem is. The issue is that we maintain the fact that a CMP2UNCMP message needs to be sent in a static variable "send_cmp2uncmp" and had set it to TRUE (in gtmrecv_poll_actions.c) but forgot to clear it before moving on to the new connection. So this leftover static variable caused the incorrect send and resulted in the assert failure of the source server. In pro, the source server would have issued an error message and closed the connection and opened a new connection with the receiver server. So the user would not notice any issues because of this (in the sense, the source server will not go down) even though they can see evidence of an unnecessary disconnect/reconnect sequence in the source server log. As part of the connection reset, we currently reset a few static variables of the old connection but do not reset the send_cmp2uncmp variable. There are a few other statics in use too and are best cleared whenever a connection gets reset. So that is how this fix proceeds. All statics related to the current connection are bundled up in a conn_state_t structure and maintained in the "curr_conn_state" static variable. Anytime a connection gets reset, the entire structure gets cleared.
ksbhaskar
changed the title
Replication source server could unnecessarily close (and reopen) its connection with the receiver server in rare cases
Establish replication connections more quickly in an edge case
Mar 19, 2018
ksbhaskar
changed the title
Establish replication connections more quickly in an edge case
Establish replication connections more efficiently in an edge case
Mar 19, 2018
nars1
added a commit
to nars1/YottaDB
that referenced
this issue
May 8, 2018
…ssed out in YottaDB#136) soon after connection gets reset Issue YottaDB#136 moved all state related information in gtmrecv_poll_actions.c into a structure variable "curr_conn_state". But there was another global variable "gtmrecv_send_cmp2uncmp" which was set in gtmrecv_process.c and was the same as "curr_conn_state.send_cmp2uncmp" which meant whenever "curr_conn_state.send_cmp2uncmp" got cleared (due to the memset changed introduced in YottaDB#136), "gtmrecv_send_cmp2uncmp" should have also been cleared. Not doing so, meant it was still possible for a stale old connection related state variable to be carried over to a new connection resulting in an assert failure in the source server in rare cases (just like YottaDB#136) in dbg or a connection loss and reconnect sequence in pro (this did happen in one in-house test failure). This is now fixed so both variables are kept in sync. It was originally thought of avoiding the duplication and replacing all usages of "gtmrecv_send_cmp2uncmp" with "curr_conn_state.send_cmp2uncmp" but it was not trivial due to the need to move the "conn_state_t" type definition from gtmsource_poll_actions.c into a header file so gtmrecv_process.c can also inherit that definition. And since this is an edge case, and there was a risk of other subtle regressions, the trivial change was adopted even though it is not as desirable.
nars1
added a commit
that referenced
this issue
May 8, 2018
…t in #136) soon after connection gets reset Issue #136 moved all state related information in gtmrecv_poll_actions.c into a structure variable "curr_conn_state". But there was another global variable "gtmrecv_send_cmp2uncmp" which was set in gtmrecv_process.c and was the same as "curr_conn_state.send_cmp2uncmp" which meant whenever "curr_conn_state.send_cmp2uncmp" got cleared (due to the memset changed introduced in #136), "gtmrecv_send_cmp2uncmp" should have also been cleared. Not doing so, meant it was still possible for a stale old connection related state variable to be carried over to a new connection resulting in an assert failure in the source server in rare cases (just like #136) in dbg or a connection loss and reconnect sequence in pro (this did happen in one in-house test failure). This is now fixed so both variables are kept in sync. It was originally thought of avoiding the duplication and replacing all usages of "gtmrecv_send_cmp2uncmp" with "curr_conn_state.send_cmp2uncmp" but it was not trivial due to the need to move the "conn_state_t" type definition from gtmsource_poll_actions.c into a header file so gtmrecv_process.c can also inherit that definition. And since this is an edge case, and there was a risk of other subtle regressions, the trivial change was adopted even though it is not as desirable.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Final Release Note
Initiating replication connections between Source and Receiver Servers is more efficient. Previously, in rare cases, the Source Server unnecessarily disconnect the connection and reconnected. [YDB#136]
Description
If the source server is started with a non-zero compression level (-cmplvl qualifier), and the receiver server is started with a zero compression level, the receiver server, as part of its initial handshake, sends a REPL_CMP2UNCMP message to the source server to let it know to send uncompressed records in the replication pipe. It is possible in rare cases, that before this message is sent, the source server closes the connection for other reasons (e.g. heartbeat from the receiver was not received for a while etc.). In that case, the receiver server could incorrectly send the REPL_CMP2UNCMP message as part the initial handshake with the source server on a new connection (which happens soon after the source server closes the old connection) and if this happens, the source server could get confused by this unexpected message at the start of a connection and close the connection once more with a message of the form "UNKNOWN msg (type = ...) received when waiting for msg (type = ...). Closing connection." in the source server log file.
The fix for this is that the receiver server does not send the REPL_CMP2UNCMP message (which was needed for the old connection) the moment a new connection is established with the source server.
Draft Release Note
The replication receiver server sends the correct handshake messages on a new connection with the source server. In YottaDB r1.10 and before, it was possible in rare cases for the source server to unnecessarily disconnect the connection.
The text was updated successfully, but these errors were encountered: