Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Establish replication connections more efficiently in an edge case #136

Closed
nars1 opened this issue Jan 26, 2018 · 0 comments
Closed

Establish replication connections more efficiently in an edge case #136

nars1 opened this issue Jan 26, 2018 · 0 comments
Assignees
Milestone

Comments

@nars1
Copy link
Collaborator

nars1 commented Jan 26, 2018

Final Release Note

Initiating replication connections between Source and Receiver Servers is more efficient. Previously, in rare cases, the Source Server unnecessarily disconnect the connection and reconnected. [YDB#136]

Description

If the source server is started with a non-zero compression level (-cmplvl qualifier), and the receiver server is started with a zero compression level, the receiver server, as part of its initial handshake, sends a REPL_CMP2UNCMP message to the source server to let it know to send uncompressed records in the replication pipe. It is possible in rare cases, that before this message is sent, the source server closes the connection for other reasons (e.g. heartbeat from the receiver was not received for a while etc.). In that case, the receiver server could incorrectly send the REPL_CMP2UNCMP message as part the initial handshake with the source server on a new connection (which happens soon after the source server closes the old connection) and if this happens, the source server could get confused by this unexpected message at the start of a connection and close the connection once more with a message of the form "UNKNOWN msg (type = ...) received when waiting for msg (type = ...). Closing connection." in the source server log file.

The fix for this is that the receiver server does not send the REPL_CMP2UNCMP message (which was needed for the old connection) the moment a new connection is established with the source server.

Draft Release Note

The replication receiver server sends the correct handshake messages on a new connection with the source server. In YottaDB r1.10 and before, it was possible in rare cases for the source server to unnecessarily disconnect the connection.

@nars1 nars1 added this to the r120 milestone Jan 26, 2018
@nars1 nars1 self-assigned this Jan 26, 2018
nars1 added a commit to nars1/YottaDB that referenced this issue Jan 26, 2018
…ts reset; Not doing so could affect new connection when it gets established

This is an issue observed from a test failure (in the merge/tp_stress subtest) during internal
testing. The primary failure symptom was an assert failure in the source server

  %GTM-F-ASSERT, Assert failed in sr_unix/gtmsource_process_ops.c line 1167 for expression (FALSE)

The source server assert failed because it was expecting a REPL_INSTINFO message
(msgtype == 29) but instead got a REPL_CMP2UNCMP message (msgtype == 25).

The merge/tp_stress subtest induces a REPL_CMP2UNCMP situation using white-box code so the fact
that the receiver sent a REPL_CMP2UNCMP message is not surprising. But the timing of the send
is where the problem is.

The issue is that we maintain the fact that a CMP2UNCMP message needs to be sent in a static
variable "send_cmp2uncmp" and had set it to TRUE (in gtmrecv_poll_actions.c) but forgot to
clear it before moving on to the new connection. So this leftover static variable caused
the incorrect send and resulted in the assert failure of the source server. In pro, the
source server would have issued an error message and closed the connection and opened a new
connection with the receiver server. So the user would not notice any issues because of this
(in the sense, the source server will not go down) even though they can see evidence of an
unnecessary disconnect/reconnect sequence in the source server log.

As part of the connection reset, we currently reset a few static variables of the old connection
but do not reset the send_cmp2uncmp variable. There are a few other statics in use too and are
best cleared whenever a connection gets reset.

So that is how this fix proceeds. All statics related to the current connection are bundled up
in a conn_state_t structure and maintained in the "curr_conn_state" static variable. Anytime
a connection gets reset, the entire structure gets cleared.
nars1 added a commit that referenced this issue Jan 26, 2018
…t; Not doing so could affect new connection when it gets established

This is an issue observed from a test failure (in the merge/tp_stress subtest) during internal
testing. The primary failure symptom was an assert failure in the source server

  %GTM-F-ASSERT, Assert failed in sr_unix/gtmsource_process_ops.c line 1167 for expression (FALSE)

The source server assert failed because it was expecting a REPL_INSTINFO message
(msgtype == 29) but instead got a REPL_CMP2UNCMP message (msgtype == 25).

The merge/tp_stress subtest induces a REPL_CMP2UNCMP situation using white-box code so the fact
that the receiver sent a REPL_CMP2UNCMP message is not surprising. But the timing of the send
is where the problem is.

The issue is that we maintain the fact that a CMP2UNCMP message needs to be sent in a static
variable "send_cmp2uncmp" and had set it to TRUE (in gtmrecv_poll_actions.c) but forgot to
clear it before moving on to the new connection. So this leftover static variable caused
the incorrect send and resulted in the assert failure of the source server. In pro, the
source server would have issued an error message and closed the connection and opened a new
connection with the receiver server. So the user would not notice any issues because of this
(in the sense, the source server will not go down) even though they can see evidence of an
unnecessary disconnect/reconnect sequence in the source server log.

As part of the connection reset, we currently reset a few static variables of the old connection
but do not reset the send_cmp2uncmp variable. There are a few other statics in use too and are
best cleared whenever a connection gets reset.

So that is how this fix proceeds. All statics related to the current connection are bundled up
in a conn_state_t structure and maintained in the "curr_conn_state" static variable. Anytime
a connection gets reset, the entire structure gets cleared.
@nars1 nars1 closed this as completed Jan 26, 2018
@ksbhaskar ksbhaskar changed the title Replication source server could unnecessarily close (and reopen) its connection with the receiver server in rare cases Establish replication connections more quickly in an edge case Mar 19, 2018
@ksbhaskar ksbhaskar changed the title Establish replication connections more quickly in an edge case Establish replication connections more efficiently in an edge case Mar 19, 2018
nars1 added a commit to nars1/YottaDB that referenced this issue May 8, 2018
…ssed out in YottaDB#136) soon after connection gets reset

Issue YottaDB#136 moved all state related information in gtmrecv_poll_actions.c into a structure
variable "curr_conn_state". But there was another global variable "gtmrecv_send_cmp2uncmp"
which was set in gtmrecv_process.c and was the same as "curr_conn_state.send_cmp2uncmp"
which meant whenever "curr_conn_state.send_cmp2uncmp" got cleared (due to the memset
changed introduced in YottaDB#136), "gtmrecv_send_cmp2uncmp" should have also been cleared.
Not doing so, meant it was still possible for a stale old connection related state
variable to be carried over to a new connection resulting in an assert failure in the
source server in rare cases (just like YottaDB#136) in dbg or a connection loss and reconnect
sequence in pro (this did happen in one in-house test failure). This is now fixed
so both variables are kept in sync.

It was originally thought of avoiding the duplication and replacing all usages of
"gtmrecv_send_cmp2uncmp" with "curr_conn_state.send_cmp2uncmp" but it was not trivial
due to the need to move the "conn_state_t" type definition from gtmsource_poll_actions.c
into a header file so gtmrecv_process.c can also inherit that definition. And since this
is an edge case, and there was a risk of other subtle regressions, the trivial change
was adopted even though it is not as desirable.
nars1 added a commit that referenced this issue May 8, 2018
…t in #136) soon after connection gets reset

Issue #136 moved all state related information in gtmrecv_poll_actions.c into a structure
variable "curr_conn_state". But there was another global variable "gtmrecv_send_cmp2uncmp"
which was set in gtmrecv_process.c and was the same as "curr_conn_state.send_cmp2uncmp"
which meant whenever "curr_conn_state.send_cmp2uncmp" got cleared (due to the memset
changed introduced in #136), "gtmrecv_send_cmp2uncmp" should have also been cleared.
Not doing so, meant it was still possible for a stale old connection related state
variable to be carried over to a new connection resulting in an assert failure in the
source server in rare cases (just like #136) in dbg or a connection loss and reconnect
sequence in pro (this did happen in one in-house test failure). This is now fixed
so both variables are kept in sync.

It was originally thought of avoiding the duplication and replacing all usages of
"gtmrecv_send_cmp2uncmp" with "curr_conn_state.send_cmp2uncmp" but it was not trivial
due to the need to move the "conn_state_t" type definition from gtmsource_poll_actions.c
into a header file so gtmrecv_process.c can also inherit that definition. And since this
is an edge case, and there was a risk of other subtle regressions, the trivial change
was adopted even though it is not as desirable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant