Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source server spin loops sometimes (instead of issuing FILTERNOTALIVE error) when external filter program terminates #264

Closed
nars1 opened this issue May 29, 2018 · 0 comments
Assignees
Milestone

Comments

@nars1
Copy link
Collaborator

nars1 commented May 29, 2018

Final Release Note

Description

When the replication source server is started with an external filter M program and that filter terminates prematurely, it is possible for the source server to go into a spin loop where it invokes the read() system call in repl_filter_recv_line(). In this case, the read() would return a length of 0 (to indicate EOF in the pipe since the filter program that writes to the pipe is gone) but that is not currently handled correctly so we loop back to do the read(). In some cases (if we have sent a full transaction), we start a timer for $ydb_repl_filter_timeout seconds (default of 64 seconds) and that fortunately terminates the loop but if we have not yet sent a full transaction (e.g. the pipe is full and the filter needs to read some data for us to send more), the source server will loop eternally.

Draft Release Note

The replication source server correctly issues a FILTERNOTALIVE error in case it was started with an external filter (using the -FILTER qualifier) that terminates abnormally. Previously, it was possible for the source server to spin-loop waiting for a response from the long-dead filter program.

@nars1 nars1 added this to the r124 milestone May 29, 2018
@nars1 nars1 self-assigned this May 29, 2018
nars1 added a commit to nars1/YottaDB that referenced this issue May 29, 2018
Not doing so could cause the source server to loop indefinitely in case the filter program
terminates prematurely.
nars1 added a commit that referenced this issue May 30, 2018
Not doing so could cause the source server to loop indefinitely in case the filter program
terminates prematurely.
@nars1 nars1 closed this as completed May 30, 2018
nars1 added a commit to nars1/YottaDBtest that referenced this issue May 31, 2018
The source server loops indefinitely with r1.22 x86_64 and arm linux with this revised
testcase. But passes with the code fix for YottaDB#264. Whereas the previous test case used
to fail rarely only on arm linux and pass always on x86_64 linux.
nars1 added a commit to nars1/YottaDBtest that referenced this issue May 31, 2018
The source server loops indefinitely with r1.22 x86_64 and arm linux with this revised
testcase. But passes with the code fix for YottaDB#264. Whereas the previous test case used
to fail rarely only on arm linux and pass always on x86_64 linux.

The enhanced test case (which includes a hang in notalive.m) is done randomly so
we exercise the previous test codepath too.
nars1 added a commit to nars1/YottaDBtest that referenced this issue May 31, 2018
The source server loops indefinitely with r1.22 x86_64 and arm linux with this revised
testcase. But passes with the code fix for YottaDB#264. Whereas the previous test case used
to fail rarely only on arm linux and pass always on x86_64 linux.

The enhanced test case (which includes a hang in notalive.m) is done randomly so
we exercise the previous test codepath too.
nars1 added a commit to YottaDB/YDBTest that referenced this issue May 31, 2018
The source server loops indefinitely with r1.22 x86_64 and arm linux with this revised
testcase. But passes with the code fix for #264. Whereas the previous test case used
to fail rarely only on arm linux and pass always on x86_64 linux.

The enhanced test case (which includes a hang in notalive.m) is done randomly so
we exercise the previous test codepath too.
nars1 added a commit to nars1/YottaDB that referenced this issue May 31, 2018
…BRT/SIG-6 errors

The check for whether the temporary file name is longer than the allocated buffer can hold
was incorrect. It was assuming 4 bytes (SIZEOF(uint4)) for the string that followed the
hexadecimal process-id but the actual name had _XXXXXX following it which accounts for 7
bytes. For example, the actual temporary file name was DEFAULT_17fa_XXXXXX (where XXXXXX
was replaced with a temporary string by "mkstemp") but the check was accounting only
for DEFAULT_17fa_XXX. So there was a potential for a 3-byte overflow to go unnoticed.
This logic is now reworked to use SNPRINTF to let us accurately know if the allocated
buffer is not enough and if so issue a FILENAMETOOLONG error.

In addition, the FILENAMETOOLONG error is now prefixed with a FILEPARSE error message
which prints the entire path name that exceeded the 255 byte limit. Previously this
only printed the absolute directory path without including the temporary file name. This
could be potentially confusing to the user since they would see the directory path printed
in the FILEPARSE error message is less than 255 characters but still an error is being
issued saying it is longer than 255 characters.
nars1 added a commit that referenced this issue May 31, 2018
…-6 errors

The check for whether the temporary file name is longer than the allocated buffer can hold
was incorrect. It was assuming 4 bytes (SIZEOF(uint4)) for the string that followed the
hexadecimal process-id but the actual name had _XXXXXX following it which accounts for 7
bytes. For example, the actual temporary file name was DEFAULT_17fa_XXXXXX (where XXXXXX
was replaced with a temporary string by "mkstemp") but the check was accounting only
for DEFAULT_17fa_XXX. So there was a potential for a 3-byte overflow to go unnoticed.
This logic is now reworked to use SNPRINTF to let us accurately know if the allocated
buffer is not enough and if so issue a FILENAMETOOLONG error.

In addition, the FILENAMETOOLONG error is now prefixed with a FILEPARSE error message
which prints the entire path name that exceeded the 255 byte limit. Previously this
only printed the absolute directory path without including the temporary file name. This
could be potentially confusing to the user since they would see the directory path printed
in the FILEPARSE error message is less than 255 characters but still an error is being
issued saying it is longer than 255 characters.
nars1 added a commit that referenced this issue May 31, 2018
1) / is not guaranteed to be last byte in tempdir_full.addr in various cases.
   So the / that was removed from the prior #267 commit is inserted back.

2) Previously, tempdir_full.addr held the fully constructed name but it no longer does
   after the prior #267 commit. So tempdir_full.addr usages after tempfilename has been
   constructed need to be replaced with tempfilename.

3) One use of orig_buff_size was replaced with SIZEOF(tempfilename) since the latter
   is what got used as the maximum size in the SNPRINTF call. Both the values hold the
   same value so there is no issue but still it is better/safer to be consistent.
nars1 added a commit to nars1/YottaDB that referenced this issue Jun 1, 2018
…le name to keep its length at 4-bytes even if the process_id is less than 2**24
nars1 added a commit to nars1/YottaDB that referenced this issue Jun 1, 2018
nars1 added a commit that referenced this issue Jun 1, 2018
…; pid is a 4-byte number so need 8x (not 4x) in SNPRINTF format string

* Zero pad the hexadecimal pid in the backup temporary file name to keep its length at 4-bytes even if the process_id is less than 2**24

* 4-byte pid translates to 8 hexadecimal digits so need 8x instead of 4x in the SNPRINTF format string.

* Remove duplication of the format string by moving it to a macro TEMP_FILE_FMT_STRING. This fixes a few differences in the format string used in the two places.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant