Source server spin loops sometimes (instead of issuing FILTERNOTALIVE error) when external filter program terminates #264

nars1 · 2018-05-29T20:34:51Z

Final Release Note

Description

When the replication source server is started with an external filter M program and that filter terminates prematurely, it is possible for the source server to go into a spin loop where it invokes the read() system call in repl_filter_recv_line(). In this case, the read() would return a length of 0 (to indicate EOF in the pipe since the filter program that writes to the pipe is gone) but that is not currently handled correctly so we loop back to do the read(). In some cases (if we have sent a full transaction), we start a timer for $ydb_repl_filter_timeout seconds (default of 64 seconds) and that fortunately terminates the loop but if we have not yet sent a full transaction (e.g. the pipe is full and the filter needs to read some data for us to send more), the source server will loop eternally.

Draft Release Note

The replication source server correctly issues a FILTERNOTALIVE error in case it was started with an external filter (using the -FILTER qualifier) that terminates abnormally. Previously, it was possible for the source server to spin-loop waiting for a response from the long-dead filter program.

Not doing so could cause the source server to loop indefinitely in case the filter program terminates prematurely.

The source server loops indefinitely with r1.22 x86_64 and arm linux with this revised testcase. But passes with the code fix for YottaDB#264. Whereas the previous test case used to fail rarely only on arm linux and pass always on x86_64 linux.

The source server loops indefinitely with r1.22 x86_64 and arm linux with this revised testcase. But passes with the code fix for YottaDB#264. Whereas the previous test case used to fail rarely only on arm linux and pass always on x86_64 linux. The enhanced test case (which includes a hang in notalive.m) is done randomly so we exercise the previous test codepath too.

The source server loops indefinitely with r1.22 x86_64 and arm linux with this revised testcase. But passes with the code fix for #264. Whereas the previous test case used to fail rarely only on arm linux and pass always on x86_64 linux. The enhanced test case (which includes a hang in notalive.m) is done randomly so we exercise the previous test codepath too.

…BRT/SIG-6 errors The check for whether the temporary file name is longer than the allocated buffer can hold was incorrect. It was assuming 4 bytes (SIZEOF(uint4)) for the string that followed the hexadecimal process-id but the actual name had _XXXXXX following it which accounts for 7 bytes. For example, the actual temporary file name was DEFAULT_17fa_XXXXXX (where XXXXXX was replaced with a temporary string by "mkstemp") but the check was accounting only for DEFAULT_17fa_XXX. So there was a potential for a 3-byte overflow to go unnoticed. This logic is now reworked to use SNPRINTF to let us accurately know if the allocated buffer is not enough and if so issue a FILENAMETOOLONG error. In addition, the FILENAMETOOLONG error is now prefixed with a FILEPARSE error message which prints the entire path name that exceeded the 255 byte limit. Previously this only printed the absolute directory path without including the temporary file name. This could be potentially confusing to the user since they would see the directory path printed in the FILEPARSE error message is less than 255 characters but still an error is being issued saying it is longer than 255 characters.

…-6 errors The check for whether the temporary file name is longer than the allocated buffer can hold was incorrect. It was assuming 4 bytes (SIZEOF(uint4)) for the string that followed the hexadecimal process-id but the actual name had _XXXXXX following it which accounts for 7 bytes. For example, the actual temporary file name was DEFAULT_17fa_XXXXXX (where XXXXXX was replaced with a temporary string by "mkstemp") but the check was accounting only for DEFAULT_17fa_XXX. So there was a potential for a 3-byte overflow to go unnoticed. This logic is now reworked to use SNPRINTF to let us accurately know if the allocated buffer is not enough and if so issue a FILENAMETOOLONG error. In addition, the FILENAMETOOLONG error is now prefixed with a FILEPARSE error message which prints the entire path name that exceeded the 255 byte limit. Previously this only printed the absolute directory path without including the temporary file name. This could be potentially confusing to the user since they would see the directory path printed in the FILEPARSE error message is less than 255 characters but still an error is being issued saying it is longer than 255 characters.

1) / is not guaranteed to be last byte in tempdir_full.addr in various cases. So the / that was removed from the prior #267 commit is inserted back. 2) Previously, tempdir_full.addr held the fully constructed name but it no longer does after the prior #267 commit. So tempdir_full.addr usages after tempfilename has been constructed need to be replaced with tempfilename. 3) One use of orig_buff_size was replaced with SIZEOF(tempfilename) since the latter is what got used as the maximum size in the SNPRINTF call. Both the values hold the same value so there is no issue but still it is better/safer to be consistent.

…le name to keep its length at 4-bytes even if the process_id is less than 2**24

…; pid is a 4-byte number so need 8x (not 4x) in SNPRINTF format string * Zero pad the hexadecimal pid in the backup temporary file name to keep its length at 4-bytes even if the process_id is less than 2**24 * 4-byte pid translates to 8 hexadecimal digits so need 8x instead of 4x in the SNPRINTF format string. * Remove duplication of the format string by moving it to a macro TEMP_FILE_FMT_STRING. This fixes a few differences in the format string used in the two places.

nars1 added the upstream issue label May 29, 2018

nars1 added this to the r124 milestone May 29, 2018

nars1 self-assigned this May 29, 2018

nars1 added a commit to nars1/YottaDB that referenced this issue May 29, 2018

[YottaDB#264] Handle return value of 0 from read() i.e. EOF on pipe fd

2b2e45e

Not doing so could cause the source server to loop indefinitely in case the filter program terminates prematurely.

nars1 added a commit that referenced this issue May 30, 2018

[#264] Handle return value of 0 from read() i.e. EOF on pipe fd

035ec59

Not doing so could cause the source server to loop indefinitely in case the filter program terminates prematurely.

nars1 closed this as completed May 30, 2018

nars1 added a commit to nars1/YottaDB that referenced this issue Jun 1, 2018

[YottaDB#264] Zero pad the hexadecimal pid in the backup temporary fi…

7833fd4

…le name to keep its length at 4-bytes even if the process_id is less than 2**24

nars1 added a commit to nars1/YottaDB that referenced this issue Jun 1, 2018

[YottaDB#264] Address review comments

52ad3fb

This was referenced Jun 1, 2018

[#264] Fix buffer overflow in MUPIP BACKUP that can cause SIGABRT/SIG-6 errors #268

Merged

[#264] Zero pad the hexadecimal pid in the backup temporary file name to keep its length at 4-bytes even if the process_id is less than 2**24 #269

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source server spin loops sometimes (instead of issuing FILTERNOTALIVE error) when external filter program terminates #264

Source server spin loops sometimes (instead of issuing FILTERNOTALIVE error) when external filter program terminates #264

nars1 commented May 29, 2018

Source server spin loops sometimes (instead of issuing FILTERNOTALIVE error) when external filter program terminates #264

Source server spin loops sometimes (instead of issuing FILTERNOTALIVE error) when external filter program terminates #264

Comments

nars1 commented May 29, 2018

Final Release Note

Description

Draft Release Note