Source server behaves correctly (no SIG-11) if started with an external filter followed by a deactivate and shutdown #30

nars1 · 2017-09-08T14:23:51Z

Final Release Note

The replication source server terminates normally in case it was started with an external filter and later transitioned from an active to a passive source server before being shut down. Previously, such a source server would terminate abnormally with a SIG-11. (YDB#30)

Description

If the replication source server is started with an external filter and connects to a receiver server and is later deactivated (i.e. changed from an active to a passive source server) and soon afterwards a source server shutdown command is issued, the source server terminates abnormally with a SIG-11.

Below is the output and the corresponding test script.

> source repl.csh
.
.
Wed Sep  6 17:27:22 2017 : REPL INFO - Last Seqno written in jnlpool : 0  Last Seqno sent : 0  Number of unsent Seqno : 0
Wed Sep  6 17:27:22 2017 : REPL INFO - Jnl Total : 0  Msg Total : 0  CmpMsg Total : 0
%GTM-F-KILLBYSIGSINFO1, MUPIP process 756 has been killed by a signal 11 at address 0x000000000057A901 (vaddr 0x0000000000000000)
%GTM-F-SIGMAPERR, Signal was caused by an address not mapped to an object
unset echo
.
.

> cat repl.csh 
set echo
set verbose
setenv gtm_repl_instance mumps.repl
cp $gtm_test/T998/com/extfilter.m .
@ port = 5001
# Setup primary
rm -f mumps.gld mumps.dat mumps.mjl* *.log* mumps.repl
gde exit; mupip create
mupip replicate -instance -name=INSTA
mupip set -replication=on -reg "*"
# Setup secondary
rm -rf tmp; mkdir tmp; cd tmp
rm -f mumps.gld mumps.dat mumps.mjl* *.log* mumps.repl
gde exit; mupip create
mupip replicate -instance -name=INSTB
mupip set -replication=on -reg "*"
mupip replic -source -start -passive -log=passive_source.log -buf=1 -instsecondary=INSTA
mupip replic -receiv -start -listen=$port -log=receiver.log -buf=2097152
cd ..
mupip replic -source -start -secondary=${HOST}:$port -log=source.log -buf=1 -instsecondary=INSTB -filter=\"$gtm_dist/mumps -run ^extfilter\"
set srcpid = `mupip replicate -source -checkhealth |& grep ^PID | awk '{print $2}'`
mupip replic -source -deactivate -instsecondary=INSTB -rootprimary
mupip replicate -source -shut -time=0
cat source.log
unset echo
unset verbose

Draft Release Note

The replication source server terminates normally in case it was started with an external filter and later transitioned from an active to a passive source server before being asked to shut down. Previously, the source server would terminate abnormally with a SIG-11.

The text was updated successfully, but these errors were encountered:

…secondary errors if primary error is out-of-memory If already exiting, do not open any object/source directories (which could include relinkctl files) as part of $ZROUTINES initialization. This avoids potentially nasty codepaths particulary if the reason we are exiting is an out-of-memory. We do not expect any user to run such extreme out-of-memory codepaths/tests so it is not considered necessary to create a user-visible issue for this. For example, below are two C-stacks that showed up in core dumps while running the simpleapi/fatalerror2 subtest. In both cases, if we avoid the zro_init() call we can avoid such cores. Core1 ------ Notice the local variables passed in #0 have "Cannot access memory" errors. Most likely there was no space allocating the C-stack in this core. (gdb) where #0 ydb_trans_log_name (envindx=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c5c>, trans=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c50>, buffer=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c48>, buffer_len=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c58>, ignore_errors=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c44>, is_ydb_env_match=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c38>) at sr_port/ydb_trans_log_name.c:41 #1 util_out_send_oper (addr=0x7ffe1e3c7800 "%YDB-E-RELINKCTLERR, Error with relink control structure for $ZROUTINES directory ., %YDB-E-SYSCALL, Error received from system call mmap() -- called from module "..., len=287) at sr_unix/util_output.c:731 #2 util_out_print_vaparm (message=0x0, flush=4, var=0x7ffe1e3c8050, faocnt=2147483647) at sr_unix/util_output.c:871 #3 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904 #4 jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:261 #5 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36 #6 rts_error_va (csa=0x0, argcnt=12, var=0x7ffe1e3c82b0) at sr_unix/rts_error.c:159 #7 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92 #8 relinkctl_map (linkctl=0x7ffe1e3c8890) at sr_unix/relinkctl.c:679 #9 relinkctl_open (linkctl=0x7ffe1e3c8890, object_dir_missing=0) at sr_unix/relinkctl.c:333 #10 relinkctl_attach (obj_container_name=0x7ffe1e3cbb50, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188 #11 zro_load (str=0x5611ed710ce8) at sr_unix/zro_load.c:159 #12 zro_init () at sr_port/zro_init.c:51 #13 zshow_svn (output=0x7ffe1e40f0b0, one_sv=0) at sr_port/zshow_svn.c:694 #14 op_zshow (func=0x7ffe1e4171b0, type=1, lvn=0x0) at sr_port/op_zshow.c:166 #15 jobexam_dump (dump_filename_arg=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0, fatal_file_name_buff=0x7ffe1e417c40 "simpleapi_0_2/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_65362_1.txt") at sr_port/jobexam_process.c:232 #16 jobexam_process (dump_file_name=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0) at sr_port/jobexam_process.c:152 #17 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66 #18 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224 #19 rts_error_va (csa=0x0, argcnt=5, var=0x7ffe1e41a6a0) at sr_unix/rts_error.c:159 #20 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92 #21 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114 #22 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748 #23 lvtreenode_newblock (sym=0x5611ed733b40, numElems=2097152) at sr_port/lv_newblock.c:82 #24 lvtreenode_getslot (sym=0x5611ed733b40) at sr_port/lv_getslot.c:145 #25 lvAvlTreeNodeInsert (lvt=0x5611ed736050, key=0x7ffe1e41aab0, parent=0x5611f87cb608) at sr_port/lv_tree.c:1698 #26 op_putindx (argcnt=1, start=0x5611ed73b0a0) at sr_port/op_putindx.c:192 #27 callg (fnptr=0x7fb75d4f4fff <op_putindx>, paramlist=0x7ffe1e41ae60) at sr_unix/callg.c:60 #28 ydb_set_s (varname=0x7ffe1e41b5e0, subs_used=1, subsarray=0x7ffe1e41b5f0, value=0x7ffe1e41ade0) at sr_unix/ydb_set_s.c:108 #29 gvnset () at fatalerror.c:56 #30 ydb_tp_s (tpfn=0x5611ed225260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193 #31 main () at fatalerror.c:32 Core2 ----- In this case there is a SIG-11 deep inside syslog(). Most likely due to an out-of-memory situation. Program terminated with signal SIGSEGV, Segmentation fault. #0 vfprintf () from /usr/lib64/libc.so.6 #1 fprintf () from /usr/lib64/libc.so.6 #2 __vsyslog_chk () from /usr/lib64/libc.so.6 #3 syslog () from /usr/lib64/libc.so.6 #4 util_out_send_oper (addr=0x7ffdadd5ec10 "%YDB-E-JOBEXAMFAIL, YottaDB process 50787 executing $ZJOBEXAM function failed with the preceding error message -- generated from 0x", '0' <repeats 16 times>, ".", len=149) at sr_unix/util_output.c:761 #5 util_out_print_vaparm (message=0x0, flush=4, var=0x7ffdadd5f460, faocnt=2147483647) at sr_unix/util_output.c:871 #6 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904 #7 send_msg_va (csa=0x0, arg_count=0, var=0x7ffdadd5fa00) at sr_unix/send_msg.c:149 #8 send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:79 #9 jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:264 #10 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36 #11 rts_error_va (csa=0x0, argcnt=12, var=0x7ffdadd5fc60) at sr_unix/rts_error.c:159 #12 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92 #13 relinkctl_map (linkctl=0x7ffdadd60240) at sr_unix/relinkctl.c:679 #14 relinkctl_open (linkctl=0x7ffdadd60240, object_dir_missing=0) at sr_unix/relinkctl.c:333 #15 relinkctl_attach (obj_container_name=0x7ffdadd63500, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188 #16 zro_load (str=0x55df19dd3ce8) at sr_unix/zro_load.c:159 #17 zro_init () at sr_port/zro_init.c:51 #18 zshow_svn (output=0x7ffdadda6a60, one_sv=0) at sr_port/zshow_svn.c:694 #19 op_zshow (func=0x7ffdaddaeb60, type=1, lvn=0x0) at sr_port/op_zshow.c:166 #20 jobexam_dump (dump_filename_arg=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660, fatal_file_name_buff=0x7ffdaddaf5f0 "simpleapi_0_40/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_50787_1.txt") at sr_port/jobexam_process.c:232 #21 jobexam_process (dump_file_name=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660) at sr_port/jobexam_process.c:152 #22 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66 #23 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224 #24 rts_error_va (csa=0x0, argcnt=5, var=0x7ffdaddb2050) at sr_unix/rts_error.c:159 #25 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92 #26 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114 #27 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748 #28 lvtreenode_newblock (sym=0x55df19df6b40, numElems=2097152) at sr_port/lv_newblock.c:82 #29 lvtreenode_getslot (sym=0x55df19df6b40) at sr_port/lv_getslot.c:145 #30 lvAvlTreeNodeInsert (lvt=0x55df19df9050, key=0x7ffdaddb2460, parent=0x55df24e8e5c8) at sr_port/lv_tree.c:1698 #31 op_putindx (argcnt=1, start=0x55df19dfe0a0) at sr_port/op_putindx.c:192 #32 callg (fnptr=0x7feae36c6fff <op_putindx>, paramlist=0x7ffdaddb2810) at sr_unix/callg.c:60 #33 ydb_set_s (varname=0x7ffdaddb2f90, subs_used=1, subsarray=0x7ffdaddb2fa0, value=0x7ffdaddb2790) at sr_unix/ydb_set_s.c:108 #34 gvnset () at fatalerror.c:56 #35 ydb_tp_s (tpfn=0x55df18a5c260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193 #36 main () at fatalerror.c:32

…that can fail in rare cases * The below assert failed once in 1500 r126/ydb464 subtest runs across a dozen boxes. ```c assert((1 != forced_exit) || GET_DEFERRED_EXIT_CHECK_NEEDED); #27 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #28 wcs_wtstart (region=0x7f6970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:830 #29 wcs_stale (tid=8350064, hd_len=8, region=0x872fe8) at sr_port/t_end_sysops.c:1386 #30 timer_handler (why=14, info=0x7fadda7caea8 <stapi_signal_handler_oscontext+10728>, context=0x7fadda7caf28 <stapi_signal_handler_oscontext+10856>) at sr_unix/gt_timers.c:773 #31 ydb_os_signal_handler (sig=14, info=0x7ffdee359eb0, context=0x7ffdee359d80) at sr_unix/ydb_os_signal_handler.c:66 #32 <signal handler called> #33 generic_signal_handler (sig=2, info=0x7fadda7c9588 <stapi_signal_handler_oscontext+4296>, context=0x7fadda7c9608 <stapi_signal_handler_oscontext+4424>) at sr_unix/generic_signal_handler.c:244 #34 ydb_os_signal_handler (sig=2, info=0x7ffdee35a870, context=0x7ffdee35a740) at sr_unix/ydb_os_signal_handler.c:88 #35 <signal handler called> #36 clock_nanosleep () from /usr/lib64/libc.so.6 #37 m_usleep (useconds=1000) at sr_unix/sleep.c:28 #38 wcs_sleep (sleepfactor=1) at sr_port/wcs_sleep.c:28 #39 wcs_phase2_commit_wait (csa=0x7f4840, cr=0x7fadd7b88a50) at sr_port/wcs_phase2_commit_wait.c:268 #40 t_qread (blk=6, cycle=0x7ffdee35c568, cr_out=0x7ffdee35c558) at sr_port/t_qread.c:668 #41 gvcst_search (pKey=0x7f6040, pHist=0x0) at sr_port/gvcst_search.c:438 #42 gvcst_put2 (val=0x7fadda7d3940 <increment_delta_mval>, parms=0x7ffdee35f8b0) at sr_port/gvcst_put.c:980 #43 gvcst_put (val=0x7fadda7d3940 <increment_delta_mval>) at sr_port/gvcst_put.c:299 #44 gvcst_incr (increment=0x7ffdee362030, result=0x7ffdee362050) at sr_port/gvcst_incr.c:60 #45 op_gvincr (increment=0x7ffdee362030, result=0x7ffdee362050) at sr_port/op_gvincr.c:60 #46 ydb_incr_s (varname=0x7ffdee362b10, subs_used=0, subsarray=0x7ffdee3629d0, increment=0x7ffdee362aa0, ret_value=0x7ffdee362a90) at sr_unix/ydb_incr_s.c:156 #47 runProc (settings=0x7ffdee363490, curDepth=0) at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:447 #48 runProc_driver (settings=0x7ffdee363490) at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:145 #49 main () at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:93 (gdb) f 28 #28 wcs_wtstart (region=0x7f6970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:830 830 ENABLE_INTERRUPTS(INTRPT_IN_WCS_WTSTART, prev_intrpt_state); (gdb) p forced_exit $5 = 1 (gdb) p deferred_signal_handling_needed $6 = 0 ``` * This is because the signal handler for SIGALRM (sig=14) happened after line 183 but before line 193. ```c sr_port/have_crit.h ------------------- 183 forced_exit = 1; \ 184 if (in_os_signal_handler) \ 185 { /* If we are inside an OS signal handler and therefore had to defer exit \ 186 * handling, set "outofband" also to TRUE as this is checked by lots of \ 187 * potentially long-running commands in the runtime (e.g. HANG etc.) and we \ 188 * want all of those to automatically trigger process exit handling. \ 189 */ \ 190 outofband = deferred_signal; \ 191 } \ 192 /* Whenever "forced_exit" gets set to 1, set the corresponding deferred event too */ \ 193 SET_DEFERRED_EXIT_CHECK_NEEDED; \ ``` * We cannot easily protect against this situation and so it is better to remove the assert.

…signal_handling_needed; Fixes rare test failures Background ---------- * The `locks/locks_main` subtest failed on an ARMV7L Pi 3 system with the following assert in in-house testing (Note: this failure happened only once and is not easily reproducible in hundreds of test runs). ```diff > %YDB-F-ASSERT, Assert failed in sr_unix/util_output.c line 761 for expression (!GET_DEFERRED_EXIT_CHECK_NEEDED || (1 == forced_exit)) ``` * The C-stack from the core file looked as follows. ```c #25 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:134 #26 ch_cond_core () at sr_unix/ch_cond_core.c:80 #27 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192 #28 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #29 util_out_send_oper (addr=0x7ea37734 "%YDB-I-SUSPENDING, Process Received Signal 19. Suspending processing on user request or attempt to do terminal I/O while running in the background -- generated from 0x75F61ACC.", len=177) at sr_unix/util_output.c:761 #30 util_out_print_vaparm (message=0x0, flush=4, var=..., faocnt=2147483647) at sr_unix/util_output.c:880 #31 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:913 #32 send_msg_va (csa=0x0, arg_count=0, var=...) at sr_unix/send_msg.c:193 #33 send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:86 #34 suspend (sig=19) at sr_unix/suspend.c:39 #35 deferred_signal_handler () at sr_port/deferred_signal_handler.c:86 #36 gtm_hiber_start_wait_any (hiber=1) at sr_unix/gt_timers.c:347 #37 op_lock2_common (timeout=9223372036854775800, laflag=64 '@') at sr_port/op_lock2.c:358 #38 op_lock2 (timeout=0xe8bca8, laflag=64 '@') at sr_port/op_lock2.c:158 #39 op_incrlock (timeout=0xe8bca8) at sr_port/op_incrlock.c:37 (gdb) f 29 #29 0x75fa3960 in util_out_send_oper (addr=0x7ea37734 "%YDB-I-SUSPENDING, Process Received Signal 19. Suspending processing on user request or attempt to do terminal I/O while running in the background -- generated from 0x75F61ACC.", len=177) at sr_unix/util_output.c:761 761 ENABLE_INTERRUPTS(INTRPT_IN_LOG_FUNCTION, prev_intrpt_state); (gdb) p deferred_signal_handling_needed $4 = 4294967292 (gdb) p/x deferred_signal_handling_needed $5 = 0xfffffffc ``` * As can be seen from the gdb output above, the global variable `deferred_signal_handling_needed` had a negative value (`0xfffffffc` is equal to `-4`). This global variable is supposed to always be 0 or a positive value indicating a bitmask of deferred events that need to be processed. A negative value implies an out-of-design state resulting in incorrect interpretations of the deferred events that need to be processed. The assert that failed indicates that `forced_exit` is not 1 (implying that a SIGTERM or similar fatal signal was not deferred) but the global variable `deferred_signal_handling_needed` indicates that such a fatal signal had been deferred implying a contradiction. * After a lot of review of the code, I have a theory as to what might have happened. Let us say the global variable `deferred_signal_handling_needed` had the initial value of 1 indicating there is a deferred timer event. The `CLEAR_DEFERRED_TIMERS_CHECK_NEEDED` macro first does a `GET_DEFERRED_TIMERS_CHECK_NEEDED` call and if that returns TRUE (indicating the bit denoting the deferred timer signal is set), it then does an `INTERLOCK_ADD` call to decrement the global variable by the value `DEFERRED_SIGNAL_HANDLING_TIMERS` (which is 1) effectively clearing the bit. But it is possible a timer (or any other signal) interrupt occurs right after the `GET_DEFERRED_TIMERS_CHECK_NEEDED` macro call but before the `INTERLOCK_ADD` call. And inside the signal handler this global variable could be updated to a different value. For example, `timer_handler()` (invoked if a SIGALRM signal is received) does a `CLEAR_DEFERRED_TIMERS_CHECK_NEEDED` macro call on its own which would clear the bit in the global variable and set it to the value 0. Once the `timer_handler()` call returns, the `INTERLOCK_ADD()` will now be doing a decrement of the global variable by the `DEFERRED_SIGNAL_HANDLING_TIMERS` value thinking it is clearing the bit without realizing the bit got cleared by a signal handler in between. This decrement will now take the global variable value to an out of design state of `-1`. Once there, the global variable can take on other out of design values like `-2` and `-4` etc. like what was seen in the core. Fix --- * The primary issue is that the `INTERLOCK_ADD()` macro is not aware of state changes to the global variable that occur after we examine the value of the global variable but before we do the addition/subtraction. * This is fixed by reworking that to instead use `COMPSWAP_LOCK` macro with a for loop that is retried in case the global variable value changes concurrently and retries the loop. * Took this opportunity to add some asserts and rework the macros in `have_crit.h` so they are simplified and code duplication is avoided. Also entailed changing the type of `deferred_signal_handling_needed` to `global_latch_t`. * Added a `GLOBAL_LATCH_VALUE` macro in `sr_port/mdef.h` to get the latch value.

…ady started exiting * As part of a prior commit (SHA 723688c) various functions that started a timer (`wcs_clean_dbsync()`, `wcs_stale()` etc.) were fixed to not start one if we have already started exit processing. * One such timer function that should also have been fixed but was left out is `gtmsource_heartbeat_timer()`. We had an in-house test failure which failed an assert in `start_timer()` because `gtmsource_heartbeat_timer()` was being started while we had already started exit processing. Below is the C-stack of the failure for the record. ```c #0 __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va () at sr_unix/rts_error.c:192 #5 rts_error_csa () at sr_unix/rts_error.c:99 #6 start_timer () at sr_unix/gt_timers.c:433 #7 gtmsource_heartbeat_timer () at sr_unix/gtmsource_heartbeat.c:74 #8 timer_handler () at sr_unix/gt_timers.c:889 #9 ydb_os_signal_handler () at sr_unix/ydb_os_signal_handler.c:63 #10 <signal handler called> #11 __GI___libc_write () at ../sysdeps/unix/sysv/linux/write.c:26 #12 _IO_new_file_write () at fileops.c:1181 #13 new_do_write () at libioP.h:948 #14 _IO_new_file_xsputn () at fileops.c:1255 #15 _IO_new_file_xsputn () at fileops.c:1197 #16 __GI__IO_fwrite () at libioP.h:948 #17 gtm_fwrite () at sr_port/eintr_wrappers.h:334 #18 gtm_fprintf () at tdio.c:82 #19 util_out_print_vaparm () at sr_nix/util_output.c:876 #20 util_out_print () at sr_unix/util_output.c:914 #21 gtm_putmsg_csa () at sr_unix/gtm_putmsg.c:73 #22 gds_rundown () at sr_unix/gds_rundown.c:1060 #23 gv_rundown () at sr_port/gv_rundown.c:122 #24 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144 #25 __run_exit_handlers () at exit.c:108 #26 __GI_exit () at exit.c:139 #27 gtm_image_exit () at sr_unix/gtm_image_exit.c:27 #28 util_base_ch () at sr_port/util_base_ch.c:124 #29 gtmsource_ch () at sr_port/gtmsource_ch.c:96 #30 gtmsource_readfiles () at aDB/V999_R131/sr_unix/gtmsource_readfiles.c:2023 #31 gtmsource_get_jnlrecs () attaDB/V999_R131/sr_unix/gtmsource_process_ops.c:980 #32 gtmsource_process () at sr_unix/gtmsource_process.c:1546 #33 gtmsource () at sr_unix/gtmsource.c:525 #34 mupip_main () at sr_unix/mupip_main. #35 dlopen_libyottadb () at /Distri9_R131/sr_unix/dlopen_libyottadb.c:151 #36 main () at sr_unix/mupip.c:22 ``` * This failure is now fixed by checking `exit_handler_active` and if it is `TRUE` we skip starting this timer.

… detect signal/timer handling Background ---------- * We had one rare test failure during in-house testing. The `ideminter_rolrec/mupipstop_rollback_or_recover` subtest failed with the following symptom. ```sh $ cat ROLLBACK1_3.logx mupip journal -ROLLBACK -back -verify -verbose "*" -noonline -resync=369813 -lost=ROLLBACK1_3.lost Sat Sep 9 04:17:18 PM EDT 2023 . . %YDB-I-MUJNLSTAT, Forward processing started at Sat Sep 9 16:19:23 2023 %YDB-I-MUINFOUINT8, mur_process_seqno_table returns min_broken_seqno : 18446744073709551615 [0xFFFFFFFFFFFFFFFF] %YDB-I-MUINFOUINT8, mur_process_seqno_table returns losttn_seqno : 369813 [0x000000000005A495] %YDB-I-MUINFOSTR, Module : mur_forward:at the start at Sat Sep 9 16:19:23 2023 . . %YDB-I-MUINFOSTR, Journal file : ideminter_rolrec_0/mupipstop_rollback_or_recover/g.mjl_2023252161233 %YDB-I-MUINFOUINT4, Record Offset : 65744 [0x000100D0] %YDB-F-FORCEDHALT, Image HALTed by MUPIP STOP %YDB-F-ASSERT, Assert failed in sr_unix/db_ipcs_reset.c line 110 for expression (((TREF(dio_buff)).aligned != (char *)(csd)) || (!timer_in_handler && !multi_thread_in_use)) Sat Sep 9 04:20:35 PM EDT 2023 The time the mupip command took: 197 ``` * The core file corresponding to the above assert failure had the following stack trace. ```c (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140217990231872, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7fff160fdc00) at sr_unix/rts_error.c:198 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #8 db_ipcs_reset (reg=0x563c77a1c0b0) at sr_unix/db_ipcs_reset.c:110 #9 mur_close_files () at sr_port/mur_close_files.c:841 #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:116 #11 signal_exit_handler (exit_handler_name=0x7f870b624acc "deferred_exit_handler", sig=15, info=0x7f870b7856a8 <stapi_signal_handler_oscontext+3320>, context=0x7f870b785728 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78 #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120 #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74 #14 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862 #15 wcs_stale (tid=94817705118848, hd_len=8, region=0x563c77924b08) at sr_port/t_end_sysops.c:1445 #16 timer_handler (why=0, info=0x7f870b787088 <stapi_signal_handler_oscontext+9944>, context=0x7f870b787108 <stapi_signal_handler_oscontext+10072>, is_os_signal_handler=0) at sr_unix/gt_timers.c:913 #17 check_for_deferred_timers () at sr_unix/gt_timers.c:1312 #18 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78 #19 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862 #20 wcs_timer_start (reg=0x563c77a1cc80, io_ok=1) at sr_port/t_end_sysops.c:1344 #21 op_tcommit () at sr_port/op_tcommit.c:535 #22 mur_output_record (rctl=0x563c77a28a40) at sr_port/mur_output_record.c:323 #23 mur_forward_play_cur_jrec (rctl=0x563c77a28a40) at sr_port/mur_forward_play_cur_jrec.c:362 #24 mur_forward_multi_proc (rctl=0x563c77a28a40) at sr_port/mur_forward.c:400 #25 gtm_multi_proc (fnptr=0x7f870ae20f00 <mur_forward_multi_proc>, ntasks=1, max_procs=1, ret_array=0x563c7cb21a40, parm_array=0x563c77a27c40, parmElemSize=512, extra_shm_size=2640, init_fnptr=0x7f870ae2b9f0 <mur_forward_multi_proc_init>, finish_fnptr=0x7f870ae2bc10 <mur_forward_multi_proc_finish>) at sr_unix/gtm_multi_proc.c:122 #26 mur_forward (min_broken_time=4294967295, min_broken_seqno=18446744073709551615, losttn_seqno=369813) at sr_port/mur_forward.c:158 #27 mupip_recover () at sr_port/mupip_recover.c:588 #28 mupip_main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip_main.c:122 #29 dlopen_libyottadb (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0, main_func=0x563c761b1004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #30 main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip.c:22 (gdb) p gtm_threadgbl_true->dio_buff.aligned $5 = 0x563c78429000 "GDSDYNUNX04" (gdb) p csd $6 = (sgmnt_data_ptr_t) 0x563c78429000 (gdb) p timer_in_handler $1 = 1 (gdb) p multi_thread_in_use $2 = 0 (gdb) p forced_exit $3 = 2 (gdb) p exit_handler_active $4 = 1 (gdb) p in_os_signal_handler $1 = 0 ``` Issue ----- * The assert failure was in the db_ipcs_reset() -> DB_LSEEKREAD -> DBG_CHECK_DIO_ALIGNMENT. * The `DBG_CHECK_DIO_ALIGNMENT` macro had the following comment. ```c 53 /* If we are using the global variable "dio_buff.aligned", then we better not be executing in timer \ 54 * code or in threaded code (as we have only ONE buffer to use). Assert that. \ 55 */ \ 56 assert(((TREF(dio_buff)).aligned != (char *)(buff)) || (!timer_in_handler && !multi_thread_in_use)); \ ``` * In the failure case, even though we are executing in timer code we are actually in exit handler code (as can be seen by the `forced_exit` and `exit_handler_active` variables in the gdb analysis above). In this case, the exit handler code will not return out of the timer code and so it is okay for the assert to not be TRUE. * The global variable being checked in the assert is `timer_in_handler`. This is where the issue is. That global variable being TRUE just means the `timer_handler()` function is in the current call stack. It does not mean that we are handling a SIGALRM/timer signal and interrupting the mainline code. The assert is intended to protect against signal handler interrupting the mainline code. Therefore, the correct global variable to check in the assert is `in_os_signal_handler`. Fix --- * The fix is simple and is to use `in_os_signal_handler` instead of `timer_in_handler` in the assert.

nars1 added the bug label Sep 8, 2017

nars1 added this to the r110 milestone Sep 8, 2017

nars1 self-assigned this Sep 8, 2017

nars1 closed this as completed Oct 5, 2017

ksbhaskar changed the title ~~SIG-11 from source server if started with an external filter followed by a deactivate and shutdown~~ Source server behaves correctly (no SIG-11) if started with an external filter followed by a deactivate and shutdown Oct 25, 2017

nars1 added upstream issue and removed bug labels Dec 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Source server behaves correctly (no SIG-11) if started with an external filter followed by a deactivate and shutdown #30

Source server behaves correctly (no SIG-11) if started with an external filter followed by a deactivate and shutdown #30

nars1 commented Sep 8, 2017 •

edited by ksbhaskar

Loading

Source server behaves correctly (no SIG-11) if started with an external filter followed by a deactivate and shutdown #30

Source server behaves correctly (no SIG-11) if started with an external filter followed by a deactivate and shutdown #30

Comments

nars1 commented Sep 8, 2017 • edited by ksbhaskar Loading

Final Release Note

Description

Draft Release Note

nars1 commented Sep 8, 2017 •

edited by ksbhaskar

Loading