-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create linker map files for all executables (including libgtmshr) to … #27
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…help with debugging undefined symbol issues in build
estess
approved these changes
Aug 31, 2017
chathaway-codes
pushed a commit
that referenced
this pull request
Nov 18, 2018
…secondary errors if primary error is out-of-memory If already exiting, do not open any object/source directories (which could include relinkctl files) as part of $ZROUTINES initialization. This avoids potentially nasty codepaths particulary if the reason we are exiting is an out-of-memory. We do not expect any user to run such extreme out-of-memory codepaths/tests so it is not considered necessary to create a user-visible issue for this. For example, below are two C-stacks that showed up in core dumps while running the simpleapi/fatalerror2 subtest. In both cases, if we avoid the zro_init() call we can avoid such cores. Core1 ------ Notice the local variables passed in #0 have "Cannot access memory" errors. Most likely there was no space allocating the C-stack in this core. (gdb) where #0 ydb_trans_log_name (envindx=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c5c>, trans=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c50>, buffer=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c48>, buffer_len=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c58>, ignore_errors=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c44>, is_ydb_env_match=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c38>) at sr_port/ydb_trans_log_name.c:41 #1 util_out_send_oper (addr=0x7ffe1e3c7800 "%YDB-E-RELINKCTLERR, Error with relink control structure for $ZROUTINES directory ., %YDB-E-SYSCALL, Error received from system call mmap() -- called from module "..., len=287) at sr_unix/util_output.c:731 #2 util_out_print_vaparm (message=0x0, flush=4, var=0x7ffe1e3c8050, faocnt=2147483647) at sr_unix/util_output.c:871 #3 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904 #4 jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:261 #5 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36 #6 rts_error_va (csa=0x0, argcnt=12, var=0x7ffe1e3c82b0) at sr_unix/rts_error.c:159 #7 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92 #8 relinkctl_map (linkctl=0x7ffe1e3c8890) at sr_unix/relinkctl.c:679 #9 relinkctl_open (linkctl=0x7ffe1e3c8890, object_dir_missing=0) at sr_unix/relinkctl.c:333 #10 relinkctl_attach (obj_container_name=0x7ffe1e3cbb50, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188 #11 zro_load (str=0x5611ed710ce8) at sr_unix/zro_load.c:159 #12 zro_init () at sr_port/zro_init.c:51 #13 zshow_svn (output=0x7ffe1e40f0b0, one_sv=0) at sr_port/zshow_svn.c:694 #14 op_zshow (func=0x7ffe1e4171b0, type=1, lvn=0x0) at sr_port/op_zshow.c:166 #15 jobexam_dump (dump_filename_arg=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0, fatal_file_name_buff=0x7ffe1e417c40 "simpleapi_0_2/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_65362_1.txt") at sr_port/jobexam_process.c:232 #16 jobexam_process (dump_file_name=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0) at sr_port/jobexam_process.c:152 #17 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66 #18 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224 #19 rts_error_va (csa=0x0, argcnt=5, var=0x7ffe1e41a6a0) at sr_unix/rts_error.c:159 #20 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92 #21 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114 #22 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748 #23 lvtreenode_newblock (sym=0x5611ed733b40, numElems=2097152) at sr_port/lv_newblock.c:82 #24 lvtreenode_getslot (sym=0x5611ed733b40) at sr_port/lv_getslot.c:145 #25 lvAvlTreeNodeInsert (lvt=0x5611ed736050, key=0x7ffe1e41aab0, parent=0x5611f87cb608) at sr_port/lv_tree.c:1698 #26 op_putindx (argcnt=1, start=0x5611ed73b0a0) at sr_port/op_putindx.c:192 #27 callg (fnptr=0x7fb75d4f4fff <op_putindx>, paramlist=0x7ffe1e41ae60) at sr_unix/callg.c:60 #28 ydb_set_s (varname=0x7ffe1e41b5e0, subs_used=1, subsarray=0x7ffe1e41b5f0, value=0x7ffe1e41ade0) at sr_unix/ydb_set_s.c:108 #29 gvnset () at fatalerror.c:56 #30 ydb_tp_s (tpfn=0x5611ed225260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193 #31 main () at fatalerror.c:32 Core2 ----- In this case there is a SIG-11 deep inside syslog(). Most likely due to an out-of-memory situation. Program terminated with signal SIGSEGV, Segmentation fault. #0 vfprintf () from /usr/lib64/libc.so.6 #1 fprintf () from /usr/lib64/libc.so.6 #2 __vsyslog_chk () from /usr/lib64/libc.so.6 #3 syslog () from /usr/lib64/libc.so.6 #4 util_out_send_oper (addr=0x7ffdadd5ec10 "%YDB-E-JOBEXAMFAIL, YottaDB process 50787 executing $ZJOBEXAM function failed with the preceding error message -- generated from 0x", '0' <repeats 16 times>, ".", len=149) at sr_unix/util_output.c:761 #5 util_out_print_vaparm (message=0x0, flush=4, var=0x7ffdadd5f460, faocnt=2147483647) at sr_unix/util_output.c:871 #6 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904 #7 send_msg_va (csa=0x0, arg_count=0, var=0x7ffdadd5fa00) at sr_unix/send_msg.c:149 #8 send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:79 #9 jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:264 #10 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36 #11 rts_error_va (csa=0x0, argcnt=12, var=0x7ffdadd5fc60) at sr_unix/rts_error.c:159 #12 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92 #13 relinkctl_map (linkctl=0x7ffdadd60240) at sr_unix/relinkctl.c:679 #14 relinkctl_open (linkctl=0x7ffdadd60240, object_dir_missing=0) at sr_unix/relinkctl.c:333 #15 relinkctl_attach (obj_container_name=0x7ffdadd63500, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188 #16 zro_load (str=0x55df19dd3ce8) at sr_unix/zro_load.c:159 #17 zro_init () at sr_port/zro_init.c:51 #18 zshow_svn (output=0x7ffdadda6a60, one_sv=0) at sr_port/zshow_svn.c:694 #19 op_zshow (func=0x7ffdaddaeb60, type=1, lvn=0x0) at sr_port/op_zshow.c:166 #20 jobexam_dump (dump_filename_arg=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660, fatal_file_name_buff=0x7ffdaddaf5f0 "simpleapi_0_40/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_50787_1.txt") at sr_port/jobexam_process.c:232 #21 jobexam_process (dump_file_name=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660) at sr_port/jobexam_process.c:152 #22 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66 #23 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224 #24 rts_error_va (csa=0x0, argcnt=5, var=0x7ffdaddb2050) at sr_unix/rts_error.c:159 #25 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92 #26 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114 #27 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748 #28 lvtreenode_newblock (sym=0x55df19df6b40, numElems=2097152) at sr_port/lv_newblock.c:82 #29 lvtreenode_getslot (sym=0x55df19df6b40) at sr_port/lv_getslot.c:145 #30 lvAvlTreeNodeInsert (lvt=0x55df19df9050, key=0x7ffdaddb2460, parent=0x55df24e8e5c8) at sr_port/lv_tree.c:1698 #31 op_putindx (argcnt=1, start=0x55df19dfe0a0) at sr_port/op_putindx.c:192 #32 callg (fnptr=0x7feae36c6fff <op_putindx>, paramlist=0x7ffdaddb2810) at sr_unix/callg.c:60 #33 ydb_set_s (varname=0x7ffdaddb2f90, subs_used=1, subsarray=0x7ffdaddb2fa0, value=0x7ffdaddb2790) at sr_unix/ydb_set_s.c:108 #34 gvnset () at fatalerror.c:56 #35 ydb_tp_s (tpfn=0x55df18a5c260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193 #36 main () at fatalerror.c:32
chathaway-codes
pushed a commit
that referenced
this pull request
Nov 21, 2018
…CK being called during exit handling When a C program that spawned off multiple threads that used the SimpleThreadAPI (e.g. ydb_tp_st() etc.) was deadlocked (due to a code issue), pressing Ctrl-C (SIGINT) did nothing so pressing Ctrl-\ (SIGQUIT) to terminate the C program caused a MAXRTSERRDEPTH fatal error and resulted in a core dump. Below is the actual output. ^C^\%YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with coreQuit (core dumped) The corresponding C-stack follows. (gdb) where #0 __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:72 #2 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52090) at sr_unix/rts_error.c:144 #3 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52270) at sr_unix/rts_error.c:146 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52450) at sr_unix/rts_error.c:146 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #8 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52630) at sr_unix/rts_error.c:146 #9 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #10 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52810) at sr_unix/rts_error.c:146 #11 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #12 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df529f0) at sr_unix/rts_error.c:146 #13 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #14 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52bd0) at sr_unix/rts_error.c:146 #15 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #16 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52db0) at sr_unix/rts_error.c:146 #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #18 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52f90) at sr_unix/rts_error.c:146 #19 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #20 send_msg_va (csa=0x0, arg_count=8, var=0x7fb28df53570) at sr_unix/send_msg.c:125 #21 send_msg_csa (csa=0x0, arg_count=8) at sr_unix/send_msg.c:84 #22 generic_signal_handler (sig=3, info=0x7fb28df53830, context=0x7fb28df53700) at sr_unix/generic_signal_handler.c:244 #23 <signal handler called> #24 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb2880180a8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #25 __pthread_cond_wait_common (abstime=0x0, mutex=0x7fb288018040, cond=0x7fb288018080) at pthread_cond_wait.c:502 #26 __pthread_cond_wait (cond=0x7fb288018080, mutex=0x7fb288018040) at pthread_cond_wait.c:655 #27 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:80 #28 start_thread (arg=0x7fb28df54700) at pthread_create.c:463 #29 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 The primary error was at #20 in send_msg_va() inside the PTHREAD_MUTEX_LOCK_IF_NEEDED macro. The actual assert that failed inside the macro was the following. sr_unix/gtm_multi_thread.h --------------------------- 99 /* We should never use pthread_* calls inside a signal/timer handler. Assert that */ \ 100 assert(!in_nondeferrable_signal_handler); \ We were in a signal handler handling a non-deferrable signal (Ctrl-\ aka SIGQUIT) and are about to do a pthread_mutex_lock() library call which is a no-no. If we are in an exit handler, it is possible for send_msg() to be needed (to log the signal that was received etc.) but it is safer to not do any pthread activity since we cannot be sure if we are exiting while inside a signal handler or not. Therefore the fix for this is to check if "process_exiting" global variable is TRUE and if so, we skip all pthread* calls in the PTHREAD_MUTEX_LOCK_IF_NEEDED and PTHREAD_MUTEX_UNLOCK_IF_NEEDED macros.
chathaway-codes
pushed a commit
that referenced
this pull request
Jun 12, 2019
In one v60000/gtm4525b subtest run using imptpgo.go, a process assert failed. > %YDB-F-ASSERT, Assert failed in sr_port/tp_clean_up.c line 104 for expression (!update_trans) Below is the C-stack #0 __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:57 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 ch_cond_core () at sr_unix/ch_cond_core.c:77 #3 rts_error_va () at sr_unix/rts_error.c:192 #4 rts_error_csa () at sr_unix/rts_error.c:99 #5 tp_clean_up () at sr_port/tp_clean_up.c:104 #6 op_trollback () at sr_port/op_trollback.c:149 #7 t_abort () at sr_port/t_abort.c:53 #8 secshr_db_clnup () at sr_port/secshr_db_clnup.c:568 #9 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:212 #10 __run_exit_handlers () at exit.c:83 #11 __GI_exit () at exit.c:105 #12 gtm_image_exit () at sr_unix/gtm_image_exit.c:27 #13 wait_for_repl_inst_unfreeze_nocsa_jpl () at sr_port/anticipatory_freeze.h:489 #14 wait_for_repl_inst_unfreeze () at sr_port/anticipatory_freeze.h:526 #15 wcs_wtstart () at sr_unix/wcs_wtstart.c:702 #16 wcs_stale () at sr_port/t_end_sysops.c:1387 #17 timer_handler () at sr_unix/gt_timers.c:834 #18 <signal handler called> #19 __clock_nanosleep () at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:45 #20 wait_for_repl_inst_unfreeze_nocsa_jpl () at sr_port/anticipatory_freeze.h:503 #21 wait_for_repl_inst_unfreeze () at sr_port/anticipatory_freeze.h:526 #22 t_retry () at sr_port/t_retry.c:183 #23 t_end () at sr_port/t_end.c:1874 #24 gvcst_bmp_mark_free () at sr_port/gvcst_bmp_mark_free.c:215 #25 gvcst_expand_free_subtree () at sr_port/gvcst_expand_free_subtree.c:182 #26 op_tcommit () at sr_port/op_tcommit.c:581 #27 stkok3 () at sr_armv7l/opp_tcommit.s:38 (gdb) f 5 #5 0xb66ef674 in tp_clean_up (clnup_state=TP_ROLLBACK) at /Distrib/YottaDB/V998_R124/sr_port/tp_clean_up.c:104 104 assert(!update_trans); 100 if (tp_pointer->implicit_tstart) 101 { /* Resetting this is necessary to avoid blowing an assert in t_begin that it is 0 at the start of a transaction. */ 102 update_trans = 0; 103 } else 104 assert(!update_trans); (gdb) p process_exiting $4 = 1 The assert at line 104 is now enhanced to allow for the "process_exiting" case. A comment has been added to the code to explain why this is okay.
nars1
added a commit
that referenced
this pull request
May 4, 2020
…d malloc issues * We had an in-house test failure on an ARMV6L box with the following diff. ```diff > ideminter_rolrec_0/mupipstop_rollback_or_recover/impjob_imptp0.mje5 > %YDB-F-ASSERT, Assert failed in sr_port/gtm_malloc_src.h line 695 for expression (FALSE) ``` Below is the C-stack at the time of the assert failure. ```gdb #0 __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 ch_cond_core () at sr_unix/ch_cond_core.c:77 #3 rts_error_va () at sr_unix/rts_error.c:192 #4 rts_error_csa () at sr_unix/rts_error.c:99 #5 gtm_malloc () at sr_port/gtm_malloc_src.h:695 #6 condstk_expand () at sr_unix/condstk_expand.c:53 #7 ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:59 #8 deferred_signal_handler () at sr_port/deferred_signal_handler.c:57 #9 gtm_malloc () at sr_port/gtm_malloc_src.h:748 #10 iorm_use () at sr_unix/iorm_use.c:988 #11 iorm_open () at sr_unix/iorm_open.c:254 #12 io_open_try () at sr_unix/io_open_try.c:616 #13 op_open () at sr_port/op_open.c:160 #14 open_source_file () at sr_unix/source_file.c:253 #15 compiler_startup () at sr_port/compiler_startup.c:130 #16 compile_source_file () at sr_unix/source_file.c:173 #17 op_zcompile () at sr_port/op_zcompile.c:57 #18 gtm_trigger_complink () at sr_unix/gtm_trigger.c:451 #19 gtm_trigger () at sr_unix/gtm_trigger.c:551 #20 gvtr_match_n_invoke () at sr_unix/gv_trigger.c:1683 #21 gvcst_put2 () at sr_port/gvcst_put.c:2806 #22 gvcst_put () at sr_port/gvcst_put.c:299 #23 op_gvput () at sr_port/op_gvput.c:79 #24 ydb_set_s () at sr_unix/ydb_set_s.c:137 #25 ydb_set_st () at sr_unix/ydb_set_st.c:42 #26 _cgo_d187034042ca_Cfunc_ydb_set_st () at cgo-gcc-prolog:170 #27 runtime.asmcgocall () at /usr/lib/go-1.11/src/runtime/asm_arm.s:617 ``` * The cause of the assert failure is a nested call to `gtm_malloc()` (frames 9 and 5 above). And the reason that nested call happened is because the initial allocation of the condition handler stack size of 5 was not enough when `sr_unix/ydb_stm_invoke_deferred_signal_handler.c` tried to do an ESTABLISH and add one more condition handler (at frame number 7). This is because the condition handler stack was already used up with the following handlers. ```gdb (gdb) p chnd[0].ch $14 = (void (*)()) 0xb62f4f70 <stop_image_conditional_core> (gdb) p chnd[1].ch $15 = (void (*)()) 0xb63b0f10 <ydb_simpleapi_ch> (gdb) p chnd[2].ch $16 = (void (*)()) 0xb67f1e0c <gtm_trigger_complink_ch> (gdb) p chnd[3].ch $17 = (void (*)()) 0xb69c45e8 <source_ch> (gdb) p chnd[4].ch $18 = (void (*)()) 0xb6ce211c <compiler_ch> ``` * The initial condition handler stack size (controlled by the `CONDSTK_INITIAL_INCR` macro) is currently set to 5 (last changed from 2 to 5 as part of GT.M V6.3-000) for DEBUG builds and set to 8 for PRO/Release builds. * Due to YottaDB's use of SimpleAPI, this limit of 5 is clearly not enough (as shown by the above failure) so it is now being bumped to 8 for DEBUG and to 16 for PRO/Release builds (just to be safe).
nars1
added a commit
that referenced
this pull request
Aug 7, 2020
…that can fail in rare cases * The below assert failed once in 1500 r126/ydb464 subtest runs across a dozen boxes. ```c assert((1 != forced_exit) || GET_DEFERRED_EXIT_CHECK_NEEDED); #27 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #28 wcs_wtstart (region=0x7f6970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:830 #29 wcs_stale (tid=8350064, hd_len=8, region=0x872fe8) at sr_port/t_end_sysops.c:1386 #30 timer_handler (why=14, info=0x7fadda7caea8 <stapi_signal_handler_oscontext+10728>, context=0x7fadda7caf28 <stapi_signal_handler_oscontext+10856>) at sr_unix/gt_timers.c:773 #31 ydb_os_signal_handler (sig=14, info=0x7ffdee359eb0, context=0x7ffdee359d80) at sr_unix/ydb_os_signal_handler.c:66 #32 <signal handler called> #33 generic_signal_handler (sig=2, info=0x7fadda7c9588 <stapi_signal_handler_oscontext+4296>, context=0x7fadda7c9608 <stapi_signal_handler_oscontext+4424>) at sr_unix/generic_signal_handler.c:244 #34 ydb_os_signal_handler (sig=2, info=0x7ffdee35a870, context=0x7ffdee35a740) at sr_unix/ydb_os_signal_handler.c:88 #35 <signal handler called> #36 clock_nanosleep () from /usr/lib64/libc.so.6 #37 m_usleep (useconds=1000) at sr_unix/sleep.c:28 #38 wcs_sleep (sleepfactor=1) at sr_port/wcs_sleep.c:28 #39 wcs_phase2_commit_wait (csa=0x7f4840, cr=0x7fadd7b88a50) at sr_port/wcs_phase2_commit_wait.c:268 #40 t_qread (blk=6, cycle=0x7ffdee35c568, cr_out=0x7ffdee35c558) at sr_port/t_qread.c:668 #41 gvcst_search (pKey=0x7f6040, pHist=0x0) at sr_port/gvcst_search.c:438 #42 gvcst_put2 (val=0x7fadda7d3940 <increment_delta_mval>, parms=0x7ffdee35f8b0) at sr_port/gvcst_put.c:980 #43 gvcst_put (val=0x7fadda7d3940 <increment_delta_mval>) at sr_port/gvcst_put.c:299 #44 gvcst_incr (increment=0x7ffdee362030, result=0x7ffdee362050) at sr_port/gvcst_incr.c:60 #45 op_gvincr (increment=0x7ffdee362030, result=0x7ffdee362050) at sr_port/op_gvincr.c:60 #46 ydb_incr_s (varname=0x7ffdee362b10, subs_used=0, subsarray=0x7ffdee3629d0, increment=0x7ffdee362aa0, ret_value=0x7ffdee362a90) at sr_unix/ydb_incr_s.c:156 #47 runProc (settings=0x7ffdee363490, curDepth=0) at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:447 #48 runProc_driver (settings=0x7ffdee363490) at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:145 #49 main () at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:93 (gdb) f 28 #28 wcs_wtstart (region=0x7f6970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:830 830 ENABLE_INTERRUPTS(INTRPT_IN_WCS_WTSTART, prev_intrpt_state); (gdb) p forced_exit $5 = 1 (gdb) p deferred_signal_handling_needed $6 = 0 ``` * This is because the signal handler for SIGALRM (sig=14) happened after line 183 but before line 193. ```c sr_port/have_crit.h ------------------- 183 forced_exit = 1; \ 184 if (in_os_signal_handler) \ 185 { /* If we are inside an OS signal handler and therefore had to defer exit \ 186 * handling, set "outofband" also to TRUE as this is checked by lots of \ 187 * potentially long-running commands in the runtime (e.g. HANG etc.) and we \ 188 * want all of those to automatically trigger process exit handling. \ 189 */ \ 190 outofband = deferred_signal; \ 191 } \ 192 /* Whenever "forced_exit" gets set to 1, set the corresponding deferred event too */ \ 193 SET_DEFERRED_EXIT_CHECK_NEEDED; \ ``` * We cannot easily protect against this situation and so it is better to remove the assert.
nars1
added a commit
that referenced
this pull request
Aug 7, 2020
…signal_handling_needed; Fixes rare test failures Background ---------- * The `locks/locks_main` subtest failed on an ARMV7L Pi 3 system with the following assert in in-house testing (Note: this failure happened only once and is not easily reproducible in hundreds of test runs). ```diff > %YDB-F-ASSERT, Assert failed in sr_unix/util_output.c line 761 for expression (!GET_DEFERRED_EXIT_CHECK_NEEDED || (1 == forced_exit)) ``` * The C-stack from the core file looked as follows. ```c #25 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:134 #26 ch_cond_core () at sr_unix/ch_cond_core.c:80 #27 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192 #28 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #29 util_out_send_oper (addr=0x7ea37734 "%YDB-I-SUSPENDING, Process Received Signal 19. Suspending processing on user request or attempt to do terminal I/O while running in the background -- generated from 0x75F61ACC.", len=177) at sr_unix/util_output.c:761 #30 util_out_print_vaparm (message=0x0, flush=4, var=..., faocnt=2147483647) at sr_unix/util_output.c:880 #31 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:913 #32 send_msg_va (csa=0x0, arg_count=0, var=...) at sr_unix/send_msg.c:193 #33 send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:86 #34 suspend (sig=19) at sr_unix/suspend.c:39 #35 deferred_signal_handler () at sr_port/deferred_signal_handler.c:86 #36 gtm_hiber_start_wait_any (hiber=1) at sr_unix/gt_timers.c:347 #37 op_lock2_common (timeout=9223372036854775800, laflag=64 '@') at sr_port/op_lock2.c:358 #38 op_lock2 (timeout=0xe8bca8, laflag=64 '@') at sr_port/op_lock2.c:158 #39 op_incrlock (timeout=0xe8bca8) at sr_port/op_incrlock.c:37 (gdb) f 29 #29 0x75fa3960 in util_out_send_oper (addr=0x7ea37734 "%YDB-I-SUSPENDING, Process Received Signal 19. Suspending processing on user request or attempt to do terminal I/O while running in the background -- generated from 0x75F61ACC.", len=177) at sr_unix/util_output.c:761 761 ENABLE_INTERRUPTS(INTRPT_IN_LOG_FUNCTION, prev_intrpt_state); (gdb) p deferred_signal_handling_needed $4 = 4294967292 (gdb) p/x deferred_signal_handling_needed $5 = 0xfffffffc ``` * As can be seen from the gdb output above, the global variable `deferred_signal_handling_needed` had a negative value (`0xfffffffc` is equal to `-4`). This global variable is supposed to always be 0 or a positive value indicating a bitmask of deferred events that need to be processed. A negative value implies an out-of-design state resulting in incorrect interpretations of the deferred events that need to be processed. The assert that failed indicates that `forced_exit` is not 1 (implying that a SIGTERM or similar fatal signal was not deferred) but the global variable `deferred_signal_handling_needed` indicates that such a fatal signal had been deferred implying a contradiction. * After a lot of review of the code, I have a theory as to what might have happened. Let us say the global variable `deferred_signal_handling_needed` had the initial value of 1 indicating there is a deferred timer event. The `CLEAR_DEFERRED_TIMERS_CHECK_NEEDED` macro first does a `GET_DEFERRED_TIMERS_CHECK_NEEDED` call and if that returns TRUE (indicating the bit denoting the deferred timer signal is set), it then does an `INTERLOCK_ADD` call to decrement the global variable by the value `DEFERRED_SIGNAL_HANDLING_TIMERS` (which is 1) effectively clearing the bit. But it is possible a timer (or any other signal) interrupt occurs right after the `GET_DEFERRED_TIMERS_CHECK_NEEDED` macro call but before the `INTERLOCK_ADD` call. And inside the signal handler this global variable could be updated to a different value. For example, `timer_handler()` (invoked if a SIGALRM signal is received) does a `CLEAR_DEFERRED_TIMERS_CHECK_NEEDED` macro call on its own which would clear the bit in the global variable and set it to the value 0. Once the `timer_handler()` call returns, the `INTERLOCK_ADD()` will now be doing a decrement of the global variable by the `DEFERRED_SIGNAL_HANDLING_TIMERS` value thinking it is clearing the bit without realizing the bit got cleared by a signal handler in between. This decrement will now take the global variable value to an out of design state of `-1`. Once there, the global variable can take on other out of design values like `-2` and `-4` etc. like what was seen in the core. Fix --- * The primary issue is that the `INTERLOCK_ADD()` macro is not aware of state changes to the global variable that occur after we examine the value of the global variable but before we do the addition/subtraction. * This is fixed by reworking that to instead use `COMPSWAP_LOCK` macro with a for loop that is retried in case the global variable value changes concurrently and retries the loop. * Took this opportunity to add some asserts and rework the macros in `have_crit.h` so they are simplified and code duplication is avoided. Also entailed changing the type of `deferred_signal_handling_needed` to `global_latch_t`. * Added a `GLOBAL_LATCH_VALUE` macro in `sr_port/mdef.h` to get the latch value.
nars1
added a commit
that referenced
this pull request
Mar 15, 2021
…ady started exiting * As part of a prior commit (SHA 723688c) various functions that started a timer (`wcs_clean_dbsync()`, `wcs_stale()` etc.) were fixed to not start one if we have already started exit processing. * One such timer function that should also have been fixed but was left out is `gtmsource_heartbeat_timer()`. We had an in-house test failure which failed an assert in `start_timer()` because `gtmsource_heartbeat_timer()` was being started while we had already started exit processing. Below is the C-stack of the failure for the record. ```c #0 __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va () at sr_unix/rts_error.c:192 #5 rts_error_csa () at sr_unix/rts_error.c:99 #6 start_timer () at sr_unix/gt_timers.c:433 #7 gtmsource_heartbeat_timer () at sr_unix/gtmsource_heartbeat.c:74 #8 timer_handler () at sr_unix/gt_timers.c:889 #9 ydb_os_signal_handler () at sr_unix/ydb_os_signal_handler.c:63 #10 <signal handler called> #11 __GI___libc_write () at ../sysdeps/unix/sysv/linux/write.c:26 #12 _IO_new_file_write () at fileops.c:1181 #13 new_do_write () at libioP.h:948 #14 _IO_new_file_xsputn () at fileops.c:1255 #15 _IO_new_file_xsputn () at fileops.c:1197 #16 __GI__IO_fwrite () at libioP.h:948 #17 gtm_fwrite () at sr_port/eintr_wrappers.h:334 #18 gtm_fprintf () at tdio.c:82 #19 util_out_print_vaparm () at sr_nix/util_output.c:876 #20 util_out_print () at sr_unix/util_output.c:914 #21 gtm_putmsg_csa () at sr_unix/gtm_putmsg.c:73 #22 gds_rundown () at sr_unix/gds_rundown.c:1060 #23 gv_rundown () at sr_port/gv_rundown.c:122 #24 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144 #25 __run_exit_handlers () at exit.c:108 #26 __GI_exit () at exit.c:139 #27 gtm_image_exit () at sr_unix/gtm_image_exit.c:27 #28 util_base_ch () at sr_port/util_base_ch.c:124 #29 gtmsource_ch () at sr_port/gtmsource_ch.c:96 #30 gtmsource_readfiles () at aDB/V999_R131/sr_unix/gtmsource_readfiles.c:2023 #31 gtmsource_get_jnlrecs () attaDB/V999_R131/sr_unix/gtmsource_process_ops.c:980 #32 gtmsource_process () at sr_unix/gtmsource_process.c:1546 #33 gtmsource () at sr_unix/gtmsource.c:525 #34 mupip_main () at sr_unix/mupip_main. #35 dlopen_libyottadb () at /Distri9_R131/sr_unix/dlopen_libyottadb.c:151 #36 main () at sr_unix/mupip.c:22 ``` * This failure is now fixed by checking `exit_handler_active` and if it is `TRUE` we skip starting this timer.
nars1
added a commit
that referenced
this pull request
Jan 31, 2022
…on a garbage file descriptor Background ---------- * This is a very rare test failure that was seen only once and on a slow ARM in-house box in internal testing. * The `stress/concurr` subtest failed with the following diff. ```diff --- concurr/concurr.diff --- 69a70,181 > host:REMOTE_SIDE:stress_1/concurr/stress_oli.out > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE) > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE) > %YDB-E-NOTALLDBRNDWN, Not all regions were successfully rundown ``` * The assert failure created a core file with the following stack trace ```c #6 gtm_close (fd=1626061471) at sr_unix/gtm_fd_trace.c:185 #7 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192 #8 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:501 #9 gv_rundown () at sr_port/gv_rundown.c:122 #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144 #11 __run_exit_handlers (status=150374524, listp=0xffff9c805680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108 #12 __GI_exit (status=<optimized out>) at exit.c:139 #13 gtm_image_exit (status=150374524) at sr_unix/gtm_image_exit.c:27 #14 util_base_ch (arg=150374524) at sr_port/util_base_ch.c:124 #15 mu_int_ch (arg=150374524) at sr_unix/mu_int_ch.c:35 #16 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192 #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #18 gtm_close (fd=-559038737) at sr_unix/gtm_fd_trace.c:185 #19 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192 #20 jnl_file_close_timer () at sr_unix/jnl_file_close_timer.c:74 #21 timer_handler (why=0, info=0xffff9c65df68 <stapi_signal_handler_oscontext+47048>, context=0xffff9c65dff0 <stapi_signal_handler_oscontext+47184>, is_os_signal_handler=0) at sr_unix/gt_timers.c:889 #22 check_for_deferred_timers () at sr_unix/gt_timers.c:1267 #23 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78 #24 gtm_free (addr=0xaaaaffca1980) at sr_port/gtm_malloc_src.h:1038 #25 ss_release (ss_ctx=0xaaaaffc78910) at sr_unix/ss_release.c:226 #26 mupip_integ () at sr_port/mupip_integ.c:801 #27 mupip_main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip_main.c:122 #28 dlopen_libyottadb (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980, main_func=0xaaaae7dd6648 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #29 main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip.c:22 ``` Issue ----- * Frame 18 in the stack trace above indicates a `gtm_close()` call happening with an fd of `-559038737`. * Frame 6 in the stack trace above indicates a `gtm_close()` call happening with an fd of `fd=1626061471`. * The real issue is in Frame 26 in the stack trace above where we call `ss_release()`. The relevant code is pasted below. **sr_port/mupip_integ.c** ```c 799 assert(SNAPSHOTS_IN_PROG(csa)); 800 assert(NULL != csa->ss_ctx); 801 ss_release(&csa->ss_ctx); 802 CLEAR_SNAPSHOTS_IN_PROG(csa); ``` * Line 801 does the `ss_release()` call and Line 802 clears the flag in `csa` that records that a snapshot is in progress. * But `ss_release()` first calls `ss_context_destroy()` and then calls `free()` so it is possible that a timer interrupt gets handled in a deferred fashion right after the `free()` but before the `CLEAR_SNAPSHOTS_IN_PROG` macro gets executed. This means we would invoke `ss_destroy_context()` on the `csa->ss_ctx` structure again inside the timer. And that would be looking at an already freed context structure. Which can then explain why garbage values of `fd` got used in the `gtm_close()` calls. Fix --- * The fix is in `sr_port/mupip_integ.c` to clear all context in global variables that indicate a snapshot is in progress BEFORE calling `ss_release()`. * Additionally, the following files were changed since the warning text from `clang-tidy` changed a bit. While at it, I also verified that this warning is a false alarm. - ci/tidy_warnings_debug.ref - ci/tidy_warnings_release.ref
nars1
added a commit
that referenced
this pull request
Jul 25, 2023
… it can cause hang with CLANG/ASAN Background ---------- * While running the YDBOcto tests with CLANG, I noticed various tests hang. All of them had a similar stack-trace. ```c (gdb) where #0 __sanitizer::FutexWait(__sanitizer::atomic_uint32_t*, unsigned int) () #1 __sanitizer::Semaphore::Wait() () #2 __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >::GetFromAllocator(__sanitizer::AllocatorStats*, unsigned long, unsigned int*, unsigned long) () #3 __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::Refill(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::PerClass*, __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >*, unsigned long) () #4 __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >*, unsigned long, unsigned long) () #5 __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) () #6 __asan::asan_calloc(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*) () #7 calloc () #8 __pthread_attr_extension (attr=0x7f29af3cee48) at ./nptl/pthread_attr_extension.c:28 #9 __GI___pthread_attr_setaffinity_np (attr=attr@entry=0x7f29af3cee48, cpusetsize=cpusetsize@entry=32, cpuset=cpuset@entry=0x603000001b40) at ./nptl/pthread_attr_setaffinity.c:45 #10 __pthread_getattr_np (thread_id=139817006390848, attr=0x7f29af3cee48) at ./nptl/pthread_getattr_np.c:194 #11 __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) () #12 __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) () #13 __asan::PlatformUnpoisonStacks() () #14 __asan_handle_no_return () #15 generic_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:187 #16 ydb_os_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0) at sr_unix/ydb_os_signal_handler.c:85 #17 <signal handler called> #18 sched_yield () at ../sysdeps/unix/syscall-template.S:120 #19 __sanitizer::StopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), void*) () #20 __lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*) () #21 __GI___dl_iterate_phdr (callback=0x55bd48373320 <__lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*)>, data=0x7ffe13010eb8) at ./elf/dl-iteratephdr.c:74 #22 __lsan::LockStuffAndStopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), __lsan::CheckForLeaksParam*) () #23 __lsan::CheckForLeaks() () #24 __lsan::DoLeakCheck() () #25 __cxa_finalize (d=0x55bd483af128) at ./stdlib/cxa_finalize.c:83 #26 __do_global_dtors_aux () #27 ?? () #28 _dl_fini () at ./elf/dl-fini.c:142 ``` Issue ----- * The YottaDB SIG-15/SIGTERM signal handler got invoked for a SIG-15. But it noticed that all YottaDB exit handler code has already been run (`exit_handler_complete` global variable is TRUE). In that case, it invoked any non-YottaDB signal handler for SIG-15 and afterwards, it invoked `_exit()` to terminate the process (in line 187). **sr_unix/generic_signal_handler.c** ```c 182 if (exit_handler_complete) 183 { 184 if (!using_alternate_sighandling) /* Go does not send us signals so no need to forward */ 185 { 186 drive_non_ydb_signal_handler_if_any("generic_signal_handler1", sig, info, context, TRUE); 187 UNDERSCORE_EXIT(-sig); 188 } 189 return; /* Nothing we can do if exit handler has run */ 190 } ``` * And because of the `_exit()` all, the CLANG/ASAN library ended up doing a `calloc()` call which hung waiting for a futex. Most likely due to re-entrant invocations of C library functions that are not async-signal safe. * The cause of this is line 187 above in my opinion. * If YottaDB exit handler has already run (as part of SIGTERM handling) and we are getting the SIGTERM signal again, then I don't see any reason to do the `_exit()` call (using the `UNDERSCORE_EXIT` macro in line 187). * This code has been there for a long time but I don't think it is doing the right thing. Fix --- * Lines 184-188 are now removed in this commit. I think the right thing to do is to just return in case the YottaDB exit handler has already been invoked. * With this change, I verified that the CLANG/ASAN tests run fine in YDBOcto. So at least one Simple API use case runs fine with the fix in this commit. * Initially I thought of disabling lines 184-188 above only when ASAN is enabled. But then I realized it is a good change for all cases and so removed lines 184-188.
nars1
added a commit
that referenced
this pull request
Sep 11, 2023
… detect signal/timer handling Background ---------- * We had one rare test failure during in-house testing. The `ideminter_rolrec/mupipstop_rollback_or_recover` subtest failed with the following symptom. ```sh $ cat ROLLBACK1_3.logx mupip journal -ROLLBACK -back -verify -verbose "*" -noonline -resync=369813 -lost=ROLLBACK1_3.lost Sat Sep 9 04:17:18 PM EDT 2023 . . %YDB-I-MUJNLSTAT, Forward processing started at Sat Sep 9 16:19:23 2023 %YDB-I-MUINFOUINT8, mur_process_seqno_table returns min_broken_seqno : 18446744073709551615 [0xFFFFFFFFFFFFFFFF] %YDB-I-MUINFOUINT8, mur_process_seqno_table returns losttn_seqno : 369813 [0x000000000005A495] %YDB-I-MUINFOSTR, Module : mur_forward:at the start at Sat Sep 9 16:19:23 2023 . . %YDB-I-MUINFOSTR, Journal file : ideminter_rolrec_0/mupipstop_rollback_or_recover/g.mjl_2023252161233 %YDB-I-MUINFOUINT4, Record Offset : 65744 [0x000100D0] %YDB-F-FORCEDHALT, Image HALTed by MUPIP STOP %YDB-F-ASSERT, Assert failed in sr_unix/db_ipcs_reset.c line 110 for expression (((TREF(dio_buff)).aligned != (char *)(csd)) || (!timer_in_handler && !multi_thread_in_use)) Sat Sep 9 04:20:35 PM EDT 2023 The time the mupip command took: 197 ``` * The core file corresponding to the above assert failure had the following stack trace. ```c (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140217990231872, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7fff160fdc00) at sr_unix/rts_error.c:198 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #8 db_ipcs_reset (reg=0x563c77a1c0b0) at sr_unix/db_ipcs_reset.c:110 #9 mur_close_files () at sr_port/mur_close_files.c:841 #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:116 #11 signal_exit_handler (exit_handler_name=0x7f870b624acc "deferred_exit_handler", sig=15, info=0x7f870b7856a8 <stapi_signal_handler_oscontext+3320>, context=0x7f870b785728 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78 #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120 #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74 #14 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862 #15 wcs_stale (tid=94817705118848, hd_len=8, region=0x563c77924b08) at sr_port/t_end_sysops.c:1445 #16 timer_handler (why=0, info=0x7f870b787088 <stapi_signal_handler_oscontext+9944>, context=0x7f870b787108 <stapi_signal_handler_oscontext+10072>, is_os_signal_handler=0) at sr_unix/gt_timers.c:913 #17 check_for_deferred_timers () at sr_unix/gt_timers.c:1312 #18 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78 #19 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862 #20 wcs_timer_start (reg=0x563c77a1cc80, io_ok=1) at sr_port/t_end_sysops.c:1344 #21 op_tcommit () at sr_port/op_tcommit.c:535 #22 mur_output_record (rctl=0x563c77a28a40) at sr_port/mur_output_record.c:323 #23 mur_forward_play_cur_jrec (rctl=0x563c77a28a40) at sr_port/mur_forward_play_cur_jrec.c:362 #24 mur_forward_multi_proc (rctl=0x563c77a28a40) at sr_port/mur_forward.c:400 #25 gtm_multi_proc (fnptr=0x7f870ae20f00 <mur_forward_multi_proc>, ntasks=1, max_procs=1, ret_array=0x563c7cb21a40, parm_array=0x563c77a27c40, parmElemSize=512, extra_shm_size=2640, init_fnptr=0x7f870ae2b9f0 <mur_forward_multi_proc_init>, finish_fnptr=0x7f870ae2bc10 <mur_forward_multi_proc_finish>) at sr_unix/gtm_multi_proc.c:122 #26 mur_forward (min_broken_time=4294967295, min_broken_seqno=18446744073709551615, losttn_seqno=369813) at sr_port/mur_forward.c:158 #27 mupip_recover () at sr_port/mupip_recover.c:588 #28 mupip_main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip_main.c:122 #29 dlopen_libyottadb (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0, main_func=0x563c761b1004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #30 main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip.c:22 (gdb) p gtm_threadgbl_true->dio_buff.aligned $5 = 0x563c78429000 "GDSDYNUNX04" (gdb) p csd $6 = (sgmnt_data_ptr_t) 0x563c78429000 (gdb) p timer_in_handler $1 = 1 (gdb) p multi_thread_in_use $2 = 0 (gdb) p forced_exit $3 = 2 (gdb) p exit_handler_active $4 = 1 (gdb) p in_os_signal_handler $1 = 0 ``` Issue ----- * The assert failure was in the db_ipcs_reset() -> DB_LSEEKREAD -> DBG_CHECK_DIO_ALIGNMENT. * The `DBG_CHECK_DIO_ALIGNMENT` macro had the following comment. ```c 53 /* If we are using the global variable "dio_buff.aligned", then we better not be executing in timer \ 54 * code or in threaded code (as we have only ONE buffer to use). Assert that. \ 55 */ \ 56 assert(((TREF(dio_buff)).aligned != (char *)(buff)) || (!timer_in_handler && !multi_thread_in_use)); \ ``` * In the failure case, even though we are executing in timer code we are actually in exit handler code (as can be seen by the `forced_exit` and `exit_handler_active` variables in the gdb analysis above). In this case, the exit handler code will not return out of the timer code and so it is okay for the assert to not be TRUE. * The global variable being checked in the assert is `timer_in_handler`. This is where the issue is. That global variable being TRUE just means the `timer_handler()` function is in the current call stack. It does not mean that we are handling a SIGALRM/timer signal and interrupting the mainline code. The assert is intended to protect against signal handler interrupting the mainline code. Therefore, the correct global variable to check in the assert is `in_os_signal_handler`. Fix --- * The fix is simple and is to use `in_os_signal_handler` instead of `timer_in_handler` in the assert.
nars1
added a commit
that referenced
this pull request
Nov 15, 2023
…_port/deferred_events.c Background ---------- * The `v61000/intrpt_wcs_wtstart` subtest (in the YDBTest project) failed a few rare occasions during internal testing with the following symptom. ```diff 12a13,299 > v61000_0_22/intrpt_wcs_wtstart/mumps-wb.out > %YDB-F-ASSERT, Assert failed in sr_port/deferred_events.c line 114 for expression (no_event == outofband || (event_type == outofband)) ``` Issue ----- * The stack trace and relevant details from the gdb core analysis are pasted below. ```c (gdb) where #0 pthread_kill () from /lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7ffcc56fd8c0) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114 #7 jobinterrupt_event (sig=10, info=0x7fb372b8a518 <stapi_signal_handler_oscontext+5528>, context=0x7fb372b8a598 <stapi_signal_handler_oscontext+5656>) at sr_port/jobinterrupt_event.c:61 #8 <signal handler called> #9 clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 #10 m_usleep (useconds=10000) at sr_unix/sleep.c:37 #11 wcs_sleep (sleepfactor=6310) at sr_port/wcs_sleep.c:28 #12 wcs_flu (options=519) at sr_unix/wcs_flu.c:571 #13 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:632 #14 gv_rundown () at sr_port/gv_rundown.c:122 #15 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233 #16 signal_exit_handler (exit_handler_name=0x7fb372a19ecf "generic_signal_handler", sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=0) at sr_unix/signal_exit_handler.c:78 #17 generic_signal_handler (sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:502 #18 ydb_os_signal_handler (sig=15, info=0x7ffcc56ffd30, context=0x7ffcc56ffc00) at sr_unix/ydb_os_signal_handler.c:88 #19 <signal handler called> #20 clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 #21 m_usleep (useconds=999000) at sr_unix/sleep.c:37 #22 wcs_wtstart (region=0xc30970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:216 #23 wcs_timer_start (reg=0xc30970, io_ok=1) at sr_port/t_end_sysops.c:1346 #24 t_end (hist1=0xcfe798, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1848 #25 gvcst_put2 (val=0xc928b8, parms=0x7ffcc5709a80) at sr_port/gvcst_put.c:2796 #26 gvcst_put (val=0xc928b8) at sr_port/gvcst_put.c:302 #27 op_gvput (var=0xc928b8) at sr_port/op_gvput.c:79 (gdb) f 6 #6 xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114 114 assert(no_event == outofband || (event_type == outofband)); (gdb) p (enum outofbands)no_event $2 = no_event (gdb) p (enum outofbands)outofband $1 = deferred_signal (gdb) p (enum outofbands)event_type $3 = jobinterrupt ``` * The test sends a SIGTERM (i.e. SIG-15) signal. This caused `outofband` variable to be set to `deferred_signal` in frame 17 above (`generic_signal_handler.c` inside the `SET_FORCED_EXIT_STATE` macro). * And then the process was sleeping (due to a white-box test case in the test). * At that point, it was holding crit and another process was waiting for this and so was about to send a `MUTEXLCKALERT` message. At this point, since the test framework had set the `gtm_procstuckexec` env var to `com/gtmprocstuck_get_stack_trace.csh`, that was invoked and it in turn invoked `^%YDBPROCSTUCKEXEC` which in turn sent a `SIGUSR1` signal (i.e. a `mupip intrpt`) to this very same process that was sleeping while holding crit. * And at this point, the process got the assert failure because the `outofband` variable indicated that a `SIG-15` signal needs to be handled whereas the `event_type` variable indicated that the current out of band event is a `jobinterrupt` event. Fix --- * This seems like a valid scenario and I suspect the assert is invalid. * I noticed that this very same assert has been removed in a later GT.M release V7.1-001. ```diff $ cd YDB $ git show tags/V7.1-001 sr_port/deferred_events.c | head -35 | tail -8 @@ -127,7 +127,6 @@ boolean_t xfer_set_handlers(int4 event_type, int4 param_val, boolean_t popped_e } if (!already_ev_handling) { - assert(no_event == outofband || (event_type == outofband)); assert(!dollar_zininterrupt || (jobinterrupt != event_type)); if (entry != (TREF(save_xfer_root_ptr))->ev_que.fl) { /* no event in play so pend this one by jiggeriing the xfer_table */ ``` * I assume GT.M noticed a similar issue but not while releasing V7.0-001 (which is what YottaDB master currently has merged) but when releasing a much later V7.1-001 version and fixed it then. * Therefore, I am removing the assert that failed. * This should let the `v61000/intrpt_wcs_wtstart` test run fine until GT.M V7.1-001 gets merged into the YottaDB master branch.
nars1
added a commit
that referenced
this pull request
Jul 10, 2024
Background ---------- * In internal testing, we noticed a rare failure in the `v51000/mu_bkup_stop` subtest where a `mupip backup` process that was sent a `SIGTERM` (by the test) ended up creating a core file due to ASAN assert failing on a double free. * Below are relevant details from the core file. ```c Core was generated by `mupip backup -online -dbg * ./49181_online1'. Program terminated with signal SIGSEGV, Segmentation fault. (gdb) where #0 ydb_os_signal_handler (sig=11, info=0x7fd09968c3f0, context=0x7fd09968c2c0) at sr_unix/ydb_os_signal_handler.c:57 #1 <signal handler called> #2 ydb_os_signal_handler (sig=6, info=0x7fd09968caf0, context=0x7fd09968c9c0) at sr_unix/ydb_os_signal_handler.c:57 #3 <signal handler called> #4 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 #5 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #6 __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #7 __GI_abort () at ./stdlib/abort.c:79 #8 __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:143 #9 __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58 #10 __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x7ffda6de6ebe, __in_chrg=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:190 #11 __asan::ReportDoubleFree (addr=140533757257728, free_stack=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:224 #12 __asan::Allocator::ReportInvalidFree (this=<optimized out>, stack=0x7ffda6de79f0, chunk_state=<optimized out>, ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_allocator.cpp:757 #13 __interceptor_free (ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:53 #14 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1485 #15 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854 #16 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501 #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103 #18 mupip_backup_call_on_signal () at sr_port/mupip_backup.c:208 #19 signal_exit_handler (exit_handler_name=0x7fd097f1dda0 "deferred_exit_handler", sig=15, info=0x7fd098480fd8 <stapi_signal_handler_oscontext+3320>, context=0x7fd098481058 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:67 #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120 #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95 #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486 #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854 #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501 #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103 #26 mupip_backup () at sr_port/mupip_backup.c:1585 #27 mupip_main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip_main.c:130 #28 dlopen_libyottadb (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50, main_func=0x55af49fd9020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #29 main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip.c:21 (gdb) f 25 #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103 103 free(ptr->backup_hdr); (gdb) f 17 #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103 103 free(ptr->backup_hdr); (gdb) down #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501 1501 gtm_free_main(addr, TAIL_CALL_LEVEL); (gdb) down #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854 854 system_free(addr); (gdb) down #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486 1486 ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state); (gdb) list 1481 { 1482 intrpt_state_t prev_intrpt_state; 1483 1484 DEFER_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state); 1485 free(addr); 1486 ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state); 1487 return; 1488 } ``` Issue ----- * We did a `free(ptr->backup_hdr)` at line 103. And that in turn ended up using the system `free()` function because the test framework had randomly set the `gtmdbglvl` env var to a value of `0x80000000`. * So at line 1485 above, the system free finished but at line 1486 we noticed the SIGTERM that was deferred and so decided to handle it. But the `ptr->backup_hdr` variable was still set to a non-NULL value so as part of the deferred exit handler, we tried to free this again resulting in the double free. Fix --- * The fix is to note `ptr->backup_hdr` in a local variable and clear the former and then attempting the `free()` on the local variable. This way if we decide to do deferred exit handling after the `free()` occurred, we will notice a NULL value of `ptr->backup_hdr` and so avoid the double free. Notes ----- * This is considered a too rare a race condition to be encountered in practice and so it is expected to not be noticed by a user. Therefore no YDB issue is created for this.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
…help with debugging undefined symbol issues in build