Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change references to GT.M into references to YottaDB - shared libraries #23

Closed
ksbhaskar opened this issue Aug 22, 2017 · 0 comments
Closed
Assignees
Milestone

Comments

@ksbhaskar
Copy link
Member

ksbhaskar commented Aug 22, 2017

Final Release Note

The file libyottadb.so contains the runtime logic that was previously in libgtmshr.so, which is now a relative symbolic link to libyottadb.so. Similarly, libyottadbutil.so contains the object code for utility routines, and libgtmutil.so is a relative symbolic link to libyottadbutil.so. If UTF-8 support is installed, a similar change also occurs in the utf8 subdirectory. There should be no change to the behavior of any application program or scripting that does not explicitly check the nature of libgtmshr.so and libgtmutil.so.

Description

This is one of several naming changes to allow YottaDB to have its own identity while sharing a code base with GT.M and remaining upward compatible. At the moment, it is a place to collect the needed changes so that they can be made at one time. The general pattern is that the file with the GT.M name is a symbolic link to the file with the YottaDB name. In each of the top level directory and lower level utf8 subdirectory, libgtmshr.so should be a relative pointer to libyottadb.so and libgtmutil.so should be a relative pointer to libyottadbutil.so.

Draft Release Note

The file libyottadb.so contains the runtime logic that was previously in libgtmshr.so, which is now a relative symbolic link to libyottadb.so. Similarly, libyottadbutil.so contains the object code for utility routines, and libgtmutil.so is a relative symbolic link to libyottadbutil,so. If UTF-8 support is installed, a similar change also occurs in the utf8 subdirectory. There should be no change to the behavior of any application program or scripting that does not explicitly check the nature of libgtmshr.so and libgtmutil.so.

@nars1 nars1 added this to the r120 milestone Jan 8, 2018
@nars1 nars1 self-assigned this Jan 26, 2018
@nars1 nars1 removed the help wanted label Jan 26, 2018
nars1 added a commit to nars1/YottaDB that referenced this issue Jan 26, 2018
nars1 added a commit to nars1/YottaDB that referenced this issue Jan 26, 2018
nars1 added a commit to nars1/YottaDB that referenced this issue Jan 26, 2018
nars1 added a commit to nars1/YottaDB that referenced this issue Jan 26, 2018
…libyottadb.so/libyottadbutil.so

Also for backward compatibility purposes
  a) install libgtmshr.so as a soft link to libyottadb.so
  b) Install libgtmutil.so as a soft link to libyottadbutil.so
nars1 added a commit that referenced this issue Jan 28, 2018
…adb.so/libyottadbutil.so

Also for backward compatibility purposes
  a) install libgtmshr.so as a soft link to libyottadb.so
  b) Install libgtmutil.so as a soft link to libyottadbutil.so
@nars1 nars1 closed this as completed Jan 29, 2018
chathaway-codes pushed a commit that referenced this issue Nov 18, 2018
…secondary errors if primary error is out-of-memory

If already exiting, do not open any object/source directories (which could include relinkctl files)
as part of $ZROUTINES initialization. This avoids potentially nasty codepaths particulary if the
reason we are exiting is an out-of-memory.

We do not expect any user to run such extreme out-of-memory codepaths/tests so it is not considered
necessary to create a user-visible issue for this.

For example, below are two C-stacks that showed up in core dumps while running the
simpleapi/fatalerror2 subtest. In both cases, if we avoid the zro_init() call we can avoid
such cores.

Core1
------
Notice the local variables passed in #0 have "Cannot access memory" errors. Most likely there was no
space allocating the C-stack in this core.

(gdb) where
 #0  ydb_trans_log_name (envindx=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c5c>, trans=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c50>, buffer=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c48>, buffer_len=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c58>, ignore_errors=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c44>, is_ydb_env_match=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c38>) at sr_port/ydb_trans_log_name.c:41
 #1  util_out_send_oper (addr=0x7ffe1e3c7800 "%YDB-E-RELINKCTLERR, Error with relink control structure for $ZROUTINES directory ., %YDB-E-SYSCALL, Error received from system call mmap() -- called from module "..., len=287) at sr_unix/util_output.c:731
 #2  util_out_print_vaparm (message=0x0, flush=4, var=0x7ffe1e3c8050, faocnt=2147483647) at sr_unix/util_output.c:871
 #3  util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904
 #4  jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:261
 #5  gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36
 #6  rts_error_va (csa=0x0, argcnt=12, var=0x7ffe1e3c82b0) at sr_unix/rts_error.c:159
 #7  rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92
 #8  relinkctl_map (linkctl=0x7ffe1e3c8890) at sr_unix/relinkctl.c:679
 #9  relinkctl_open (linkctl=0x7ffe1e3c8890, object_dir_missing=0) at sr_unix/relinkctl.c:333
 #10 relinkctl_attach (obj_container_name=0x7ffe1e3cbb50, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188
 #11 zro_load (str=0x5611ed710ce8) at sr_unix/zro_load.c:159
 #12 zro_init () at sr_port/zro_init.c:51
 #13 zshow_svn (output=0x7ffe1e40f0b0, one_sv=0) at sr_port/zshow_svn.c:694
 #14 op_zshow (func=0x7ffe1e4171b0, type=1, lvn=0x0) at sr_port/op_zshow.c:166
 #15 jobexam_dump (dump_filename_arg=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0, fatal_file_name_buff=0x7ffe1e417c40 "simpleapi_0_2/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_65362_1.txt") at sr_port/jobexam_process.c:232
 #16 jobexam_process (dump_file_name=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0) at sr_port/jobexam_process.c:152
 #17 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66
 #18 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224
 #19 rts_error_va (csa=0x0, argcnt=5, var=0x7ffe1e41a6a0) at sr_unix/rts_error.c:159
 #20 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92
 #21 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114
 #22 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748
 #23 lvtreenode_newblock (sym=0x5611ed733b40, numElems=2097152) at sr_port/lv_newblock.c:82
 #24 lvtreenode_getslot (sym=0x5611ed733b40) at sr_port/lv_getslot.c:145
 #25 lvAvlTreeNodeInsert (lvt=0x5611ed736050, key=0x7ffe1e41aab0, parent=0x5611f87cb608) at sr_port/lv_tree.c:1698
 #26 op_putindx (argcnt=1, start=0x5611ed73b0a0) at sr_port/op_putindx.c:192
 #27 callg (fnptr=0x7fb75d4f4fff <op_putindx>, paramlist=0x7ffe1e41ae60) at sr_unix/callg.c:60
 #28 ydb_set_s (varname=0x7ffe1e41b5e0, subs_used=1, subsarray=0x7ffe1e41b5f0, value=0x7ffe1e41ade0) at sr_unix/ydb_set_s.c:108
 #29 gvnset () at fatalerror.c:56
 #30 ydb_tp_s (tpfn=0x5611ed225260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193
 #31 main () at fatalerror.c:32

Core2
-----
In this case there is a SIG-11 deep inside syslog(). Most likely due to an out-of-memory situation.

Program terminated with signal SIGSEGV, Segmentation fault.
 #0  vfprintf () from /usr/lib64/libc.so.6
 #1  fprintf () from /usr/lib64/libc.so.6
 #2  __vsyslog_chk () from /usr/lib64/libc.so.6
 #3  syslog () from /usr/lib64/libc.so.6
 #4  util_out_send_oper (addr=0x7ffdadd5ec10 "%YDB-E-JOBEXAMFAIL, YottaDB process 50787 executing $ZJOBEXAM function failed with the preceding error message -- generated from 0x", '0' <repeats 16 times>, ".", len=149) at sr_unix/util_output.c:761
 #5  util_out_print_vaparm (message=0x0, flush=4, var=0x7ffdadd5f460, faocnt=2147483647) at sr_unix/util_output.c:871
 #6  util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904
 #7  send_msg_va (csa=0x0, arg_count=0, var=0x7ffdadd5fa00) at sr_unix/send_msg.c:149
 #8  send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:79
 #9  jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:264
 #10 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36
 #11 rts_error_va (csa=0x0, argcnt=12, var=0x7ffdadd5fc60) at sr_unix/rts_error.c:159
 #12 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92
 #13 relinkctl_map (linkctl=0x7ffdadd60240) at sr_unix/relinkctl.c:679
 #14 relinkctl_open (linkctl=0x7ffdadd60240, object_dir_missing=0) at sr_unix/relinkctl.c:333
 #15 relinkctl_attach (obj_container_name=0x7ffdadd63500, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188
 #16 zro_load (str=0x55df19dd3ce8) at sr_unix/zro_load.c:159
 #17 zro_init () at sr_port/zro_init.c:51
 #18 zshow_svn (output=0x7ffdadda6a60, one_sv=0) at sr_port/zshow_svn.c:694
 #19 op_zshow (func=0x7ffdaddaeb60, type=1, lvn=0x0) at sr_port/op_zshow.c:166
 #20 jobexam_dump (dump_filename_arg=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660, fatal_file_name_buff=0x7ffdaddaf5f0 "simpleapi_0_40/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_50787_1.txt") at sr_port/jobexam_process.c:232
 #21 jobexam_process (dump_file_name=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660) at sr_port/jobexam_process.c:152
 #22 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66
 #23 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224
 #24 rts_error_va (csa=0x0, argcnt=5, var=0x7ffdaddb2050) at sr_unix/rts_error.c:159
 #25 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92
 #26 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114
 #27 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748
 #28 lvtreenode_newblock (sym=0x55df19df6b40, numElems=2097152) at sr_port/lv_newblock.c:82
 #29 lvtreenode_getslot (sym=0x55df19df6b40) at sr_port/lv_getslot.c:145
 #30 lvAvlTreeNodeInsert (lvt=0x55df19df9050, key=0x7ffdaddb2460, parent=0x55df24e8e5c8) at sr_port/lv_tree.c:1698
 #31 op_putindx (argcnt=1, start=0x55df19dfe0a0) at sr_port/op_putindx.c:192
 #32 callg (fnptr=0x7feae36c6fff <op_putindx>, paramlist=0x7ffdaddb2810) at sr_unix/callg.c:60
 #33 ydb_set_s (varname=0x7ffdaddb2f90, subs_used=1, subsarray=0x7ffdaddb2fa0, value=0x7ffdaddb2790) at sr_unix/ydb_set_s.c:108
 #34 gvnset () at fatalerror.c:56
 #35 ydb_tp_s (tpfn=0x55df18a5c260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193
 #36 main () at fatalerror.c:32
chathaway-codes pushed a commit that referenced this issue Nov 21, 2018
…CK being called during exit handling

When a C program that spawned off multiple threads that used the SimpleThreadAPI (e.g. ydb_tp_st() etc.)
was deadlocked (due to a code issue), pressing Ctrl-C (SIGINT) did nothing so pressing Ctrl-\ (SIGQUIT)
to terminate the C program caused a MAXRTSERRDEPTH fatal error and resulted in a core dump.

Below is the actual output.

^C^\%YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with coreQuit (core dumped)

The corresponding C-stack follows.

(gdb) where
 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52090) at sr_unix/rts_error.c:144
 #3  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52270) at sr_unix/rts_error.c:146
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52450) at sr_unix/rts_error.c:146
 #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #8  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52630) at sr_unix/rts_error.c:146
 #9  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #10 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52810) at sr_unix/rts_error.c:146
 #11 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #12 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df529f0) at sr_unix/rts_error.c:146
 #13 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #14 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52bd0) at sr_unix/rts_error.c:146
 #15 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #16 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52db0) at sr_unix/rts_error.c:146
 #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #18 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52f90) at sr_unix/rts_error.c:146
 #19 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #20 send_msg_va (csa=0x0, arg_count=8, var=0x7fb28df53570) at sr_unix/send_msg.c:125
 #21 send_msg_csa (csa=0x0, arg_count=8) at sr_unix/send_msg.c:84
 #22 generic_signal_handler (sig=3, info=0x7fb28df53830, context=0x7fb28df53700) at sr_unix/generic_signal_handler.c:244
 #23 <signal handler called>
 #24 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb2880180a8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
 #25 __pthread_cond_wait_common (abstime=0x0, mutex=0x7fb288018040, cond=0x7fb288018080) at pthread_cond_wait.c:502
 #26 __pthread_cond_wait (cond=0x7fb288018080, mutex=0x7fb288018040) at pthread_cond_wait.c:655
 #27 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:80
 #28 start_thread (arg=0x7fb28df54700) at pthread_create.c:463
 #29 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The primary error was at #20 in send_msg_va() inside the PTHREAD_MUTEX_LOCK_IF_NEEDED macro.
The actual assert that failed inside the macro was the following.

sr_unix/gtm_multi_thread.h
---------------------------
     99                 /* We should never use pthread_* calls inside a signal/timer handler. Assert that */                    \
    100                 assert(!in_nondeferrable_signal_handler);                                                               \

We were in a signal handler handling a non-deferrable signal (Ctrl-\ aka SIGQUIT) and are about to do
a pthread_mutex_lock() library call which is a no-no.

If we are in an exit handler, it is possible for send_msg() to be needed (to log the signal that was received
etc.) but it is safer to not do any pthread activity since we cannot be sure if we are exiting while inside
a signal handler or not. Therefore the fix for this is to check if "process_exiting" global variable is TRUE
and if so, we skip all pthread* calls in the PTHREAD_MUTEX_LOCK_IF_NEEDED and PTHREAD_MUTEX_UNLOCK_IF_NEEDED
macros.
chathaway-codes pushed a commit that referenced this issue Jan 10, 2019
…ThreadAPI is active

This issue was exposed by a failure in the dual_fail_extend/dual_fail2_mustop_sigquit subtest.
This test terminates processes by sending them a SIGQUIT/SIG-3 or SIGTERM/SIG-15 signal.
But since one of the threads (the MAIN worker thread) in this multi-threaded process was inside wcs_wtstart() in a
non-interruptable code zone (DEFER_INTERRUPTS had been done), the exit handler invoked in
another concurrently running thread decided to defer the exit until the ENABLE_INTERRUPTS
happened in the worker thread. When the ENABLE_INTERRUPTS did happen, the worker thread invoked
exit handling code while it was already inside a timer handler. And since this particular test
was running with GDSV4 format blocks, wcs_wtstart() could not flush such blocks (since it required
a call to gtm_malloc() which meant a pthread_mutex_lock() call while inside a timer handler which is
a no-no) and so wcs_flu() was not able to flush any blocks as part of exit handling causing it to
fail an assert. Below is the C-stack corresponding to the assert failure.

(gdb) where
 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7f59dccc02a0) at sr_unix/rts_error.c:194
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  wcs_flu (options=519) at sr_unix/wcs_flu.c:587
 #7  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:608
 #8  gv_rundown () at sr_port/gv_rundown.c:123
 #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:204
 #10 __run_exit_handlers (status=-3, listp=0x7f59e2319718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
 #11 __GI_exit (status=<optimized out>) at exit.c:139
 #12 gtm_image_exit (status=-3) at sr_unix/gtm_image_exit.c:27
 #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:111
 #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:45
 #15 wcs_wtstart (region=0x55b9581d66d8, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:829
 #16 wcs_stale (tid=94254535632600, hd_len=8, region=0x55b9581d62a8) at sr_port/t_end_sysops.c:1387
 #17 timer_handler (why=14) at sr_unix/gt_timers.c:821
 #18 <signal handler called>
 #19 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:277
 #20 gtm_memcpy_validate_and_execute (target=0x7f59dccc25c0, src=0x7f59e32fd6c6, len=0) at sr_port/gtm_memcpy_validate_and_execute.c:42
 #21 gvcst_put2 (val=0x7f59e30c7440 <increment_delta_mval>, parms=0x7f59dccc4be0) at sr_port/gvcst_put.c:626
 #22 gvcst_put (val=0x7f59e30c7440 <increment_delta_mval>) at sr_port/gvcst_put.c:299
 #23 gvcst_incr (increment=0x55b9581a05a0, result=0x7f59d8009410) at sr_port/gvcst_incr.c:56
 #24 op_gvincr (increment=0x55b9581a05a0, result=0x7f59d8009410) at sr_port/op_gvincr.c:58

The fix for this issue is to not invoke exit handling while inside the timer handler if we know
SimpleThreadAPI is active. In that case, finish the timer handler first and invoke exit handling
a little later in mainline code where it is safe to invoke exit handling.
chathaway-codes pushed a commit that referenced this issue Jun 12, 2019
In one v60000/gtm4525b subtest run using imptpgo.go, a process assert failed.

> %YDB-F-ASSERT, Assert failed in sr_port/tp_clean_up.c line 104 for expression (!update_trans)

Below is the C-stack

 #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
 #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
 #3  rts_error_va () at sr_unix/rts_error.c:192
 #4  rts_error_csa () at sr_unix/rts_error.c:99
 #5  tp_clean_up () at sr_port/tp_clean_up.c:104
 #6  op_trollback () at sr_port/op_trollback.c:149
 #7  t_abort () at sr_port/t_abort.c:53
 #8  secshr_db_clnup () at sr_port/secshr_db_clnup.c:568
 #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:212
 #10 __run_exit_handlers () at exit.c:83
 #11 __GI_exit () at exit.c:105
 #12 gtm_image_exit () at sr_unix/gtm_image_exit.c:27
 #13 wait_for_repl_inst_unfreeze_nocsa_jpl () at sr_port/anticipatory_freeze.h:489
 #14 wait_for_repl_inst_unfreeze () at sr_port/anticipatory_freeze.h:526
 #15 wcs_wtstart () at sr_unix/wcs_wtstart.c:702
 #16 wcs_stale () at sr_port/t_end_sysops.c:1387
 #17 timer_handler () at sr_unix/gt_timers.c:834
 #18 <signal handler called>
 #19 __clock_nanosleep () at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:45
 #20 wait_for_repl_inst_unfreeze_nocsa_jpl () at sr_port/anticipatory_freeze.h:503
 #21 wait_for_repl_inst_unfreeze () at sr_port/anticipatory_freeze.h:526
 #22 t_retry () at sr_port/t_retry.c:183
 #23 t_end () at sr_port/t_end.c:1874
 #24 gvcst_bmp_mark_free () at sr_port/gvcst_bmp_mark_free.c:215
 #25 gvcst_expand_free_subtree () at sr_port/gvcst_expand_free_subtree.c:182
 #26 op_tcommit () at sr_port/op_tcommit.c:581
 #27 stkok3 () at sr_armv7l/opp_tcommit.s:38

(gdb) f 5
 #5  0xb66ef674 in tp_clean_up (clnup_state=TP_ROLLBACK) at /Distrib/YottaDB/V998_R124/sr_port/tp_clean_up.c:104
 104                     assert(!update_trans);

 100  if (tp_pointer->implicit_tstart)
 101  {       /* Resetting this is necessary to avoid blowing an assert in t_begin that it is 0 at the start of a transaction. */
 102          update_trans = 0;
 103  } else
 104          assert(!update_trans);

(gdb) p process_exiting
 $4 = 1

The assert at line 104 is now enhanced to allow for the "process_exiting" case. A comment has been
added to the code to explain why this is okay.
nars1 added a commit that referenced this issue May 4, 2020
…d malloc issues

* We had an in-house test failure on an ARMV6L box with the following diff.

  ```diff
   > ideminter_rolrec_0/mupipstop_rollback_or_recover/impjob_imptp0.mje5
   > %YDB-F-ASSERT, Assert failed in sr_port/gtm_malloc_src.h line 695 for expression (FALSE)
   ```

  Below is the C-stack at the time of the assert failure.

  ```gdb
  #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
  #3  rts_error_va () at sr_unix/rts_error.c:192
  #4  rts_error_csa () at sr_unix/rts_error.c:99
  #5  gtm_malloc () at sr_port/gtm_malloc_src.h:695
  #6  condstk_expand () at sr_unix/condstk_expand.c:53
  #7  ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:59
  #8  deferred_signal_handler () at sr_port/deferred_signal_handler.c:57
  #9  gtm_malloc () at sr_port/gtm_malloc_src.h:748
  #10 iorm_use () at sr_unix/iorm_use.c:988
  #11 iorm_open () at sr_unix/iorm_open.c:254
  #12 io_open_try () at sr_unix/io_open_try.c:616
  #13 op_open () at sr_port/op_open.c:160
  #14 open_source_file () at sr_unix/source_file.c:253
  #15 compiler_startup () at sr_port/compiler_startup.c:130
  #16 compile_source_file () at sr_unix/source_file.c:173
  #17 op_zcompile () at sr_port/op_zcompile.c:57
  #18 gtm_trigger_complink () at sr_unix/gtm_trigger.c:451
  #19 gtm_trigger () at sr_unix/gtm_trigger.c:551
  #20 gvtr_match_n_invoke () at sr_unix/gv_trigger.c:1683
  #21 gvcst_put2 () at sr_port/gvcst_put.c:2806
  #22 gvcst_put () at sr_port/gvcst_put.c:299
  #23 op_gvput () at sr_port/op_gvput.c:79
  #24 ydb_set_s () at sr_unix/ydb_set_s.c:137
  #25 ydb_set_st () at sr_unix/ydb_set_st.c:42
  #26 _cgo_d187034042ca_Cfunc_ydb_set_st () at cgo-gcc-prolog:170
  #27 runtime.asmcgocall () at /usr/lib/go-1.11/src/runtime/asm_arm.s:617
  ```

* The cause of the assert failure is a nested call to `gtm_malloc()` (frames 9 and 5 above).
  And the reason that nested call happened is because the initial allocation of the condition handler
  stack size of 5 was not enough when `sr_unix/ydb_stm_invoke_deferred_signal_handler.c` tried to
  do an ESTABLISH and add one more condition handler (at frame number 7). This is because the
  condition handler stack was already used up with the following handlers.

  ```gdb
  (gdb) p chnd[0].ch
  $14 = (void (*)()) 0xb62f4f70 <stop_image_conditional_core>
  (gdb) p chnd[1].ch
  $15 = (void (*)()) 0xb63b0f10 <ydb_simpleapi_ch>
  (gdb) p chnd[2].ch
  $16 = (void (*)()) 0xb67f1e0c <gtm_trigger_complink_ch>
  (gdb) p chnd[3].ch
  $17 = (void (*)()) 0xb69c45e8 <source_ch>
  (gdb) p chnd[4].ch
  $18 = (void (*)()) 0xb6ce211c <compiler_ch>
  ```

* The initial condition handler stack size (controlled by the `CONDSTK_INITIAL_INCR` macro) is currently
  set to 5 (last changed from 2 to 5 as part of GT.M V6.3-000) for DEBUG builds and set to 8 for
  PRO/Release builds.

* Due to YottaDB's use of SimpleAPI, this limit of 5 is clearly not enough (as shown by the above failure)
  so it is now being bumped to 8 for DEBUG and to 16 for PRO/Release builds (just to be safe).
nars1 added a commit that referenced this issue Jun 12, 2020
… is sent to a YottaDB process

* It is possible a timer interrupt comes in while we are canceling the timer in `sys_canc_timer()`
  (invoked in `generic_signal_handler()`). This can cause problems since we might end up trying to
  start a posix system timer on a non-existing timer id (as shown by the below C-stack we saw in
  a test failure).

  ```gdb
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  sys_settimer (tid=1978083808, time_to_expir=0x7eeeb62c) at sr_unix/gt_timers.c:564
  #7  start_first_timer (curr_time=0x7eeeb6fc) at sr_unix/gt_timers.c:633
  #8  timer_handler (why=14, info=0x76bad060 <stapi_signal_handler_oscontext+8808>, context=0x76bad0e0 <stapi_signal_handler_oscontext+8936>) at sr_unix/gt_timers.c:853
  #9  <signal handler called>
  #10 timer_delete (timerid=0x823a38) at ../sysdeps/unix/sysv/linux/timer_delete.c:38
  #11 sys_canc_timer () at sr_unix/gt_timers.c:1041
  #12 generic_signal_handler (sig=15, info=0x76babbc0 <stapi_signal_handler_oscontext+3528>, context=0x76babc40 <stapi_signal_handler_oscontext+3656>) at sr_unix/generic_signal_handler.c:401
  #13 <signal handler called>
  #14 write () at ../sysdeps/unix/syscall-template.S:84
  #15 iorm_wteol (x=1, iod=0x83c420) at sr_unix/iorm_wteol.c:226
  #16 write_text_newline_and_flush_pio (text=0x7eeec298) at sr_port/flush_pio.c:128
  #17 util_out_print_vaparm (message=0x76ab1864 "Blocks coalesced    : !SL ", flush=1, var=..., faocnt=2147483647) at sr_unix/util_output.c:872
  #18 util_out_print (message=0x76ab1864 "Blocks coalesced    : !SL ", flush=1) at sr_unix/util_output.c:913
  #19 reorg_finish (dest_blk_id=6003, blks_processed=1, blks_killed=0, blks_reused=0, file_extended=0, lvls_reduced=0, blks_coalesced=0, blks_split=0, blks_swapped=0) at sr_port/mu_reorg.c:720
  #20 mu_reorg (gl_ptr=0x10bdca0, exclude_glist_ptr=0x7eeed5a8, resume=0x7eeed4c4, index_fill_factor=100, data_fill_factor=100, reorg_op=0) at sr_port/mu_reorg.c:556
  #21 mupip_reorg () at sr_port/mupip_reorg.c:283
  #22 mupip_main (argc=2, argv=0x7eef7914, envp=0x7eef7920) at sr_unix/mupip_main.c:122
  #23 dlopen_libyottadb (argc=2, argv=0x7eef7914, envp=0x7eef7920, main_func=0x115f4 "mupip_main") at sr_unix/dlopen_libyottadb.c:148
  #24 main (argc=2, argv=0x7eef7914, envp=0x7eef7920) at sr_unix/mupip.c:22

  (gdb) f 6
  #6  0x75df09f0 in sys_settimer (tid=1978083808, time_to_expir=0x7eeeb62c) at sr_unix/gt_timers.c:564
  564                     assert(WBTEST_ENABLED(WBTEST_SETITIMER_ERROR));
  (gdb) list
  559             assert(sys_timer.it_value.tv_sec || sys_timer.it_value.tv_nsec);
  560             sys_timer.it_interval.tv_sec = sys_timer.it_interval.tv_nsec = 0;
  561             if ((-1 == timer_settime(posix_timer_id, 0, &sys_timer, &old_sys_timer)) || WBTEST_ENABLED(WBTEST_SETITIMER_ERROR))
  562             {
  563                     save_errno = errno;
  564                     assert(WBTEST_ENABLED(WBTEST_SETITIMER_ERROR));
  565                     WBTEST_ONLY(WBTEST_SETITIMER_ERROR,
  566                             save_errno = EINVAL;
  567                     );
  568                     rts_error_csa(CSA_ARG(NULL) VARLSTCNT(8)
  569                                             ERR_SYSCALL, 5, RTS_ERROR_LITERAL("timer_settime()"), CALLFROM, save_errno);

  (gdb) p save_errno
  $1 = 22
  ```

  The fix is to remove the `sys_canc_timer()` call in `generic_signal_handler()` as it is not clear to me what
  purpose it serves. Later in exit handling (in `gtm_exit_handler()` etc.), we anyways do a call to
  `CANCEL_TIMERS` to cancel any active unsafe timers. This is a safer way of doing the `sys_canc_timer()`
  (as it blocks SIGALRM).

* That said, as part of the code review @estess indicated that he remembered this as being necessary for some
  reason when we were about to dump a core due to a fatal signal (e.g. assert etc.). Therefore, I have
  added code to block SIGALRM only in that code path even though similar code also exists and would be invoked
  a little later in `sr_unix/gtm_fork_n_core.c`.
nars1 added a commit that referenced this issue Aug 7, 2020
…e si->kill_set_tail set to NULL

* Below is the C-stack from the failure (1 out of 500 runs).

  ```c
  (gdb) where
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  generic_signal_handler (sig=11, info=0x7f4c6ee65588 <stapi_signal_handler_oscontext+4296>, context=0x7f4c6ee65608 <stapi_signal_handler_oscontext+4424>) at sr_unix/generic_signal_handler.c:422
  #4  ydb_os_signal_handler (sig=11, info=0x7ffe3eb88530, context=0x7ffe3eb88400) at sr_unix/ydb_os_signal_handler.c:84
  #5  <signal handler called>
  #6  tp_clean_up (clnup_state=TP_ROLLBACK) at sr_port/tp_clean_up.c:215
  #7  op_trollback (rb_levels=0) at sr_port/op_trollback.c:148
  #8  secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:569
  #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:212
  #10 signal_exit_handler (sig=2, info=0x7f4c6ee65588 <stapi_signal_handler_oscontext+4296>, context=0x7f4c6ee65608 <stapi_signal_handler_oscontext+4424>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:77
  #11 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:111
  #12 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #13 gtm_free (addr=0x1918040) at sr_port/gtm_malloc_src.h:1038
  #14 rollbk_sgm_tlvl_info (newlevel=1, si=0x191b840) at sr_port/tp_incr_clean_up.c:381
  #15 tp_incr_clean_up (newlevel=1) at sr_port/tp_incr_clean_up.c:96
  #16 op_trollback (rb_levels=-1) at sr_port/op_trollback.c:218
  #17 ydb_tp_s_common (lydbrtn=LYDB_RTN_TP, tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffe3eb8a240, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:301
  #18 ydb_tp_s (tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffe3eb8a240, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:38
  #19 runProc (settings=0x7ffe3eb8c1f0, curDepth=1) at simpleapi/inref/randomWalk.c:666
  #20 tpHelper (tpfnparm=0x7ffe3eb8b770) at simpleapi/inref/randomWalk.c:691
  #21 ydb_tp_s_common (lydbrtn=LYDB_RTN_TP, tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffe3eb8b770, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:256
  #22 ydb_tp_s (tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffe3eb8b770, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:38
  #23 runProc (settings=0x7ffe3eb8c1f0, curDepth=0) at simpleapi/inref/randomWalk.c:666
  #24 runProc_driver (settings=0x7ffe3eb8c1f0) at simpleapi/inref/randomWalk.c:145
  #25 main () at simpleapi/inref/randomWalk.c:93

  (gdb) f 6
  #6  0x00007f4c6e39e4d2 in tp_clean_up (clnup_state=TP_ROLLBACK) at sr_port/tp_clean_up.c:215
  215                                             FREE_KILL_SET(ks);

  (gdb) p ks
  $1 = (kill_set *) 0xdeadbeefdeadbeef
  ```

* The SIG-11 was because we were done with a `FREE_KILL_SET` (in frame 14 above) when we realized the need
  to handle a deferred signal and as part of handling that we ended up doing another `FREE_KILL_SET` (in
  frame 6 above) on the same kill-set element resulting in a double free.

* This is now fixed by setting the global variable `si->kill_set_head` to NULL before invoking the
  `FREE_KILL_SET` on a copy of the global variable stored in a temporary variable before it got set to NULL.

* The `FREE_KILL_SET` macro is now passed an additional parameter which is the global variable to reset.
  The macro resets the passed in global variable to NULL before it does any `free()` calls.
nars1 added a commit that referenced this issue Mar 15, 2021
…ady started exiting

* As part of a prior commit (SHA 723688c) various functions that started a
  timer (`wcs_clean_dbsync()`, `wcs_stale()` etc.) were fixed to not start one if we have already started
  exit processing.

* One such timer function that should also have been fixed but was left out is `gtmsource_heartbeat_timer()`.
  We had an in-house test failure which failed an assert in `start_timer()` because `gtmsource_heartbeat_timer()`
  was being started while we had already started exit processing. Below is the C-stack of the failure for the record.

  ```c
  #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va () at sr_unix/rts_error.c:192
  #5  rts_error_csa () at sr_unix/rts_error.c:99
  #6  start_timer () at sr_unix/gt_timers.c:433
  #7  gtmsource_heartbeat_timer () at sr_unix/gtmsource_heartbeat.c:74
  #8  timer_handler () at sr_unix/gt_timers.c:889
  #9  ydb_os_signal_handler () at sr_unix/ydb_os_signal_handler.c:63
  #10 <signal handler called>
  #11 __GI___libc_write () at ../sysdeps/unix/sysv/linux/write.c:26
  #12 _IO_new_file_write () at fileops.c:1181
  #13 new_do_write () at libioP.h:948
  #14 _IO_new_file_xsputn () at fileops.c:1255
  #15 _IO_new_file_xsputn () at fileops.c:1197
  #16 __GI__IO_fwrite () at libioP.h:948
  #17 gtm_fwrite () at sr_port/eintr_wrappers.h:334
  #18 gtm_fprintf () at tdio.c:82
  #19 util_out_print_vaparm () at sr_nix/util_output.c:876
  #20 util_out_print () at sr_unix/util_output.c:914
  #21 gtm_putmsg_csa () at sr_unix/gtm_putmsg.c:73
  #22 gds_rundown () at sr_unix/gds_rundown.c:1060
  #23 gv_rundown () at sr_port/gv_rundown.c:122
  #24 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144
  #25 __run_exit_handlers () at exit.c:108
  #26 __GI_exit () at exit.c:139
  #27 gtm_image_exit () at sr_unix/gtm_image_exit.c:27
  #28 util_base_ch () at sr_port/util_base_ch.c:124
  #29 gtmsource_ch () at sr_port/gtmsource_ch.c:96
  #30 gtmsource_readfiles () at aDB/V999_R131/sr_unix/gtmsource_readfiles.c:2023
  #31 gtmsource_get_jnlrecs () attaDB/V999_R131/sr_unix/gtmsource_process_ops.c:980
  #32 gtmsource_process () at sr_unix/gtmsource_process.c:1546
  #33 gtmsource () at sr_unix/gtmsource.c:525
  #34 mupip_main () at sr_unix/mupip_main.
  #35 dlopen_libyottadb () at /Distri9_R131/sr_unix/dlopen_libyottadb.c:151
  #36 main () at sr_unix/mupip.c:22
  ```

* This failure is now fixed by checking `exit_handler_active` and if it is `TRUE` we skip starting this timer.
nars1 added a commit that referenced this issue Mar 18, 2021
…if process has already started exiting

* As part of a prior commit (a37022e) `sr_unix/gtmsource_heartbeat.c` was
  fixed to skip starting a timer if the process has already started exiting.

  Turns out there is one more place in the same file where the timer is started and that needed a similar
  fix but was missed out in the prior commit.

* We had an in-house test failure with the following C-stack that exercised the missed out code path
  (`sr_unix/gtmsource_heartbeat.c` line 75, frame 7 below).

  ```c
  #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va () at sr_unix/rts_error.c:192
  #5  rts_error_csa () at sr_unix/rts_error.c:99
  #6  start_timer () at sr_unix/gt_timers.c:433
  #7  gtmsource_heartbeat_timer () at sr_unix/gtmsource_heartbeat.c:75
  #8  timer_handler () at sr_unix/gt_timers.c:889
  #9  ydb_os_signal_handler () at sr_unix/ydb_os_signal_handler.c:63
  #10 <signal handler called>
  #11 gds_rundown () at sr_unix/gds_rundown.c:249
  #12 gv_rundown () at sr_port/gv_rundown.c:122
  #13 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144
  #14 __run_exit_handlers () at exit.c:108
  #15 __GI_exit () at exit.c:139
  #16 gtm_image_exit () at sr_unix/gtm_image_exit.c:27
  #17 util_base_ch () at sr_port/util_base_ch.c:124
  #18 gtmsource_ch () at sr_port/gtmsource_ch.c:96
  #19 gtmsource_readfiles () at sr_unix/gtmsource_readfiles.c:2023
  #20 gtmsource_get_jnlrecs () at sr_unix/gtmsource_process_ops.c:966
  #21 gtmsource_process () at sr_unix/gtmsource_process.c:1557
  #22 gtmsource () at sr_unix/gtmsource.c:525
  #23 mupip_main () at sr_unix/mupip_main.c:122
  #24 dlopen_libyottadb () at sr_unix/dlopen_libyottadb.c:151
  #25 main () at sr_unix/mupip.c:22
  ```

* A similar fix is now applied to this code path. A new macro `START_GTMSOURCE_HEARTBEAT_TIMER_IF_NOT_EXITING`
  now implements the fix from the prior commit and is now invoked from both the code paths. This way we avoid
  code duplication.
nars1 added a commit that referenced this issue Jan 15, 2022
…ready exiting (fixes random r132/ydb635 subtest failure)

Background
----------
* The `r132/ydb635` subtest (in the YDBTest project) started to fail on a RHEL 7 in-house system
  after merging GT.M V6.3-011.

* The failure symptom was a core file with the following stack trace.

  ```c
  Thread 1 (Thread 0x7f5c261dd740 (LWP 49939)):
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  ch_overrun () at sr_unix/ch_overrun.c:35
  #3  rts_error_va (csa=0x0, argcnt=4, var=0x7ffdc23a7840) at sr_unix/rts_error.c:198
  #4  rts_error_csa (csa=0x0, argcnt=4) at sr_unix/rts_error.c:99
  #5  resetterm (iod=0x1630c40) at sr_unix/resetterm.c:55
  #6  io_rundown (rundown_type=1) at sr_port/io_rundown.c:74
  #7  mupip_exit_handler () at sr_unix/mupip_exit_handler.c:171
  #8  __run_exit_handlers () from /usr/lib64/libc.so.6
  #9  exit () from /usr/lib64/libc.so.6
  #10 gtm_image_exit (status=150373082) at sr_unix/gtm_image_exit.c:27
  #11 util_base_ch (arg=150373082) at sr_port/util_base_ch.c:124
  #12 gtmio_ch (arg=150373082) at sr_unix/gtmio_ch.c:24
  #13 rts_error_va (csa=0x0, argcnt=1, var=0x7ffdc23a8250) at sr_unix/rts_error.c:198
  #14 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #15 iott_readfl (v=0x16774b8, length=32766, nsec_timeout=9223372036854775800) at sr_unix/iott_readfl.c:973
  #16 iott_read (v=0x16774b8, nsec_timeout=9223372036854775800) at sr_unix/iott_read.c:29
  #17 op_read (v=0x16774b8, timeout=0x7f5c2491d1c0 <literal_notimeout>) at sr_port/op_read.c:68
  #18 cli_get_parm (entry=0x7ffdc23b8c90 "WHAT", val_buf=0x7ffdc23b0b90 "") at sr_unix/cli_parse.c:1025
  #19 cli_get_str (entry=0x7f5c2460d784 "WHAT", dst=0x162dcd4 "", max_len=0x162dcd2) at sr_unix/cli.c:285
  #20 mupip_integ () at sr_port/mupip_integ.c:290
  #21 mupip_main (argc=2, argv=0x7ffdc23c4e88, envp=0x7ffdc23c4ea0) at sr_unix/mupip_main.c:122
  #22 dlopen_libyottadb (argc=2, argv=0x7ffdc23c4e88, envp=0x7ffdc23c4ea0, main_func=0x401470 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #23 main (argc=2, argv=0x7ffdc23c4e88, envp=0x7ffdc23c4ea0) at sr_unix/mupip.c:22
  ```

* As can be seen from the below gdb output, we got an IOEOF error in frame 15 and then went to the exit handler
  in frame 7 and as part of exiting, we encountered a ERR_TCSETATTR error in frame 5.

  ```c
  (gdb) f 15
  #15 iott_readfl (v=0x16774b8, length=32766, nsec_timeout=9223372036854775800) at sr_unix/iott_readfl.c:973
  973                    rts_error_csa(CSA_ARG(NULL) VARLSTCNT(1) ERR_IOEOF);
  (gdb) f 7
  #7  mupip_exit_handler () at sr_unix/mupip_exit_handler.c:171
  171             io_rundown(RUNDOWN_EXCEPT_STD);
  (gdb) f 5
  #5  resetterm (iod=0x1630c40) at sr_unix/resetterm.c:55
  55                  rts_error_csa(CSA_ARG(NULL) VARLSTCNT(4) ERR_TCSETATTR, 1, ttptr->fildes, save_errno);
  ```

Issue
-----
* In frame 5, there was no condition handler to handle the ERR_TCSETATTR error and so we generated a core file.
  This is because we are already exiting due to an error.

Fix
----
* The fix is in `sr_unix/resetterm.c` to check if `exit_handler_active` is TRUE and if so not issue the
  ERR_TCSETATTR error. Reasoning is described in a code comment.

* While at this, I realized that it would be nice to issue a NOPRINCIO error message to the syslog and
  terminate the process in case we already encountered an error while writing to the terminal. Therefore
  added a call to the ISSUE_NOPRINCIO_BEFORE_RTS_ERROR_IF_APPROPRIATE macro that currently exists in
  `sr_unix/iott_use.c`. And moved the macro to `sr_port/io.h` so it can be called from multiple places.

  Also noticed a pre-existing usage in `sr_unix/iott_use.c` where as `TCFLUSH()` call failure could also
  benefit from issuing the NOPRINCIO error message. So added that too.

* With these changes, the test (which kills the terminal in an `expect` session before the `mupip integ`
  process could return back to the shell prompt) now passes reliably. In the syslog, I do see a
  `NOPRINCIO` error message now whereas it did not show up before.
nars1 added a commit that referenced this issue Jan 26, 2022
…e specification using ^[..] syntax

Background
----------
* This is an issue identified by fuzz testing.

* Below is a simple example illustrating the failure using a `set` command.

  ```m
  YDB>set ^[$order(@x,1)
  %YDB-F-GTMASSERT2, YottaDB r998 Linux x86_64 - Assert failed sr_port/f_order.c line 121 for expression (DEPTH)
  ```

* Interestingly though, a similar example using the `write` command instead of the `set` command
  works fine in that it correctly issues the EXTGBLDEL error.

  ```m
  YDB>write ^[$order(@x,1)
  %YDB-E-EXTGBLDEL, Invalid delimiter for extended global syntax
          write ^[$order(@x,1)
                              ^-----
  ```

Issue
-----
* The C-stack from the core file is the following.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140639387280448) at pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140639387280448) at pthread_kill.c:80
  #2  __GI___pthread_kill (threadid=140639387280448, signo=3) at pthread_kill.c:91
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=9, var=0x7ffdb630a260) at sr_unix/rts_error.c:198
  #7  rts_error (argcnt=9) at sr_unix/rts_error.c:88
  #8  gtm_assert2 (condlen=5, condtext=0x7fe92512a100 "DEPTH", file_name_len=44, file_name=0x7fe925129fe0 "sr_port/f_order.c", line_no=121) at sr_port/gtm_assert2.c:36
  #9  f_order (a=0x7ffdb630aab0, op=OC_FNORDER) at sr_port/f_order.c:121
  #10 expritem (a=0x7ffdb630aab0) at sr_port/expritem.c:637
  #11 expratom (a=0x7ffdb630aab0) at sr_port/expratom.c:29
  #12 expratom_coerce_mval (a=0x7ffdb630aab0) at sr_port/expratom_coerce_mval.c:34
  #13 gvn () at sr_port/gvn.c:70
  #14 m_set () at sr_port/m_set.c:300
  #15 cmd () at sr_port/cmd.c:312
  #16 linetail () at sr_port/linetail.c:35
  #17 line (lnc=0x7ffdb630c9c0) at sr_port/line.c:230
  #18 compiler_startup () at sr_port/compiler_startup.c:183
  #19 compile_source_file (flen=44, faddr=0x7ffdb630d1f0 "x.m", MFtIsReqd=1) at sr_unix/source_file.c:174
  #20 gtm_compile () at sr_unix/gtm_compile.c:113
  #21 init_gtm () at sr_unix/init_gtm.c:183
  #22 gtm_main (argc=2, argv=0x7ffdb6311d68, envp=0x7ffdb6311d80) at sr_unix/gtm_main.c:178
  #23 dlopen_libyottadb (argc=2, argv=0x7ffdb6311d68, envp=0x7ffdb6311d80, main_func=0x56087a968020 "gtm_main") at sr_unix/dlopen_libyottadb.c:151
  #24 main (argc=2, argv=0x7ffdb6311d68, envp=0x7ffdb6311d80) at sr_unix/gtm.c:20

  (gdb) f 9
  #9  f_order (a=0x7ffdb630aab0, op=OC_FNORDER) at sr_port/f_order.c:121
  121    DISABLE_SIDE_EFFECT_AT_DEPTH;    /* doing this here let's us know specifically if direction had SE threat */
  ```

* The failure was because `TREF(expr_depth)` was 0 whereas the `DISABLE_SIDE_EFFECT_AT_DEPTH` macro was
  expecting a non-zero expression depth.

* When we are in `f_order()`, we are guaranteed a non-zero expression depth if we were called from `expr()`.
  But in case we are processing an extended global reference using the `^[...]` syntax, we use
  `expratom_coerce_mval()` instead of `expr()` (at frame 13 in gvn.c, line 70 below).

  **sr_port/gvn.c**
  ```c
        67     if (vbar)
        68             parse_status = expr(sb1++, MUMPS_EXPR);
        69     else
   -->  70             parse_status = expratom_coerce_mval(sb1++);
  ```

  In this case, `TREF(expr_depth)` is not incremented. And so we cannot invoke `DISABLE_SIDE_EFFECT_AT_DEPTH`
  inside `f_order()` deep down in the stack.

Fix
---
* The fix is to enhance the `DISABLE_SIDE_EFFECT_AT_DEPTH` macro to handle the case that `TREF(expr_depth)`
  can be zero in rare cases. In that case, we do not propagate the side effect state one depth down. We
  just ignore the side effect state till now and reset the current state at depth 0 to be FALSE and return.

* This takes care of all callers of the `DISABLE_SIDE_EFFECT_AT_DEPTH` macro that do not go through the
  `DECREMENT_EXPR_DEPTH` macro.

* In the case of the `DECREMENT_EXPR_DEPTH` macro, we do expect `TREF(expr_depth)` to be non-zero even if
  it is called from `expratom_coerce_mval()`. This is because whichever deep function invocation in the
  stack did the `DECREMENT_EXPR_DEPTH` should have previously done a corresponding `INCREMENT_EXPR_DEPTH`.
  Therefore this now has a newly added `assert(TREF(expr_depth));`.
nars1 added a commit that referenced this issue Jan 31, 2022
…on a garbage file descriptor

Background
----------
* This is a very rare test failure that was seen only once and on a slow ARM in-house box in
  internal testing.

* The `stress/concurr` subtest failed with the following diff.

  ```diff
  --- concurr/concurr.diff ---
  69a70,181
  > host:REMOTE_SIDE:stress_1/concurr/stress_oli.out
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE)
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE)
  > %YDB-E-NOTALLDBRNDWN, Not all regions were successfully rundown
  ```

* The assert failure created a core file with the following stack trace

  ```c
   #6 gtm_close (fd=1626061471) at sr_unix/gtm_fd_trace.c:185
   #7 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192
   #8 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:501
   #9 gv_rundown () at sr_port/gv_rundown.c:122
  #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144
  #11 __run_exit_handlers (status=150374524, listp=0xffff9c805680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
  #12 __GI_exit (status=<optimized out>) at exit.c:139
  #13 gtm_image_exit (status=150374524) at sr_unix/gtm_image_exit.c:27
  #14 util_base_ch (arg=150374524) at sr_port/util_base_ch.c:124
  #15 mu_int_ch (arg=150374524) at sr_unix/mu_int_ch.c:35
  #16 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192
  #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #18 gtm_close (fd=-559038737) at sr_unix/gtm_fd_trace.c:185
  #19 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192
  #20 jnl_file_close_timer () at sr_unix/jnl_file_close_timer.c:74
  #21 timer_handler (why=0, info=0xffff9c65df68 <stapi_signal_handler_oscontext+47048>, context=0xffff9c65dff0 <stapi_signal_handler_oscontext+47184>, is_os_signal_handler=0) at sr_unix/gt_timers.c:889
  #22 check_for_deferred_timers () at sr_unix/gt_timers.c:1267
  #23 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #24 gtm_free (addr=0xaaaaffca1980) at sr_port/gtm_malloc_src.h:1038
  #25 ss_release (ss_ctx=0xaaaaffc78910) at sr_unix/ss_release.c:226
  #26 mupip_integ () at sr_port/mupip_integ.c:801
  #27 mupip_main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip_main.c:122
  #28 dlopen_libyottadb (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980, main_func=0xaaaae7dd6648 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #29 main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip.c:22
  ```

Issue
-----
* Frame 18 in the stack trace above indicates a `gtm_close()` call happening with an fd of `-559038737`.

* Frame 6 in the stack trace above indicates a `gtm_close()` call happening with an fd of `fd=1626061471`.

* The real issue is in Frame 26 in the stack trace above where we call `ss_release()`. The relevant code
  is pasted below.

  **sr_port/mupip_integ.c**
  ```c
       799     assert(SNAPSHOTS_IN_PROG(csa));
       800     assert(NULL != csa->ss_ctx);
       801     ss_release(&csa->ss_ctx);
       802     CLEAR_SNAPSHOTS_IN_PROG(csa);
  ```

* Line 801 does the `ss_release()` call and Line 802 clears the flag in `csa` that records that a snapshot
  is in progress.

* But `ss_release()` first calls `ss_context_destroy()` and then calls `free()` so it is possible that a
  timer interrupt gets handled in a deferred fashion right after the `free()` but before the
  `CLEAR_SNAPSHOTS_IN_PROG` macro gets executed. This means we would invoke `ss_destroy_context()` on the
  `csa->ss_ctx` structure again inside the timer. And that would be looking at an already freed context
  structure. Which can then explain why garbage values of `fd` got used in the `gtm_close()` calls.

Fix
---
* The fix is in `sr_port/mupip_integ.c` to clear all context in global variables that indicate a snapshot
  is in progress BEFORE calling `ss_release()`.

* Additionally, the following files were changed since the warning text from `clang-tidy` changed a bit.
  While at it, I also verified that this warning is a false alarm.
  - ci/tidy_warnings_debug.ref
  - ci/tidy_warnings_release.ref
nars1 added a commit that referenced this issue Jul 21, 2022
Background
----------
* While running the TCK04 bats subtest in the YDBOcto repo using a Debug build of YottaDB
  that was built using `clang` (not `gcc`), I encountered a very rare failure (took hundreds
  of test reruns to reproduce once).

* Although the failure happened only with `clang`, the same issue can happen with `gcc` builds
  of YottaDB too given the right timing of events/signals.

* Below is the stack trace of the core file from the assert using the gdb debugger.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140299547846464) at pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140299547846464) at pthread_kill.c:80
  #2  __GI___pthread_kill (threadid=140299547846464, signo=3) at pthread_kill.c:91
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  generic_signal_handler (sig=11, info=0x7f9a08aecca8 <stapi_signal_handler_oscontext+4424>, context=0x7f9a08aecd28 <stapi_signal_handler_oscontext+4552>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:492
  #6  ydb_os_signal_handler (sig=11, info=0x7fff10881b70, context=0x7fff10881a40) at sr_unix/ydb_os_signal_handler.c:85
  #7  <signal handler called>
  #8  cleanup_list (list=0xaf8a40) at sr_port/buddy_list.c:205
  #9  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:1098
  #10 gv_rundown () at sr_port/gv_rundown.c:122
  #11 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #12 signal_exit_handler (exit_handler_name=0x7f9a0898ce5a "deferred_exit_handler", sig=15, info=0x7f9a08aecca8 <stapi_signal_handler_oscontext+4424>, context=0x7f9a08aecd28 <stapi_signal_handler_oscontext+4552>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #15 gtm_malloc_main (size=520, stack_level=1) at sr_port/gtm_malloc_src.h:800
  #16 gtm_malloc (size=520) at sr_port/gtm_malloc_src.h:1486
  #17 initialize_list (list=0xaf8a40, elemSize=192, initAlloc=64) at sr_port/buddy_list.c:52
  #18 gvcst_tp_init (greg=0xaf1a18) at sr_port/gvcst_tp_init.c:55
  #19 tp_set_sgm () at sr_port/tp_set_sgm.c:53
  #20 change_reg () at sr_port/change_reg.c:57
  #21 gv_bind_name (addr=0xaf1470, gvname=0x7fff10882e98) at sr_port/gv_bind_name.c:144
  #22 op_gvname_common (count=4, hash_code=-1391378772, val_arg=0x7f9a08f1f998, var=0x7fff10889c00) at sr_port/op_gvname.c:117
  #23 op_gvname_fast (count_arg=6, hash_code=-1391378772, val_arg=0x7f9a08f1f998) at sr_port/op_gvname.c:81

  (gdb) f 8
  #8  cleanup_list (list=0xaf8a40) at sr_port/buddy_list.c:205
  205             while(*curr)

  (gdb) f 17
  #17 initialize_list (list=0xaf8a40, elemSize=192, initAlloc=64) at sr_port/buddy_list.c:52
  52              list->ptrArray = (char **)malloc((size_t)SIZEOF(char *) * (MAX_MEM_SIZE_IN_BITS + 2));
  ```

Issue
-----
* A SIG-15/SIGTERM signal interrupted the `initialize_list()` call in frame 17. In frame 18, we were
  trying to initialize `si->tlvl_cw_set_list` as the below line of code indicates.

  **sr_port/gvcst_tp_init.c**
  ```c
     55   initialize_list(si->tlvl_cw_set_list, SIZEOF(cw_set_element), TLVL_CW_SET_LIST_INIT_ALLOC);
  ```

* The signal caused us to proceed to exit handling and as part of that we tried to cleanup the
  incompletely set up structure `si->tlvl_cw_set_list` at line 1098 below.

  **sr_unix/gds_rundown.c**
  ```c
   1082                 if (csa->sgm_info_ptr)
   1083                 {
   1084                         si = csa->sgm_info_ptr;
   1085                         /* It is possible we got interrupted before initializing all fields of "si"
   1086                          * completely so account for NULL values while freeing/releasing those fields.
   1087                          */
   1088                         assert((si->tp_csa == csa) || (NULL == si->tp_csa));
   1089                         if (si->jnl_tail)
   1090                         {
   1091                                 PROBE_FREEUP_BUDDY_LIST(si->format_buff_list);
   1092                                 PROBE_FREEUP_BUDDY_LIST(si->jnl_list);
   1093                                 FREE_JBUF_RSRV_STRUCT(si->jbuf_rsrv_ptr);
   1094                         }
   1095                         PROBE_FREEUP_BUDDY_LIST(si->recompute_list);
   1096                         PROBE_FREEUP_BUDDY_LIST(si->new_buff_list);
   1097                         PROBE_FREEUP_BUDDY_LIST(si->tlvl_info_list);
   1098                         PROBE_FREEUP_BUDDY_LIST(si->tlvl_cw_set_list);
   1099                         PROBE_FREEUP_BUDDY_LIST(si->cw_set_list);
  ```

* And that caused the SIG-11.

Fix
---
* A lot of the above cleanup in `sr_unix/gds_rundown.c` happens only if `csa->sgm_info_ptr` is non-NULL.

* But this field gets set to a non-NULL value at the very start of `sr_port/gvcst_tp_init.c` before
  a lot of the individual fields (like `si->tlvl_cw_set_list` etc.) get initialized.

* Therefore, the fix is to set `csa->sgm_info_ptr` to a non-NULL value `AFTER` all the initialization
  of the individual members in that structure has happened.

Notes
-----
* Even though the user-visible symptom is a SIG-11, this issue is considered rare enough for a user to
  encounter so a separate issue is not created for this fix.
nars1 added a commit that referenced this issue Aug 9, 2022
… .m file is attempted

Background
----------
* Below is a simple test case obtained from a fuzz test failure in in-house testing.

  ```m
  $ cat test.m
   set fn="generated.m"
   open fn:new
   use fn
   write " z"
   Set $ZROUTINES=""
   zlink "generated.m"

  $ $ydb_dist/yottadb -run test
  %YDB-F-KILLBYSIGSINFO1, YottaDB process 55439 has been killed by a signal 11 at address 0x00007F4F4F82EED7 (vaddr 0x0000000000000008)
  %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object
  Segmentation fault (core dumped)
  ```

* This is a failure in both Release and Debug builds of YottaDB as well as the upstream GT.M.

Issue
-----
* Below is the stack trace from the core file.

  ```c
  (gdb) where
  #0  ins_errtriple (in_error=150373618) at sr_port/ins_errtriple.c:51
  #1  stx_error_va (in_error=150373618, args=0x7f6559aa53c0) at sr_port/stx_error.c:164
  #2  rts_error_va (csa=0x0, argcnt=1, var=0x7f6559aa54a0) at sr_unix/rts_error.c:179
  #3  rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #4  iorm_wteol (x=1, iod=0x62d000004840) at sr_unix/iorm_wteol.c:87
  #5  iorm_cond_wteol (iod=0x62d000004840) at sr_unix/iorm_flush.c:42
  #6  iorm_close (iod=0x62d000004840, pp=0x7f6559aa63b0) at sr_unix/iorm_close.c:112
  #7  io_dev_close (d=0x62d000005ec0) at sr_port/io_rundown.c:102
  #8  io_rundown (rundown_type=0) at sr_port/io_rundown.c:60
  #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:239
  #10 signal_exit_handler (exit_handler_name=0x7f6555366520 "generic_signal_handler", sig=11, info=0x7f6555881948 <stapi_signal_handler_oscontext+4424>, context=0x7f65558819c8 <stapi_signal_handler_oscontext+4552>, is_deferred_exit=0) at sr_unix/signal_exit_handler.c:78
  #11 generic_signal_handler (sig=11, info=0x7f6555881948 <stapi_signal_handler_oscontext+4424>, context=0x7f65558819c8 <stapi_signal_handler_oscontext+4552>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:500
  #12 ydb_os_signal_handler (sig=11, info=0x7f6559aa6bf0, context=0x7f6559aa6ac0) at sr_unix/ydb_os_signal_handler.c:85
  #13 <signal handler called>
  #14 ins_errtriple (in_error=150373618) at sr_port/ins_errtriple.c:51
  #15 stx_error_va (in_error=150373618, args=0x7ffe77c31f90) at sr_port/stx_error.c:164
  #16 rts_error_va (csa=0x0, argcnt=1, var=0x7ffe77c32070) at sr_unix/rts_error.c:179
  #17 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #18 iorm_wteol (x=1, iod=0x62d000004840) at sr_unix/iorm_wteol.c:87
  #19 iorm_readfl (v=0x7ffe77c33bb0, width=32767, nsec_timeout=<optimized out>) at sr_unix/iorm_readfl.c:229
  #20 op_readfl (v=0x7ffe77c33bb0, length=32767, timeout=0x7f65555111a0 <literal_notimeout>) at sr_port/op_readfl.c:80
  #21 read_source_file () at sr_unix/source_file.c:290
  #22 compiler_startup () at sr_port/compiler_startup.c:159
  #23 zlcompile (len=11 '\v', addr=0x7ffe77c34820 "generated.m") at sr_port/zlcompile.c:45
  #24 op_zlink (v=0x62d0000062e0, quals=0x7f6555fbe6c0) at sr_unix/op_zlink.c:496
  ```

* The SIG-11 happened because we were trying to access `TREF(pos_in_chain)` to get the last triple
  before we started parsing the current line.

  **sr_port/ins_errtriple.c**
  ```c
    49   x = (TREF(pos_in_chain)).exorder.bl;
    50   /* If first error in the current line/cmd, delete all triples and replace them with an OC_RTERROR triple. */
    51   add_rterror_triple = (OC_RTERROR != x->exorder.fl->opcode);
  ```

  But turns out we are issuing an error even before we started parsing the first line in the M program.
  This is because the `iorm_wteol()` call, while trying to read from the M source file as part of the ZLINK,
  tried to write an EOL to the source M program and it cannot because the source is opened read-only and so
  issued a ERR_DEVICEREADONLY error.

  And because of this, the contents of `TREF(pos_in_chain)` are not appropriately initialized and so are not
  reliable (they will contain triples left over from the previous compile and can point to freed memory
  or NULL pointers resulting in SIG-11).

Fix
---
* The first fix is to initialize `TREF(pos_in_chain)` to `*TREF(curtchain)` in `sr_port/tripinit.c` right
  after `TREF(curtchain)` is initialized.

  This way any errors in compilation will result in `ins_errtriple()` referencing an initialized
  `TREF(pos_in_chain)`.

* The second fix is in `sr_port/ins_errtriple.c` where we should now account for the possibility that
  `TREF(pos_in_chain).exorder.bl` could be `NULL`. In that case, we should add an `OC_RTERROR` triple
  just like we would if we find that the start of the current M line already has triples and the first
  triple in that chain is not already a `OC_RTERROR` triple. So the change is to set `add_rterror_triple`
  variable to TRUE in case we find `TREF(pos_in_chain).exorder.bl` is NULL.

* With just the above two fixes, I noticed the simple test case presented above no longer failing with a
  SIG-11. But it still had some extraneous output.

  ```sh
  $ $ydb_dist/yottadb -run test40

                                     ^-----
                  At column 28, line 1, source module generated.m
  %YDB-E-DEVICEREADONLY, Cannot write to read-only device
  ```

  I expected only the `%YDB-E-DEVICEREADONLY` error line. Not the 3 lines before it which is syntax
  highlighting a non-existent M source line.

  Turns out this is an issue in `sr_port/show_source_line.c` where we issue a sequence of `ERR_SRCLIN`,
  `ERR_SRCLNNTDSP` and `ERR_SRCLOC` messages to take care of the syntax highlighting even if there is
  no M source code to highlight.

  This is now fixed by checking `line_chwidth` and only if it is greater than 0 do we issue those messages.
  Otherwise we skip those messages.

  With that change, the revised output is as follows. This looks a lot cleaner to me.

  ```sh
  $ $ydb_dist/yottadb -run test40
  %YDB-E-DEVICEREADONLY, Cannot write to read-only device
  ```
nars1 added a commit that referenced this issue Jul 25, 2023
… it can cause hang with CLANG/ASAN

Background
----------
* While running the YDBOcto tests with CLANG, I noticed various tests hang. All of them had a
  similar stack-trace.

  ```c
  (gdb) where
  #0  __sanitizer::FutexWait(__sanitizer::atomic_uint32_t*, unsigned int) ()
  #1  __sanitizer::Semaphore::Wait() ()
  #2  __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >::GetFromAllocator(__sanitizer::AllocatorStats*, unsigned long, unsigned int*, unsigned long) ()
  #3  __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::Refill(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::PerClass*, __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >*, unsigned long) ()
  #4  __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >*, unsigned long, unsigned long) ()
  #5  __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) ()
  #6  __asan::asan_calloc(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*) ()
  #7  calloc ()
  #8  __pthread_attr_extension (attr=0x7f29af3cee48) at ./nptl/pthread_attr_extension.c:28
  #9  __GI___pthread_attr_setaffinity_np (attr=attr@entry=0x7f29af3cee48, cpusetsize=cpusetsize@entry=32, cpuset=cpuset@entry=0x603000001b40) at ./nptl/pthread_attr_setaffinity.c:45
  #10 __pthread_getattr_np (thread_id=139817006390848, attr=0x7f29af3cee48) at ./nptl/pthread_getattr_np.c:194
  #11 __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ()
  #12 __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ()
  #13 __asan::PlatformUnpoisonStacks() ()
  #14 __asan_handle_no_return ()
  #15 generic_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:187
  #16 ydb_os_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0) at sr_unix/ydb_os_signal_handler.c:85
  #17 <signal handler called>
  #18 sched_yield () at ../sysdeps/unix/syscall-template.S:120
  #19 __sanitizer::StopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), void*) ()
  #20 __lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*) ()
  #21 __GI___dl_iterate_phdr (callback=0x55bd48373320 <__lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*)>, data=0x7ffe13010eb8) at ./elf/dl-iteratephdr.c:74
  #22 __lsan::LockStuffAndStopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), __lsan::CheckForLeaksParam*) ()
  #23 __lsan::CheckForLeaks() ()
  #24 __lsan::DoLeakCheck() ()
  #25 __cxa_finalize (d=0x55bd483af128) at ./stdlib/cxa_finalize.c:83
  #26 __do_global_dtors_aux ()
  #27 ?? ()
  #28 _dl_fini () at ./elf/dl-fini.c:142
  ```

Issue
-----
* The YottaDB SIG-15/SIGTERM signal handler got invoked for a SIG-15. But it noticed that all YottaDB
  exit handler code has already been run (`exit_handler_complete` global variable is TRUE). In that
  case, it invoked any non-YottaDB signal handler for SIG-15 and afterwards, it invoked `_exit()` to
  terminate the process (in line 187).

  **sr_unix/generic_signal_handler.c**
  ```c
    182         if (exit_handler_complete)
    183         {
    184                 if (!using_alternate_sighandling)       /* Go does not send us signals so no need to forward */
    185                 {
    186                         drive_non_ydb_signal_handler_if_any("generic_signal_handler1", sig, info, context, TRUE);
    187                         UNDERSCORE_EXIT(-sig);
    188                 }
    189                 return;         /* Nothing we can do if exit handler has run */
    190         }
  ```

* And because of the `_exit()` all, the CLANG/ASAN library ended up doing a `calloc()` call which hung
  waiting for a futex. Most likely due to re-entrant invocations of C library functions that are not
  async-signal safe.

* The cause of this is line 187 above in my opinion.

* If YottaDB exit handler has already run (as part of SIGTERM handling) and we are getting the SIGTERM signal
  again, then I don't see any reason to do the `_exit()` call (using the `UNDERSCORE_EXIT` macro in line 187).

* This code has been there for a long time but I don't think it is doing the right thing.

Fix
---
* Lines 184-188 are now removed in this commit. I think the right thing to do is to just return in case the
  YottaDB exit handler has already been invoked.

* With this change, I verified that the CLANG/ASAN tests run fine in YDBOcto. So at least one Simple API
  use case runs fine with the fix in this commit.

* Initially I thought of disabling lines 184-188 above only when ASAN is enabled. But then I realized it
  is a good change for all cases and so removed lines 184-188.
nars1 added a commit that referenced this issue Sep 11, 2023
… detect signal/timer handling

Background
----------
* We had one rare test failure during in-house testing. The `ideminter_rolrec/mupipstop_rollback_or_recover`
  subtest failed with the following symptom.

  ```sh
  $ cat ROLLBACK1_3.logx
  mupip journal -ROLLBACK -back -verify -verbose "*"  -noonline -resync=369813 -lost=ROLLBACK1_3.lost
  Sat Sep  9 04:17:18 PM EDT 2023
  .
  .
  %YDB-I-MUJNLSTAT, Forward processing started at Sat Sep  9 16:19:23 2023
  %YDB-I-MUINFOUINT8, mur_process_seqno_table returns min_broken_seqno : 18446744073709551615 [0xFFFFFFFFFFFFFFFF]
  %YDB-I-MUINFOUINT8, mur_process_seqno_table returns losttn_seqno : 369813 [0x000000000005A495]
  %YDB-I-MUINFOSTR, Module : mur_forward:at the start at Sat Sep  9 16:19:23 2023
  .
  .
  %YDB-I-MUINFOSTR,     Journal file : ideminter_rolrec_0/mupipstop_rollback_or_recover/g.mjl_2023252161233
  %YDB-I-MUINFOUINT4,     Record Offset : 65744 [0x000100D0]
  %YDB-F-FORCEDHALT, Image HALTed by MUPIP STOP
  %YDB-F-ASSERT, Assert failed in sr_unix/db_ipcs_reset.c line 110 for expression (((TREF(dio_buff)).aligned != (char *)(csd)) || (!timer_in_handler && !multi_thread_in_use))
  Sat Sep  9 04:20:35 PM EDT 2023
  The time the mupip command took:  197
  ```

* The core file corresponding to the above assert failure had the following stack trace.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140217990231872, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fff160fdc00) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  db_ipcs_reset (reg=0x563c77a1c0b0) at sr_unix/db_ipcs_reset.c:110
  #9  mur_close_files () at sr_port/mur_close_files.c:841
  #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:116
  #11 signal_exit_handler (exit_handler_name=0x7f870b624acc "deferred_exit_handler", sig=15, info=0x7f870b7856a8 <stapi_signal_handler_oscontext+3320>, context=0x7f870b785728 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #14 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862
  #15 wcs_stale (tid=94817705118848, hd_len=8, region=0x563c77924b08) at sr_port/t_end_sysops.c:1445
  #16 timer_handler (why=0, info=0x7f870b787088 <stapi_signal_handler_oscontext+9944>, context=0x7f870b787108 <stapi_signal_handler_oscontext+10072>, is_os_signal_handler=0) at sr_unix/gt_timers.c:913
  #17 check_for_deferred_timers () at sr_unix/gt_timers.c:1312
  #18 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #19 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862
  #20 wcs_timer_start (reg=0x563c77a1cc80, io_ok=1) at sr_port/t_end_sysops.c:1344
  #21 op_tcommit () at sr_port/op_tcommit.c:535
  #22 mur_output_record (rctl=0x563c77a28a40) at sr_port/mur_output_record.c:323
  #23 mur_forward_play_cur_jrec (rctl=0x563c77a28a40) at sr_port/mur_forward_play_cur_jrec.c:362
  #24 mur_forward_multi_proc (rctl=0x563c77a28a40) at sr_port/mur_forward.c:400
  #25 gtm_multi_proc (fnptr=0x7f870ae20f00 <mur_forward_multi_proc>, ntasks=1, max_procs=1, ret_array=0x563c7cb21a40, parm_array=0x563c77a27c40, parmElemSize=512, extra_shm_size=2640, init_fnptr=0x7f870ae2b9f0 <mur_forward_multi_proc_init>, finish_fnptr=0x7f870ae2bc10 <mur_forward_multi_proc_finish>) at sr_unix/gtm_multi_proc.c:122
  #26 mur_forward (min_broken_time=4294967295, min_broken_seqno=18446744073709551615, losttn_seqno=369813) at sr_port/mur_forward.c:158
  #27 mupip_recover () at sr_port/mupip_recover.c:588
  #28 mupip_main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip_main.c:122
  #29 dlopen_libyottadb (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0, main_func=0x563c761b1004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #30 main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip.c:22

  (gdb) p gtm_threadgbl_true->dio_buff.aligned
  $5 = 0x563c78429000 "GDSDYNUNX04"
  (gdb) p csd
  $6 = (sgmnt_data_ptr_t) 0x563c78429000
  (gdb) p timer_in_handler
  $1 = 1
  (gdb) p multi_thread_in_use
  $2 = 0

  (gdb) p forced_exit
  $3 = 2
  (gdb) p exit_handler_active
  $4 = 1
  (gdb) p in_os_signal_handler
  $1 = 0
  ```

Issue
-----
* The assert failure was in the db_ipcs_reset() -> DB_LSEEKREAD -> DBG_CHECK_DIO_ALIGNMENT.

* The `DBG_CHECK_DIO_ALIGNMENT` macro had the following comment.

  ```c
     53         /* If we are using the global variable "dio_buff.aligned", then we better not be executing in timer     \
     54          * code or in threaded code (as we have only ONE buffer to use). Assert that.                           \
     55          */                                                                                                     \
     56         assert(((TREF(dio_buff)).aligned != (char *)(buff)) || (!timer_in_handler && !multi_thread_in_use));    \
  ```

* In the failure case, even though we are executing in timer code we are actually in exit handler code
  (as can be seen by the `forced_exit` and `exit_handler_active` variables in the gdb analysis above).
  In this case, the exit handler code will not return out of the timer code and so it is okay for the
  assert to not be TRUE.

* The global variable being checked in the assert is `timer_in_handler`. This is where the issue is.
  That global variable being TRUE just means the `timer_handler()` function is in the current call stack.
  It does not mean that we are handling a SIGALRM/timer signal and interrupting the mainline code.
  The assert is intended to protect against signal handler interrupting the mainline code. Therefore,
  the correct global variable to check in the assert is `in_os_signal_handler`.

Fix
---
* The fix is simple and is to use `in_os_signal_handler` instead of `timer_in_handler` in the assert.
nars1 added a commit that referenced this issue Nov 15, 2023
…ert failure)

Background
----------
* Below is a first-time failure, when running the `r126/ydb464` subtest (from the YDBTest project), that
  I noticed while trying to reproduce some other failure.

  ```diff
  --- ydb464/ydb464.diff ---
  19a20,73
  > r126_0_31/ydb464/simpleapi2/child98118.log
  > %YDB-F-ASSERT, Assert failed in sr_port/insert_region.c line 110 for expression ((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit))
  ```

* The C-stack and relevant variables from the core file are pasted below.

  ```c
  (gdb) where
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffee07f7480) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110
  #7  mlk_unlock (p=0x1591940) at sr_port/mlk_unlock.c:70
  #8  tp_unwind (newlevel=0, invocation_type=ROLLBACK_INVOCATION, tprestart_rc=0x0) at sr_port/tp_unwind.c:294
  #9  op_trollback (rb_levels=0) at sr_port/op_trollback.c:200
  #10 secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:569
  #11 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:230
  #12 signal_exit_handler (exit_handler_name=0x7ff4913b071e "deferred_exit_handler", sig=2, info=0x7ff491795458 <stapi_signal_handler_oscontext+3224>, context=0x7ff4917954d8 <stapi_signal_handler_oscontext+3352>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #15 rel_crit (reg=0x14d0170) at sr_unix/rel_crit.c:81
  #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120
  #17 op_lock2_common (timeout=0, laflag=64 '@') at sr_port/op_lock2.c:242
  #18 op_incrlock_common (timeout=0) at sr_port/op_incrlock.c:49
  #19 ydb_lock_incr_s (timeout_nsec=0, varname=0x7ffee07f8c30, subs_used=0, subsarray=0x0) at sr_unix/ydb_lock_incr_s.c:91
  #20 runProc (settings=0x7ffee07fab80, curDepth=1) at simpleapi/inref/randomWalk.c:489
  #21 tpHelper (tpfnparm=0x7ffee07fa100) at simpleapi/inref/randomWalk.c:691
  #22 ydb_tp_s_common (lydbrtn=LYDB_RTN_TP, tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:256
  #23 ydb_tp_s (tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:38
  #24 runProc (settings=0x7ffee07fab80, curDepth=0) at simpleapi/inref/randomWalk.c:666
  #25 runProc_driver (settings=0x7ffee07fab80) at simpleapi/inref/randomWalk.c:145
  #26 main () at simpleapi/inref/randomWalk.c:93

  (gdb) f 6
  #6  insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110
  110                                     assert((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit));

  (gdb) p process_exiting
  $3 = 1

  (gdb) p t_tries
  $4 = 3

  (gdb) p dollar_tlevel
  $5 = 1

  (gdb) p csa->now_crit
  $6 = 0

  (gdb) up
  #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120
  120                             TPNOTACID_CHECK(LOCKGCINTP);
  ```

Issue
-----
* The assert that failed in `insert_region()` (frame 6 in above stack trace) indicates that we were in the
  final retry (i.e. `t_tries` is equal to `3` or `CDB_STAGNATE`) but we did not hold crit on the current
  region where we are trying to do an `mlk_unlock()` operation.

* The assert is valid and did expose an issue.

* In frame 16, in `mlk_lock()`, we did a `rel_crit()` call in the `TPNOTACID_CHECK` macro while in the
  final retry.

  **sr_port/mlk_lock.c**
  ```c
    120                         TPNOTACID_CHECK(LOCKGCINTP);
  ```

* Below is the code inside the macro.

  **sr_port/tp.h**
  ```c
     979 #define TPNOTACID_CHECK(CALLER_STR)                                                                                             \
     980 {                                                                                                                               \
     981         GBLREF  boolean_t       mupip_jnl_recover;                                                                              \
     982         mval            zpos;                                                                                                   \
     983                                                                                                                                 \
     984         if (IS_TP_AND_FINAL_RETRY)                                                                                              \
     985         {                                                                                                                       \
  -> 986                 TP_REL_CRIT_ALL_REG;                                                                                            \
     987                 assert(!mupip_jnl_recover);                                                                                     \
     988                 TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK;                                                                         \
  ```

* Line 986 is where the issue is. We do a `rel_crit()` call there but `t_tries` is still not decremented.
  The decrement of `t_tries` happens 2 lines later at line 988.

* Before doing the `rel_crit()` call, we need to decrement `t_tries`. This way, in case `rel_crit()`
  decides to invoke exit handling due to handling a deferred SIGINT signal (sent in the `ydb464` subtest),
  the assert in `insert_region()` would not be confused by seeing this out-of-design state and will not
  attempt to invoke `t_retry()` etc. which is a no-no as we should not transfer control to M code as
  part of a TP restart while the process is about to terminate on receipt of a SIGINT signal.

Fix
---
* Notice that in `sr_port/t_commit_cleanup.c`, the `t_tries` decrement happens BEFORE the `rel_crit()`
  call.

  **sr_port/t_commit_cleanup.c**
  ```c
    288       if (CDB_STAGNATE <= t_tries)
    289               TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK; /* t_tries untouched for rollback and recover */
      .
      .
    303               if (!csa->hold_onto_crit && csa->now_crit)
    304                       rel_crit(tr->reg);      /* Undo Step (CMT01) */
  ```

* In a similar fashion, in the `TPNOTACID_CHECK` macro in `sr_port/tp.h`, the `TP_REL_CRIT_ALL_REG` call
  should happen AFTER the `TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK` call. And that is the fix.

* While doing this fix, I noticed a similar ordering issue in `sr_port/gvcst_init.c` and so fixed that too.

Notes
-----
* While this failure happened with a Debug build of YottaDB, I suspect there is an issue in the Release
  build of YottaDB too. But not sure exactly what the user-visible implications are. Even if so, it is
  likely to be not encountered in practice and so no user-visible issue is created for this.
nars1 added a commit that referenced this issue Nov 15, 2023
…_port/deferred_events.c

Background
----------
* The `v61000/intrpt_wcs_wtstart` subtest (in the YDBTest project) failed a few rare occasions
  during internal testing with the following symptom.

  ```diff
  12a13,299
  > v61000_0_22/intrpt_wcs_wtstart/mumps-wb.out
  > %YDB-F-ASSERT, Assert failed in sr_port/deferred_events.c line 114 for expression (no_event == outofband || (event_type == outofband))
  ```

Issue
-----
* The stack trace and relevant details from the gdb core analysis are pasted below.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffcc56fd8c0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114
  #7  jobinterrupt_event (sig=10, info=0x7fb372b8a518 <stapi_signal_handler_oscontext+5528>, context=0x7fb372b8a598 <stapi_signal_handler_oscontext+5656>) at sr_port/jobinterrupt_event.c:61
  #8  <signal handler called>
  #9  clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  #10 m_usleep (useconds=10000) at sr_unix/sleep.c:37
  #11 wcs_sleep (sleepfactor=6310) at sr_port/wcs_sleep.c:28
  #12 wcs_flu (options=519) at sr_unix/wcs_flu.c:571
  #13 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:632
  #14 gv_rundown () at sr_port/gv_rundown.c:122
  #15 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #16 signal_exit_handler (exit_handler_name=0x7fb372a19ecf "generic_signal_handler", sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=0) at sr_unix/signal_exit_handler.c:78
  #17 generic_signal_handler (sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:502
  #18 ydb_os_signal_handler (sig=15, info=0x7ffcc56ffd30, context=0x7ffcc56ffc00) at sr_unix/ydb_os_signal_handler.c:88
  #19 <signal handler called>
  #20 clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  #21 m_usleep (useconds=999000) at sr_unix/sleep.c:37
  #22 wcs_wtstart (region=0xc30970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:216
  #23 wcs_timer_start (reg=0xc30970, io_ok=1) at sr_port/t_end_sysops.c:1346
  #24 t_end (hist1=0xcfe798, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1848
  #25 gvcst_put2 (val=0xc928b8, parms=0x7ffcc5709a80) at sr_port/gvcst_put.c:2796
  #26 gvcst_put (val=0xc928b8) at sr_port/gvcst_put.c:302
  #27 op_gvput (var=0xc928b8) at sr_port/op_gvput.c:79

  (gdb) f 6
  #6  xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114
  114                     assert(no_event == outofband || (event_type == outofband));

  (gdb) p (enum outofbands)no_event
  $2 = no_event

  (gdb) p (enum outofbands)outofband
  $1 = deferred_signal

  (gdb) p (enum outofbands)event_type
  $3 = jobinterrupt
  ```

* The test sends a SIGTERM (i.e. SIG-15) signal. This caused `outofband` variable to be set to
  `deferred_signal` in frame 17 above (`generic_signal_handler.c` inside the `SET_FORCED_EXIT_STATE` macro).

* And then the process was sleeping (due to a white-box test case in the test).

* At that point, it was holding crit and another process was waiting for this and so was about to send
  a `MUTEXLCKALERT` message. At this point, since the test framework had set the `gtm_procstuckexec` env
  var to `com/gtmprocstuck_get_stack_trace.csh`, that was invoked and it in turn invoked `^%YDBPROCSTUCKEXEC`
  which in turn sent a `SIGUSR1` signal (i.e. a `mupip intrpt`) to this very same process that was sleeping
  while holding crit.

* And at this point, the process got the assert failure because the `outofband` variable indicated that
  a `SIG-15` signal needs to be handled whereas the `event_type` variable indicated that the current
  out of band event is a `jobinterrupt` event.

Fix
---
* This seems like a valid scenario and I suspect the assert is invalid.

* I noticed that this very same assert has been removed in a later GT.M release V7.1-001.

  ```diff
  $ cd YDB
  $ git show tags/V7.1-001 sr_port/deferred_events.c | head -35 | tail -8
  @@ -127,7 +127,6 @@ boolean_t xfer_set_handlers(int4  event_type, int4 param_val, boolean_t popped_e
          }
          if (!already_ev_handling)
          {
  -               assert(no_event == outofband || (event_type == outofband));
                  assert(!dollar_zininterrupt || (jobinterrupt != event_type));
                  if (entry != (TREF(save_xfer_root_ptr))->ev_que.fl)
                  {       /* no event in play so pend this one by jiggeriing the xfer_table */
  ```

* I assume GT.M noticed a similar issue but not while releasing V7.0-001 (which is what YottaDB master
  currently has merged) but when releasing a much later V7.1-001 version and fixed it then.

* Therefore, I am removing the assert that failed.

* This should let the `v61000/intrpt_wcs_wtstart` test run fine until GT.M V7.1-001 gets merged into
  the YottaDB master branch.
nars1 added a commit that referenced this issue Mar 26, 2024
…ofband_clear.c)

Background
----------
* After GT.M V7.0-002 changes were merged, the `r130/ydb560` subtest started failing with the
  following symptom.

  ```
  %YDB-F-ASSERT, Assert failed in sr_port/outofband_clear.c line 43 for expression (TRUE == status)
  ```

* A simple way to reproduce this issue is to run the following and in a parallel terminal send
  a `kill -4` to the `mumps` process (that is stuck in the `hang` command).

  ```sh
  $ cat test.m
   set x=1
   hang 100

  $ mumps -run test
  ```

* Before V7.0-002 merge, one would see just 1 core file (due to the `kill -4`). But after the
  merge, one would see 3 core files. And the 2nd core file had the following stack trace.

  ```
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140112165532736, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd024690b0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  outofband_clear () at sr_port/outofband_clear.c:43
  #9  outofband_action (lnfetch_or_start=0) at sr_port/outofband_action.c:58
  #10 async_action (lnfetch_or_start=false) at sr_port/deferred_events.c:394
  #11 lvzwr_var (lv=0x60f0000005f0, n=0) at sr_port/lvzwr_var.c:184
  #12 lvzwr_fini (out=0x7ffd02471dc0, t=1) at sr_port/lvzwr_fini.c:84
  #13 op_lvpatwrite (count=0, arg1=140724641668224) at sr_port/op_lvpatwrite.c:85
  #14 zshow_zwrite (output=0x7ffd02471dc0) at sr_port/zshow_zwrite.c:40
  #15 op_zshow (func=0x7ffd0247a0e0, type=1, lvn=0x0) at sr_port/op_zshow.c:166
  #16 jobexam_dump (dump_filename_arg=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fatal_file_name_buff=0x7ffd0247ae20 "/extra4/testarea1/nars/V998/tst_V998_R201_dbg_28_240320_111309/r130_0/ydb560/YDB_FATAL_ERROR.ZSHOW_DMP_89246_1.txt", fmt=0x0, dev_in_use=0x7ffd0247a240) at sr_port/jobexam_process.c:238
  #17 jobexam_process (dump_file_name=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fmt=0x0) at sr_port/jobexam_process.c:147
  #18 create_fatal_error_zshow_dmp (signal=4) at sr_port/create_fatal_error_zshow_dmp.c:66
  #19 signal_exit_handler (exit_handler_name=0x7f6e64c43140 "deferred_exit_handler", sig=4, info=0x7f6e6519f938 <stapi_signal_handler_oscontext+3320>, context=0x7f6e6519f9b8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:59
  #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95
  #22 set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:48
  #23 async_action (lnfetch_or_start=true) at sr_port/deferred_events.c:380
  #24 l1 () at sr_x86_64/op_startintrrpt.s:40

  (gdb) f 8
  #8  outofband_clear () at sr_port/outofband_clear.c:43
  43                      assert(TRUE == status);

  (gdb) list
  41              {
  42                      status = xfer_reset_if_setter(outofband);
  43                      assert(TRUE == status);
  44              }
  45      }

  (gdb) p outofband
  $1 = 11

  (gdb) p (enum outofbands)outofband
  $2 = deferred_signal
  ```

Issue
-----
* The issue was that `xfer_reset_if_setter()` had been reworked in GT.M V7.0-002. And that caused the
  handling of the `deferred_signal` type of outofband (which is a YottaDB-only value, unknown to the
  GT.M code base) not be handled correctly.

* The reason why `xfer_reset_if_setter()` returned FALSE in line 42 above is that the `event_state`
  for `deferred_signal` event_type at line 249 below was `pending`. Not `active` and so the call to
  line 250 got skipped. That would have done the real reset that was needed.

  **sr_port/deferred_events.c**
  ```c
    212 boolean_t xfer_reset_if_setter(int4 event_type)
      .
    249     if (res = (active == TAREF1(save_xfer_root, event_type).event_state))   /* WARNING: assignment */
    250             res = (real_xfer_reset(event_type));
  ```

Fix
---
* The fix was to set the event_state for `deferred_signal` outofband to `active` in `deferred_signal_set()`
  just like it is done for `jobinterrupt` outofband in `jobinterrupt_set()`.

* After this change though, an assert in line 370 below (in the `async_action()` function) failed.

  **sr_port/deferred_events.c**
  ```c
    350 void async_action(bool lnfetch_or_start)
      .
    358         if (jobinterrupt == outofband)
    359         {
      .
    367                 TAREF1(save_xfer_root, jobinterrupt).event_state = pending;     /* jobinterrupt gets a pass from the assert below */
    368         } else if (!lnfetch_or_start)
    369         {       /* something other than a new line caugth this, so  */
    370                 assert(pending >= TAREF1(save_xfer_root, outofband).event_state);
    371                 TAREF1(save_xfer_root, outofband).event_state = pending;        /* make it pending in case it was not there yet */
    372         }
  ```

  I noticed that `jobinterrupt` gets special handling in line 367. So decided to have special handling
  for `deferred_signal` as well. But the special handling is different here in that we do not modify
  the `event_state` (like is done for `jobinterrupt` in line 367 above) for the `deferred_signal` case.
  Just that we skip lines 370-371.

* With the changes in the above 2 bullets, the simple test case shown above started working fine in that
  it only generated 1 core file (not 3 core files).
nars1 added a commit that referenced this issue Jul 10, 2024
Background
----------
* In internal testing, we noticed a rare failure in the `v51000/mu_bkup_stop` subtest
  where a `mupip backup` process that was sent a `SIGTERM` (by the test) ended up
  creating a core file due to ASAN assert failing on a double free.

* Below are relevant details from the core file.

  ```c
  Core was generated by `mupip backup -online -dbg * ./49181_online1'.
  Program terminated with signal SIGSEGV, Segmentation fault.

  (gdb) where
  #0  ydb_os_signal_handler (sig=11, info=0x7fd09968c3f0, context=0x7fd09968c2c0) at sr_unix/ydb_os_signal_handler.c:57
  #1  <signal handler called>
  #2  ydb_os_signal_handler (sig=6, info=0x7fd09968caf0, context=0x7fd09968c9c0) at sr_unix/ydb_os_signal_handler.c:57
  #3  <signal handler called>
  #4  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
  #5  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
  #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
  #7  __GI_abort () at ./stdlib/abort.c:79
  #8  __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:143
  #9  __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
  #10 __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x7ffda6de6ebe, __in_chrg=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:190
  #11 __asan::ReportDoubleFree (addr=140533757257728, free_stack=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:224
  #12 __asan::Allocator::ReportInvalidFree (this=<optimized out>, stack=0x7ffda6de79f0, chunk_state=<optimized out>, ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_allocator.cpp:757
  #13 __interceptor_free (ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:53
  #14 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1485
  #15 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  #16 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  #18 mupip_backup_call_on_signal () at sr_port/mupip_backup.c:208
  #19 signal_exit_handler (exit_handler_name=0x7fd097f1dda0 "deferred_exit_handler", sig=15, info=0x7fd098480fd8 <stapi_signal_handler_oscontext+3320>, context=0x7fd098481058 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:67
  #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95
  #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486
  #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  #26 mupip_backup () at sr_port/mupip_backup.c:1585
  #27 mupip_main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip_main.c:130
  #28 dlopen_libyottadb (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50, main_func=0x55af49fd9020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #29 main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip.c:21

  (gdb) f 25
  #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  103                                     free(ptr->backup_hdr);

  (gdb) f 17
  #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  103                                     free(ptr->backup_hdr);

  (gdb) down
  #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  1501            gtm_free_main(addr, TAIL_CALL_LEVEL);

  (gdb) down
  #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  854                     system_free(addr);

  (gdb) down
  #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486
  1486            ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);

  (gdb) list
  1481    {
  1482            intrpt_state_t  prev_intrpt_state;
  1483
  1484            DEFER_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);
  1485            free(addr);
  1486            ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);
  1487            return;
  1488    }
  ```

Issue
-----
* We did a `free(ptr->backup_hdr)` at line 103. And that in turn ended up using the system `free()`
  function because the test framework had randomly set the `gtmdbglvl` env var to a value of
  `0x80000000`.

* So at line 1485 above, the system free finished but at line 1486 we noticed the SIGTERM that was
  deferred and so decided to handle it. But the `ptr->backup_hdr` variable was still set to a
  non-NULL value so as part of the deferred exit handler, we tried to free this again resulting
  in the double free.

Fix
---
* The fix is to note `ptr->backup_hdr` in a local variable and clear the former and then attempting
  the `free()` on the local variable. This way if we decide to do deferred exit handling after the
  `free()` occurred, we will notice a NULL value of `ptr->backup_hdr` and so avoid the double free.

Notes
-----
* This is considered a too rare a race condition to be encountered in practice and so it is expected
  to not be noticed by a user. Therefore no YDB issue is created for this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants