Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update copyrights for all modules changed by YottaDB since V6.3-001A #17

Merged
merged 2 commits into from
Aug 3, 2017

Conversation

nars1
Copy link
Collaborator

@nars1 nars1 commented Aug 3, 2017

No description provided.

@nars1 nars1 self-assigned this Aug 3, 2017
@nars1 nars1 requested a review from estess August 3, 2017 17:20
CMakeLists.txt Outdated
@@ -2,6 +2,9 @@
# #
# Copyright (c) 2012-2016 Fidelity National Information #
# Services, Inc. and/or its subsidiaries. All rights reserved. #
# #
# Copyright (c) 2017 YottaDB LLC. and/or its subsidiaries. #
# All rights reserved. #
# #
# Copyright (c) 2017 Finxact, LLC. and/or its subsidiaries. #
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My ownly question is why the Finxact CC is being kept. The only interest Finxact had in it was spun off into YottaDB LLC so this is in effect mentioning our interest twice. If we do for some reason keep it, the YottaDB LLC mention should be last since copyrights are temporally ordered.

@YottaDB
Copy link
Collaborator

YottaDB commented Aug 3, 2017 via email

@nars1 nars1 merged commit d60c6c3 into YottaDB:master Aug 3, 2017
nars1 added a commit to nars1/YottaDB that referenced this pull request Oct 24, 2017
…andler in threaded code

```
The v63000/gtm8394 subtest failed an assert with the following stack trace.

 #0  0x00007f3f038734c7 in kill () from /usr/lib64/libc.so.6
 #1  0x00000000006c2413 in gtm_dump_core () at R110/sr_unix/gtm_dump_core.c:69
 #2  0x00000000006d5dd0 in gtm_fork_n_core () at R110/sr_unix/gtm_fork_n_core.c:211
 YottaDB#3  0x0000000000695b5b in ch_cond_core () at R110/sr_unix/ch_cond_core.c:59
 YottaDB#4  0x000000000087e6ba in rts_error_va (csa=0x0, argcnt=7, var=0x7f3ef864e178) at R110/sr_unix/rts_error.c:153
 YottaDB#5  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=7) at R110/sr_unix/rts_error.c:85
 YottaDB#6  0x0000000000916610 in hashtab_rehash_ch (arg=150373340) at R110/sr_port/hashtab_rehash_ch.c:33
 YottaDB#7  0x000000000087ec12 in rts_error_va (csa=0x0, argcnt=5, var=0x7f3ef864e438) at R110/sr_unix/rts_error.c:153
 YottaDB#8  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=5) at R110/sr_unix/rts_error.c:85
 YottaDB#9  0x00000000008fa778 in raise_gtmmemory_error () at R110/sr_port/gtm_malloc_src.h:1074
 YottaDB#10 0x00000000008f5ee2 in gtm_malloc (size=835672) at R110/sr_port/gtm_malloc_src.h:724
 YottaDB#11 0x0000000000978722 in init_hashtab_intl_int8 (table=0x7f3ef864e780, minsize=24594, old_table=0x10e8718 <murgbl+88>) at R110/sr_port/hashtab_implementation.h:392
 YottaDB#12 0x000000000097971e in expand_hashtab_int8 (table=0x10e8718 <murgbl+88>, minsize=24594) at R110/sr_port/hashtab_implementation.h:436
 YottaDB#13 0x000000000097a063 in add_hashtab_intl_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0, changing_table_size=0) at R110/sr_port/hashtab_implementation.h:499
 YottaDB#14 0x000000000097a005 in add_hashtab_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0) at R110/sr_port/hashtab_implementation.h:483
 YottaDB#15 0x000000000052a9cc in mur_back_processing_one_region (mur_back_options=0x7f3ef864ee40) at R110/sr_port/mur_back_process.c:1064
 YottaDB#16 0x0000000000523e09 in mur_back_phase1 (rctl=0x2e8fc20) at R110/sr_port/mur_back_process.c:535
 YottaDB#17 0x00000000006e75b8 in gtm_multi_thread_helper (tparm=0x7ffe5753ef30) at R110/sr_unix/gtm_multi_thread.c:228
 YottaDB#18 0x00007f3f03629e25 in start_thread () from /usr/lib64/libpthread.so.0
 YottaDB#19 0x00007f3f0393634d in clone () from /usr/lib64/libc.so.6

This is a test where a memory-error is forced (using limit vmemorysize). And various rollbacks are run. One of them runs with multiple threads and one thread gets a memory error during hashtable expansion. Normally a memory error causes the thread to exit and in turn that signals other threads to exit which is handled fine. But in this case, the condition handler hashtab_rehash_ch() did an UNWIND because it decided an out-of-memory situation implies we will abort the expansion and continue with the previous hashtable (this was a good-to-expand call, not a need-to-expand call). And the UNWIND macro had an assert that we better not be inside multi-threaded code. But that is exactly where we were in this failure.

The reason why the UNWIND has that logic is because in pro it would return control to the erroring thread and let it continue processing but we would not have released the pthread-mutex-lock that we obtained in rts_error_va() for this thread. That means all other threads will not be able to get this lock for various actions they do until the erroring thread tries to obtain the lock again (at which point we would check that we already hold the lock and not try to get the lock again) and later when we release it, other threads will be able to get the thread lock.

The fix is to make sure we release the thread-level lock in the UNWIND macro (and assert that we do hold the lock in dbg).

The pro implication of this issue is that a MUPIP JOURNAL command that encounters a memory error in some cases could in the worst case transform a multi-threaded recovery to a non-threaded recovery command thereby slowing it down. No other user-visible implications are expected out of this.

```
nars1 added a commit that referenced this pull request Oct 24, 2017
…andler in threaded code

```
The v63000/gtm8394 subtest failed an assert with the following stack trace.

 #0  0x00007f3f038734c7 in kill () from /usr/lib64/libc.so.6
 #1  0x00000000006c2413 in gtm_dump_core () at R110/sr_unix/gtm_dump_core.c:69
 #2  0x00000000006d5dd0 in gtm_fork_n_core () at R110/sr_unix/gtm_fork_n_core.c:211
 #3  0x0000000000695b5b in ch_cond_core () at R110/sr_unix/ch_cond_core.c:59
 #4  0x000000000087e6ba in rts_error_va (csa=0x0, argcnt=7, var=0x7f3ef864e178) at R110/sr_unix/rts_error.c:153
 #5  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=7) at R110/sr_unix/rts_error.c:85
 #6  0x0000000000916610 in hashtab_rehash_ch (arg=150373340) at R110/sr_port/hashtab_rehash_ch.c:33
 #7  0x000000000087ec12 in rts_error_va (csa=0x0, argcnt=5, var=0x7f3ef864e438) at R110/sr_unix/rts_error.c:153
 #8  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=5) at R110/sr_unix/rts_error.c:85
 #9  0x00000000008fa778 in raise_gtmmemory_error () at R110/sr_port/gtm_malloc_src.h:1074
 #10 0x00000000008f5ee2 in gtm_malloc (size=835672) at R110/sr_port/gtm_malloc_src.h:724
 #11 0x0000000000978722 in init_hashtab_intl_int8 (table=0x7f3ef864e780, minsize=24594, old_table=0x10e8718 <murgbl+88>) at R110/sr_port/hashtab_implementation.h:392
 #12 0x000000000097971e in expand_hashtab_int8 (table=0x10e8718 <murgbl+88>, minsize=24594) at R110/sr_port/hashtab_implementation.h:436
 #13 0x000000000097a063 in add_hashtab_intl_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0, changing_table_size=0) at R110/sr_port/hashtab_implementation.h:499
 #14 0x000000000097a005 in add_hashtab_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0) at R110/sr_port/hashtab_implementation.h:483
 #15 0x000000000052a9cc in mur_back_processing_one_region (mur_back_options=0x7f3ef864ee40) at R110/sr_port/mur_back_process.c:1064
 #16 0x0000000000523e09 in mur_back_phase1 (rctl=0x2e8fc20) at R110/sr_port/mur_back_process.c:535
 #17 0x00000000006e75b8 in gtm_multi_thread_helper (tparm=0x7ffe5753ef30) at R110/sr_unix/gtm_multi_thread.c:228
 #18 0x00007f3f03629e25 in start_thread () from /usr/lib64/libpthread.so.0
 #19 0x00007f3f0393634d in clone () from /usr/lib64/libc.so.6

This is a test where a memory-error is forced (using limit vmemorysize). And various rollbacks are run. One of them runs with multiple threads and one thread gets a memory error during hashtable expansion. Normally a memory error causes the thread to exit and in turn that signals other threads to exit which is handled fine. But in this case, the condition handler hashtab_rehash_ch() did an UNWIND because it decided an out-of-memory situation implies we will abort the expansion and continue with the previous hashtable (this was a good-to-expand call, not a need-to-expand call). And the UNWIND macro had an assert that we better not be inside multi-threaded code. But that is exactly where we were in this failure.

The reason why the UNWIND has that logic is because in pro it would return control to the erroring thread and let it continue processing but we would not have released the pthread-mutex-lock that we obtained in rts_error_va() for this thread. That means all other threads will not be able to get this lock for various actions they do until the erroring thread tries to obtain the lock again (at which point we would check that we already hold the lock and not try to get the lock again) and later when we release it, other threads will be able to get the thread lock.

The fix is to make sure we release the thread-level lock in the UNWIND macro (and assert that we do hold the lock in dbg).

The pro implication of this issue is that a MUPIP JOURNAL command that encounters a memory error in some cases could in the worst case transform a multi-threaded recovery to a non-threaded recovery command thereby slowing it down. No other user-visible implications are expected out of this.

```
chathaway-codes pushed a commit that referenced this pull request Oct 17, 2018
When ydb_chset env var is set to "M", compiling the following line

	set c=$PIECE("Hello "_$ZCH(190)_" world!",$ZCH(191),1,2)

Failed an assert

%YDB-F-ASSERT, Assert failed in sr_unix/gtm_utf8.c line 273 for expression (gtm_utf8_mode)

with the following C-stack

 #6  utf8_badchar_real () at sr_unix/gtm_utf8.c:273
 #7  utf8_badchar_dec () at sr_unix/gtm_utf8.c:249
 #8  valid_utf_string () at sr_unix/gtm_utf8.c:410
 #9  op_fnzpiece () at sr_port/op_fnzpiece.c:53
 #10 f_piece () at sr_unix/f_piece.c:171
 #11 expritem () at sr_port/expritem.c:619
 #12 expratom () at sr_port/expratom.c:29
 #13 eval_expr () at sr_port/eval_expr.c:63
 #14 expr () at sr_port/expr.c:29
 #15 m_write () at sr_port/m_write.c:71
 #16 cmd () at sr_port/cmd.c:302
 #17 linetail () at sr_port/linetail.c:35
 #18 line () at sr_port/line.c:230
 #19 compiler_startup () at sr_port/compiler_startup.c:144
 #20 compile_source_file () at sr_unix/source_file.c:132
 #21 gtm_compile () at sr_unix/gtm_compile.c:120

The assert that failed is correct. The issue is that we called the utf8_badchar_real() function
in non-UTF8 mode (i.e. when "gtm_utf8_mode" is 0). The issue is in op_fnzpiece() where we
invoke the valid_utf_string() function only if we are in UTF-8 mode (indicated by "gtm_utf8_mode == 1").
The assert (likely introduced as part of GTM-7762 in GT.M V6.3-000) is now fixed to take care of this.
chathaway-codes pushed a commit that referenced this pull request Nov 18, 2018
…secondary errors if primary error is out-of-memory

If already exiting, do not open any object/source directories (which could include relinkctl files)
as part of $ZROUTINES initialization. This avoids potentially nasty codepaths particulary if the
reason we are exiting is an out-of-memory.

We do not expect any user to run such extreme out-of-memory codepaths/tests so it is not considered
necessary to create a user-visible issue for this.

For example, below are two C-stacks that showed up in core dumps while running the
simpleapi/fatalerror2 subtest. In both cases, if we avoid the zro_init() call we can avoid
such cores.

Core1
------
Notice the local variables passed in #0 have "Cannot access memory" errors. Most likely there was no
space allocating the C-stack in this core.

(gdb) where
 #0  ydb_trans_log_name (envindx=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c5c>, trans=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c50>, buffer=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c48>, buffer_len=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c58>, ignore_errors=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c44>, is_ydb_env_match=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c38>) at sr_port/ydb_trans_log_name.c:41
 #1  util_out_send_oper (addr=0x7ffe1e3c7800 "%YDB-E-RELINKCTLERR, Error with relink control structure for $ZROUTINES directory ., %YDB-E-SYSCALL, Error received from system call mmap() -- called from module "..., len=287) at sr_unix/util_output.c:731
 #2  util_out_print_vaparm (message=0x0, flush=4, var=0x7ffe1e3c8050, faocnt=2147483647) at sr_unix/util_output.c:871
 #3  util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904
 #4  jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:261
 #5  gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36
 #6  rts_error_va (csa=0x0, argcnt=12, var=0x7ffe1e3c82b0) at sr_unix/rts_error.c:159
 #7  rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92
 #8  relinkctl_map (linkctl=0x7ffe1e3c8890) at sr_unix/relinkctl.c:679
 #9  relinkctl_open (linkctl=0x7ffe1e3c8890, object_dir_missing=0) at sr_unix/relinkctl.c:333
 #10 relinkctl_attach (obj_container_name=0x7ffe1e3cbb50, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188
 #11 zro_load (str=0x5611ed710ce8) at sr_unix/zro_load.c:159
 #12 zro_init () at sr_port/zro_init.c:51
 #13 zshow_svn (output=0x7ffe1e40f0b0, one_sv=0) at sr_port/zshow_svn.c:694
 #14 op_zshow (func=0x7ffe1e4171b0, type=1, lvn=0x0) at sr_port/op_zshow.c:166
 #15 jobexam_dump (dump_filename_arg=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0, fatal_file_name_buff=0x7ffe1e417c40 "simpleapi_0_2/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_65362_1.txt") at sr_port/jobexam_process.c:232
 #16 jobexam_process (dump_file_name=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0) at sr_port/jobexam_process.c:152
 #17 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66
 #18 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224
 #19 rts_error_va (csa=0x0, argcnt=5, var=0x7ffe1e41a6a0) at sr_unix/rts_error.c:159
 #20 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92
 #21 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114
 #22 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748
 #23 lvtreenode_newblock (sym=0x5611ed733b40, numElems=2097152) at sr_port/lv_newblock.c:82
 #24 lvtreenode_getslot (sym=0x5611ed733b40) at sr_port/lv_getslot.c:145
 #25 lvAvlTreeNodeInsert (lvt=0x5611ed736050, key=0x7ffe1e41aab0, parent=0x5611f87cb608) at sr_port/lv_tree.c:1698
 #26 op_putindx (argcnt=1, start=0x5611ed73b0a0) at sr_port/op_putindx.c:192
 #27 callg (fnptr=0x7fb75d4f4fff <op_putindx>, paramlist=0x7ffe1e41ae60) at sr_unix/callg.c:60
 #28 ydb_set_s (varname=0x7ffe1e41b5e0, subs_used=1, subsarray=0x7ffe1e41b5f0, value=0x7ffe1e41ade0) at sr_unix/ydb_set_s.c:108
 #29 gvnset () at fatalerror.c:56
 #30 ydb_tp_s (tpfn=0x5611ed225260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193
 #31 main () at fatalerror.c:32

Core2
-----
In this case there is a SIG-11 deep inside syslog(). Most likely due to an out-of-memory situation.

Program terminated with signal SIGSEGV, Segmentation fault.
 #0  vfprintf () from /usr/lib64/libc.so.6
 #1  fprintf () from /usr/lib64/libc.so.6
 #2  __vsyslog_chk () from /usr/lib64/libc.so.6
 #3  syslog () from /usr/lib64/libc.so.6
 #4  util_out_send_oper (addr=0x7ffdadd5ec10 "%YDB-E-JOBEXAMFAIL, YottaDB process 50787 executing $ZJOBEXAM function failed with the preceding error message -- generated from 0x", '0' <repeats 16 times>, ".", len=149) at sr_unix/util_output.c:761
 #5  util_out_print_vaparm (message=0x0, flush=4, var=0x7ffdadd5f460, faocnt=2147483647) at sr_unix/util_output.c:871
 #6  util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904
 #7  send_msg_va (csa=0x0, arg_count=0, var=0x7ffdadd5fa00) at sr_unix/send_msg.c:149
 #8  send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:79
 #9  jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:264
 #10 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36
 #11 rts_error_va (csa=0x0, argcnt=12, var=0x7ffdadd5fc60) at sr_unix/rts_error.c:159
 #12 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92
 #13 relinkctl_map (linkctl=0x7ffdadd60240) at sr_unix/relinkctl.c:679
 #14 relinkctl_open (linkctl=0x7ffdadd60240, object_dir_missing=0) at sr_unix/relinkctl.c:333
 #15 relinkctl_attach (obj_container_name=0x7ffdadd63500, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188
 #16 zro_load (str=0x55df19dd3ce8) at sr_unix/zro_load.c:159
 #17 zro_init () at sr_port/zro_init.c:51
 #18 zshow_svn (output=0x7ffdadda6a60, one_sv=0) at sr_port/zshow_svn.c:694
 #19 op_zshow (func=0x7ffdaddaeb60, type=1, lvn=0x0) at sr_port/op_zshow.c:166
 #20 jobexam_dump (dump_filename_arg=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660, fatal_file_name_buff=0x7ffdaddaf5f0 "simpleapi_0_40/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_50787_1.txt") at sr_port/jobexam_process.c:232
 #21 jobexam_process (dump_file_name=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660) at sr_port/jobexam_process.c:152
 #22 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66
 #23 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224
 #24 rts_error_va (csa=0x0, argcnt=5, var=0x7ffdaddb2050) at sr_unix/rts_error.c:159
 #25 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92
 #26 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114
 #27 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748
 #28 lvtreenode_newblock (sym=0x55df19df6b40, numElems=2097152) at sr_port/lv_newblock.c:82
 #29 lvtreenode_getslot (sym=0x55df19df6b40) at sr_port/lv_getslot.c:145
 #30 lvAvlTreeNodeInsert (lvt=0x55df19df9050, key=0x7ffdaddb2460, parent=0x55df24e8e5c8) at sr_port/lv_tree.c:1698
 #31 op_putindx (argcnt=1, start=0x55df19dfe0a0) at sr_port/op_putindx.c:192
 #32 callg (fnptr=0x7feae36c6fff <op_putindx>, paramlist=0x7ffdaddb2810) at sr_unix/callg.c:60
 #33 ydb_set_s (varname=0x7ffdaddb2f90, subs_used=1, subsarray=0x7ffdaddb2fa0, value=0x7ffdaddb2790) at sr_unix/ydb_set_s.c:108
 #34 gvnset () at fatalerror.c:56
 #35 ydb_tp_s (tpfn=0x55df18a5c260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193
 #36 main () at fatalerror.c:32
chathaway-codes pushed a commit that referenced this pull request Nov 21, 2018
…CK being called during exit handling

When a C program that spawned off multiple threads that used the SimpleThreadAPI (e.g. ydb_tp_st() etc.)
was deadlocked (due to a code issue), pressing Ctrl-C (SIGINT) did nothing so pressing Ctrl-\ (SIGQUIT)
to terminate the C program caused a MAXRTSERRDEPTH fatal error and resulted in a core dump.

Below is the actual output.

^C^\%YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with coreQuit (core dumped)

The corresponding C-stack follows.

(gdb) where
 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52090) at sr_unix/rts_error.c:144
 #3  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52270) at sr_unix/rts_error.c:146
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52450) at sr_unix/rts_error.c:146
 #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #8  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52630) at sr_unix/rts_error.c:146
 #9  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #10 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52810) at sr_unix/rts_error.c:146
 #11 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #12 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df529f0) at sr_unix/rts_error.c:146
 #13 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #14 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52bd0) at sr_unix/rts_error.c:146
 #15 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #16 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52db0) at sr_unix/rts_error.c:146
 #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #18 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52f90) at sr_unix/rts_error.c:146
 #19 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #20 send_msg_va (csa=0x0, arg_count=8, var=0x7fb28df53570) at sr_unix/send_msg.c:125
 #21 send_msg_csa (csa=0x0, arg_count=8) at sr_unix/send_msg.c:84
 #22 generic_signal_handler (sig=3, info=0x7fb28df53830, context=0x7fb28df53700) at sr_unix/generic_signal_handler.c:244
 #23 <signal handler called>
 #24 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb2880180a8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
 #25 __pthread_cond_wait_common (abstime=0x0, mutex=0x7fb288018040, cond=0x7fb288018080) at pthread_cond_wait.c:502
 #26 __pthread_cond_wait (cond=0x7fb288018080, mutex=0x7fb288018040) at pthread_cond_wait.c:655
 #27 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:80
 #28 start_thread (arg=0x7fb28df54700) at pthread_create.c:463
 #29 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The primary error was at #20 in send_msg_va() inside the PTHREAD_MUTEX_LOCK_IF_NEEDED macro.
The actual assert that failed inside the macro was the following.

sr_unix/gtm_multi_thread.h
---------------------------
     99                 /* We should never use pthread_* calls inside a signal/timer handler. Assert that */                    \
    100                 assert(!in_nondeferrable_signal_handler);                                                               \

We were in a signal handler handling a non-deferrable signal (Ctrl-\ aka SIGQUIT) and are about to do
a pthread_mutex_lock() library call which is a no-no.

If we are in an exit handler, it is possible for send_msg() to be needed (to log the signal that was received
etc.) but it is safer to not do any pthread activity since we cannot be sure if we are exiting while inside
a signal handler or not. Therefore the fix for this is to check if "process_exiting" global variable is TRUE
and if so, we skip all pthread* calls in the PTHREAD_MUTEX_LOCK_IF_NEEDED and PTHREAD_MUTEX_UNLOCK_IF_NEEDED
macros.
chathaway-codes pushed a commit that referenced this pull request Jan 10, 2019
…ThreadAPI is active

This issue was exposed by a failure in the dual_fail_extend/dual_fail2_mustop_sigquit subtest.
This test terminates processes by sending them a SIGQUIT/SIG-3 or SIGTERM/SIG-15 signal.
But since one of the threads (the MAIN worker thread) in this multi-threaded process was inside wcs_wtstart() in a
non-interruptable code zone (DEFER_INTERRUPTS had been done), the exit handler invoked in
another concurrently running thread decided to defer the exit until the ENABLE_INTERRUPTS
happened in the worker thread. When the ENABLE_INTERRUPTS did happen, the worker thread invoked
exit handling code while it was already inside a timer handler. And since this particular test
was running with GDSV4 format blocks, wcs_wtstart() could not flush such blocks (since it required
a call to gtm_malloc() which meant a pthread_mutex_lock() call while inside a timer handler which is
a no-no) and so wcs_flu() was not able to flush any blocks as part of exit handling causing it to
fail an assert. Below is the C-stack corresponding to the assert failure.

(gdb) where
 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7f59dccc02a0) at sr_unix/rts_error.c:194
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  wcs_flu (options=519) at sr_unix/wcs_flu.c:587
 #7  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:608
 #8  gv_rundown () at sr_port/gv_rundown.c:123
 #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:204
 #10 __run_exit_handlers (status=-3, listp=0x7f59e2319718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
 #11 __GI_exit (status=<optimized out>) at exit.c:139
 #12 gtm_image_exit (status=-3) at sr_unix/gtm_image_exit.c:27
 #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:111
 #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:45
 #15 wcs_wtstart (region=0x55b9581d66d8, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:829
 #16 wcs_stale (tid=94254535632600, hd_len=8, region=0x55b9581d62a8) at sr_port/t_end_sysops.c:1387
 #17 timer_handler (why=14) at sr_unix/gt_timers.c:821
 #18 <signal handler called>
 #19 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:277
 #20 gtm_memcpy_validate_and_execute (target=0x7f59dccc25c0, src=0x7f59e32fd6c6, len=0) at sr_port/gtm_memcpy_validate_and_execute.c:42
 #21 gvcst_put2 (val=0x7f59e30c7440 <increment_delta_mval>, parms=0x7f59dccc4be0) at sr_port/gvcst_put.c:626
 #22 gvcst_put (val=0x7f59e30c7440 <increment_delta_mval>) at sr_port/gvcst_put.c:299
 #23 gvcst_incr (increment=0x55b9581a05a0, result=0x7f59d8009410) at sr_port/gvcst_incr.c:56
 #24 op_gvincr (increment=0x55b9581a05a0, result=0x7f59d8009410) at sr_port/op_gvincr.c:58

The fix for this issue is to not invoke exit handling while inside the timer handler if we know
SimpleThreadAPI is active. In that case, finish the timer handler first and invoke exit handling
a little later in mainline code where it is safe to invoke exit handling.
chathaway-codes pushed a commit that referenced this pull request Mar 25, 2019
…nvocation can be interrupted by a real timer handler interrupt i.e. timer_handler(SIGALRM)

In an M program invocation (i.e. no SimpleThreadAPI), the below assert failed.

%YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 730 for expression
	(simpleThreadAPI_active || !STAPI_IS_SIGNAL_HANDLER_DEFERRED(sig_hndlr_timer_handler))

And below is the corresponding C-stack

 #0  pthread_kill () from /usr/lib64/libpthread.so.0
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:79
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffef57234f0) at sr_unix/rts_error.c:194
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  timer_handler (why=14, info=0x7f26ac6b3e48 <stapi_signal_handler_oscontext+10728>, context=0x7f26ac6b3ec8 <stapi_signal_handler_oscontext+10856>) at sr_unix/gt_timers.c:730
 #7  <signal handler called>
 #8  timer_handler (why=0, info=0x0, context=0x0) at sr_unix/gt_timers.c:727
 #9  check_for_deferred_timers () at sr_unix/gt_timers.c:1205
 #10 deferred_signal_handler () at sr_port/deferred_signal_handler.c:68
 #11 gtm_trigger_complink () at sr_unix/gtm_trigger.c:382
 #12 process_xecute () at sr_unix/trigger_parse.c:1214
 #13 trigger_parse () at sr_unix/trigger_parse.c:1446
 #14 trigger_update_rec () at sr_unix/trigger_update.c:1253
 #15 trigger_update_rec_helper () at sr_unix/trigger_update.c:2007
 #16 trigger_update () at sr_unix/trigger_update.c:2060
 #17 op_fnztrigger () at sr_port/op_fnztrigger.c:245

The issue is that STAPI_IS_SIGNAL_HANDLER_DEFERRED(sig_hndlr_timer_handler) is set to TRUE
by the debug-only STAPI_FAKE_TIMER_HANDLER_WAS_DEFERRED macro invocation in check_for_deterred_timers()
before calling timer_handler(DUMMY_SIG_NUM). And if a real timer handler interrupt happens before
the timer_handler(DUMMY_SIG_NUM) invocation is finished, we will fail the assert.

The assert is not of much use now now that a lot more assertions are already folded into the
FORWARD_SIG_TO_MAIN_THREAD_IF_NEEDED macro so it is removed.
chathaway-codes pushed a commit that referenced this pull request Mar 25, 2019
…as a new signal; Reuse pre-existing info/context from original signal handler for this case too

Fixes an occasional dual_fail_extend/dual_fail2_mustop_sigquit subtest failure where
a KILLBYSIGUINFO message is expected when another process sends a SIG-3 but instead we
see a KILLBYSIGSINFO1 message in the .mje file.

Below is one such stack trace where such an incorrect message gets sent out. While we were
handling the SIG-3 in a deferred fashion (through deferred_signal_handler()), another SIG-3
came in from the MAIN worker thread which drove generic_signal_handler() in a nested fashion
and caused the issue.

(gdb) where
 #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
 #3  rts_error_va () at sr_unix/rts_error.c:194
 #4  rts_error_csa () at sr_unix/rts_error.c:101
 #5  generic_signal_handler () at sr_unix/generic_signal_handler.c:195
 #6  <signal handler called>
 #7  semop () at ../sysdeps/unix/sysv/linux/semop.c:30
 #8  try_semop_get_c_stack () at sr_unix/gtm_c_stack_trace_semop.c:59
 #9  ftok_sem_lock () at sr_unix/ftok_sems.c:232
 #10 gds_rundown () at sr_unix/gds_rundown.c:324
 #11 gv_rundown () at sr_port/gv_rundown.c:123
 #12 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:215
 #13 __run_exit_handlers () at exit.c:108
 #14 __GI_exit () at exit.c:139
 #15 gtm_image_exit () at sr_unix/gtm_image_exit.c:27
 #16 generic_signal_handler () at sr_unix/generic_signal_handler.c:361
 #17 ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:51
 #18 deferred_signal_handler () at sr_port/deferred_signal_handler.c:55
 #19 tp_tend () at sr_port/tp_tend.c:1887
 #20 op_tcommit () at sr_port/op_tcommit.c:496

This is now fixed by checking if the signal came in from another thread in the same process
(SI_TKILL) and if so treat this as a forwarded signal and not reset info/context but instead
reuse whatever was there from the original signal handler invocation.

A consequence of this change is that a pre-existing assert (that checked "stapi_signal_handler_deferred")
could now fail. That is now removed.
chathaway-codes pushed a commit that referenced this pull request May 28, 2019
…database files of some regions do not exist

There are 2 issues.

1) In sr_port/op_gvorder.c (and sr_port/op_zprevious.c), we set "gv_cur_region" to a region in the
   global directory before invoking gv_init_reg(). But if the gv_init_reg() call fails (e.g. DBFILERR
   error due to a missing database file), we end up with gv_cur_region set to a non-NULL value but
   gv_cur_region->open is FALSE which is an out-of-design state for most of the database code as that
   assumes the global variable "gv_cur_region" corresponds to a valid and open database file.

2) This out-of-design state leaves the process in a state that is vulnerable to SIG-11 later like
   the below C-stack . The SIG-11 happens in change_reg() because gv_cur_region is set to a non-NULL
   value but the region has not yet been open (due to the missing database file).

 #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
 #3  generic_signal_handler (sig=11, ...) at sr_unix/generic_signal_handler.c:405
 #4  <signal handler called>
 #5  change_reg () at sr_port/change_reg.c:49
 #6  gvzwrite_clnup () at sr_port/gvzwrite_clnup.c:47
 #7  gvzwrite_ch () at sr_port/gvzwrite_ch.c:20
 #8  rts_error_va () at sr_unix/rts_error.c:192
 #9  rts_error_csa () at sr_unix/rts_error.c:99
 #10 dbfilopn () at sr_unix/gvcst_init_sysops.c:613
 #11 gvcst_init () at sr_port/gvcst_init.c:862
 #12 gv_init_reg () at sr_port/gv_init_reg.c:56
 #13 gv_bind_name () at sr_port/gv_bind_name.c:75
 #14 op_gvname_common () at sr_port/op_gvname.c:117
 #15 op_gvname () at sr_port/op_gvname.c:70
 #16 gvzwr_fini () at sr_port/gvzwr_fini.c:76
 #17 op_gvzwrite () at sr_port/op_gvzwrite.c:65

The fixes are two fold as well.

1) Primary fix is to ensure the out-of-design state is not created by op_gvorder.c (and op_zprevious.c).
   This is done by moving the initialization of the global variable "gv_cur_region" to AFTER the
   gv_init_reg() call. This ensures if a DBFILERR error occurs inside gv_init_reg(), the global
   gv_cur_region still reflects the state it was in before the name level $ORDER started.

2) Secondary fix is to change_reg.c to ensure it handles the out-of-design state (if any other code that
   we don't know of creates that situation) without a SIG-11. This is done by checking for
   gv_cur_region->open and if it is FALSE setting cs_addrs/cs_data to FALSE. Just like is already done
   in the TP_CHANGE_REG macro.
chathaway-codes pushed a commit that referenced this pull request Jun 12, 2019
In one v60000/gtm4525b subtest run using imptpgo.go, a process assert failed.

> %YDB-F-ASSERT, Assert failed in sr_port/tp_clean_up.c line 104 for expression (!update_trans)

Below is the C-stack

 #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
 #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
 #3  rts_error_va () at sr_unix/rts_error.c:192
 #4  rts_error_csa () at sr_unix/rts_error.c:99
 #5  tp_clean_up () at sr_port/tp_clean_up.c:104
 #6  op_trollback () at sr_port/op_trollback.c:149
 #7  t_abort () at sr_port/t_abort.c:53
 #8  secshr_db_clnup () at sr_port/secshr_db_clnup.c:568
 #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:212
 #10 __run_exit_handlers () at exit.c:83
 #11 __GI_exit () at exit.c:105
 #12 gtm_image_exit () at sr_unix/gtm_image_exit.c:27
 #13 wait_for_repl_inst_unfreeze_nocsa_jpl () at sr_port/anticipatory_freeze.h:489
 #14 wait_for_repl_inst_unfreeze () at sr_port/anticipatory_freeze.h:526
 #15 wcs_wtstart () at sr_unix/wcs_wtstart.c:702
 #16 wcs_stale () at sr_port/t_end_sysops.c:1387
 #17 timer_handler () at sr_unix/gt_timers.c:834
 #18 <signal handler called>
 #19 __clock_nanosleep () at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:45
 #20 wait_for_repl_inst_unfreeze_nocsa_jpl () at sr_port/anticipatory_freeze.h:503
 #21 wait_for_repl_inst_unfreeze () at sr_port/anticipatory_freeze.h:526
 #22 t_retry () at sr_port/t_retry.c:183
 #23 t_end () at sr_port/t_end.c:1874
 #24 gvcst_bmp_mark_free () at sr_port/gvcst_bmp_mark_free.c:215
 #25 gvcst_expand_free_subtree () at sr_port/gvcst_expand_free_subtree.c:182
 #26 op_tcommit () at sr_port/op_tcommit.c:581
 #27 stkok3 () at sr_armv7l/opp_tcommit.s:38

(gdb) f 5
 #5  0xb66ef674 in tp_clean_up (clnup_state=TP_ROLLBACK) at /Distrib/YottaDB/V998_R124/sr_port/tp_clean_up.c:104
 104                     assert(!update_trans);

 100  if (tp_pointer->implicit_tstart)
 101  {       /* Resetting this is necessary to avoid blowing an assert in t_begin that it is 0 at the start of a transaction. */
 102          update_trans = 0;
 103  } else
 104          assert(!update_trans);

(gdb) p process_exiting
 $4 = 1

The assert at line 104 is now enhanced to allow for the "process_exiting" case. A comment has been
added to the code to explain why this is okay.
nars1 pushed a commit that referenced this pull request Feb 21, 2020
The assert was only allowing cs_addrs to be NULL in the case of the update process but other situations
have recently made themselves known with a very rare failure in refresh_sec_1/refresh_secondary_from_secondary
so we have removed the is_updproc from the cs_addrs check since it can happen under other conditions. This
matches with a previous change that had been made on line 120 for the same reason.

Here is the stack of the failure:

 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:91
 #3  generic_signal_handler (sig=11, info=, context=) at sr_unix/generic_signal_handler.c:409
 #4  <signal handler called>
 #5  mutex_deadlock_check (criticalPtr=, csa=) at sr_port/mutex_deadlock_check.c:113
 #6  gtm_mutex_lock (reg=, mutex_spin_parms=, crash_count=0, mutex_lock_type=MUTEX_LOCK_WRITE) at sr_unix/mutex.c:813
 #7  grab_lock (reg=, is_blocking_wait=1, onln_rlbk_action=2) at sr_unix/grab_lock.c:83
 #8  repl_inst_ftok_counter_halted (udi=) at sr_unix/repl_inst_ftok_counter_halted.c:45
 #9  jnlpool_init (pool_user=GTMRELAXED, gtmsource_startup=0, jnlpool_creator=, gd_ptr=) at sr_unix/jnlpool_init.c:767
 #10 gvcst_init (reg=) at sr_port/gvcst_init.c:949
 #11 gv_init_reg (reg=) at sr_port/gv_init_reg.c:56
 #12 gv_bind_name (addr=, gvname=) at sr_port/gv_bind_name.c:75
 #13 op_gvname_common (count=2, hash_code=-751208200, val_arg=, var=) at sr_port/op_gvname.c:117
 #14 op_gvname (count_arg=3, val_arg=) at sr_port/op_gvname.c:70
 #15 callg (fnptr= <op_gvname>, paramlist=) at sr_unix/callg.c:61
 #16 ydb_get_s (varname=, subs_used=2, subsarray=, ret_value=) at sr_unix/ydb_get_s.c:189
 #17 ydb_get_st (tptoken=0, errstr=, varname=, subs_used=2, subsarray=, ret_value=) at sr_unix/ydb_get_st.c:42
 #18 _cgo_0782cc9ff37d_Cfunc_ydb_get_st (v=) at cgo-gcc-prolog:47
 #19 runtime.asmcgocall () at /usr/lib/go-1.10/src/runtime/asm_amd64.s:688
 #20 ?? ()
 #21 func.* ()
 #22 ?? ()
nars1 added a commit that referenced this pull request Mar 25, 2020
…HECK_NEEDED || (1 != forced_exit));`

In-house testing revealed a rare test failure in the v53003/D9I10002706 subtest.
A MUPIP REORG process started by this subtest was sent a MUPIP STOP by the subtest and assert failed.

```
> cat ./mupip_reorg_7.logx
Fill Factor:: Index blocks 100%: Data blocks 100%
%YDB-F-ASSERT, Assert failed in sr_port/db_csh_getn.c line 480 for expression (GET_DEFERRED_EXIT_CHECK_NEEDED || (1 != forced_exit))

(gdb) where
 #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
 #4  rts_error_va () at sr_unix/rts_error.c:192
 #5  rts_error_csa () at sr_unix/rts_error.c:99
 #6  db_csh_getn () at sr_port/db_csh_getn.c:480
 #7  t_qread () at sr_port/t_qread.c:444
 #8  gvcst_search () at sr_port/gvcst_search.c:526
 #9  gvcst_data2 () at sr_port/gvcst_data.c:150
 #10 gvcst_data () at sr_port/gvcst_data.c:76
 #11 op_gvdata () at sr_port/op_gvdata.c:48
 #12 gv_select_reg () at sr_port/gv_select.c:373
 #13 gv_select () at sr_port/gv_select.c:302
 #14 mupip_reorg () at sr_port/mupip_reorg.c:238
 #15 mupip_main () at sr_unix/mupip_main.c:122
 #16 dlopen_libyottadb () at sr_unix/dlopen_libyottadb.c:148
 #17 main () at sr_unix/mupip.c:22

(gdb) f 6
 #6  0x00007fd5b9bd69da in db_csh_getn (block=1343) at sr_port/db_csh_getn.c:480
480                     ENABLE_INTERRUPTS(INTRPT_IN_DB_CSH_GETN, prev_intrpt_state);

(gdb) p forced_exit
$4 = 1

(gdb) p deferred_signal_handling_needed
$5 = 4
```

The assert that failed evaluates to TRUE in the core. That is `GET_DEFERRED_EXIT_CHECK_NEEDED` is TRUE
and `forced_exit` is 1. I suspect what happened is that after `GET_DEFERRED_EXIT_CHECK_NEEDED` macro got
executed, the MUPIP STOP was received thereby interrupting the assert and then going to execute
`generic_signal_handler()` which then set `forced_exit` to 1 as well as made `GET_DEFERRED_EXIT_CHECK_NEEDED`
macro TRUE before returning to the assert code that then checked `forced_exit` was 1 and failed the assert.

To prevent failures like this all we need is to reverse the operands in the `||` condition. This way
we check `1 != forced_exit` first. And only if that is not TRUE (i.e. `1 == forced_exit`) will we even
go to do the `GET_DEFERRED_EXIT_CHECK_NEEDED` check. At that point, we are guaranteed the macro will
evaluate to a non-zero value.
nars1 added a commit that referenced this pull request May 4, 2020
…d malloc issues

* We had an in-house test failure on an ARMV6L box with the following diff.

  ```diff
   > ideminter_rolrec_0/mupipstop_rollback_or_recover/impjob_imptp0.mje5
   > %YDB-F-ASSERT, Assert failed in sr_port/gtm_malloc_src.h line 695 for expression (FALSE)
   ```

  Below is the C-stack at the time of the assert failure.

  ```gdb
  #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
  #3  rts_error_va () at sr_unix/rts_error.c:192
  #4  rts_error_csa () at sr_unix/rts_error.c:99
  #5  gtm_malloc () at sr_port/gtm_malloc_src.h:695
  #6  condstk_expand () at sr_unix/condstk_expand.c:53
  #7  ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:59
  #8  deferred_signal_handler () at sr_port/deferred_signal_handler.c:57
  #9  gtm_malloc () at sr_port/gtm_malloc_src.h:748
  #10 iorm_use () at sr_unix/iorm_use.c:988
  #11 iorm_open () at sr_unix/iorm_open.c:254
  #12 io_open_try () at sr_unix/io_open_try.c:616
  #13 op_open () at sr_port/op_open.c:160
  #14 open_source_file () at sr_unix/source_file.c:253
  #15 compiler_startup () at sr_port/compiler_startup.c:130
  #16 compile_source_file () at sr_unix/source_file.c:173
  #17 op_zcompile () at sr_port/op_zcompile.c:57
  #18 gtm_trigger_complink () at sr_unix/gtm_trigger.c:451
  #19 gtm_trigger () at sr_unix/gtm_trigger.c:551
  #20 gvtr_match_n_invoke () at sr_unix/gv_trigger.c:1683
  #21 gvcst_put2 () at sr_port/gvcst_put.c:2806
  #22 gvcst_put () at sr_port/gvcst_put.c:299
  #23 op_gvput () at sr_port/op_gvput.c:79
  #24 ydb_set_s () at sr_unix/ydb_set_s.c:137
  #25 ydb_set_st () at sr_unix/ydb_set_st.c:42
  #26 _cgo_d187034042ca_Cfunc_ydb_set_st () at cgo-gcc-prolog:170
  #27 runtime.asmcgocall () at /usr/lib/go-1.11/src/runtime/asm_arm.s:617
  ```

* The cause of the assert failure is a nested call to `gtm_malloc()` (frames 9 and 5 above).
  And the reason that nested call happened is because the initial allocation of the condition handler
  stack size of 5 was not enough when `sr_unix/ydb_stm_invoke_deferred_signal_handler.c` tried to
  do an ESTABLISH and add one more condition handler (at frame number 7). This is because the
  condition handler stack was already used up with the following handlers.

  ```gdb
  (gdb) p chnd[0].ch
  $14 = (void (*)()) 0xb62f4f70 <stop_image_conditional_core>
  (gdb) p chnd[1].ch
  $15 = (void (*)()) 0xb63b0f10 <ydb_simpleapi_ch>
  (gdb) p chnd[2].ch
  $16 = (void (*)()) 0xb67f1e0c <gtm_trigger_complink_ch>
  (gdb) p chnd[3].ch
  $17 = (void (*)()) 0xb69c45e8 <source_ch>
  (gdb) p chnd[4].ch
  $18 = (void (*)()) 0xb6ce211c <compiler_ch>
  ```

* The initial condition handler stack size (controlled by the `CONDSTK_INITIAL_INCR` macro) is currently
  set to 5 (last changed from 2 to 5 as part of GT.M V6.3-000) for DEBUG builds and set to 8 for
  PRO/Release builds.

* Due to YottaDB's use of SimpleAPI, this limit of 5 is clearly not enough (as shown by the above failure)
  so it is now being bumped to 8 for DEBUG and to 16 for PRO/Release builds (just to be safe).
nars1 added a commit that referenced this pull request Jun 12, 2020
… is sent to a YottaDB process

* It is possible a timer interrupt comes in while we are canceling the timer in `sys_canc_timer()`
  (invoked in `generic_signal_handler()`). This can cause problems since we might end up trying to
  start a posix system timer on a non-existing timer id (as shown by the below C-stack we saw in
  a test failure).

  ```gdb
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  sys_settimer (tid=1978083808, time_to_expir=0x7eeeb62c) at sr_unix/gt_timers.c:564
  #7  start_first_timer (curr_time=0x7eeeb6fc) at sr_unix/gt_timers.c:633
  #8  timer_handler (why=14, info=0x76bad060 <stapi_signal_handler_oscontext+8808>, context=0x76bad0e0 <stapi_signal_handler_oscontext+8936>) at sr_unix/gt_timers.c:853
  #9  <signal handler called>
  #10 timer_delete (timerid=0x823a38) at ../sysdeps/unix/sysv/linux/timer_delete.c:38
  #11 sys_canc_timer () at sr_unix/gt_timers.c:1041
  #12 generic_signal_handler (sig=15, info=0x76babbc0 <stapi_signal_handler_oscontext+3528>, context=0x76babc40 <stapi_signal_handler_oscontext+3656>) at sr_unix/generic_signal_handler.c:401
  #13 <signal handler called>
  #14 write () at ../sysdeps/unix/syscall-template.S:84
  #15 iorm_wteol (x=1, iod=0x83c420) at sr_unix/iorm_wteol.c:226
  #16 write_text_newline_and_flush_pio (text=0x7eeec298) at sr_port/flush_pio.c:128
  #17 util_out_print_vaparm (message=0x76ab1864 "Blocks coalesced    : !SL ", flush=1, var=..., faocnt=2147483647) at sr_unix/util_output.c:872
  #18 util_out_print (message=0x76ab1864 "Blocks coalesced    : !SL ", flush=1) at sr_unix/util_output.c:913
  #19 reorg_finish (dest_blk_id=6003, blks_processed=1, blks_killed=0, blks_reused=0, file_extended=0, lvls_reduced=0, blks_coalesced=0, blks_split=0, blks_swapped=0) at sr_port/mu_reorg.c:720
  #20 mu_reorg (gl_ptr=0x10bdca0, exclude_glist_ptr=0x7eeed5a8, resume=0x7eeed4c4, index_fill_factor=100, data_fill_factor=100, reorg_op=0) at sr_port/mu_reorg.c:556
  #21 mupip_reorg () at sr_port/mupip_reorg.c:283
  #22 mupip_main (argc=2, argv=0x7eef7914, envp=0x7eef7920) at sr_unix/mupip_main.c:122
  #23 dlopen_libyottadb (argc=2, argv=0x7eef7914, envp=0x7eef7920, main_func=0x115f4 "mupip_main") at sr_unix/dlopen_libyottadb.c:148
  #24 main (argc=2, argv=0x7eef7914, envp=0x7eef7920) at sr_unix/mupip.c:22

  (gdb) f 6
  #6  0x75df09f0 in sys_settimer (tid=1978083808, time_to_expir=0x7eeeb62c) at sr_unix/gt_timers.c:564
  564                     assert(WBTEST_ENABLED(WBTEST_SETITIMER_ERROR));
  (gdb) list
  559             assert(sys_timer.it_value.tv_sec || sys_timer.it_value.tv_nsec);
  560             sys_timer.it_interval.tv_sec = sys_timer.it_interval.tv_nsec = 0;
  561             if ((-1 == timer_settime(posix_timer_id, 0, &sys_timer, &old_sys_timer)) || WBTEST_ENABLED(WBTEST_SETITIMER_ERROR))
  562             {
  563                     save_errno = errno;
  564                     assert(WBTEST_ENABLED(WBTEST_SETITIMER_ERROR));
  565                     WBTEST_ONLY(WBTEST_SETITIMER_ERROR,
  566                             save_errno = EINVAL;
  567                     );
  568                     rts_error_csa(CSA_ARG(NULL) VARLSTCNT(8)
  569                                             ERR_SYSCALL, 5, RTS_ERROR_LITERAL("timer_settime()"), CALLFROM, save_errno);

  (gdb) p save_errno
  $1 = 22
  ```

  The fix is to remove the `sys_canc_timer()` call in `generic_signal_handler()` as it is not clear to me what
  purpose it serves. Later in exit handling (in `gtm_exit_handler()` etc.), we anyways do a call to
  `CANCEL_TIMERS` to cancel any active unsafe timers. This is a safer way of doing the `sys_canc_timer()`
  (as it blocks SIGALRM).

* That said, as part of the code review @estess indicated that he remembered this as being necessary for some
  reason when we were about to dump a core due to a fatal signal (e.g. assert etc.). Therefore, I have
  added code to block SIGALRM only in that code path even though similar code also exists and would be invoked
  a little later in `sr_unix/gtm_fork_n_core.c`.
nars1 added a commit that referenced this pull request Jul 23, 2020
…ion (blktn < ctn)

Issue description
-----------------

* In-house testing showed the following rare test failure (happened 3 times in different tests
  out of a total 1000s of test runs). The primary failure is the below assert.

  ```diff
  > host:refresh_sec_1_10/refresh_secondary/impjob_imptp0.mje3
  > %YDB-F-ASSERT, Assert failed in sr_port/gvcst_blk_build.c line 135 for expression (!IS_MCODE_RUNNING || !cs_addrs->t_commit_crit || (dba_bg != cs_data->acc_meth) || (n_gds_t_op < cse->mode) || (cse->mode == gds_t_acquired) || ((!cs_data->asyncio && (blktn < ctn)) || (cs_data->asyncio && (blktn <= ctn))))
  ```

* Below is the C-stack.

  ```c
  (gdb) where
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
  #3  rts_error_va (csa=0x0, argcnt=7, var=0x7fff88215080) at sr_unix/rts_error.c:192
  #4  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #5  gvcst_blk_build (cse=0x1cbffd0, base_addr=0x7f26683c9000 "\001", ctn=100550) at sr_port/gvcst_blk_build.c:134
  #6  bg_update_phase2 (cs=0x1cbffd0, ctn=100550, effective_tn=100550, si=0x1b79240) at sr_port/t_end_sysops.c:1106
  #7  tp_tend () at sr_port/tp_tend.c:1627
  #8  op_tcommit () at sr_port/op_tcommit.c:495
  #9  ydb_tp_s_common (lydbrtn=LYDB_RTN_TP_COMMIT, tpfn=0x0, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:279
  #10 ydb_tp_st (tptoken=0, errstr=0x0, tpfn=0x51e3e0 <ydb_tp_st_wrapper_cgo>, tpfnparm=0xc00016dd10, transid=0x1abe7f0 "BATCH", namecount=0, varnames=0x1abe770) at sr_unix/ydb_tp_st.c:104
  #11 _cgo_d187034042ca_Cfunc_ydb_tp_st (v=0xc0000e3808) at cgo-gcc-prolog:83
  #12 runtime.asmcgocall () at /usr/lib/golang/src/runtime/asm_amd64.s:655
  #13 ?? ()
  #14 __tsan::Release(__tsan::ThreadState*, unsigned long, unsigned long) ()
  #15 racecall () at /usr/lib/golang/src/runtime/race_amd64.s:381
  #16 ?? () at /usr/lib/golang/src/runtime/proc.go:1080
  #17 runtime.rt0_go () at /usr/lib/golang/src/runtime/asm_amd64.s:220
  #18 ?? ()

  (gdb) f 5
  #5  0x00007f26753d77c7 in gvcst_blk_build (cse=0x1cbffd0, base_addr=0x7f26683c9000 "\001", ctn=100550) at /Distrib/YottaDB/V998_R129/sr_port/gvcst_blk_build.c:134
  134             assert(!IS_MCODE_RUNNING || !cs_addrs->t_commit_crit || (dba_bg != cs_data->acc_meth) || (n_gds_t_op < cse->mode)
  135                    || (cse->mode == gds_t_acquired) || ((!cs_data->asyncio && (blktn < ctn)) || (cs_data->asyncio && (blktn <= ctn))));

  (gdb) p blktn
  $1 = 100550

  (gdb) p ctn
  $2 = 100550
  ```

* The assert failed because a block that we are about to update as part of this transaction commit
  already has a block-header with a transaction number that is equal to the database current transaction
  number of 100550. This is an out-of-design situation. We expect the block-header to always have a
  transaction number that is LESS THAN the current db tn.

* After some code review and detailed analysis of the core, found that it is a case where NOISOLATION
  was used (i.e. `cse->recompute_list_head` is non-NULL for various cw-set elements as shown below). Because
  of that, we had to invoke `bg_update_phase2()` for such NOISOLATION related cses while still in phase1
  of the transaction when we hold crit.

* As as part of this TP transaction, we had to update 11 blocks.

  ```c
  (gdb) up
  #6  0x00007f26755fd820 in bg_update_phase2 (cs=0x1cbffd0, ctn=100550, effective_tn=100550, si=0x1b79240) at /Distrib/YottaDB/V998_R129/sr_port/t_end_sysops.c:1106
  1106                            gvcst_blk_build(cs, blk_ptr, effective_tn);

  (gdb) p si->cw_set_depth
  $3 = 11
  ```

* And that as part of invoking `bg_update_phase2()` for the 10th block, we ended up using the same
  cache-record (and hence the same GDS block buffer in shared memory) that we had previously used to
  store the 2nd block in the same TP transaction. This is confirmed by `cs->old_block` pointing to a
  different buffer than `cs->cr->buffaddr` as shown below.

  ```c
  (gdb) p cs->old_block
  $4 = (sm_uc_ptr_t) 0x7f26683c4000 "\001"

  (gdb) p (sm_uc_ptr_t)(cs->cr->buffaddr + (char *)cs_addrs->mlkctl)
  $5 = (unsigned char *) 0x7f26683c9000 "\001"
  ```

* This is further confirmed by the cr values that are identical for 2 different cw_set elements below.

  ```c
  (gdb) p si->first_cw_set->next_cw_set
  $6 = (struct cw_set_element_struct *) 0x1cbf9d0
  (gdb) p $6->cr
  $7 = (cache_rec_ptr_t) 0x7f266831fc30
  (gdb) p $6->blk
  $8 = 4636
  (gdb) p $6->recompute_list_head
  $9 = (key_cum_value *) 0x1ccdec0

  (gdb) p si->first_cw_set->next_cw_set->next_cw_set->next_cw_set->next_cw_set->next_cw_set->next_cw_set->next_cw_set->next_cw_set->next_cw_set
  $11 = (struct cw_set_element_struct *) 0x1cbffd0
  (gdb) p $11->cr
  $13 = (cache_rec_ptr_t) 0x7f266831fc30
  (gdb) p $11->blk
  $12 = 6463
  (gdb) p $11->recompute_list_head
  $14 = (key_cum_value *) 0x1cd4160
  (gdb) p $11->cr->blk
  $15 = 6463
  ```

* And since the 2nd block changes had already been done by the time we came to update the 10th block,
  we see that the buffer (which is the same for both blocks) already has a block header with the tn
  that is EQUAL to the db curr tn. That explains the assert failure.

* The reason why the same cache-record got reused for multiple block updates in the same TP transaction
  is because the 10th block did not pin a cache-record as part of the commit validation. Only the
  2nd block did. This is because the 10th block had already built a private copy of the block and set
  `cse->new_buff` to a non-NULL value so there is no need to pin a shared memory buffer since we have
  a private buffer already. This is confirmed by the debugger below where the 2nd block had a NULL
  new_buff, but the 10th block had a non-NULL new_buff.

  ```c
  (gdb) p $6->new_buff
  $16 = (unsigned char *) 0x0

  (gdb) p $11->new_buff
  $17 = (unsigned char *) 0x1cc59b0 "\001"
  ```

* This means when `bg_update_phase1()` call is done for the 10th block, a `db_csh_getn()` call will
  happen in that function and that can end up picking the exact same buffer that we used to update
  and commit the 2nd block. That explains the last missing link in this puzzle.

* There is no issue with this. The updates of the 2nd block already got flushed to disk as part
  of the `db_csh_getn()` call. And the updates for the 10th block are going to happen right now in the
  same buffer. There is no need for the 2nd block to stay in shared memory to finish the rest of the
  TP transaction commit so there is no issue. It is just as assert that did not take this exceptional
  situation into consideration.

Fix
----
* When `db_csh_getn()` picks up a new buffer while in the midst of the commit, it is possible for it
  to pick a buffer that was already used in the same commit for a prior block. Therefore reset the
  block header tn in the buffer to one less than the db curr tn (only in debug builds). This should
  ensure the assert in `gvcst_blk_build()` does not fail for the buffer reuse case. The fixes are in
  `sr_port/t_end_sysops.c` towards the end of the `bg_update_phase1()` function.
nars1 added a commit that referenced this pull request Aug 7, 2020
…ue to nested malloc()/free() invocation

Background
-----------
* The r126/ydb464 subtest tests Ctrl-C handling in a SimpleAPI program. Once in a while it hangs with
  two types of C-stack traces (pasted below). In both cases, the C program was in the middle of
  a glibc `malloc()` call when it got the Ctrl-C. The signal handler then ends up invoking some
  function (`sys_settimer()` in the first case below and `syslog()` in the second case below) that in
  turn requires a `malloc()` and that hangs because it requires an internal glibc/system lock that is
  currently held by the interrupted `malloc()`.

  ```c
  () where
  #0  __lll_lock_wait_private () from /usr/lib64/libc.so.6
  #1  _L_lock_17166 () from /usr/lib64/libc.so.6
  #2  malloc () from /usr/lib64/libc.so.6
  #3  timer_create@@GLIBC_2.3.3 () from /usr/lib64/librt.so.1
  #4  sys_settimer () at sr_unix/gt_timers.c:545
  #5  start_first_timer () at sr_unix/gt_timers.c:633
  #6  cancel_unsafe_timers () at sr_unix/gt_timers.c:1086
  #7  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:212
  #8  generic_signal_handler () at sr_unix/generic_signal_handler.c:449
  #9  <signal handler called>
  #10 _int_malloc () from /usr/lib64/libc.so.6
  #11 malloc () from /usr/lib64/libc.so.6
  #12 runProc () at simpleapi/inref/randomWalk.c:212
  #13 runProc_driver () at simpleapi/inref/randomWalk.c:145
  #14 main () at simpleapi/inref/randomWalk.c:93

  () where
  #0  __lll_lock_wait_private () from /usr/lib64/libc.so.6
  #1  _L_lock_17166 () from /usr/lib64/libc.so.6
  #2  malloc () from /usr/lib64/libc.so.6
  #3  open_memstream () from /usr/lib64/libc.so.6
  #4  __vsyslog_chk () from /usr/lib64/libc.so.6
  #5  syslog () from /usr/lib64/libc.so.6
  #6  util_out_send_oper () at sr_unix/util_output.c:770
  #7  util_out_print_vaparm () at sr_unix/util_output.c:880
  #8  util_out_print () at sr_unix/util_output.c:913
  #9  send_msg_va () at sr_unix/send_msg.c:179
  #10 send_msg_csa () at sr_unix/send_msg.c:84
  #11 forced_exit_err_display () at sr_unix/forced_exit_err_display.c:72
  #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:87
  #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:67
  #14 wcs_wtstart () at sr_unix/wcs_wtstart.c:831
  #15 wcs_stale () at sr_port/t_end_sysops.c:1387
  #16 timer_handler () at sr_unix/gt_timers.c:773
  #17 <signal handler called>
  #18 _int_malloc () from /usr/lib64/libc.so.6
  #19 malloc () from /usr/lib64/libc.so.6
  #20 runProc () at simpleapi/inref/randomWalk.c:217
  #21 runProc_driver () at simpleapi/inref/randomWalk.c:145
  #22 main () at simpleapi/inref/randomWalk.c:93
  ```

* The main C program cannot be expected to not use standard C functions like `malloc()` (that would
  avoid hangs like the above). But the YottaDB signal handler has no way of knowing where the signal
  handler interrupt occurred so has to keep its code simple (so it cannot avoid the hang in the signal
  handler code).

* Therefore, I think YottaDB should avoid calls like `sys_settimer()` and `syslog()` while it is inside a
  signal handler if it determines the base program is a C program. While this might reduce the ability
  of YDB to log events in the system log, it might be our best bet at avoiding a hang in case of a Ctrl-C.

* The suggestion in the previous bullet was originally made in the issue description at gitlab. But some
  more thought on this problem made me realize it is not easily possible to go through all the system
  call usages inside all code that is reachable from `generic_signal_handler()`. Therefore a different
  approach to fixing this issue was taken as described in the following section.

Fix
----
* A new function `ydb_os_signal_handler()` is registered as the signal handler for SIGALRM/SIGINT/SIGTERM.
  This replaces the functions `generic_signal_handler()` and `timer_handler()` which were previously
  registered as the signal handlers for these signals. A new global variable "in_os_signal_handler" is set
  to a non-zero value while we are inside `ydb_os_signal_handler()`. Note that this variable is not just
  a boolean (FALSE/TRUE) but can take on values of 2 (for example if a SIGALRM is received while we are
  handling a SIGTERM). Control is then transferred from `ydb_os_signal_handler()` to the actual signal
  handler function (`timer_handler()` or `generic_signal_handler()` depending on the signal).

* This variable is now checked in the `DEFER_EXIT_PROCESSING` macro (in `generic_signal_handler.c`)
  and if found to be non-zero, we defer exit processing. This avoids calls to `gtm_exit_handler()` (and
  later calls to `malloc()` that hang like described in the 1st type of C-stack in the `Background` section).

* This variable is checked in `deferred_signal_handler()` and if non-zero, the function returns right
  away without doing deferred signal handling thereby avoiding potentially risky system calls. This
  avoids `deferred_signal_handler()` from later making a `syslog()` call and causing a hang (like the
  2nd type of C-stack described in the `Background` section).

* In addition, various other pieces of code look at this new `in_os_signal_handler` global variable to
  know if we are inside a signal handler and if so avoid doing anything that is risky to do inside a
  signal handler (like `malloc`/`free`/`syslog`).

* Because we now skip exit handling if we are inside the signal handler, we have a new problem. Previously
  we were guaranteed to exit the process (assuming we are not holding crit etc.) when the SIGTERM/SIGINT
  is received. But now we will not do exit handling because we are inside the os signal handler at that
  point. But then how do we ensure the process does exit soon afterwards.

  To solve this, all places which can get an `EINTR` return code from a system call were examined. All those
  now need to check if the system call got interrupted because of a SIGTERM/SIGINT signal and if so check
  if deferred signal handling needs to be invoked now that we are outisde the signal handler.

  This meant introducing a new `EINTR_HANDLING_CHECK` macro that is invoked in all places where we have a
  `do/while` loop based on `EINTR`. This necessitated changes to a LOT of files.

  For an example of the changes, see `sr_port/eintr_wrappers.h`.
nars1 added a commit that referenced this pull request Aug 7, 2020
…e si->kill_set_tail set to NULL

* Below is the C-stack from the failure (1 out of 500 runs).

  ```c
  (gdb) where
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  generic_signal_handler (sig=11, info=0x7f4c6ee65588 <stapi_signal_handler_oscontext+4296>, context=0x7f4c6ee65608 <stapi_signal_handler_oscontext+4424>) at sr_unix/generic_signal_handler.c:422
  #4  ydb_os_signal_handler (sig=11, info=0x7ffe3eb88530, context=0x7ffe3eb88400) at sr_unix/ydb_os_signal_handler.c:84
  #5  <signal handler called>
  #6  tp_clean_up (clnup_state=TP_ROLLBACK) at sr_port/tp_clean_up.c:215
  #7  op_trollback (rb_levels=0) at sr_port/op_trollback.c:148
  #8  secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:569
  #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:212
  #10 signal_exit_handler (sig=2, info=0x7f4c6ee65588 <stapi_signal_handler_oscontext+4296>, context=0x7f4c6ee65608 <stapi_signal_handler_oscontext+4424>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:77
  #11 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:111
  #12 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #13 gtm_free (addr=0x1918040) at sr_port/gtm_malloc_src.h:1038
  #14 rollbk_sgm_tlvl_info (newlevel=1, si=0x191b840) at sr_port/tp_incr_clean_up.c:381
  #15 tp_incr_clean_up (newlevel=1) at sr_port/tp_incr_clean_up.c:96
  #16 op_trollback (rb_levels=-1) at sr_port/op_trollback.c:218
  #17 ydb_tp_s_common (lydbrtn=LYDB_RTN_TP, tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffe3eb8a240, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:301
  #18 ydb_tp_s (tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffe3eb8a240, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:38
  #19 runProc (settings=0x7ffe3eb8c1f0, curDepth=1) at simpleapi/inref/randomWalk.c:666
  #20 tpHelper (tpfnparm=0x7ffe3eb8b770) at simpleapi/inref/randomWalk.c:691
  #21 ydb_tp_s_common (lydbrtn=LYDB_RTN_TP, tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffe3eb8b770, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:256
  #22 ydb_tp_s (tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffe3eb8b770, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:38
  #23 runProc (settings=0x7ffe3eb8c1f0, curDepth=0) at simpleapi/inref/randomWalk.c:666
  #24 runProc_driver (settings=0x7ffe3eb8c1f0) at simpleapi/inref/randomWalk.c:145
  #25 main () at simpleapi/inref/randomWalk.c:93

  (gdb) f 6
  #6  0x00007f4c6e39e4d2 in tp_clean_up (clnup_state=TP_ROLLBACK) at sr_port/tp_clean_up.c:215
  215                                             FREE_KILL_SET(ks);

  (gdb) p ks
  $1 = (kill_set *) 0xdeadbeefdeadbeef
  ```

* The SIG-11 was because we were done with a `FREE_KILL_SET` (in frame 14 above) when we realized the need
  to handle a deferred signal and as part of handling that we ended up doing another `FREE_KILL_SET` (in
  frame 6 above) on the same kill-set element resulting in a double free.

* This is now fixed by setting the global variable `si->kill_set_head` to NULL before invoking the
  `FREE_KILL_SET` on a copy of the global variable stored in a temporary variable before it got set to NULL.

* The `FREE_KILL_SET` macro is now passed an additional parameter which is the global variable to reset.
  The macro resets the passed in global variable to NULL before it does any `free()` calls.
nars1 added a commit that referenced this pull request Aug 7, 2020
…n multi-threaded programs

* The `dualfail_ms/dual_fail_multisite` subtest failed with the following diff.

  ```diff
  > host:dualfail_ms_1_23/dual_fail_multisite/instance2/impjob_imptp1.mje1
  > %YDB-F-ASSERT, Assert failed in sr_unix/util_output.c line 770 for expression (!in_os_signal_handler)
  ```

  The C-stack of the process is as follows.

  ```c
  > #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
  > #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  > #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
  > #3  rts_error_va (csa=0x0, argcnt=7, var=0x7f2ee2b2b7d0) at sr_unix/rts_error.c:192
  > #4  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  > #5  util_out_send_oper (addr=0x7f2ee2b2c020 "%YDB-I-DBFILEXT, Database file dualfail_ms_1_23/dual_fail_multisite/instance2/mumps.dat extended from 0x0000CDDF blocks to 0x0000CE44 at t"..., len=268) at sr_unix/util_output.c:770
  > #6  util_out_print_vaparm (message=0x0, flush=4, var=0x7f2ee2b2c870, faocnt=2147483647) at sr_unix/util_output.c:880
  > #7  util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:913
  > #8  send_msg_va (csa=0x898040, arg_count=0, var=0x7f2ee2b2ce30) at sr_unix/send_msg.c:193
  > #9  send_msg_csa (csa=0x898040, arg_count=7) at sr_unix/send_msg.c:86
  > #10 gdsfilext (blocks=52703, filesize=52703, trans_in_prog=1) at sr_unix/gdsfilext.c:578
  > #11 bm_getfree (hint=1, blk_used=0x7f2ee2b2d174, cw_work=5, cs=0x9a1ea0, cw_depth_ptr=0x988310) at sr_port/bm_getfree.c:108
  > #12 op_tcommit () at sr_port/op_tcommit.c:385
  > #13 gvcst_put (val=0x7f2ee2b31e90) at sr_port/gvcst_put.c:422
  > #14 op_gvput (var=0x7f2ee2b31e90) at sr_port/op_gvput.c:79
  > #15 ydb_set_s (varname=0x862bc0, subs_used=2, subsarray=0x862c00, value=0x860a70) at sr_unix/ydb_set_s.c:136
  > #16 ydb_set_st (tptoken=0, errstr=0x0, varname=0x862bc0, subs_used=2, subsarray=0x862c00, value=0x860a70) at sr_unix/ydb_set_st.c:42
  > #17 _cgo_d187034042ca_Cfunc_ydb_set_st (v=0xc0000c39f8) at cgo-gcc-prolog:185
  > #18 runtime.asmcgocall () at /snap/go/6123/src/runtime/asm_amd64.s:655
  > #19 runtime.exitsyscallfast.func1 () at /snap/go/6123/src/runtime/proc.go:3181
  ```

  Notice that `ydb_os_signal_handler()` is not in the C-stack but the assert failed because
  `in_os_signal_handler` is non-zero.

  This is because `in_os_signal_handler` got set to a non-zero value in `ydb_os_signal_handler()` invoked
  in another thread in this multi-threaded Go program. But that thread did not own the YDB engine lock.

  This exposed the issue that we should not be updating the `in_os_signal_handler` global variable while
  not holding the YDB engine lock in case of a multi-threaded program. Or else the thread that owns the
  lock could incorrectly assert fail.

* The fix is to not update the global variable in `ydb_os_signal_handler()` when we don't have the YDB
  engine lock. But instead pass it as a new parameter to the signal handler functions (`timer_handler()`
  and `generic_signal_handler()`) which will then set the global variable based on the passed in parameter
  only after the macro `FORWARD_SIG_TO_MAIN_THREAD_IF_NEEDED` macro has been invoked. The fact that the
  macro did not return implies we hold the YDB engine lock and so it is safe to update the global variable.
nars1 added a commit that referenced this pull request Dec 9, 2020
…llback into account

* In-house testing failed once in hundreds of runs with the following signature in an online rollback.

  ```sh
  $ cat ideminter_rolrec_0_11/interrupted_rollback_or_recover/ROLLBACK2_1.logx
  mupip journal -ROLLBACK -back -noverify -verbose "*"  -online -resync=1029453 -lost=ROLLBACK2_1.lost
  Tue Dec  8 23:33:57 EST 2020
  %YDB-I-MUJNLSTAT, Initial processing started at Tue Dec  8 23:33:57 2020
  %YDB-I-MUJPOOLRNDWNSUC, Jnlpool section (id = 485425229) belonging to the replication instance ideminter_rolrec_0_11/interrupted_rollback_or_recover/mumps.repl successfully rundown
  %YDB-I-ORLBKSTART, ONLINE ROLLBACK started on instance INSTANCE1 corresponding to ideminter_rolrec_0_11/interrupted_rollback_or_recover/mumps.repl
  %YDB-F-ASSERT, Assert failed in sr_unix/wcs_flu.c line 352 for expression (in_mu_rndwn_file)
  ```

* Below was the C-stack trace in the core file produced by the assert failure.

  ```gdb
  #14 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #15 wcs_flu (options=0) at sr_unix/wcs_flu.c:352
  #16 mur_open_files (retry=0) at sr_port/mur_open_files.c:653
  #17 mupip_recover () at sr_port/mupip_recover.c:197
  ```

* The assert that failed was inside `wcs_flu.c`. But the variable `in_mu_rndwn_file` that the assert failed
  on is not set in the case of an online rollback.

  In the case of a MUPIP RECOVER or an offline/standalone MUPIP ROLLBACK, the call to `wcs_flu()` happens
  at line 444 below. Whereas, the call to `wcs_flu()` in the case of an online MUPIP ROLLBACK happens at
  line 653 below. It is the line 653 call that failed the assert inside `wcs_flu()`.

  ```c
  sr_port/mur_open_files.c
  -------------------------
    442                                 if (!jgbl.onlnrlbk)
    443                                 {
    444                                         if (!STANDALONE(rctl->gd))      /* STANDALONE macro calls mu_rndwn_file() */
    445                                         {
      .
    490         if (jgbl.onlnrlbk)
    491         {
      .
    653                         if (!wcs_flu(WCSFLU_NONE))
  ```

* The fix is to enhance the assert to additionally take an online rollback into account.

* This is a Debug-only issue. In Release/PRO builds, the code proceeds to do the right thing.
nars1 added a commit that referenced this pull request Feb 24, 2021
…imer thread has already been terminated

Background
----------
* This commit started out to fix the following test failure.

  The `v53003_1/D9I10002703` subtest failed in in-house testing with the following symptom.

  ```
  > %YDB-F-ASSERT, Assert failed in send_msg.c line 113 for expression ((EXIT_IMMED == exit_state) || in_fake_enospc)
  ```

  And with the below C-stack trace.

  ```c
  (gdb) where
      #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
      #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
      #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
      #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
      #4  rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192
      #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  --> #6  send_msg_va (csa=0x0, arg_count=4, var=...) at sr_unix/send_msg.c:113
      #7  send_msg_csa (csa=0x0, arg_count=4) at sr_unix/send_msg.c:86
  --> #8  gtm_close (fd=7) at sr_unix/gtm_fd_trace.c:181
      #9  ss_destroy_context (lcl_ss_ctx=...) at sr_unix/ss_context_mgr.c:167
      #10 jnl_file_close_timer () at sr_unix/jnl_file_close_timer.c:72
      #11 timer_handler (why=14, info=..., context=..., is_os_signal_handler=1) at sr_unix/gt_timers.c:840
      #12 ydb_os_signal_handler (sig=14, info=..., context=...) at sr_unix/ydb_os_signal_handler.c:63
      #13 <signal handler called>
      #14 __GI___close (fd=7) at ../sysdeps/unix/sysv/linux/close.c:27
  --> #15 gtm_close (fd=7) at sr_unix/gtm_fd_trace.c:176
      #16 ss_destroy_context (lcl_ss_ctx=...) at sr_unix/ss_context_mgr.c:167
      #17 ss_create_context (lcl_ss_ctx=..., ss_shmcycle=419) at sr_unix/ss_context_mgr.c:90
      #18 t_end (hist1=..., hist2=0x0, ctn=18446744071629176832) at sr_port/t_end.c:499
      #19 gvcst_put2 (val=..., parms=...) at sr_port/gvcst_put.c:2659
      #20 gvcst_put (val=...) at sr_port/gvcst_put.c:299
      #21 op_gvput (var=...) at sr_port/op_gvput.c:79
  ```

* The assert failed in frame `#6` because we were inside an os signal handler and trying to invoke `syslog()`
  which is not an `async-signal-safe` function.

  And the reason for that is because the original `gtm_close()` call in frame `#15` got interrupted by a
  `SIGALRM` and we did a `gtm_close()` call on the very same file descriptor (`fd=7`) in a nested frame `#8`
  and that returned `EBADF` because the `fd` was already closed in the outer call that got interrupted.

  In Release builds, we would not invoke `gtm_close()` but instead invoke `close()` and not check the
  return status so it is a Debug-build only issue.

* But this failure revealed a more fundamental issue and that is that `jnl_file_close_timer()` which does
  calls to `shmdt()` and `syslog()` should not be invoked while inside an os signal handler. This is because
  those functions are not `async-signal-safe` functions. Therefore `jnl_file_close_timer()` should be
  considered an unsafe timer handler function just like we already handle `wcs_stale()` and `wcs_clean_dbsync()`.

* Note that this function did not have a `shmdt()` invocation until `GT.M V6.3-007` changes were merged into
  the YottaDB master branch. As part of those changes, the function `jnl_file_close_timer.c` (whose primary
  purpose was to close file descriptors corresponding to stale journal files) was overloaded to also close
  stale snapshot file descriptors (by a call to `SS_RELEASE_IF_NEEDED` macro).

  I would have preferred a separate timer handler function for closing stale snapshot file descriptors
  instead of piggybacking it on a function that is journal file related. That is because the latter works
  only on journaled regions whereas the former is needed even for non-journaled regions. But I did not want
  to open a can of worms now so let that be.

Fix
---
* `sr_unix/gt_timers.c` : The function `jnl_file_close_timer()` is now treated as an unsafe pointer by adding
  it to the `IS_KNOWN_UNSAFE_TIMER_HANDLER` macro.

  Also added a function pointer global variable `jnl_file_close_timer_fptr` like is done for other unsafe
  handlers. Added a comment as to why this is necessary (to avoid executable size bloat in `gtmsecshr`).

  And added `jnl_file_close_timer_fptr` as okay to be added as a timer in `start_timer()` just like we already
  handle `wcs_clean_dbsync_fptr` and `wcs_stale_fptr`. The only values of `intrpt_ok_state` that we have seen
  possible (other than `INTRPT_OK_TO_INTERRUPT`) in in-house testing are `INTRPT_IN_DB_CSH_GETN` and
  `INTRPT_IN_GDS_RUNDOWN` (with the corresponding C-stacks pasted below for the record).

   ```c
  (gdb) where
      #6  start_timer (tid=..., time_to_expir=60000000000, handler=..., hdata_len=0, hdata=0x0) at sr_unix/gt_timers.c:451
      #7  jnl_file_open (reg=..., init=0) at sr_unix/jnl_file_open.c:227
      #8  jnl_ensure_open (reg=..., csa=...) at sr_port/jnl_ensure_open.c:71
      #9  wcs_get_space (reg=..., needed=0, cr=...) at sr_unix/wcs_get_space.c:206
  --> #10 db_csh_getn (block=384) at sr_port/db_csh_getn.c:311
      #11 t_qread (blk=384, cycle=..., cr_out=...) at sr_port/t_qread.c:444
      #12 gvcst_search (pKey=..., pHist=0x0) at sr_port/gvcst_search.c:438
      #13 updproc_preread () at sr_port/updhelper_reader.c:378
      #14 updhelper_reader () at sr_port/updhelper_reader.c:139
      #15 mupip_main (argc=4, argv=..., envp=...) at sr_unix/mupip_main.c:122
      #16 dlopen_libyottadb (argc=4, argv=..., envp=..., main_func=... "mupip_main") at sr_unix/dlopen_libyottadb.c:151
      #17 main (argc=4, argv=..., envp=...) at sr_unix/mupip.c:22

  (gdb) where
      #6  start_timer (tid=..., time_to_expir=..., handler=..., hdata_len=0, hdata=0x0) at sr_unix/gt_timers.c:439
      #7  jnl_file_open (reg=..., init=0) at sr_unix/jnl_file_open.c:218
      #8  jnl_ensure_open (reg=..., csa=...) at sr_port/jnl_ensure_open.c:71
      #9  wcs_flu (options=519) at sr_unix/wcs_flu.c:419
  --> #10 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:583
      #11 gv_rundown () at sr_port/gv_rundown.c:122
      #12 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:224
      #13 __run_exit_handlers (status=0, listp=..., run_list_atexit=..., run_dtors=...) at exit.c:108
      #14 __GI_exit (status=<optimized out>) at exit.c:139
      #15 gtm_image_exit (status=0) at sr_unix/gtm_image_exit.c:27
      #16 op_zhalt (retcode=0, is_zhalt=0) at sr_port/op_zhalt.c:99
  ```

* `sr_unix/gt_timers_add_safe_hndlrs.c` : The function `jnl_file_close_timer()` is removed from the list of
  known safe handlers.

* Since we now do not invoke `jnl_file_close_timer()` inside the signal handler (due to the above bullets),
  the original assert is no longer possible either (it required `jnl_file_close_timer()` to be invoked
  inside the signal handler to cause the nested `gtm_close()` invocation). So the assert failure is also
  automatically fixed.

* While the fact that a function that is not `async-signal-safe` is being called from the os signal handler
  is a potential for an issue (behavior is undefined according to the man pages), it is not clear if this
  translates into a user-visible issue so no gitlab issue is created for this.

* With the changes described in the above bullets, some tests occasionally failed an assert in Debug builds
  and the following error in Release builds.

  ```
  > %YDB-E-SYSCALL, Error received from system call timer_create() -- called from module sr_unix/gt_timers.c at line 600
  ```

  The C-stack at the time of this error was the following.

  ```
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
  #3  rts_error_va () at sr_unix/rts_error.c:192
  #4  rts_error_csa () at sr_unix/rts_error.c:99
  #5  sys_settimer () at sr_unix/gt_timers.c:598
  #6  start_first_timer () at sr_unix/gt_timers.c:683
  #7  start_timer_int () at sr_unix/gt_timers.c:499
  #8  start_timer () at sr_unix/gt_timers.c:461
  #9  try_semop_get_c_stack () at sr_unix/gtm_c_stack_trace_semop.c:62
  #10 ftok_sem_lock () at sr_unix/ftok_sems.c:231
  #11 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:331
  #12 gv_rundown () at sr_port/gv_rundown.c:122
  #13 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:224
  #14 ydb_exit () at sr_unix/ydb_exit.c:150
  #15 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:201
  #16 __run_exit_handlers () at exit.c:108
  #17 __GI_exit () at exit.c:139
  #18 __libc_start_main () at ../csu/libc-start.c:342
  ```

  The issue was that `CANCEL_TIMERS` (invoked as part of the `LOCK_RUNDOWN_MACRO` macro in `gtm_exit_handler.c`)
  canceled all unsafe timers which meant it now canceled the `jnl_file_close_timer()` timer as well as it
  became an unsafe timer as part of this commit. And since that was the only timer left in the timer chain
  at the time `CANCEL_TIMERS` got invoked, the system timer got stopped too. And later as part of `gds_rundown()`
  when we needed to start a timer, we also needed to start the system timer and that posed problems in a
  SimpleThreadAPI environment as we are in exit processing and the MAIN worker thread that handles `SIGALRM`
  already terminated so using a dead thread id in the `timer_create()` call resulted in an error.

  This is fixed by changing `sr_unix/gt_timers.c` to additionally check if this is a SimpleThreadAPI environment
  and if `exit_handler_active` is TRUE and if so not use the MAIN worker thread id (which is already stored in
  the `posix_timer_thread_id` global variable) but copy the current thread id into the global variable.
  Turns out this is exactly the issue noticed in another unrelated test failure described by #679. Coincidentally
  the same fix done here was suggested there so that is good.

  Related to this change, a few other files were modified just in case.
  - sr_unix/jnl_file_close_timer.h : If `exit_handler_active` is TRUE, we do not start the `jnl_file_close_timer()`
    as it does not do much and might cause issues during exit processing.
  - sr_unix/ss_context_mgr.c : An assert is added to ensure we never invoke this function (which has a `shmdt()`
    call and is not async-signal-safe) inside an os invoked signal handler.
  - sr_unix/gds_rundown.c : Reordered the steps to `destroy`/`free`/`clear` of `csa->ss_ctx` so the `clear`
    happens first. This way even if we get interrupted by a signal after the `destroy` and `free` step but before
    the `clear` step and proceed to exit the process as part of that interrupt (e.g. if a `SIGTERM` is sent thrice
    to the same process, we immediately proceed to exit even if it is potentially not safe to do so) an invocation
    of `ss_destroy_context()` inside `gds_rundown()` does not try a `shmdt()` of the shmid stored in an already
    freed `csa->ss_ctx` (e.g. the shmid would have a value of `0xdeadbeef` in case `ydb_dbglvl` is set to `0x1F0`).

* Additionally, @estess had the following question (pasted from
  https://gitlab.com/YottaDB/DB/YDB/-/merge_requests/929#note_514699904).

  - This fix is re-starting timers but with the current thread becoming the interrupt thread. That will
    work for C but it may or may not work for Go (if a timer actually pops). This is because once the
    exit handler starts, a Go process can no longer get SIGALRM interrupts. In a Go process, signals are
    fielded by Go and the Go wrapper but after the exit handler runs, the wrapper can no longer call into
    YDB to drive the timer handler because the checks in LIBYOTTADB_RUNTIME_CHECK*() looking at the global
    exit_handler_active.  I'm not sure what the solution is for that or if that's just a limitation we
    have to deal with. Additionally, should we shutdown the timer system at some point like just before
    we either exit or drive the Go panic callback? YottaDB may be returning to Go or other language main
    code instead of exiting and that main code may not be prepared to handle a timer popping.

  To which @nars1 had the following response (pasted from
  https://gitlab.com/YottaDB/DB/YDB/-/merge_requests/929#note_515285207).

  - Regarding your question, I think YottaDB should not start a timer if it has started shutting
    down. The following are the timers that I can think of which can get started during shutdown.
    * `jnl_file_close_timer` : !929 has already fixed it to not be started as part of shutdown processing.
    * `wcs_stale` and `wcs_clean_dbsync` : These are flush timers that are nice to have and not
      necessary and can definitely be skipped if we are exiting. No correctness issues. And like you
      say, it would be better to not have these timers pop long after YottaDB has shut down. So I plan
      on fixing the code to not start these timers if `exit_handler_active` is `TRUE`.
    * `semwt2long_handler` : While trying to grab the ftok lock, we start this timer (in
      `sr_unix/gtm_c_stack_trace_semop.c`) to detect hangs. But we cancel this timer in the same function
      before returning so this won't pop after YottaDB has shut down like the timers in the above
      bullets. It would be nice to have a timer started for this case even if Go is the main program. But
      that is not possible since Go did the `sigaction()` of `SIGALRM` and so will receive control and not
      YottaDB if/when this timer pops. A better solution would be to use `semtimedop()` instead of `semop()`.
      This avoids the need for a timer altogether.

  The fixes are in

  - `sr_port/t_end_sysops.c` : To not start `wcs_stale` as a timer when `exit_handler_active` is `TRUE`.
  - `sr_unix/wcs_clean_dbsync.h` : To not start `wcs_clean_dbsync` as a timer when `exit_handler_active` is `TRUE`.
  - `sr_unix/gtm_c_stack_trace_semop.c` : To use `semtimedop()` instead of `semop()` thereby avoiding the need
    to start a timer if this code is invoked when `exit_handler_active` is TRUE. This also involved removing
    a global variable `TREF(semwait2long)`, the function `semwt2long_handler()`, the macro
    `CANCEL_TIMER_AND_RETURN_SUCCESS` and an unused macro `ISSUE_CRITSEMFAIL_AND_RETURN`. On a related note,
    the macro `MAX_SEM_WAIT_TIME` was removed and a new macro `MAX_SEM_WAIT_TIME_IN_SECONDS` introduced instead.
    The removal of the 4-byte `TREF(semwait2long)` caused some changes in the following clang-tidy warnings
    reference file so those were updated to reflect latest output.
    * ci/tidy_warnings_debug.ref
    * ci/tidy_warnings_release.ref
  - `sr_unix/gt_timers.c` : Added an assert that we never come to `start_timer()` with `exit_handler_active`
    set to `TRUE`.  Further testing revealed various test failures where this newly added assert failed.
    * One such call graph was `gds_rundown()` -> `send_mesg2gtmsecshr()` -> `start_timer()` where the timer
      handler function was `client_timer_handler()`. This timer is now avoided by setting the `SO_RCVTIMEO`
      socket option to a timeout (of `CLIENT_ACK_TIMEOUT_IN_SECONDS` seconds) using a `setsockopt()` call.
      And the `client_timer_handler()` function is now removed.
    * One failure involved `fake_enospc()` being invoked during exit handling. This is now fixed by a new
      `START_TIMER` macro in `sr_unix/fake_enospc.c` to skip the `start_timer()` invocation if
      `exit_handler_active` is `TRUE`.
    * One failure involved `turn_tracing_off()` starting a timer as part of `db_init()` call which it can
      do if it needs to open a database file corresponding to the tracing global even though this happens
      as part of exit handling. This is fixed in `sr_unix/gtm_exit_handler.c` by moving mprof rundown step
      to BEFORE lock rundown and moving `exit_handler_active` variable setting to `TRUE` in between these
      two steps.
    * One failure involved a timer signal getting delivered while we are in `gtm_exit_handler()` after
      `exit_handler_active` was set to `TRUE` but before `CANCEL_TIMERS` macro was invoked and invoking
      `jnl_file_close_timer()` which ended up starting a new timer of itself. This is now fixed by
      checking for `exit_handler_active` and if it is `TRUE` skipping the `start_timer()` invocation.
      Pre-existing usages of the global variable `process_exiting` were fixed to use `exit_handler_active`
      instead as the latter is the more accurate one to use (as this is what is checked by `start_timer()`).

Test
----
* Manually verified with a debugger.

  Terminal 1 : Started a `mupip integ -online -reg "*"` process, set a break point in `ss_initiate()`,
  waited for that function to return and then paused this terminal.

  Terminal 2 : Started a `yottadb -direct` process, set a break point in `jnl_file_close_timer()`, and
  then did an update and verified that the break point was reached but `ydb_os_signal_handler()` (the os
  signal handler) was not in the C-stack but instead a deferred signal handler was (which is what we want).

  ```c
  (gdb) where
  #0  jnl_file_close_timer () at sr_unix/jnl_file_close_timer.c:41
  #1  timer_handler (why=0, info=..., context=..., is_os_signal_handler=0) at sr_unix/gt_timers.c:844
  #2  check_for_deferred_timers () at sr_unix/gt_timers.c:1222
  #3  deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #4  eintr_handling_check () at sr_port/eintr_handling_check.c:29
  #5  dm_read (v=0x5555555cc958) at sr_unix/dm_read.c:382
  #6  op_dmode () at sr_port/op_dmode.c:123
  ```

  It is not easy to test this in an automated test case so not spending time coming up with one.

* Ran E_ALL (existing automated tests in YDBTest repo) dozens of times with these changes to ensure there are
  no regressions.
nars1 added a commit that referenced this pull request Mar 15, 2021
…ady started exiting

* As part of a prior commit (SHA 723688c) various functions that started a
  timer (`wcs_clean_dbsync()`, `wcs_stale()` etc.) were fixed to not start one if we have already started
  exit processing.

* One such timer function that should also have been fixed but was left out is `gtmsource_heartbeat_timer()`.
  We had an in-house test failure which failed an assert in `start_timer()` because `gtmsource_heartbeat_timer()`
  was being started while we had already started exit processing. Below is the C-stack of the failure for the record.

  ```c
  #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va () at sr_unix/rts_error.c:192
  #5  rts_error_csa () at sr_unix/rts_error.c:99
  #6  start_timer () at sr_unix/gt_timers.c:433
  #7  gtmsource_heartbeat_timer () at sr_unix/gtmsource_heartbeat.c:74
  #8  timer_handler () at sr_unix/gt_timers.c:889
  #9  ydb_os_signal_handler () at sr_unix/ydb_os_signal_handler.c:63
  #10 <signal handler called>
  #11 __GI___libc_write () at ../sysdeps/unix/sysv/linux/write.c:26
  #12 _IO_new_file_write () at fileops.c:1181
  #13 new_do_write () at libioP.h:948
  #14 _IO_new_file_xsputn () at fileops.c:1255
  #15 _IO_new_file_xsputn () at fileops.c:1197
  #16 __GI__IO_fwrite () at libioP.h:948
  #17 gtm_fwrite () at sr_port/eintr_wrappers.h:334
  #18 gtm_fprintf () at tdio.c:82
  #19 util_out_print_vaparm () at sr_nix/util_output.c:876
  #20 util_out_print () at sr_unix/util_output.c:914
  #21 gtm_putmsg_csa () at sr_unix/gtm_putmsg.c:73
  #22 gds_rundown () at sr_unix/gds_rundown.c:1060
  #23 gv_rundown () at sr_port/gv_rundown.c:122
  #24 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144
  #25 __run_exit_handlers () at exit.c:108
  #26 __GI_exit () at exit.c:139
  #27 gtm_image_exit () at sr_unix/gtm_image_exit.c:27
  #28 util_base_ch () at sr_port/util_base_ch.c:124
  #29 gtmsource_ch () at sr_port/gtmsource_ch.c:96
  #30 gtmsource_readfiles () at aDB/V999_R131/sr_unix/gtmsource_readfiles.c:2023
  #31 gtmsource_get_jnlrecs () attaDB/V999_R131/sr_unix/gtmsource_process_ops.c:980
  #32 gtmsource_process () at sr_unix/gtmsource_process.c:1546
  #33 gtmsource () at sr_unix/gtmsource.c:525
  #34 mupip_main () at sr_unix/mupip_main.
  #35 dlopen_libyottadb () at /Distri9_R131/sr_unix/dlopen_libyottadb.c:151
  #36 main () at sr_unix/mupip.c:22
  ```

* This failure is now fixed by checking `exit_handler_active` and if it is `TRUE` we skip starting this timer.
nars1 added a commit that referenced this pull request Mar 18, 2021
…if process has already started exiting

* As part of a prior commit (a37022e) `sr_unix/gtmsource_heartbeat.c` was
  fixed to skip starting a timer if the process has already started exiting.

  Turns out there is one more place in the same file where the timer is started and that needed a similar
  fix but was missed out in the prior commit.

* We had an in-house test failure with the following C-stack that exercised the missed out code path
  (`sr_unix/gtmsource_heartbeat.c` line 75, frame 7 below).

  ```c
  #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va () at sr_unix/rts_error.c:192
  #5  rts_error_csa () at sr_unix/rts_error.c:99
  #6  start_timer () at sr_unix/gt_timers.c:433
  #7  gtmsource_heartbeat_timer () at sr_unix/gtmsource_heartbeat.c:75
  #8  timer_handler () at sr_unix/gt_timers.c:889
  #9  ydb_os_signal_handler () at sr_unix/ydb_os_signal_handler.c:63
  #10 <signal handler called>
  #11 gds_rundown () at sr_unix/gds_rundown.c:249
  #12 gv_rundown () at sr_port/gv_rundown.c:122
  #13 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144
  #14 __run_exit_handlers () at exit.c:108
  #15 __GI_exit () at exit.c:139
  #16 gtm_image_exit () at sr_unix/gtm_image_exit.c:27
  #17 util_base_ch () at sr_port/util_base_ch.c:124
  #18 gtmsource_ch () at sr_port/gtmsource_ch.c:96
  #19 gtmsource_readfiles () at sr_unix/gtmsource_readfiles.c:2023
  #20 gtmsource_get_jnlrecs () at sr_unix/gtmsource_process_ops.c:966
  #21 gtmsource_process () at sr_unix/gtmsource_process.c:1557
  #22 gtmsource () at sr_unix/gtmsource.c:525
  #23 mupip_main () at sr_unix/mupip_main.c:122
  #24 dlopen_libyottadb () at sr_unix/dlopen_libyottadb.c:151
  #25 main () at sr_unix/mupip.c:22
  ```

* A similar fix is now applied to this code path. A new macro `START_GTMSOURCE_HEARTBEAT_TIMER_IF_NOT_EXITING`
  now implements the fix from the prior commit and is now invoked from both the code paths. This way we avoid
  code duplication.
nars1 added a commit that referenced this pull request Mar 26, 2021
…of frame_pointer global variable

* In-house testing revealed a test failure in the `dual_fail_extend/dual_fail2_mustop_sigquit` subtest
  with the following symptom.

  ```diff
  52a53,248
  > hostname:dual_fail_extend_1_2/dual_fail2_mustop_sigquit/impjob_imptp0.mje3
  > %YDB-F-ASSERT, Assert failed in sr_unix/ydb_exit.c line 123 for expression (NULL != frame_pointer)
  ```

* After some non-trivial analysis, it was found that this test sends a `SIG-15` (aka `SIGTERM`) and it
  is possible that the signal gets handled in `ci_ret_code_quit.c` as part of a call to `gtmci_isv_restore()`
  while `frame_pointer` is `NULL`.

  The full C-stack that demonstrates where exit handling kicked in is pasted below for the record. Note that
  it does not include the `ydb_exit()` call that assert failed since `ydb_exit()` is invoked from a different
  thread at a later point (as part of invoking the defer handler in Go which is the last thing that happens
  before the process dies and is handled in the Go side).

  ```gdb
  (gdb) where
  #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
  #3  rts_error_va () at sr_unix/rts_error.c:192
  #4  rts_error_csa () at sr_unix/rts_error.c:99
  #5  signal_exit_handler () at sr_unix/signal_exit_handler.c:46
  #6  generic_signal_handler () at sr_unix/generic_signal_handler.c:500
  #7  ydb_altmain_sighandler () at sr_unix/ydb_altmain_sighandler.c:27
  #8  process_pending_signals () at sr_unix/ydb_sig_dispatch.c:289
  #9  gtm_free () at sr_port/gtm_malloc_src.h:1038
  #10 gtmci_isv_restore () at sr_unix/gtmci_isv.c:62
  #11 ci_ret_code_quit () at sr_unix/ci_ret_code.c:37
  #12 ydb_ci_exec () at sr_unix/gtmci.c:1012
  #13 ydb_cip_helper () at sr_unix/ydb_cip_helper.c:51
  #14 ydb_cip_t () at sr_unix/ydb_cip_t.c:46
  #15 callg_nc () at sr_port/callg_nc.c:67
  #16 ydb_call_variadic_plist_func () at sr_unix/ydb_call_variadic_plist_func.c:24
  #17 _cgo_5e4589acf993_Cfunc_ydb_call_variadic_plist_func () at cgo-gcc-prolog:54
  #18 runtime.asmcgocall () at /snap/go/7221/src/runtime/asm_amd64.s:667
  #19 threaded_api_ydb_engine_unlock () at sr_unix/libyottadb_int.h:1034
  #20 __tsan::Release()
  #21 racecall () at /snap/go/7221/src/runtime/race_amd64.s:413
  #22 ?? () at /snap/go/7221/src/runtime/signal_unix.go:1035
  ```

* If exit handling was deferred until a few lines later in `ci_ret_code_quit.c` where `frame_pointer` gets
  set to a value from the parent base frame (value stored in the M stack in `msp`), `frame_pointer` would
  have been a non-NULL value.

* This is now implemented by a set of `DEFER_INTERRUPTS`/`ENABLE_INTERRUPTS` macro calls that surround the
  window of code where `frame_pointer` can be temporarily `NULL`. This is possible in the following files
  so all of them were fixed.
  - sr_unix/ci_ret_code.c
  - sr_unix/gtm_trigger.c
  - sr_unix/ojchildparms.c

* For the records, this type of failure was seen once in 100 runs of the test and every time it has been
  in the case where `imptp` was built using the `YDBGo` wrapper. Not yet clear whether it is possible in
  other wrappers (e.g. `YDBRust`, `SimpleThreadAPI`, `SimpleAPI` etc.). In any case, this issue seems to
  be a Debug-only one in that in Release builds `ydb_exit()` will just not go into the block that relies on
  a non-NULL value of `frame_pointer` since `process_exiting` would be `TRUE` after the exit handler code
  has run. Therefore, no user-visible symptom is expected to be seen because of this hence no issue is
  created on gitlab for this and this commit is tagged as `[DEBUG-ONLY]`.

* Without the fixes, the test failure was seen 8 in 800 test runs with the following pertinent settings
  from `settings.csh`. Out of these, the `ydb_imptp_flavor` setting is what matters the most in my
  understanding.

  ```csh
  setenv ydb_imptp_flavor 3
  # Go environment variables
  setenv ydb_go_race_detector_on 1
  setenv GOGC 1
  ```

  In any case, with the fixes, and using the same settings as above, no test failures were seen in 800 test
  runs thus confirming the fixes in this commit work.
nars1 added a commit that referenced this pull request Mar 31, 2021
…le descriptor

* In-house testing showed a rare test failure in the `v53003_1/D9I10002703` subtest with the following diff.

  ```diff
  > hostname:v53003_1_10/D9I10002703/child_d002703.mje1
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 183 for expression (FALSE)
  ```

  The C-stack corresponding to the assert failure was the following.

  ```gdb
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=..., argcnt=7, var=...) at sr_unix/rts_error.c:192
  #5  rts_error_csa (csa=..., argcnt=7) at sr_unix/rts_error.c:99
  #6  gtm_close (fd=5) at sr_unix/gtm_fd_trace.c:183
  #7  ss_destroy_context (lcl_ss_ctx=...) at sr_unix/ss_context_mgr.c:176
  #8  jnl_file_close_timer () at sr_unix/jnl_file_close_timer.c:74
  #9  timer_handler (why=0, info=..., context=..., is_os_signal_handler=0) at sr_unix/gt_timers.c:889
  #10 check_for_deferred_timers () at sr_unix/gt_timers.c:1267
  #11 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #12 eintr_handling_check () at sr_port/eintr_handling_check.c:29
  #13 ss_destroy_context (lcl_ss_ctx=...) at sr_unix/ss_context_mgr.c:176
  #14 ss_create_context (lcl_ss_ctx=..., ss_shmcycle=339) at sr_unix/ss_context_mgr.c:93
  #15 t_end (hist1=..., hist2=..., ctn=18446744071629176832) at sr_port/t_end.c:499
  #16 gvcst_put2 (val=..., parms=...) at sr_port/gvcst_put.c:2661
  #17 gvcst_put (val=...) at sr_port/gvcst_put.c:299
  #18 op_gvput (var=...) at sr_port/op_gvput.c:79
  ```

* The issue is that `CLOSEFILE_RESET` macro (in frame 13 above) invoked `eintr_handling_check()`
  AFTER it had invoked the `close()` call on the snapshot shadow file descriptor. But it had not yet
  cleared the file descriptor so the `eintr_handling_check(0` eventually invoked `CLOSEFILE_RESET` again
  (in frame 7 above). And when it invoked the `close()` call again on the same file descriptor, it failed
  an assert in `sr_unix/gtm_fd_trace.c` line 183.

* This is now fixed by saving a copy of `lcl_ss_ctx->shdw_fd` into a local variable, setting the global
  variable to `FD_INVALID` and then invoking `CLOSEFILE_RESET` on the local variable. This way, while
  inside the `CLOSEFILE_RESET` macro we are guaranteed a call into `eintr_handling_check()` can never
  call `CLOSEFILE_RESET` macro on `lcl_ss_ctx->shdw_fd` again since it would be `FD_INVALID`.

* This is a Debug build only issue. In a Release build, the return value of `close()` is ignored so
  there will not be any user-visible symptom.
nars1 added a commit that referenced this pull request Jan 15, 2022
…ready exiting (fixes random r132/ydb635 subtest failure)

Background
----------
* The `r132/ydb635` subtest (in the YDBTest project) started to fail on a RHEL 7 in-house system
  after merging GT.M V6.3-011.

* The failure symptom was a core file with the following stack trace.

  ```c
  Thread 1 (Thread 0x7f5c261dd740 (LWP 49939)):
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  ch_overrun () at sr_unix/ch_overrun.c:35
  #3  rts_error_va (csa=0x0, argcnt=4, var=0x7ffdc23a7840) at sr_unix/rts_error.c:198
  #4  rts_error_csa (csa=0x0, argcnt=4) at sr_unix/rts_error.c:99
  #5  resetterm (iod=0x1630c40) at sr_unix/resetterm.c:55
  #6  io_rundown (rundown_type=1) at sr_port/io_rundown.c:74
  #7  mupip_exit_handler () at sr_unix/mupip_exit_handler.c:171
  #8  __run_exit_handlers () from /usr/lib64/libc.so.6
  #9  exit () from /usr/lib64/libc.so.6
  #10 gtm_image_exit (status=150373082) at sr_unix/gtm_image_exit.c:27
  #11 util_base_ch (arg=150373082) at sr_port/util_base_ch.c:124
  #12 gtmio_ch (arg=150373082) at sr_unix/gtmio_ch.c:24
  #13 rts_error_va (csa=0x0, argcnt=1, var=0x7ffdc23a8250) at sr_unix/rts_error.c:198
  #14 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #15 iott_readfl (v=0x16774b8, length=32766, nsec_timeout=9223372036854775800) at sr_unix/iott_readfl.c:973
  #16 iott_read (v=0x16774b8, nsec_timeout=9223372036854775800) at sr_unix/iott_read.c:29
  #17 op_read (v=0x16774b8, timeout=0x7f5c2491d1c0 <literal_notimeout>) at sr_port/op_read.c:68
  #18 cli_get_parm (entry=0x7ffdc23b8c90 "WHAT", val_buf=0x7ffdc23b0b90 "") at sr_unix/cli_parse.c:1025
  #19 cli_get_str (entry=0x7f5c2460d784 "WHAT", dst=0x162dcd4 "", max_len=0x162dcd2) at sr_unix/cli.c:285
  #20 mupip_integ () at sr_port/mupip_integ.c:290
  #21 mupip_main (argc=2, argv=0x7ffdc23c4e88, envp=0x7ffdc23c4ea0) at sr_unix/mupip_main.c:122
  #22 dlopen_libyottadb (argc=2, argv=0x7ffdc23c4e88, envp=0x7ffdc23c4ea0, main_func=0x401470 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #23 main (argc=2, argv=0x7ffdc23c4e88, envp=0x7ffdc23c4ea0) at sr_unix/mupip.c:22
  ```

* As can be seen from the below gdb output, we got an IOEOF error in frame 15 and then went to the exit handler
  in frame 7 and as part of exiting, we encountered a ERR_TCSETATTR error in frame 5.

  ```c
  (gdb) f 15
  #15 iott_readfl (v=0x16774b8, length=32766, nsec_timeout=9223372036854775800) at sr_unix/iott_readfl.c:973
  973                    rts_error_csa(CSA_ARG(NULL) VARLSTCNT(1) ERR_IOEOF);
  (gdb) f 7
  #7  mupip_exit_handler () at sr_unix/mupip_exit_handler.c:171
  171             io_rundown(RUNDOWN_EXCEPT_STD);
  (gdb) f 5
  #5  resetterm (iod=0x1630c40) at sr_unix/resetterm.c:55
  55                  rts_error_csa(CSA_ARG(NULL) VARLSTCNT(4) ERR_TCSETATTR, 1, ttptr->fildes, save_errno);
  ```

Issue
-----
* In frame 5, there was no condition handler to handle the ERR_TCSETATTR error and so we generated a core file.
  This is because we are already exiting due to an error.

Fix
----
* The fix is in `sr_unix/resetterm.c` to check if `exit_handler_active` is TRUE and if so not issue the
  ERR_TCSETATTR error. Reasoning is described in a code comment.

* While at this, I realized that it would be nice to issue a NOPRINCIO error message to the syslog and
  terminate the process in case we already encountered an error while writing to the terminal. Therefore
  added a call to the ISSUE_NOPRINCIO_BEFORE_RTS_ERROR_IF_APPROPRIATE macro that currently exists in
  `sr_unix/iott_use.c`. And moved the macro to `sr_port/io.h` so it can be called from multiple places.

  Also noticed a pre-existing usage in `sr_unix/iott_use.c` where as `TCFLUSH()` call failure could also
  benefit from issuing the NOPRINCIO error message. So added that too.

* With these changes, the test (which kills the terminal in an `expect` session before the `mupip integ`
  process could return back to the shell prompt) now passes reliably. In the syslog, I do see a
  `NOPRINCIO` error message now whereas it did not show up before.
nars1 added a commit that referenced this pull request Jan 19, 2022
…ournals get concurrently switched

Background
----------
* This is a long standing issue in the replication source server code that showed up as a very rare
  test failure in in-house testing.

* The `online_bkup_1/online2` subtest failed with the following diff.

  ```diff
  > host:online_bkup_1_21/online2/SRC_02_16_10.log
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtmsource_readfiles.c line 592 for expression ((new_eof_addr >= prev_eof_addr) || (DIVIDE_ROUND_DOWN(prev_eof_addr, REPL_BLKSIZE(rb)) - ((0 == prev_eof_addr % REPL_BLKSIZE(rb)) ? 1 : 0) == DIVIDE_ROUND_DOWN(new_eof_addr, REPL_BLKSIZE(rb))))
  ```

* The source server failed an assert and the resulting core file had the following stack trace.

  ```c
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd74815a60) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  update_eof_addr (ctl=0x62d000021040, eof_change=0x7ffd74816040) at sr_unix/gtmsource_readfiles.c:590
  #7  position_read (ctl=0x62d000021040, read_seqno=1003065) at sr_unix/gtmsource_readfiles.c:1379
  #8  read_transaction (ctl=0x62d000021040, buff=0x7ffd74817f40, bufsiz=0x7ffd74817f80, read_jnl_seqno=1003064) at sr_unix/gtmsource_readfiles.c:1094
  #9  read_regions (buff=0x7ffd74817f40, buff_avail=0x7ffd74817f80, attempt_open_oldnew=0, brkn_trans=0x7ffd74817f90, read_jnl_seqno=1003064) at sr_unix/gtmsource_readfiles.c:1910
  #10 read_and_merge (buff=0x7f4e01a5e890 "", maxbufflen=2097144, read_jnl_seqno=1003064) at sr_unix/gtmsource_readfiles.c:1569
  #11 gtmsource_readfiles (buff=0x7f4e01a5e848 "\aH", data_len=0x7ffd74819020, maxbufflen=2097144, read_multiple=1) at sr_unix/gtmsource_readfiles.c:2026
  #12 gtmsource_get_jnlrecs (buff=0x7f4e01a5e848 "\aH", data_len=0x7ffd74819020, maxbufflen=2097144, read_multiple=1) at sr_unix/gtmsource_process_ops.c:966
  #13 gtmsource_process () at sr_unix/gtmsource_process.c:1565
  #14 gtmsource () at sr_unix/gtmsource.c:525
  #15 mupip_main (argc=11, argv=0x7ffd7481ea68, envp=0x7ffd7481eac8) at sr_unix/mupip_main.c:122
  #16 dlopen_libyottadb (argc=11, argv=0x7ffd7481ea68, envp=0x7ffd7481eac8, main_func=0x4ddfe0 <str> "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #17 main (argc=11, argv=0x7ffd7481ea68, envp=0x7ffd7481eac8) at sr_unix/mupip.c:22

  (gdb) f 6
  #6  update_eof_addr (ctl=0x62d000021040, eof_change=0x7ffd74816040) at sr_unix/gtmsource_readfiles.c:590
  590             assert((new_eof_addr >= prev_eof_addr)

  (gdb) p/x new_eof_addr
  $2 = 0x101a8

  (gdb) p/x prev_eof_addr
  $3 = 0x4cea948

  (gdb) p ctl->eof_addr_final
  $11 = 0

  (gdb) p fc->id
  $8 = {inode = 5769810, device = 65028}

  (gdb) p csa->nl->jnl_file.u
  $9 = {inode = 5770666, device = 65028}

  (gdb) p/x rb->fc.jfh->alignsize
  $14 = 0x4000000
  ```

* The issue is that `new_eof_addr` is less than `prev_eof_addr`. In that, as the inode analysis above shows,
  `new_eof_addr` corresponds to an offset in the latest journal file whereas `prev_eof_addr` corresponds to
  an offset in the previous generation journal file.

Issue
-----
* There turns out to be a longstanding timing issue in the code. We note down `prev_eof_addr` first, then
  check if there has been a concurrent journal file switch (using the `is_gdid_gdid_identical()` call at
  line 557 below) and once we know there has been no journal file switch we note down `new_eof_addr` from
  `jb->dskaddr` in shared memory in line 559.

  **sr_unix/gtmsource_readfiles.c**
  ```c
    555         prev_eof_addr = fc->eof_addr;
    556         *eof_change = 0;
    557         if (is_gdid_gdid_identical(&fc->id, JNL_GDID_PTR(csa)))
    558         {
    559                 new_eof_addr = csa->jnl->jnl_buff->dskaddr;
  ```

* But it is possible a concurrent journal file switch happens AFTER line 557 but BEFORE line 559.
  In that case, `new_eof_addr` would correspond to an offset in a different journal file than `prev_eof_addr`.
  And that can result in the assert failure.

Fix
---
* The `new_eof_addr` computation is moved to BEFORE the `is_gdid_gdid_identical()` function call.
  If the function call returns TRUE indicating the journal file has not switched, we will use the
  already noted down value of `new_eof_addr`. If the function call returns FALSE indicating the journal
  file has switched, we will compute `new_eof_addr` afresh. In this case, the computation of `new_eof_addr`
  BEFORE the `is_gdid_gdid_identical()` function call is wasted work but is not a big issue since journal
  file switches are rare events anyways and so the wasted work happens only in rare cases.

Notes
-----
* The consequences of the assert failure in a Release build (PRO build) are that the source server would
  note down a much smaller offset as the last valid offset of a previous generation journal file and it
  is most likely going to end up with errors if it tries to replicate the journal records that lie after
  the incorrect small offset.

  No such issues have been reported yet by any users. That is not surprising  as this requires a journal
  file switch to happen in a very small instruction window.
nars1 added a commit that referenced this pull request Jan 26, 2022
…e specification using ^[..] syntax

Background
----------
* This is an issue identified by fuzz testing.

* Below is a simple example illustrating the failure using a `set` command.

  ```m
  YDB>set ^[$order(@x,1)
  %YDB-F-GTMASSERT2, YottaDB r998 Linux x86_64 - Assert failed sr_port/f_order.c line 121 for expression (DEPTH)
  ```

* Interestingly though, a similar example using the `write` command instead of the `set` command
  works fine in that it correctly issues the EXTGBLDEL error.

  ```m
  YDB>write ^[$order(@x,1)
  %YDB-E-EXTGBLDEL, Invalid delimiter for extended global syntax
          write ^[$order(@x,1)
                              ^-----
  ```

Issue
-----
* The C-stack from the core file is the following.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140639387280448) at pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140639387280448) at pthread_kill.c:80
  #2  __GI___pthread_kill (threadid=140639387280448, signo=3) at pthread_kill.c:91
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=9, var=0x7ffdb630a260) at sr_unix/rts_error.c:198
  #7  rts_error (argcnt=9) at sr_unix/rts_error.c:88
  #8  gtm_assert2 (condlen=5, condtext=0x7fe92512a100 "DEPTH", file_name_len=44, file_name=0x7fe925129fe0 "sr_port/f_order.c", line_no=121) at sr_port/gtm_assert2.c:36
  #9  f_order (a=0x7ffdb630aab0, op=OC_FNORDER) at sr_port/f_order.c:121
  #10 expritem (a=0x7ffdb630aab0) at sr_port/expritem.c:637
  #11 expratom (a=0x7ffdb630aab0) at sr_port/expratom.c:29
  #12 expratom_coerce_mval (a=0x7ffdb630aab0) at sr_port/expratom_coerce_mval.c:34
  #13 gvn () at sr_port/gvn.c:70
  #14 m_set () at sr_port/m_set.c:300
  #15 cmd () at sr_port/cmd.c:312
  #16 linetail () at sr_port/linetail.c:35
  #17 line (lnc=0x7ffdb630c9c0) at sr_port/line.c:230
  #18 compiler_startup () at sr_port/compiler_startup.c:183
  #19 compile_source_file (flen=44, faddr=0x7ffdb630d1f0 "x.m", MFtIsReqd=1) at sr_unix/source_file.c:174
  #20 gtm_compile () at sr_unix/gtm_compile.c:113
  #21 init_gtm () at sr_unix/init_gtm.c:183
  #22 gtm_main (argc=2, argv=0x7ffdb6311d68, envp=0x7ffdb6311d80) at sr_unix/gtm_main.c:178
  #23 dlopen_libyottadb (argc=2, argv=0x7ffdb6311d68, envp=0x7ffdb6311d80, main_func=0x56087a968020 "gtm_main") at sr_unix/dlopen_libyottadb.c:151
  #24 main (argc=2, argv=0x7ffdb6311d68, envp=0x7ffdb6311d80) at sr_unix/gtm.c:20

  (gdb) f 9
  #9  f_order (a=0x7ffdb630aab0, op=OC_FNORDER) at sr_port/f_order.c:121
  121    DISABLE_SIDE_EFFECT_AT_DEPTH;    /* doing this here let's us know specifically if direction had SE threat */
  ```

* The failure was because `TREF(expr_depth)` was 0 whereas the `DISABLE_SIDE_EFFECT_AT_DEPTH` macro was
  expecting a non-zero expression depth.

* When we are in `f_order()`, we are guaranteed a non-zero expression depth if we were called from `expr()`.
  But in case we are processing an extended global reference using the `^[...]` syntax, we use
  `expratom_coerce_mval()` instead of `expr()` (at frame 13 in gvn.c, line 70 below).

  **sr_port/gvn.c**
  ```c
        67     if (vbar)
        68             parse_status = expr(sb1++, MUMPS_EXPR);
        69     else
   -->  70             parse_status = expratom_coerce_mval(sb1++);
  ```

  In this case, `TREF(expr_depth)` is not incremented. And so we cannot invoke `DISABLE_SIDE_EFFECT_AT_DEPTH`
  inside `f_order()` deep down in the stack.

Fix
---
* The fix is to enhance the `DISABLE_SIDE_EFFECT_AT_DEPTH` macro to handle the case that `TREF(expr_depth)`
  can be zero in rare cases. In that case, we do not propagate the side effect state one depth down. We
  just ignore the side effect state till now and reset the current state at depth 0 to be FALSE and return.

* This takes care of all callers of the `DISABLE_SIDE_EFFECT_AT_DEPTH` macro that do not go through the
  `DECREMENT_EXPR_DEPTH` macro.

* In the case of the `DECREMENT_EXPR_DEPTH` macro, we do expect `TREF(expr_depth)` to be non-zero even if
  it is called from `expratom_coerce_mval()`. This is because whichever deep function invocation in the
  stack did the `DECREMENT_EXPR_DEPTH` should have previously done a corresponding `INCREMENT_EXPR_DEPTH`.
  Therefore this now has a newly added `assert(TREF(expr_depth));`.
nars1 added a commit that referenced this pull request Jan 31, 2022
…on a garbage file descriptor

Background
----------
* This is a very rare test failure that was seen only once and on a slow ARM in-house box in
  internal testing.

* The `stress/concurr` subtest failed with the following diff.

  ```diff
  --- concurr/concurr.diff ---
  69a70,181
  > host:REMOTE_SIDE:stress_1/concurr/stress_oli.out
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE)
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE)
  > %YDB-E-NOTALLDBRNDWN, Not all regions were successfully rundown
  ```

* The assert failure created a core file with the following stack trace

  ```c
   #6 gtm_close (fd=1626061471) at sr_unix/gtm_fd_trace.c:185
   #7 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192
   #8 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:501
   #9 gv_rundown () at sr_port/gv_rundown.c:122
  #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144
  #11 __run_exit_handlers (status=150374524, listp=0xffff9c805680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
  #12 __GI_exit (status=<optimized out>) at exit.c:139
  #13 gtm_image_exit (status=150374524) at sr_unix/gtm_image_exit.c:27
  #14 util_base_ch (arg=150374524) at sr_port/util_base_ch.c:124
  #15 mu_int_ch (arg=150374524) at sr_unix/mu_int_ch.c:35
  #16 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192
  #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #18 gtm_close (fd=-559038737) at sr_unix/gtm_fd_trace.c:185
  #19 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192
  #20 jnl_file_close_timer () at sr_unix/jnl_file_close_timer.c:74
  #21 timer_handler (why=0, info=0xffff9c65df68 <stapi_signal_handler_oscontext+47048>, context=0xffff9c65dff0 <stapi_signal_handler_oscontext+47184>, is_os_signal_handler=0) at sr_unix/gt_timers.c:889
  #22 check_for_deferred_timers () at sr_unix/gt_timers.c:1267
  #23 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #24 gtm_free (addr=0xaaaaffca1980) at sr_port/gtm_malloc_src.h:1038
  #25 ss_release (ss_ctx=0xaaaaffc78910) at sr_unix/ss_release.c:226
  #26 mupip_integ () at sr_port/mupip_integ.c:801
  #27 mupip_main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip_main.c:122
  #28 dlopen_libyottadb (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980, main_func=0xaaaae7dd6648 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #29 main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip.c:22
  ```

Issue
-----
* Frame 18 in the stack trace above indicates a `gtm_close()` call happening with an fd of `-559038737`.

* Frame 6 in the stack trace above indicates a `gtm_close()` call happening with an fd of `fd=1626061471`.

* The real issue is in Frame 26 in the stack trace above where we call `ss_release()`. The relevant code
  is pasted below.

  **sr_port/mupip_integ.c**
  ```c
       799     assert(SNAPSHOTS_IN_PROG(csa));
       800     assert(NULL != csa->ss_ctx);
       801     ss_release(&csa->ss_ctx);
       802     CLEAR_SNAPSHOTS_IN_PROG(csa);
  ```

* Line 801 does the `ss_release()` call and Line 802 clears the flag in `csa` that records that a snapshot
  is in progress.

* But `ss_release()` first calls `ss_context_destroy()` and then calls `free()` so it is possible that a
  timer interrupt gets handled in a deferred fashion right after the `free()` but before the
  `CLEAR_SNAPSHOTS_IN_PROG` macro gets executed. This means we would invoke `ss_destroy_context()` on the
  `csa->ss_ctx` structure again inside the timer. And that would be looking at an already freed context
  structure. Which can then explain why garbage values of `fd` got used in the `gtm_close()` calls.

Fix
---
* The fix is in `sr_port/mupip_integ.c` to clear all context in global variables that indicate a snapshot
  is in progress BEFORE calling `ss_release()`.

* Additionally, the following files were changed since the warning text from `clang-tidy` changed a bit.
  While at it, I also verified that this warning is a false alarm.
  - ci/tidy_warnings_debug.ref
  - ci/tidy_warnings_release.ref
nars1 added a commit that referenced this pull request Feb 2, 2022
…imers.c (fixes rare set_jnl/mu_backup_sa_access test failure)

Background
----------
* The `set_jnl/mu_backup_sa_access` subtest failed in one rare test run on an ARM system as follows.

  ```diff
  205a206,367
  > set_jnl_0_4/mu_backup_sa_access/imptp.out
  > %YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 561 for expression (FALSE == oldjnlclose_started)
  > %YDB-F-ASSERT, Assert failed in sr_unix/gds_rundown_ch.c line 32 for expression (INTRPT_IN_GDS_RUNDOWN == intrpt_ok_state)
  > %YDB-E-GTMSECSHRSTART, Client - 87068 : gtmsecshr failed to startup
  > %YDB-F-ASSERT, Assert failed in sr_unix/ipcrmid.c line 75 for expression (FALSE)
  > %YDB-E-NOTALLDBRNDWN, Not all regions were successfully rundown
  > %YDB-E-GVRUNDOWN, Error during global database rundown
  ```

Issue
-----
* The primary failure is an assert related to the `oldjnlclose_started` global variable.
  The C-stack of the core file is as follows.

  ```c
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  clear_timers () at sr_unix/gt_timers.c:561
  #7  create_server () at sr_unix/secshr_client.c:470
  #8  send_mesg2gtmsecshr (code=2, id=203915338, path=0x0, path_len=0) at sr_unix/secshr_client.c:296
  #9  sem_rmid (ipcid=203915338) at sr_unix/ipcrmid.c:72
  #10 ftok_sem_release (reg=0xaaaad6035288, decr_cnt=1, immediate=0) at sr_unix/ftok_sems.c:291
  #11 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:1050
  #12 gv_rundown () at sr_port/gv_rundown.c:122
  #13 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #14 __run_exit_handlers (status=0, listp=0xffffb36d16b8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
  #15 __GI_exit (status=<optimized out>) at exit.c:139
  #16 gtm_image_exit (status=0) at sr_unix/gtm_image_exit.c:27
  #17 op_zhalt (retcode=0, is_zhalt=0) at sr_port/op_zhalt.c:106
  #18 ?? ()

  (gdb) f 6
  #6  clear_timers () at sr_unix/gt_timers.c:561
  561                     assert(FALSE == oldjnlclose_started);
  ```

* The issue is in frame 13 (`gtm_exit_handler.c`) where we do a `CANCEL_TIMERS` call. We could do this
  while we still have a `jnl_file_close_timer()` timer entry active and waiting to pop. In this case,
  the `CANCEL_TIMERS` call would remove the timer entry but the global variable `oldjnlclose_started` which
  had been set to `TRUE` (when we started the timer in the `START_JNL_FILE_CLOSE_TIMER_IF_NEEDED` macro
  in `sr_unix/jnl_file_close_timer.h`) would not have been reset and so there is an out-of-sync situation
  where the timer queue has no entry (i.e. `timeroot` is NULL) but `oldjnlclose_started` is still TRUE.

Fix
---
* This is fixed by changing the assert instead to a statement that sets the global variable to FALSE just
  like what the assert expected it to be. And a comment has been added to explain why this is done.
nars1 added a commit that referenced this pull request Jul 21, 2022
Background
----------
* While running the TCK04 bats subtest in the YDBOcto repo using a Debug build of YottaDB
  that was built using `clang` (not `gcc`), I encountered a very rare failure (took hundreds
  of test reruns to reproduce once).

* Below is the stack trace of the core file from the assert using the gdb debugger.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140433852622656) at pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140433852622656) at pthread_kill.c:80
  #2  __GI___pthread_kill (threadid=140433852622656, signo=3) at pthread_kill.c:91
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd0ac05210) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:1108
  #9  gv_rundown () at sr_port/gv_rundown.c:122
  #10 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #11 signal_exit_handler (exit_handler_name=0x7fb94dc90e5a "deferred_exit_handler", sig=15, info=0x7fb94ddf0ca8 <stapi_signal_handler_oscontext+4424>, context=0x7fb94ddf0d28 <stapi_signal_handler_oscontext+4552>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #14 gtm_malloc_main (size=512, stack_level=1) at sr_port/gtm_malloc_src.h:800
  #15 gtm_malloc (size=512) at sr_port/gtm_malloc_src.h:1486
  #16 gvcst_tp_init (greg=0x22b98d8) at sr_port/gvcst_tp_init.c:68
  #17 tp_set_sgm () at sr_port/tp_set_sgm.c:53
  #18 change_reg () at sr_port/change_reg.c:57
  #19 gv_bind_name (addr=0x22b94e0, gvname=0x7ffd0ac06048) at sr_port/gv_bind_name.c:144
  #20 op_gvname_common (count=8, hash_code=112891184, val_arg=0x7fb94e21c978, var=0x7ffd0ac0cdb0) at sr_port/op_gvname.c:117
  #21 op_gvname_fast (count_arg=10, hash_code=112891184, val_arg=0x7fb94e21c978) at sr_port/op_gvname.c:81

  (gdb) f 8
  #8  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:1108
  1108              assert(NULL != si->cr_array);

  (gdb) f 16
  #16 gvcst_tp_init (greg=0x22b98d8) at sr_port/gvcst_tp_init.c:68
  68              si->cr_array = (cache_rec_ptr_ptr_t)malloc(SIZEOF(cache_rec_ptr_t) * si->cr_array_size);
  ```

Issue
-----
* `si->cr_array_size` is initialized at line 67 below and `si->cr_array` is initialized at line 68 below.

  **sr_port/gvcst_tp_init.c**
  ```c
     67    si->cr_array_size = si->cur_tp_hist_size;
     68    si->cr_array = (cache_rec_ptr_ptr_t)malloc(SIZEOF(cache_rec_ptr_t) * si->cr_array_size);
  ```

* But the assert in line 1108 below assumes that if `si->cr_array_size` is set, then `si->cr_array` must
  also have been set. This is not right if a signal (say `SIG-15` aka `SIGTERM`) comes in between lines
  67 and 68 above like it did in the above failure.

  **sr_unix/gds_rundown.c**
  ```c
   1100                         if (NULL != si->blks_in_use)
   1101                         {
   1102                                 free_hashtab_int4(si->blks_in_use);
   1103                                 free(si->blks_in_use);
   1104                                 si->blks_in_use = NULL;
   1105                         }
   1106                         if (si->cr_array_size)
   1107                         {
   1108                                 assert(NULL != si->cr_array);
   1109                                 if (NULL != si->cr_array)
   1110                                         free(si->cr_array);
   1111                         }
  ```

Fix
---
* `si->cr_array` is checked directly for whether it is `NULL` or not and only in the latter case do we
  invoke `free(si->cr_array)`. This is no longer based on the value of `si->cr_array_size`. This is more
  in line with how we already handle `si->blks_in_use` in line 1100.

* In effect the assert at line 1108 is now removed.

Notes
-----
* In Release builds, the `assert` had no effect and so there was no issue as we later did an `if` check
  anyways.
nars1 added a commit that referenced this pull request Jul 21, 2022
Background
----------
* While running the TCK04 bats subtest in the YDBOcto repo using a Debug build of YottaDB
  that was built using `clang` (not `gcc`), I encountered a very rare failure (took hundreds
  of test reruns to reproduce once).

* Although the failure happened only with `clang`, the same issue can happen with `gcc` builds
  of YottaDB too given the right timing of events/signals.

* Below is the stack trace of the core file from the assert using the gdb debugger.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140299547846464) at pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140299547846464) at pthread_kill.c:80
  #2  __GI___pthread_kill (threadid=140299547846464, signo=3) at pthread_kill.c:91
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  generic_signal_handler (sig=11, info=0x7f9a08aecca8 <stapi_signal_handler_oscontext+4424>, context=0x7f9a08aecd28 <stapi_signal_handler_oscontext+4552>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:492
  #6  ydb_os_signal_handler (sig=11, info=0x7fff10881b70, context=0x7fff10881a40) at sr_unix/ydb_os_signal_handler.c:85
  #7  <signal handler called>
  #8  cleanup_list (list=0xaf8a40) at sr_port/buddy_list.c:205
  #9  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:1098
  #10 gv_rundown () at sr_port/gv_rundown.c:122
  #11 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #12 signal_exit_handler (exit_handler_name=0x7f9a0898ce5a "deferred_exit_handler", sig=15, info=0x7f9a08aecca8 <stapi_signal_handler_oscontext+4424>, context=0x7f9a08aecd28 <stapi_signal_handler_oscontext+4552>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #15 gtm_malloc_main (size=520, stack_level=1) at sr_port/gtm_malloc_src.h:800
  #16 gtm_malloc (size=520) at sr_port/gtm_malloc_src.h:1486
  #17 initialize_list (list=0xaf8a40, elemSize=192, initAlloc=64) at sr_port/buddy_list.c:52
  #18 gvcst_tp_init (greg=0xaf1a18) at sr_port/gvcst_tp_init.c:55
  #19 tp_set_sgm () at sr_port/tp_set_sgm.c:53
  #20 change_reg () at sr_port/change_reg.c:57
  #21 gv_bind_name (addr=0xaf1470, gvname=0x7fff10882e98) at sr_port/gv_bind_name.c:144
  #22 op_gvname_common (count=4, hash_code=-1391378772, val_arg=0x7f9a08f1f998, var=0x7fff10889c00) at sr_port/op_gvname.c:117
  #23 op_gvname_fast (count_arg=6, hash_code=-1391378772, val_arg=0x7f9a08f1f998) at sr_port/op_gvname.c:81

  (gdb) f 8
  #8  cleanup_list (list=0xaf8a40) at sr_port/buddy_list.c:205
  205             while(*curr)

  (gdb) f 17
  #17 initialize_list (list=0xaf8a40, elemSize=192, initAlloc=64) at sr_port/buddy_list.c:52
  52              list->ptrArray = (char **)malloc((size_t)SIZEOF(char *) * (MAX_MEM_SIZE_IN_BITS + 2));
  ```

Issue
-----
* A SIG-15/SIGTERM signal interrupted the `initialize_list()` call in frame 17. In frame 18, we were
  trying to initialize `si->tlvl_cw_set_list` as the below line of code indicates.

  **sr_port/gvcst_tp_init.c**
  ```c
     55   initialize_list(si->tlvl_cw_set_list, SIZEOF(cw_set_element), TLVL_CW_SET_LIST_INIT_ALLOC);
  ```

* The signal caused us to proceed to exit handling and as part of that we tried to cleanup the
  incompletely set up structure `si->tlvl_cw_set_list` at line 1098 below.

  **sr_unix/gds_rundown.c**
  ```c
   1082                 if (csa->sgm_info_ptr)
   1083                 {
   1084                         si = csa->sgm_info_ptr;
   1085                         /* It is possible we got interrupted before initializing all fields of "si"
   1086                          * completely so account for NULL values while freeing/releasing those fields.
   1087                          */
   1088                         assert((si->tp_csa == csa) || (NULL == si->tp_csa));
   1089                         if (si->jnl_tail)
   1090                         {
   1091                                 PROBE_FREEUP_BUDDY_LIST(si->format_buff_list);
   1092                                 PROBE_FREEUP_BUDDY_LIST(si->jnl_list);
   1093                                 FREE_JBUF_RSRV_STRUCT(si->jbuf_rsrv_ptr);
   1094                         }
   1095                         PROBE_FREEUP_BUDDY_LIST(si->recompute_list);
   1096                         PROBE_FREEUP_BUDDY_LIST(si->new_buff_list);
   1097                         PROBE_FREEUP_BUDDY_LIST(si->tlvl_info_list);
   1098                         PROBE_FREEUP_BUDDY_LIST(si->tlvl_cw_set_list);
   1099                         PROBE_FREEUP_BUDDY_LIST(si->cw_set_list);
  ```

* And that caused the SIG-11.

Fix
---
* A lot of the above cleanup in `sr_unix/gds_rundown.c` happens only if `csa->sgm_info_ptr` is non-NULL.

* But this field gets set to a non-NULL value at the very start of `sr_port/gvcst_tp_init.c` before
  a lot of the individual fields (like `si->tlvl_cw_set_list` etc.) get initialized.

* Therefore, the fix is to set `csa->sgm_info_ptr` to a non-NULL value `AFTER` all the initialization
  of the individual members in that structure has happened.

Notes
-----
* Even though the user-visible symptom is a SIG-11, this issue is considered rare enough for a user to
  encounter so a separate issue is not created for this fix.
nars1 added a commit that referenced this pull request Aug 9, 2022
… .m file is attempted

Background
----------
* Below is a simple test case obtained from a fuzz test failure in in-house testing.

  ```m
  $ cat test.m
   set fn="generated.m"
   open fn:new
   use fn
   write " z"
   Set $ZROUTINES=""
   zlink "generated.m"

  $ $ydb_dist/yottadb -run test
  %YDB-F-KILLBYSIGSINFO1, YottaDB process 55439 has been killed by a signal 11 at address 0x00007F4F4F82EED7 (vaddr 0x0000000000000008)
  %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object
  Segmentation fault (core dumped)
  ```

* This is a failure in both Release and Debug builds of YottaDB as well as the upstream GT.M.

Issue
-----
* Below is the stack trace from the core file.

  ```c
  (gdb) where
  #0  ins_errtriple (in_error=150373618) at sr_port/ins_errtriple.c:51
  #1  stx_error_va (in_error=150373618, args=0x7f6559aa53c0) at sr_port/stx_error.c:164
  #2  rts_error_va (csa=0x0, argcnt=1, var=0x7f6559aa54a0) at sr_unix/rts_error.c:179
  #3  rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #4  iorm_wteol (x=1, iod=0x62d000004840) at sr_unix/iorm_wteol.c:87
  #5  iorm_cond_wteol (iod=0x62d000004840) at sr_unix/iorm_flush.c:42
  #6  iorm_close (iod=0x62d000004840, pp=0x7f6559aa63b0) at sr_unix/iorm_close.c:112
  #7  io_dev_close (d=0x62d000005ec0) at sr_port/io_rundown.c:102
  #8  io_rundown (rundown_type=0) at sr_port/io_rundown.c:60
  #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:239
  #10 signal_exit_handler (exit_handler_name=0x7f6555366520 "generic_signal_handler", sig=11, info=0x7f6555881948 <stapi_signal_handler_oscontext+4424>, context=0x7f65558819c8 <stapi_signal_handler_oscontext+4552>, is_deferred_exit=0) at sr_unix/signal_exit_handler.c:78
  #11 generic_signal_handler (sig=11, info=0x7f6555881948 <stapi_signal_handler_oscontext+4424>, context=0x7f65558819c8 <stapi_signal_handler_oscontext+4552>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:500
  #12 ydb_os_signal_handler (sig=11, info=0x7f6559aa6bf0, context=0x7f6559aa6ac0) at sr_unix/ydb_os_signal_handler.c:85
  #13 <signal handler called>
  #14 ins_errtriple (in_error=150373618) at sr_port/ins_errtriple.c:51
  #15 stx_error_va (in_error=150373618, args=0x7ffe77c31f90) at sr_port/stx_error.c:164
  #16 rts_error_va (csa=0x0, argcnt=1, var=0x7ffe77c32070) at sr_unix/rts_error.c:179
  #17 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #18 iorm_wteol (x=1, iod=0x62d000004840) at sr_unix/iorm_wteol.c:87
  #19 iorm_readfl (v=0x7ffe77c33bb0, width=32767, nsec_timeout=<optimized out>) at sr_unix/iorm_readfl.c:229
  #20 op_readfl (v=0x7ffe77c33bb0, length=32767, timeout=0x7f65555111a0 <literal_notimeout>) at sr_port/op_readfl.c:80
  #21 read_source_file () at sr_unix/source_file.c:290
  #22 compiler_startup () at sr_port/compiler_startup.c:159
  #23 zlcompile (len=11 '\v', addr=0x7ffe77c34820 "generated.m") at sr_port/zlcompile.c:45
  #24 op_zlink (v=0x62d0000062e0, quals=0x7f6555fbe6c0) at sr_unix/op_zlink.c:496
  ```

* The SIG-11 happened because we were trying to access `TREF(pos_in_chain)` to get the last triple
  before we started parsing the current line.

  **sr_port/ins_errtriple.c**
  ```c
    49   x = (TREF(pos_in_chain)).exorder.bl;
    50   /* If first error in the current line/cmd, delete all triples and replace them with an OC_RTERROR triple. */
    51   add_rterror_triple = (OC_RTERROR != x->exorder.fl->opcode);
  ```

  But turns out we are issuing an error even before we started parsing the first line in the M program.
  This is because the `iorm_wteol()` call, while trying to read from the M source file as part of the ZLINK,
  tried to write an EOL to the source M program and it cannot because the source is opened read-only and so
  issued a ERR_DEVICEREADONLY error.

  And because of this, the contents of `TREF(pos_in_chain)` are not appropriately initialized and so are not
  reliable (they will contain triples left over from the previous compile and can point to freed memory
  or NULL pointers resulting in SIG-11).

Fix
---
* The first fix is to initialize `TREF(pos_in_chain)` to `*TREF(curtchain)` in `sr_port/tripinit.c` right
  after `TREF(curtchain)` is initialized.

  This way any errors in compilation will result in `ins_errtriple()` referencing an initialized
  `TREF(pos_in_chain)`.

* The second fix is in `sr_port/ins_errtriple.c` where we should now account for the possibility that
  `TREF(pos_in_chain).exorder.bl` could be `NULL`. In that case, we should add an `OC_RTERROR` triple
  just like we would if we find that the start of the current M line already has triples and the first
  triple in that chain is not already a `OC_RTERROR` triple. So the change is to set `add_rterror_triple`
  variable to TRUE in case we find `TREF(pos_in_chain).exorder.bl` is NULL.

* With just the above two fixes, I noticed the simple test case presented above no longer failing with a
  SIG-11. But it still had some extraneous output.

  ```sh
  $ $ydb_dist/yottadb -run test40

                                     ^-----
                  At column 28, line 1, source module generated.m
  %YDB-E-DEVICEREADONLY, Cannot write to read-only device
  ```

  I expected only the `%YDB-E-DEVICEREADONLY` error line. Not the 3 lines before it which is syntax
  highlighting a non-existent M source line.

  Turns out this is an issue in `sr_port/show_source_line.c` where we issue a sequence of `ERR_SRCLIN`,
  `ERR_SRCLNNTDSP` and `ERR_SRCLOC` messages to take care of the syntax highlighting even if there is
  no M source code to highlight.

  This is now fixed by checking `line_chwidth` and only if it is greater than 0 do we issue those messages.
  Otherwise we skip those messages.

  With that change, the revised output is as follows. This looks a lot cleaner to me.

  ```sh
  $ $ydb_dist/yottadb -run test40
  %YDB-E-DEVICEREADONLY, Cannot write to read-only device
  ```
nars1 added a commit that referenced this pull request Aug 26, 2022
…rence using [] syntax

Background
----------
* This is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/860#note_1079650087.

* This comment is to track a longstanding issue identified by ongoing fuzz testing.
  This is an issue present even in the upstream GT.M versions.

* Below is a simple test case demonstrating the issue.

  **Release build**
  ```m
  YDB>lock +[(0!^|"x"|a)]x
  %YDB-F-KILLBYSIGSINFO1, YottaDB process 31691 has been killed by a signal 11 at address 0x00007FEB3DDC7E15 (vaddr 0x0000000000000008)
  %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object
  ```

  **Debug build**
  ```m
  YDB>lock +[(0!^|"x"|a)]x
  %YDB-F-ASSERT, Assert failed in sr_port/gvn.c line 188 for expression (NULL != TREF(expr_start))
  ```

Issue
-----
* Below is the stack trace from the assert failure

  ```c
  (gdb) where
  .
  .
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  gvn () at sr_port/gvn.c:188
  #9  glvn (a=0x7ffde466d940) at sr_port/glvn.c:38
  #10 expratom (a=0x7ffde466d940) at sr_port/expratom.c:27
  #11 eval_expr (a=0x7ffde466dc00) at sr_port/eval_expr.c:248
  #12 expritem (a=0x7ffde466dc00) at sr_port/expritem.c:551
  #13 expratom (a=0x7ffde466dc00) at sr_port/expratom.c:29
  #14 expratom_coerce_mval (a=0x7ffde466dc00) at sr_port/expratom_coerce_mval.c:34
  #15 lkglvn (gblvn=0) at sr_port/lkglvn.c:63
  #16 nref () at sr_port/nref.c:40
  #17 m_lock () at sr_port/m_lock.c:93
  #18 cmd () at sr_port/cmd.c:312
  #19 linetail () at sr_port/linetail.c:35
  #20 op_commarg (v=0x5603a5cfe598, argcode=19 '\023') at sr_port/op_commarg.c:84
  #21 op_dmode () at sr_port/op_dmode.c:159

  (gdb) f 8
  #8  gvn () at sr_port/gvn.c:188
  188                             assert(NULL != TREF(expr_start));
  ```

* The issue is that in frame number 8, we saw `TREF(shift_side_effects)` to be TRUE at line 55.

  **sr_port/gvn.c**
  ```c
     55         if (shifting = (TREF(shift_side_effects) && (!TREF(saw_side_effect) || (YDB_BOOL == TREF(ydb_fullbool)
  ```

* This caused the `shifting` variable to be set to TRUE.

* And at the end of that function, we had to insert a `OC_GVSAVTARG` triple but found that `TREF(expr_start)`
  was NULL.

* The issue is that `TREF(expr_start)` and `TREF(shift_side_effects)` were out of sync.

* If `TREF(shift_side_effects)` was non-zero, then `TREF(expr_start)` should also have been non-NULL.

* `TREF(shift_side_effects)` was set to 1 by frame number 11 in the below line.

  **sr_port/eval_expr.c**
  ```c
    104                                 TREF(shift_side_effects) = TRUE;
  ```

* And `TREF(expr_start)` was also set to a non-NULL value around then.

  **sr_port/eval_expr.c**
  ```c
     95                                 TREF(expr_start) = TREF(expr_start_orig) = ref;
  ```

* But the issue was that frame 11 `gvn()` invoke `expr()`

  **sr_port/gvn.c**
  ```c
     69                         parse_status = expr(sb1++, MUMPS_EXPR);
  ```

  And that in turn did the following.

  **sr_port/expr.c**
  ```c
     29         INCREMENT_EXPR_DEPTH;
  ```

  And this macro found `TREF(expr_depth)` set to 0 and therefore cleared `TREF(expr_depth)`

  **sr_port/compiler.h**
  ```c
    420 #define INCREMENT_EXPR_DEPTH
    424         if (!(TREF(expr_depth))++)
    425                 TREF(expr_start) = TREF(expr_start_orig) = NULL;
  ```

* Therefore, `TREF(expr_start)` was non-NULL when we entered frame 11 `gvn()` but was NULL
  towards the end of that function and that is the issue.

* The real issue is that `TREF(expr_depth)` was 0 even though we were already evaluating a boolean
  expression (and doing shifting operations for global references).

* And the cause of this is that there are 3 callers of `eval_expr()`.
  - sr_port/bool_expr.c
  - sr_port/expr.c
  - sr_port/expritem.c

* The first 2 of the above callers do a `INCREMENT_EXPR_DEPTH` before calling `eval_expr()`.

* But the 3rd caller does not. And that is where the issue lies.

* It is not clear to me why this inconsistency was there all this while. I suspect it is an oversight
  instead of being intentional.

Fix
---
* The fix is very simple and that is to call `INCREMENT_EXPR_DEPTH` (and `DECREMENT_EXPR_DEPTH`) in
  the 3rd caller `sr_port/expritem.c` before calling `eval_expr()`. This ensures `TREF(expr_depth)`
  stays a non-zero value in case `TREF(expr_start)` gets set to a non-NULL value inside `eval_expr()`.

* Additionally, I also added an assert in `sr_port/eval_expr.c` that if ever we set `TREF(expr_start)`
  to a non-NULL value, the `TREF(expr_depth)` global variable better be greater than 0.
nars1 added a commit that referenced this pull request Nov 15, 2022
…onds while waiting for WIP queue to clear

Background
----------
* This is an internal test failure that happened once. One of the symptoms of the failure was the following
  assert.

  ```diff
  > v53003_0_4/D9I10002706/bkgrnd_d002706.mje1
  > %YDB-F-ASSERT, Assert failed in sr_unix/sleep.c line 28 for expression ((8 == SIZEOF(useconds)) || ((MICROSECS_IN_SEC > useconds) && (0 < useconds)))
  ```

* The gdb analysis of the resulting core file showed the following.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140004321584192) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140004321584192) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140004321584192, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd4c844df0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  m_usleep (useconds=0) at sr_unix/sleep.c:28
  #9  wcs_sleep (sleepfactor=0) at sr_port/wcs_sleep.c:28
  #10 wait_for_wip_queue_to_clear (cnl=0x7f55457cb000, crwipq=0x7f5545a5b000, cr=0x7f5545abe6a8, reg=0x61d0000015b0) at sr_unix/wcs_wt.h:122
  #11 wcs_get_space (reg=0x61d0000015b0, needed=0, cr=0x7f5545abe6a8) at sr_unix/wcs_get_space.c:211
  #12 bt_put (reg=0x61d0000015b0, block=1833) at sr_port/bt_put.c:78
  #13 bg_update_phase1 (cs=0x7f55483e7680 <cw_set+192>, ctn=140737488387042, si=0x0) at sr_port/t_end_sysops.c:471
  #14 t_end (hist1=0x61b000194500, hist2=0x6160000ad480, ctn=18446744071629176832) at sr_port/t_end.c:1664
  #15 gvcst_kill2 (do_subtree=1, span_status=0x0, killing_chunks=0) at sr_port/gvcst_kill.c:781
  #16 gvcst_kill (do_subtree=1) at sr_port/gvcst_kill.c:149
  #17 op_gvkill () at sr_port/op_gvkill.c:83

  (gdb) f 8
  #8  0x00007f55466a5222 in m_usleep (useconds=0) at sr_unix/sleep.c:28
  28              SLEEP_USEC(useconds, TRUE);

  (gdb) p useconds
  $4 = 0

  (gdb) up
  #9  0x00007f5547016fe2 in wcs_sleep (sleepfactor=0) at sr_port/wcs_sleep.c:28
  28              SHORT_SLEEP(slpfctr);

  (gdb) p slpfctr
  $5 = 0

  (gdb) up
  #10 0x00007f5547624a07 in wait_for_wip_queue_to_clear (cnl=0x7f55457cb000, crwipq=0x7f5545a5b000, cr=0x7f5545abe6a8, reg=0x61d0000015b0) at sr_unix/wcs_wt.h:122
  122                     wcs_sleep(lcnt);

  (gdb) p lcnt
  $6 = 0
  ```

Issue
-----
* `wcs_sleep()` is not designed to be invoked for a `0` milli-second sleep.

Fix
---
* `lcnt` is reset to `1` in case it becomes `0` after the modulo operation (`%`).

Notes
-----
* Interestingly, this fix is seen in GT.M V7.0-001 in the upstream project so we will eventually get this fix.
nars1 added a commit that referenced this pull request Feb 8, 2023
…ut in dee9d0c)

Background
----------
* The `mem_stress_1/memleak` subtest failed in one rare test run on a slow in-house system with
  various core files. Below is an analysis of the first core file using gdb.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffed66c2ec0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  lvzwr_out_targkey (one=0x7ffed66c30c0) at sr_port/lvzwr_out.c:96
  #7  lvzwr_out (lvp=0x103fb48) at sr_port/lvzwr_out.c:286
  #8  lvzwr_var (lv=0x103fb48, n=3) at sr_port/lvzwr_var.c:233
  #9  lvzwr_var (lv=0x103faf0, n=2) at sr_port/lvzwr_var.c:312
  #10 lvzwr_var (lv=0x103fa98, n=1) at sr_port/lvzwr_var.c:312
  #11 lvzwr_var (lv=0x10c95d0, n=0) at sr_port/lvzwr_var.c:309
  #12 lvzwr_fini (out=0x7ffed66ce590, t=1) at sr_port/lvzwr_fini.c:83
  #13 op_lvpatwrite (count=0, arg1=140732495881408) at sr_port/op_lvpatwrite.c:85
  #14 zshow_zwrite (output=0x7ffed66ce590) at sr_port/zshow_zwrite.c:40
  #15 op_zshow (func=0x7ffed66d66c0, type=1, lvn=0x0) at sr_port/op_zshow.c:220
  #16 jobexam_dump (dump_filename_arg=0x7f28bdf51b60, dump_file_spec=0x10323b8, fatal_file_name_buff=0x7ffed66d7210 "", zshowcodes=0x7f28bdf51b60, dev_in_use=0x7ffed66d67a0) at sr_port/jobexam_process.c:237
  #17 jobexam_process (dump_file_name=0x7f28bdf51b60, zshowcodes=0x7f28bdf51b60, dump_file_spec=0x10323b8) at sr_port/jobexam_process.c:147
  #18 op_fnzjobexam (prelimSpec=0x7f28bdf51b60, zshowcodes=0x7f28bdf51b60, finalSpec=0x10323b8) at sr_port/op_fnzjobexam.c:22

  (gdb) f 6
  #6  lvzwr_out_targkey (one=0x7ffed66c30c0) at sr_port/lvzwr_out.c:96
  96              assert(MAX_STRLEN       /* WARNING assignment below; check in op_putindx should assure this */
  97                      >= (length += ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len));

  (gdb) p gtm_threadgbl_true->util_outbuff
  $1 = "%YDB-F-ASSERT, Assert failed in sr_port/lvzwr_out.c line 97 for expression (MAX_STRLEN >= (length += ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len))", '\000' <repeats 5946 times>

  (gdb) p length
  $2 = 1048577

  (gdb) p ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len
  $4 = 1048576
  ```

* Based on this, I was able to come up with a simple test case that demonstrates the same issue.

  ```m
  YDB>set x(1,$justify(2,2**20))="" zwrite x
  %YDB-F-ASSERT, Assert failed in sr_port/lvzwr_out.c line 97 for expression (MAX_STRLEN >= (length += ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len))
  ```

* This failure happens only in a Debug build. A Release build runs fine and prints a long string
  corresponding to the contents of the subscripted local variable node `x(1,<2**20-long-string>)`
  in the zwrite format.

Issue
-----
* As part of dee9d0c, the following change happened where we started allowing sets of subscripted
  local variable nodes where each subscript is 1Mib long.

* Below is relevant text from the commit message of dee9d0c.

  ```
  Files that had merge conflicts but the V63003 change was discarded
  ------------------------------------------------------------------
  Reason for discard is mentioned below against each module.

  * sr_port/op_fnquery.c & sr_port/op_putindx.c
          --> GTM-6115/GTM-8792 in GT.M V6.3-003 release notes describes that only $QUERY
          --> on lvns with subscripts exceeding 1Mb in total length will be prohibited, not
          --> other operations like SET but the change in this module does the exact opposite.
  ```

* This meant YottaDB allowed SETs of lvns where each subscript was 1MiB long. Whereas GT.M did not.

  Below is an example using GT.M V7.0-005.

  GT.M only allows a subscript that is 5 bytes shorter than 1MiB when there is just 2 subscripts in
  the lvn. It does not allow a subscript that is 4 bytes shorter than 1MiB.

  ```m
  GTM>set x($justify(1,2**20-4))=""
  %GTM-E-MAXSTRLEN, Maximum string length exceeded

  GTM>set x($justify(1,2**20-5))=""

  GTM>
  ```

  But if one tries to use 3 subscripts, GT.M only allows a subscript that is 68 bytes short of 1MiB.

  ```m
  GTM>set x($justify(1,2,2**20-5))=""
  %GTM-E-MAXSTRLEN, Maximum string length exceeded

  GTM>set x($justify(1,2,2**20-67))=""
  %GTM-E-MAXSTRLEN, Maximum string length exceeded

  GTM>set x($justify(1,2,2**20-68))=""

  GTM>
  ```

  So the maximum allowed subscript length is dependent on other subscripts in the lvn.

* The assert that failed in `sr_port/lvzwr_out.c` is tied to this logic in GT.M and relies on the
  fact that a SET of such a lvn would have been disallowed in `sr_port/op_putindx.c`.

* But YottaDB allows each subscript to be 1MiB long since dee9d0c. Independent of other subscripts
  in the lvn.

* Therefore this assert should have been removed as part of dee9d0c but was missed out then.

Fix
---
* The assert is removed in this commit. Along with it, a debug-only variable `length` as well as some
  comments describing the reliance on the obsolete `sr_port/op_putindx.c` behavior also got removed.
nars1 added a commit that referenced this pull request Jul 25, 2023
… it can cause hang with CLANG/ASAN

Background
----------
* While running the YDBOcto tests with CLANG, I noticed various tests hang. All of them had a
  similar stack-trace.

  ```c
  (gdb) where
  #0  __sanitizer::FutexWait(__sanitizer::atomic_uint32_t*, unsigned int) ()
  #1  __sanitizer::Semaphore::Wait() ()
  #2  __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >::GetFromAllocator(__sanitizer::AllocatorStats*, unsigned long, unsigned int*, unsigned long) ()
  #3  __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::Refill(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::PerClass*, __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >*, unsigned long) ()
  #4  __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >*, unsigned long, unsigned long) ()
  #5  __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) ()
  #6  __asan::asan_calloc(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*) ()
  #7  calloc ()
  #8  __pthread_attr_extension (attr=0x7f29af3cee48) at ./nptl/pthread_attr_extension.c:28
  #9  __GI___pthread_attr_setaffinity_np (attr=attr@entry=0x7f29af3cee48, cpusetsize=cpusetsize@entry=32, cpuset=cpuset@entry=0x603000001b40) at ./nptl/pthread_attr_setaffinity.c:45
  #10 __pthread_getattr_np (thread_id=139817006390848, attr=0x7f29af3cee48) at ./nptl/pthread_getattr_np.c:194
  #11 __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ()
  #12 __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ()
  #13 __asan::PlatformUnpoisonStacks() ()
  #14 __asan_handle_no_return ()
  #15 generic_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:187
  #16 ydb_os_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0) at sr_unix/ydb_os_signal_handler.c:85
  #17 <signal handler called>
  #18 sched_yield () at ../sysdeps/unix/syscall-template.S:120
  #19 __sanitizer::StopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), void*) ()
  #20 __lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*) ()
  #21 __GI___dl_iterate_phdr (callback=0x55bd48373320 <__lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*)>, data=0x7ffe13010eb8) at ./elf/dl-iteratephdr.c:74
  #22 __lsan::LockStuffAndStopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), __lsan::CheckForLeaksParam*) ()
  #23 __lsan::CheckForLeaks() ()
  #24 __lsan::DoLeakCheck() ()
  #25 __cxa_finalize (d=0x55bd483af128) at ./stdlib/cxa_finalize.c:83
  #26 __do_global_dtors_aux ()
  #27 ?? ()
  #28 _dl_fini () at ./elf/dl-fini.c:142
  ```

Issue
-----
* The YottaDB SIG-15/SIGTERM signal handler got invoked for a SIG-15. But it noticed that all YottaDB
  exit handler code has already been run (`exit_handler_complete` global variable is TRUE). In that
  case, it invoked any non-YottaDB signal handler for SIG-15 and afterwards, it invoked `_exit()` to
  terminate the process (in line 187).

  **sr_unix/generic_signal_handler.c**
  ```c
    182         if (exit_handler_complete)
    183         {
    184                 if (!using_alternate_sighandling)       /* Go does not send us signals so no need to forward */
    185                 {
    186                         drive_non_ydb_signal_handler_if_any("generic_signal_handler1", sig, info, context, TRUE);
    187                         UNDERSCORE_EXIT(-sig);
    188                 }
    189                 return;         /* Nothing we can do if exit handler has run */
    190         }
  ```

* And because of the `_exit()` all, the CLANG/ASAN library ended up doing a `calloc()` call which hung
  waiting for a futex. Most likely due to re-entrant invocations of C library functions that are not
  async-signal safe.

* The cause of this is line 187 above in my opinion.

* If YottaDB exit handler has already run (as part of SIGTERM handling) and we are getting the SIGTERM signal
  again, then I don't see any reason to do the `_exit()` call (using the `UNDERSCORE_EXIT` macro in line 187).

* This code has been there for a long time but I don't think it is doing the right thing.

Fix
---
* Lines 184-188 are now removed in this commit. I think the right thing to do is to just return in case the
  YottaDB exit handler has already been invoked.

* With this change, I verified that the CLANG/ASAN tests run fine in YDBOcto. So at least one Simple API
  use case runs fine with the fix in this commit.

* Initially I thought of disabling lines 184-188 above only when ASAN is enabled. But then I realized it
  is a good change for all cases and so removed lines 184-188.
nars1 added a commit that referenced this pull request Sep 11, 2023
… detect signal/timer handling

Background
----------
* We had one rare test failure during in-house testing. The `ideminter_rolrec/mupipstop_rollback_or_recover`
  subtest failed with the following symptom.

  ```sh
  $ cat ROLLBACK1_3.logx
  mupip journal -ROLLBACK -back -verify -verbose "*"  -noonline -resync=369813 -lost=ROLLBACK1_3.lost
  Sat Sep  9 04:17:18 PM EDT 2023
  .
  .
  %YDB-I-MUJNLSTAT, Forward processing started at Sat Sep  9 16:19:23 2023
  %YDB-I-MUINFOUINT8, mur_process_seqno_table returns min_broken_seqno : 18446744073709551615 [0xFFFFFFFFFFFFFFFF]
  %YDB-I-MUINFOUINT8, mur_process_seqno_table returns losttn_seqno : 369813 [0x000000000005A495]
  %YDB-I-MUINFOSTR, Module : mur_forward:at the start at Sat Sep  9 16:19:23 2023
  .
  .
  %YDB-I-MUINFOSTR,     Journal file : ideminter_rolrec_0/mupipstop_rollback_or_recover/g.mjl_2023252161233
  %YDB-I-MUINFOUINT4,     Record Offset : 65744 [0x000100D0]
  %YDB-F-FORCEDHALT, Image HALTed by MUPIP STOP
  %YDB-F-ASSERT, Assert failed in sr_unix/db_ipcs_reset.c line 110 for expression (((TREF(dio_buff)).aligned != (char *)(csd)) || (!timer_in_handler && !multi_thread_in_use))
  Sat Sep  9 04:20:35 PM EDT 2023
  The time the mupip command took:  197
  ```

* The core file corresponding to the above assert failure had the following stack trace.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140217990231872, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fff160fdc00) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  db_ipcs_reset (reg=0x563c77a1c0b0) at sr_unix/db_ipcs_reset.c:110
  #9  mur_close_files () at sr_port/mur_close_files.c:841
  #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:116
  #11 signal_exit_handler (exit_handler_name=0x7f870b624acc "deferred_exit_handler", sig=15, info=0x7f870b7856a8 <stapi_signal_handler_oscontext+3320>, context=0x7f870b785728 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #14 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862
  #15 wcs_stale (tid=94817705118848, hd_len=8, region=0x563c77924b08) at sr_port/t_end_sysops.c:1445
  #16 timer_handler (why=0, info=0x7f870b787088 <stapi_signal_handler_oscontext+9944>, context=0x7f870b787108 <stapi_signal_handler_oscontext+10072>, is_os_signal_handler=0) at sr_unix/gt_timers.c:913
  #17 check_for_deferred_timers () at sr_unix/gt_timers.c:1312
  #18 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #19 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862
  #20 wcs_timer_start (reg=0x563c77a1cc80, io_ok=1) at sr_port/t_end_sysops.c:1344
  #21 op_tcommit () at sr_port/op_tcommit.c:535
  #22 mur_output_record (rctl=0x563c77a28a40) at sr_port/mur_output_record.c:323
  #23 mur_forward_play_cur_jrec (rctl=0x563c77a28a40) at sr_port/mur_forward_play_cur_jrec.c:362
  #24 mur_forward_multi_proc (rctl=0x563c77a28a40) at sr_port/mur_forward.c:400
  #25 gtm_multi_proc (fnptr=0x7f870ae20f00 <mur_forward_multi_proc>, ntasks=1, max_procs=1, ret_array=0x563c7cb21a40, parm_array=0x563c77a27c40, parmElemSize=512, extra_shm_size=2640, init_fnptr=0x7f870ae2b9f0 <mur_forward_multi_proc_init>, finish_fnptr=0x7f870ae2bc10 <mur_forward_multi_proc_finish>) at sr_unix/gtm_multi_proc.c:122
  #26 mur_forward (min_broken_time=4294967295, min_broken_seqno=18446744073709551615, losttn_seqno=369813) at sr_port/mur_forward.c:158
  #27 mupip_recover () at sr_port/mupip_recover.c:588
  #28 mupip_main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip_main.c:122
  #29 dlopen_libyottadb (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0, main_func=0x563c761b1004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #30 main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip.c:22

  (gdb) p gtm_threadgbl_true->dio_buff.aligned
  $5 = 0x563c78429000 "GDSDYNUNX04"
  (gdb) p csd
  $6 = (sgmnt_data_ptr_t) 0x563c78429000
  (gdb) p timer_in_handler
  $1 = 1
  (gdb) p multi_thread_in_use
  $2 = 0

  (gdb) p forced_exit
  $3 = 2
  (gdb) p exit_handler_active
  $4 = 1
  (gdb) p in_os_signal_handler
  $1 = 0
  ```

Issue
-----
* The assert failure was in the db_ipcs_reset() -> DB_LSEEKREAD -> DBG_CHECK_DIO_ALIGNMENT.

* The `DBG_CHECK_DIO_ALIGNMENT` macro had the following comment.

  ```c
     53         /* If we are using the global variable "dio_buff.aligned", then we better not be executing in timer     \
     54          * code or in threaded code (as we have only ONE buffer to use). Assert that.                           \
     55          */                                                                                                     \
     56         assert(((TREF(dio_buff)).aligned != (char *)(buff)) || (!timer_in_handler && !multi_thread_in_use));    \
  ```

* In the failure case, even though we are executing in timer code we are actually in exit handler code
  (as can be seen by the `forced_exit` and `exit_handler_active` variables in the gdb analysis above).
  In this case, the exit handler code will not return out of the timer code and so it is okay for the
  assert to not be TRUE.

* The global variable being checked in the assert is `timer_in_handler`. This is where the issue is.
  That global variable being TRUE just means the `timer_handler()` function is in the current call stack.
  It does not mean that we are handling a SIGALRM/timer signal and interrupting the mainline code.
  The assert is intended to protect against signal handler interrupting the mainline code. Therefore,
  the correct global variable to check in the assert is `in_os_signal_handler`.

Fix
---
* The fix is simple and is to use `in_os_signal_handler` instead of `timer_in_handler` in the assert.
nars1 added a commit that referenced this pull request Sep 25, 2023
…mplete deferred state setup before invoking xfer_set_handlers()

* After merging GT.M V7.0-001, the following tests failed in rare cases.
  - -t dual_fail_extend -replic -st dual_fail2_mustop_sigquit
  - -t v60000 -replic -st gtm4525b

* The failure symptom was the following.

  ```c
  (gdb) x/s gtm_threadgbl_true->util_outbuff
  0x17d3ed8:  "%YDB-F-ASSERT, Assert failed in sr_port/deferred_signal_handler.c line 38 for expression (GET_DEFERRED_EXIT_CHECK_NEEDED || (1 != forced_exit))"
  ```

* And the C-stack was the following.

  ```c
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fffa2b5d390) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  deferred_signal_handler () at sr_port/deferred_signal_handler.c:38
  #7  set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:48
  #8  xfer_set_handlers (event_type=11, param_val=1730866112, popped_entry=0) at sr_port/deferred_events.c:191
  #9  generic_signal_handler (sig=15, info=0x7f7167e24218 <stapi_signal_handler_oscontext+3320>, context=0x7f7167e24298 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:305
  #10 ydb_os_signal_handler (sig=15, info=0x7fffa2b5d9f0, context=0x7fffa2b5d8c0) at sr_unix/ydb_os_signal_handler.c:85
  #11 <signal handler called>
  #12 __GI___clock_nanosleep (clock_id=1, flags=1, req=0x7fffa2b5e058, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  #13 wait_for_repl_inst_unfreeze_nocsa_jpl (jpl=0x17ec240) at sr_port/anticipatory_freeze.h:517
  #14 wait_for_repl_inst_unfreeze (csa=0x18f7040) at sr_port/anticipatory_freeze.h:547
  #15 jnl_write_attempt (jpc=0x18f7a40, threshold=29324848) at sr_port/jnl_write_attempt.c:348
  #16 jnl_flush (reg=0x189afe8) at sr_port/jnl_flush.c:57
  #17 tp_tend () at sr_port/tp_tend.c:795
  #18 op_tcommit () at sr_port/op_tcommit.c:497

  (gdb) f 6
  #6  0x00007f71672ae771 in deferred_signal_handler () at sr_port/deferred_signal_handler.c:38
  38              assert(GET_DEFERRED_EXIT_CHECK_NEEDED || (1 != forced_exit));

  ```

* The `SET_FORCED_EXIT_STATE` macro call (in frame 9 above) is where the issue is.

  **sr_port/have_crit.h**
  ```c
      172 #define SET_FORCED_EXIT_STATE(SIG)                                                                                              \
      173 {                                                                                                                               \
      174         char                    *rname;                                                                                         \
      175                                                                                                                                 \
      176         GBLREF VSIG_ATOMIC_T    forced_exit;                                                                                    \
      177         GBLREF int              forced_exit_sig;                                                                                \
      178         GBLREF boolean_t        (*xfer_set_handlers_fnptr)(int4, void (*callback)(int4), int4 param, boolean_t popped_entry);   \
      179         GBLREF void             (*deferred_signal_set_fnptr)(int4 dummy_val);                                                   \
      180                                                                                                                                 \
      181         /* Below code is not thread safe as it modifies global variables "forced_exit"                                          \
      182          * and "forced_exit_sig".                                                                                               \
      183          */                                                                                                                     \
      184         assert(!INSIDE_THREADED_CODE(rname));                                                                                   \
      185         assert((0 == forced_exit) || (1 == forced_exit));                                                                       \
  --> 186         forced_exit = 1;                                                                                                        \
      187         forced_exit_sig = SIG;          /* Record the signal forcing us to exit */                                              \
      188         if (in_os_signal_handler)                                                                                               \
      189         {       /* If we are inside an OS signal handler and therefore had to defer exit                                        \
      190                  * handling, treat this as an outofband event as this is checked by lots of                                     \
      191                  * potentially long-running commands in the runtime (e.g. HANG etc.) and we                                     \
      192                  * want all of those to automatically trigger process exit handling.                                            \
      193                  * The below invocation takes care of the signal as a deferred outofband event                                  \
      194                  * that gets handled at the earliest safe point.                                                                \
      195                  */                                                                                                             \
      196                 if (NULL != xfer_set_handlers_fnptr)                                                                            \
  --> 197                         (*xfer_set_handlers_fnptr)(deferred_signal, deferred_signal_set_fnptr, 0, FALSE);                       \
      198                 /* else: it is "gtmsecshr" in which case outofband does not apply */                                            \
      199         }                                                                                                                       \
      200         /* Whenever "forced_exit" gets set to 1, set the corresponding deferred event too */                                    \
  --> 201         SET_DEFERRED_EXIT_CHECK_NEEDED;                                                                                         \
      202         SET_FORCED_THREAD_EXIT;         /* Signal any running threads to stop */                                                \
      203         SET_FORCED_MULTI_PROC_EXIT;     /* Signal any parallel processes to stop */                                             \
      204 }
  ```

* Line 186 sets `forced_exit` and Line 201 sets the corresponding deferred event. But Line 197
  ends up invoking `deferred_signal_handler()` which has an assert that expects Line 186 and 201 to
  have happened at the same time.

* This is fixed by moving lines 188-199 above to execute AFTER lines 200-203. That way the state setup
  of the forced exit is finished first and then the outofband set up happens by the
  `xfer_set_handlers_fnptr` call.

* Now that Lines 186 and 201 are executed BEFORE line 197 in this commit, the assert failure seen
  in the test failure should be automatically fixed.
nars1 added a commit that referenced this pull request Nov 15, 2023
…ert failure)

Background
----------
* Below is a first-time failure, when running the `r126/ydb464` subtest (from the YDBTest project), that
  I noticed while trying to reproduce some other failure.

  ```diff
  --- ydb464/ydb464.diff ---
  19a20,73
  > r126_0_31/ydb464/simpleapi2/child98118.log
  > %YDB-F-ASSERT, Assert failed in sr_port/insert_region.c line 110 for expression ((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit))
  ```

* The C-stack and relevant variables from the core file are pasted below.

  ```c
  (gdb) where
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffee07f7480) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110
  #7  mlk_unlock (p=0x1591940) at sr_port/mlk_unlock.c:70
  #8  tp_unwind (newlevel=0, invocation_type=ROLLBACK_INVOCATION, tprestart_rc=0x0) at sr_port/tp_unwind.c:294
  #9  op_trollback (rb_levels=0) at sr_port/op_trollback.c:200
  #10 secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:569
  #11 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:230
  #12 signal_exit_handler (exit_handler_name=0x7ff4913b071e "deferred_exit_handler", sig=2, info=0x7ff491795458 <stapi_signal_handler_oscontext+3224>, context=0x7ff4917954d8 <stapi_signal_handler_oscontext+3352>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #15 rel_crit (reg=0x14d0170) at sr_unix/rel_crit.c:81
  #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120
  #17 op_lock2_common (timeout=0, laflag=64 '@') at sr_port/op_lock2.c:242
  #18 op_incrlock_common (timeout=0) at sr_port/op_incrlock.c:49
  #19 ydb_lock_incr_s (timeout_nsec=0, varname=0x7ffee07f8c30, subs_used=0, subsarray=0x0) at sr_unix/ydb_lock_incr_s.c:91
  #20 runProc (settings=0x7ffee07fab80, curDepth=1) at simpleapi/inref/randomWalk.c:489
  #21 tpHelper (tpfnparm=0x7ffee07fa100) at simpleapi/inref/randomWalk.c:691
  #22 ydb_tp_s_common (lydbrtn=LYDB_RTN_TP, tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:256
  #23 ydb_tp_s (tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:38
  #24 runProc (settings=0x7ffee07fab80, curDepth=0) at simpleapi/inref/randomWalk.c:666
  #25 runProc_driver (settings=0x7ffee07fab80) at simpleapi/inref/randomWalk.c:145
  #26 main () at simpleapi/inref/randomWalk.c:93

  (gdb) f 6
  #6  insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110
  110                                     assert((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit));

  (gdb) p process_exiting
  $3 = 1

  (gdb) p t_tries
  $4 = 3

  (gdb) p dollar_tlevel
  $5 = 1

  (gdb) p csa->now_crit
  $6 = 0

  (gdb) up
  #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120
  120                             TPNOTACID_CHECK(LOCKGCINTP);
  ```

Issue
-----
* The assert that failed in `insert_region()` (frame 6 in above stack trace) indicates that we were in the
  final retry (i.e. `t_tries` is equal to `3` or `CDB_STAGNATE`) but we did not hold crit on the current
  region where we are trying to do an `mlk_unlock()` operation.

* The assert is valid and did expose an issue.

* In frame 16, in `mlk_lock()`, we did a `rel_crit()` call in the `TPNOTACID_CHECK` macro while in the
  final retry.

  **sr_port/mlk_lock.c**
  ```c
    120                         TPNOTACID_CHECK(LOCKGCINTP);
  ```

* Below is the code inside the macro.

  **sr_port/tp.h**
  ```c
     979 #define TPNOTACID_CHECK(CALLER_STR)                                                                                             \
     980 {                                                                                                                               \
     981         GBLREF  boolean_t       mupip_jnl_recover;                                                                              \
     982         mval            zpos;                                                                                                   \
     983                                                                                                                                 \
     984         if (IS_TP_AND_FINAL_RETRY)                                                                                              \
     985         {                                                                                                                       \
  -> 986                 TP_REL_CRIT_ALL_REG;                                                                                            \
     987                 assert(!mupip_jnl_recover);                                                                                     \
     988                 TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK;                                                                         \
  ```

* Line 986 is where the issue is. We do a `rel_crit()` call there but `t_tries` is still not decremented.
  The decrement of `t_tries` happens 2 lines later at line 988.

* Before doing the `rel_crit()` call, we need to decrement `t_tries`. This way, in case `rel_crit()`
  decides to invoke exit handling due to handling a deferred SIGINT signal (sent in the `ydb464` subtest),
  the assert in `insert_region()` would not be confused by seeing this out-of-design state and will not
  attempt to invoke `t_retry()` etc. which is a no-no as we should not transfer control to M code as
  part of a TP restart while the process is about to terminate on receipt of a SIGINT signal.

Fix
---
* Notice that in `sr_port/t_commit_cleanup.c`, the `t_tries` decrement happens BEFORE the `rel_crit()`
  call.

  **sr_port/t_commit_cleanup.c**
  ```c
    288       if (CDB_STAGNATE <= t_tries)
    289               TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK; /* t_tries untouched for rollback and recover */
      .
      .
    303               if (!csa->hold_onto_crit && csa->now_crit)
    304                       rel_crit(tr->reg);      /* Undo Step (CMT01) */
  ```

* In a similar fashion, in the `TPNOTACID_CHECK` macro in `sr_port/tp.h`, the `TP_REL_CRIT_ALL_REG` call
  should happen AFTER the `TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK` call. And that is the fix.

* While doing this fix, I noticed a similar ordering issue in `sr_port/gvcst_init.c` and so fixed that too.

Notes
-----
* While this failure happened with a Debug build of YottaDB, I suspect there is an issue in the Release
  build of YottaDB too. But not sure exactly what the user-visible implications are. Even if so, it is
  likely to be not encountered in practice and so no user-visible issue is created for this.
nars1 added a commit that referenced this pull request Nov 15, 2023
…_port/deferred_events.c

Background
----------
* The `v61000/intrpt_wcs_wtstart` subtest (in the YDBTest project) failed a few rare occasions
  during internal testing with the following symptom.

  ```diff
  12a13,299
  > v61000_0_22/intrpt_wcs_wtstart/mumps-wb.out
  > %YDB-F-ASSERT, Assert failed in sr_port/deferred_events.c line 114 for expression (no_event == outofband || (event_type == outofband))
  ```

Issue
-----
* The stack trace and relevant details from the gdb core analysis are pasted below.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffcc56fd8c0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114
  #7  jobinterrupt_event (sig=10, info=0x7fb372b8a518 <stapi_signal_handler_oscontext+5528>, context=0x7fb372b8a598 <stapi_signal_handler_oscontext+5656>) at sr_port/jobinterrupt_event.c:61
  #8  <signal handler called>
  #9  clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  #10 m_usleep (useconds=10000) at sr_unix/sleep.c:37
  #11 wcs_sleep (sleepfactor=6310) at sr_port/wcs_sleep.c:28
  #12 wcs_flu (options=519) at sr_unix/wcs_flu.c:571
  #13 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:632
  #14 gv_rundown () at sr_port/gv_rundown.c:122
  #15 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #16 signal_exit_handler (exit_handler_name=0x7fb372a19ecf "generic_signal_handler", sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=0) at sr_unix/signal_exit_handler.c:78
  #17 generic_signal_handler (sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:502
  #18 ydb_os_signal_handler (sig=15, info=0x7ffcc56ffd30, context=0x7ffcc56ffc00) at sr_unix/ydb_os_signal_handler.c:88
  #19 <signal handler called>
  #20 clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  #21 m_usleep (useconds=999000) at sr_unix/sleep.c:37
  #22 wcs_wtstart (region=0xc30970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:216
  #23 wcs_timer_start (reg=0xc30970, io_ok=1) at sr_port/t_end_sysops.c:1346
  #24 t_end (hist1=0xcfe798, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1848
  #25 gvcst_put2 (val=0xc928b8, parms=0x7ffcc5709a80) at sr_port/gvcst_put.c:2796
  #26 gvcst_put (val=0xc928b8) at sr_port/gvcst_put.c:302
  #27 op_gvput (var=0xc928b8) at sr_port/op_gvput.c:79

  (gdb) f 6
  #6  xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114
  114                     assert(no_event == outofband || (event_type == outofband));

  (gdb) p (enum outofbands)no_event
  $2 = no_event

  (gdb) p (enum outofbands)outofband
  $1 = deferred_signal

  (gdb) p (enum outofbands)event_type
  $3 = jobinterrupt
  ```

* The test sends a SIGTERM (i.e. SIG-15) signal. This caused `outofband` variable to be set to
  `deferred_signal` in frame 17 above (`generic_signal_handler.c` inside the `SET_FORCED_EXIT_STATE` macro).

* And then the process was sleeping (due to a white-box test case in the test).

* At that point, it was holding crit and another process was waiting for this and so was about to send
  a `MUTEXLCKALERT` message. At this point, since the test framework had set the `gtm_procstuckexec` env
  var to `com/gtmprocstuck_get_stack_trace.csh`, that was invoked and it in turn invoked `^%YDBPROCSTUCKEXEC`
  which in turn sent a `SIGUSR1` signal (i.e. a `mupip intrpt`) to this very same process that was sleeping
  while holding crit.

* And at this point, the process got the assert failure because the `outofband` variable indicated that
  a `SIG-15` signal needs to be handled whereas the `event_type` variable indicated that the current
  out of band event is a `jobinterrupt` event.

Fix
---
* This seems like a valid scenario and I suspect the assert is invalid.

* I noticed that this very same assert has been removed in a later GT.M release V7.1-001.

  ```diff
  $ cd YDB
  $ git show tags/V7.1-001 sr_port/deferred_events.c | head -35 | tail -8
  @@ -127,7 +127,6 @@ boolean_t xfer_set_handlers(int4  event_type, int4 param_val, boolean_t popped_e
          }
          if (!already_ev_handling)
          {
  -               assert(no_event == outofband || (event_type == outofband));
                  assert(!dollar_zininterrupt || (jobinterrupt != event_type));
                  if (entry != (TREF(save_xfer_root_ptr))->ev_que.fl)
                  {       /* no event in play so pend this one by jiggeriing the xfer_table */
  ```

* I assume GT.M noticed a similar issue but not while releasing V7.0-001 (which is what YottaDB master
  currently has merged) but when releasing a much later V7.1-001 version and fixed it then.

* Therefore, I am removing the assert that failed.

* This should let the `v61000/intrpt_wcs_wtstart` test run fine until GT.M V7.1-001 gets merged into
  the YottaDB master branch.
nars1 added a commit that referenced this pull request Nov 17, 2023
…Simple Thread API application

Background
----------
* The `r126/ydb464` subtest failed in one rare run with the following failure symptom.

  ```diff
  > %YDB-F-ASSERT, Assert failed in sr_port/deferred_events_queue.c line 48 for expression (INTRPT_IN_EVENT_HANDLING == intrpt_ok_state)
  ```

* When this specific test was rerun around 10000 times, we saw around a dozen failures (with differing assert
  failures but all pointing to the same underlying issue) so this failure was reproducible but not easily.

Issue
-----
* Relevant details from the core file analysis is pasted below.

  ```c
  (gdb) thread apply all bt

  Thread 6 (Thread 0x7fa62a6c0640 (LWP 99885)):
    .
  #6  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #7  set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:37
  #8  xfer_set_handlers (event_type=11, param_val=939582496, popped_entry=0) at sr_port/deferred_events.c:190
  #9  generic_signal_handler (sig=2, info=0x7fa63a1b6fd8 <stapi_signal_handler_oscontext+3320>, context=0x7fa63a1b7058 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:305
  #10 ydb_os_signal_handler (sig=2, info=0x7fa625096bf0, context=0x7fa625096ac0) at sr_unix/ydb_os_signal_handler.c:88
  #11 <signal handler called>
  #12 __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at ./nptl/pthread_create.c:835
  #13 pthread_create ()
  #14 runProc (tptoken=..., errstr=0x0, settings=..., curDepth=6) at simplethreadapi/inref/randomWalk.c:662
  #15 threadHelper (args=0x7fa62a6ba880) at simplethreadapi/inref/randomWalk.c:723
  #16 tpHelper (tptoken=..., errstr=0x7fa62a6ba850, tpfnparm=0x7fa62a6ba880) at simplethreadapi/inref/randomWalk.c:712
  #17 ydb_tp_st (tptoken=..., errstr=0x7fa62a6ba850, tpfn=0x55fa406e2d20 <tpHelper>, tpfnparm=0x7fa62a6ba880, transid=0x55fa406f69ea "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_st.c:100
  #18 runProc (tptoken=..., errstr=0x0, settings=..., curDepth=5) at simplethreadapi/inref/randomWalk.c:642
  #19 threadHelper (args=0x7fa62a6bb7e0) at simplethreadapi/inref/randomWalk.c:723
    .
  #41 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

  Thread 1 (Thread 0x7fa61ef2e640 (LWP 7158)):
    .
  #6  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #7  xfer_reset_handlers (event_type=11) at sr_port/deferred_events.c:235
  #8  outofband_clear () at sr_port/outofband_clear.c:41
  #9  outofband_action (lnfetch_or_start=0) at sr_port/outofband_action.c:55
  #10 ydb_zwr2str_s (zwr=0x7fa61ef2d550, str=0x7fa61ef2d560) at sr_unix/ydb_zwr2str_s.c:55
  #11 ydb_zwr2str_st (tptoken=..., errstr=0x7fa61ef2d530, zwr=0x7fa61ef2d550, str=0x7fa61ef2d560) at sr_unix/ydb_zwr2str_st.c:40
  #12 runProc (tptoken=..., errstr=0x0, settings=..., curDepth=7) at simplethreadapi/inref/randomWalk.c:545
  #13 threadHelper (args=0x7fa62a6b9940) at simplethreadapi/inref/randomWalk.c:723
  #14 start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
  #15 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

  (gdb) p dollar_tlevel
  $4 = 6

  (gdb) p/x ydb_engine_threadsafe_mutex_holder[0]
  $14 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[1]
  $15 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[2]
  $16 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[3]
  $17 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[4]
  $18 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[5]
  $19 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[6]
  $20 = 0x7fa61ef2e640
  ```

* This is a case when signals are sent (SIGINT aka SIG-2 in this case) to a Simple Thread API process
  and one thread (`Thread 1` below) is running under the YottaDB engine lock already but the signal
  gets delivered to another thread (`Thread 6` below) and that incorrectly starts executing the signal
  handler which in turn invokes `xfer_set_handlers()` etc.. And so at the same time, 2 threads are
  executing YottaDB engine/runtime code although only one holds the lock. This is a no-no since YottaDB
  runtime logic is not multi-thread safe.

* From the above analysis, it is clear that the process was executing a TP transaction with `dollar_tlevel`
  equal to `6`.

  `Thread 6` had invoked `ydb_tp_st()` (in frame 17) which in turn invoked a callback function that created
  a new thread `Thread 1`.

  `Thread 6` held the YottaDB engine multi-thread lock for tlevels 0, 1, 2, 3, 4, 5.

  For tlevel 6, `Thread 1` held the YottaDB engine multi-thread lock.

* But the `SIGINT` signal (sent by the test) got sent to `Thread 6`. Therefore, it should have realized,
  while in `generic_signal_handler()`, that `dollar_tlevel` is 6 and it does not own the tlevel=6 lock
  (`Thread 1` owns it) and therefore should have done a `return` at line 404 below.

  ```c
     315 #define FORWARD_SIG_TO_MAIN_THREAD_IF_NEEDED(SIGHNDLRTYPE, SIG, IS_EXI_SIGNAL, INFO, CONTEXT)                                   \
       .
     332     if (simpleThreadAPI_active)                                                                                     \
     333     {                                                                                                               \
       .
     355             thisThreadId = pthread_self();                                                                          \
     356             assert(thisThreadId);                                                                                   \
     357             SET_YDB_ENGINE_MUTEX_HOLDER_THREAD_ID(mutexHolderThreadId, tLevel);                                     \
       .
     374             thisThreadIsMutexHolder = pthread_equal(mutexHolderThreadId, thisThreadId);                             \
       .
     386             if (!thisThreadIsMutexHolder                                                                            \
     387                             || (!IS_EXI_SIGNAL && (tLevel && (!isSigThreadDirected || signalForwarded))))           \
     388             {       /* Two possibilities.                                                                           \
       .
  -> 404                     return;                                                                                         \
     405             } else                                                                                                  \
  ```

* But clearly that did not happen (from the core file). Therefore, `thisThreadIsMutexHolder` (set at line
  374 above) should have been `TRUE`.

* How that happened can be seen in line 286 below inside the macro (invoked from line 357 above).

  ```c
    268 #define SET_YDB_ENGINE_MUTEX_HOLDER_THREAD_ID(HOLDER_THREAD_ID, TLEVEL)                                         \
    269 {                                                                                                               \
    270    GBLREF  uint4           dollar_tlevel;                                                                  \
    271    GBLREF  pthread_t       ydb_engine_threadsafe_mutex_holder[];                                           \
    272                                                                                                            \
    273    /* If not in TP, the YottaDB engine lock index is 0 (i.e. ydb_engine_threadsafe_mutex_holder[0] is      \
    274     * current lock holder thread if it is non-zero). But if we are in TP, then lock index could be         \
    275     * "dollar_tlevel"     : e.g. if a "ydb_get_st" call occurs inside of the "ydb_tp_st" call OR           \
    276     * "dollar_tlevel - 1" : if control is in the TP callback function inside "ydb_tp_st" but not a         \
    277     *      SimpleThreadAPI call like "ydb_get_st" etc.                                                     \
    278     */                                                                                                     \
    279    TLEVEL = dollar_tlevel; /* take a local copy of global variable as it could be concurrently changing */ \
    280    if (!TLEVEL)                                                                                            \
    281            HOLDER_THREAD_ID = ydb_engine_threadsafe_mutex_holder[0];                                       \
    282    else                                                                                                    \
    283    {                                                                                                       \
    284            HOLDER_THREAD_ID = ydb_engine_threadsafe_mutex_holder[TLEVEL];                                  \
    285            if (!HOLDER_THREAD_ID)                                                                          \
    286                    HOLDER_THREAD_ID = ydb_engine_threadsafe_mutex_holder[TLEVEL - 1];                      \
    287    }                                                                                                       \
    288 }
  ```

* Line 284 must have returned a value of 0 for `HOLDER_THREAD_ID` and so we went to line 286 and
  used the thread owner of tlevel=5 which was `Thread 6`.

* In the core file, we see that tlevel=6 lock owned is `Thread 1`. But at the time line 284 got executed,
  `Thread 1` was not owning the lock.

* That can be explained if `Thread 1` had not yet done the `ydb_zwr2str_st()` call when line 284 got
  executed.

* The issue then is that when we found no one holding the tlevel=6 lock, we went to see who holds the
  tlevel=5 lock and returned that thread is as the current YottaDB engine multi-thread lock holder.

* This is where the issue is. `Thread 1` even though it had not yet attempted to get the lock, owns
  the lock at this point since `Thread 6` has invoked the callback function and has no control of
  what calls the callback function can invoke (including creating new threads that in turn do
  Simple Thread API calls on their own like happened with `Thread 1`).

* Treating `Thread 6` as owning the lock ended up with a situation where 2 threads think they each
  own the engine lock and run YottaDB code at the same time causing the assert failures.

* This issue is long standing (started in 2afcbd2, which was committed 2019/03/25) but it manifests
  as assert failures only after the GT.M V7.0-001 code merge. That is because the deferred event queue
  handling got reworked in V7.0-001 making it possible for more logic to execute while in the
  signal handler thereby exposing this long standing issue.

* Note that even then it has taken a few months of testing to show this one failure in a C program that
  invokes multiple threads. So it is really a rare issue.

Fix
---
* The fix is thankfully simple and is to remove lines 285-286 above. That is, check the lock
  holder for the tlevel which `dollar_tlevel` global currently points to. Do not go one before that
  if we find the top level not being held by any thread.

* With this change, `Thread 6` will not incorrectly conclude it is the owner. This is because it will
  find that the owner of the YottaDB engine lock is no thread in this case and since that does not
  match its own thread id, it will `return` if it gets delivered the SIGINT (after noting down the
  fact that this signal handling was deferred) and the next thread that runs YottaDB runtime logic
  will notice this happened and handle the signal while it holds the engine lock.

Notes
-----
* Since this issue is very unlikely to be seen in practice (needs a Simple Thread API application that
  creates threads while inside a `ydb_tp_st()` call and also sends SIGINT signals), no YDB issue is
  created for this.
nars1 added a commit that referenced this pull request Jan 19, 2024
…ABANDONED (fixes GTM-9400 for real)

Background
----------
The below is pasted from https://gitlab.com/YottaDB/DB/YDBTest/-/issues/550#note_1733171439

* While trying to test YDBTest#550, I noticed that the KILLABANDONED error happens even with V7.0-001
  whereas the GTM-9400 release note in GT.M V7.0-001 indicates this as being fixed in V7.0-001.

* Below is the test case (using `tcsh`, not `sh`) that stops after a few seconds with a `KILLABANDONED`
  error with V7.0-000 as well as with V7.0-001.

  ```sh
  cat > kill.csh << CAT_EOF
  while (1)
          k15 reorg
          if (-e STOP) then
                  break
          endif
  end
  CAT_EOF

  rm -f STOP
  unsetenv ydb_gbldir
  setenv gtmgbldir mumps.gld
  rm -f mumps.gld mumps.dat
  $gtm_dist/mumps -run GDE exit
  $gtm_dist/mupip create
  $gtm_dist/mumps -run %XCMD 'for i=1:1:100000 set ^x(i)=$j(i,200)'
  source kill.csh &
  while (1)
          foreach fillfactor (50 10 90)
                  $gtm_dist/mupip reorg -fill=$fillfactor -region DEFAULT
                  $gtm_dist/mupip integ -reg "*"
                  if ($status) then
                          touch STOP
                          break
                  endif
          end
          if (-e STOP) then
                  break
          endif
  end
  ```

Issue
-----
* I added an assert in `secshr_db_clnup.c` where it invoked the `INCR_ABANDONED_KILLS` macro and ran
  the above test to see the cause of the above `KILLABANDONED` error.

* With that change, I got an assert failure (in line 446 below) and the core file showed the below
  stack trace.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140185231378240) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140185231378240) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140185231378240, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffcb98a5b50) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:446
  #9  mupip_exit_handler () at sr_unix/mupip_exit_handler.c:124
  #10 signal_exit_handler (exit_handler_name=0x7f7f6ac987be "deferred_exit_handler", sig=15, info=0x7f7f6ae3ddf8 <stapi_signal_handler_oscontext+3320>, context=0x7f7f6ae3de78 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #11 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #12 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #13 t_end (hist1=0x5617ac86b8c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1813
  #14 mu_reorg (gl_ptr=0x5617ac865bc0, exclude_glist_ptr=0x7ffcb98aa450, resume=0x7ffcb98aa344, index_fill_factor=90, data_fill_factor=90, reorg_op=0) at sr_port/mu_reorg.c:572
  #15 mupip_reorg () at sr_port/mupip_reorg.c:334
  #16 mupip_main (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948) at sr_unix/mupip_main.c:117
  #17 dlopen_libyottadb (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948, main_func=0x5617ab37c004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #18 main (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948) at sr_unix/mupip.c:22

  (gdb) f 8
  #8  secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:446
  446                assert(!mu_reorg_process);

  (gdb) list
  441             } else if (!dollar_tlevel)
  442             {
  443                     if ((NULL != kip_csa) && (csa == kip_csa))
  444                     {
  445                             /* Assert that MUPIP REORG never leaves the database with an abandoned kill */
  446                             assert(!mu_reorg_process);
  447                             assert(0 < kip_csa->hdr->kill_in_prog);
  448                             DECR_KIP(csd, csa, kip_csa);
  449                             INCR_ABANDONED_KILLS(csd, csa);
  450                     }

  (gdb) f 13
  #13 t_end (hist1=0x5617ac86b8c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1813
  1813            REVERT; /* no need for t_ch to be invoked if any errors occur after this point */
  ```

* And below is the macro sequence how frame 13 `REVERT` ends up in frame 12 `deferred_signal_handler` call.

  ```
  REVERT -> ENABLE_INTERRUPTS -> DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED -> deferred_signal_handler
  ```

* GT.M V7.0-001 fixed GTM-9400 by adding a `DEFERRED_EXIT_REORG_CHECK` macro at logical points in
  the mupip reorg code flow where we are guaranteed the kill-in-progress condition has been cleared
  (i.e. DECR_KIP has been called).

  ```diff
  $ git show -U1 tags/V7.0-001 sr_port/mu_reorg.c | grep -B2 -A1 DEFERRED_EXIT_REORG_CHECK
  @@ -449,2 +449,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume,
                                                          DECR_KIP(cs_data, cs_addrs, kip_csa);
  +                                                       DEFERRED_EXIT_REORG_CHECK;
                                                          if (detailed_log)
  --
  @@ -579,2 +580,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume,
                                                  DECR_KIP(cs_data, cs_addrs, kip_csa);
  +                                               DEFERRED_EXIT_REORG_CHECK;
                                                  if (detailed_log)
  @@ -677,2 +679,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume,
                                  DECR_KIP(cs_data, cs_addrs, kip_csa);
  +                               DEFERRED_EXIT_REORG_CHECK;
                                  if (detailed_log)

  $ git show -U1 tags/V7.0-001 sr_unix/mu_swap_root.c | grep -B2 -A1 DEFERRED_EXIT_REORG_CHECK
  @@ -271,2 +248,3 @@ void        mu_swap_root(glist *gl_ptr, int *root_swap_statistic_ptr)
          }
  +       DEFERRED_EXIT_REORG_CHECK;      /* a single directory tree has to be quick, so check at end, rather than each DECR_KIP  */
          return;
  ```

* But what it did not realize is that even before those logical points are reached, it is possible
  for `t_end.c` to invoke the `REVERT` macro which in turn would invoke `deferred_signal_handler` like
  is seen in the above stack trace.

* Not sure how this did not get caught during the GT.M testing.

Fix
---
* In any case, the fix is simple and is to enhance `sr_port/deferred_signal_handler.c` to not invoke
  `deferred_exit_handler()` but instead `return` in case we are a `mupip reorg` process (indicated by
  the boolean_t typed `mu_reorg_process` global variable being TRUE) and we are in the middle of a
  kill-in-progress (indicated by `cs_data->kill_in_prog` being TRUE).

* This way, we delay the deferred signal handling of the `MUPIP STOP` (aka `SIG-15`/SIGTERM) a little
  more until the logical point in `sr_port/mu_reorg.c` or `sr_unix/mu_swap_root.c` is reached where
  the `DEFERRED_EXIT_REORG_CHECK` macro is invoked.
nars1 added a commit that referenced this pull request Mar 26, 2024
…ofband_clear.c)

Background
----------
* After GT.M V7.0-002 changes were merged, the `r130/ydb560` subtest started failing with the
  following symptom.

  ```
  %YDB-F-ASSERT, Assert failed in sr_port/outofband_clear.c line 43 for expression (TRUE == status)
  ```

* A simple way to reproduce this issue is to run the following and in a parallel terminal send
  a `kill -4` to the `mumps` process (that is stuck in the `hang` command).

  ```sh
  $ cat test.m
   set x=1
   hang 100

  $ mumps -run test
  ```

* Before V7.0-002 merge, one would see just 1 core file (due to the `kill -4`). But after the
  merge, one would see 3 core files. And the 2nd core file had the following stack trace.

  ```
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140112165532736, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd024690b0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  outofband_clear () at sr_port/outofband_clear.c:43
  #9  outofband_action (lnfetch_or_start=0) at sr_port/outofband_action.c:58
  #10 async_action (lnfetch_or_start=false) at sr_port/deferred_events.c:394
  #11 lvzwr_var (lv=0x60f0000005f0, n=0) at sr_port/lvzwr_var.c:184
  #12 lvzwr_fini (out=0x7ffd02471dc0, t=1) at sr_port/lvzwr_fini.c:84
  #13 op_lvpatwrite (count=0, arg1=140724641668224) at sr_port/op_lvpatwrite.c:85
  #14 zshow_zwrite (output=0x7ffd02471dc0) at sr_port/zshow_zwrite.c:40
  #15 op_zshow (func=0x7ffd0247a0e0, type=1, lvn=0x0) at sr_port/op_zshow.c:166
  #16 jobexam_dump (dump_filename_arg=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fatal_file_name_buff=0x7ffd0247ae20 "/extra4/testarea1/nars/V998/tst_V998_R201_dbg_28_240320_111309/r130_0/ydb560/YDB_FATAL_ERROR.ZSHOW_DMP_89246_1.txt", fmt=0x0, dev_in_use=0x7ffd0247a240) at sr_port/jobexam_process.c:238
  #17 jobexam_process (dump_file_name=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fmt=0x0) at sr_port/jobexam_process.c:147
  #18 create_fatal_error_zshow_dmp (signal=4) at sr_port/create_fatal_error_zshow_dmp.c:66
  #19 signal_exit_handler (exit_handler_name=0x7f6e64c43140 "deferred_exit_handler", sig=4, info=0x7f6e6519f938 <stapi_signal_handler_oscontext+3320>, context=0x7f6e6519f9b8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:59
  #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95
  #22 set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:48
  #23 async_action (lnfetch_or_start=true) at sr_port/deferred_events.c:380
  #24 l1 () at sr_x86_64/op_startintrrpt.s:40

  (gdb) f 8
  #8  outofband_clear () at sr_port/outofband_clear.c:43
  43                      assert(TRUE == status);

  (gdb) list
  41              {
  42                      status = xfer_reset_if_setter(outofband);
  43                      assert(TRUE == status);
  44              }
  45      }

  (gdb) p outofband
  $1 = 11

  (gdb) p (enum outofbands)outofband
  $2 = deferred_signal
  ```

Issue
-----
* The issue was that `xfer_reset_if_setter()` had been reworked in GT.M V7.0-002. And that caused the
  handling of the `deferred_signal` type of outofband (which is a YottaDB-only value, unknown to the
  GT.M code base) not be handled correctly.

* The reason why `xfer_reset_if_setter()` returned FALSE in line 42 above is that the `event_state`
  for `deferred_signal` event_type at line 249 below was `pending`. Not `active` and so the call to
  line 250 got skipped. That would have done the real reset that was needed.

  **sr_port/deferred_events.c**
  ```c
    212 boolean_t xfer_reset_if_setter(int4 event_type)
      .
    249     if (res = (active == TAREF1(save_xfer_root, event_type).event_state))   /* WARNING: assignment */
    250             res = (real_xfer_reset(event_type));
  ```

Fix
---
* The fix was to set the event_state for `deferred_signal` outofband to `active` in `deferred_signal_set()`
  just like it is done for `jobinterrupt` outofband in `jobinterrupt_set()`.

* After this change though, an assert in line 370 below (in the `async_action()` function) failed.

  **sr_port/deferred_events.c**
  ```c
    350 void async_action(bool lnfetch_or_start)
      .
    358         if (jobinterrupt == outofband)
    359         {
      .
    367                 TAREF1(save_xfer_root, jobinterrupt).event_state = pending;     /* jobinterrupt gets a pass from the assert below */
    368         } else if (!lnfetch_or_start)
    369         {       /* something other than a new line caugth this, so  */
    370                 assert(pending >= TAREF1(save_xfer_root, outofband).event_state);
    371                 TAREF1(save_xfer_root, outofband).event_state = pending;        /* make it pending in case it was not there yet */
    372         }
  ```

  I noticed that `jobinterrupt` gets special handling in line 367. So decided to have special handling
  for `deferred_signal` as well. But the special handling is different here in that we do not modify
  the `event_state` (like is done for `jobinterrupt` in line 367 above) for the `deferred_signal` case.
  Just that we skip lines 370-371.

* With the changes in the above 2 bullets, the simple test case shown above started working fine in that
  it only generated 1 core file (not 3 core files).
nars1 added a commit that referenced this pull request Jul 10, 2024
Background
----------
* In internal testing, we noticed a rare failure in the `v51000/mu_bkup_stop` subtest
  where a `mupip backup` process that was sent a `SIGTERM` (by the test) ended up
  creating a core file due to ASAN assert failing on a double free.

* Below are relevant details from the core file.

  ```c
  Core was generated by `mupip backup -online -dbg * ./49181_online1'.
  Program terminated with signal SIGSEGV, Segmentation fault.

  (gdb) where
  #0  ydb_os_signal_handler (sig=11, info=0x7fd09968c3f0, context=0x7fd09968c2c0) at sr_unix/ydb_os_signal_handler.c:57
  #1  <signal handler called>
  #2  ydb_os_signal_handler (sig=6, info=0x7fd09968caf0, context=0x7fd09968c9c0) at sr_unix/ydb_os_signal_handler.c:57
  #3  <signal handler called>
  #4  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
  #5  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
  #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
  #7  __GI_abort () at ./stdlib/abort.c:79
  #8  __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:143
  #9  __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
  #10 __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x7ffda6de6ebe, __in_chrg=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:190
  #11 __asan::ReportDoubleFree (addr=140533757257728, free_stack=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:224
  #12 __asan::Allocator::ReportInvalidFree (this=<optimized out>, stack=0x7ffda6de79f0, chunk_state=<optimized out>, ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_allocator.cpp:757
  #13 __interceptor_free (ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:53
  #14 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1485
  #15 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  #16 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  #18 mupip_backup_call_on_signal () at sr_port/mupip_backup.c:208
  #19 signal_exit_handler (exit_handler_name=0x7fd097f1dda0 "deferred_exit_handler", sig=15, info=0x7fd098480fd8 <stapi_signal_handler_oscontext+3320>, context=0x7fd098481058 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:67
  #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95
  #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486
  #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  #26 mupip_backup () at sr_port/mupip_backup.c:1585
  #27 mupip_main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip_main.c:130
  #28 dlopen_libyottadb (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50, main_func=0x55af49fd9020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #29 main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip.c:21

  (gdb) f 25
  #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  103                                     free(ptr->backup_hdr);

  (gdb) f 17
  #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  103                                     free(ptr->backup_hdr);

  (gdb) down
  #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  1501            gtm_free_main(addr, TAIL_CALL_LEVEL);

  (gdb) down
  #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  854                     system_free(addr);

  (gdb) down
  #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486
  1486            ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);

  (gdb) list
  1481    {
  1482            intrpt_state_t  prev_intrpt_state;
  1483
  1484            DEFER_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);
  1485            free(addr);
  1486            ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);
  1487            return;
  1488    }
  ```

Issue
-----
* We did a `free(ptr->backup_hdr)` at line 103. And that in turn ended up using the system `free()`
  function because the test framework had randomly set the `gtmdbglvl` env var to a value of
  `0x80000000`.

* So at line 1485 above, the system free finished but at line 1486 we noticed the SIGTERM that was
  deferred and so decided to handle it. But the `ptr->backup_hdr` variable was still set to a
  non-NULL value so as part of the deferred exit handler, we tried to free this again resulting
  in the double free.

Fix
---
* The fix is to note `ptr->backup_hdr` in a local variable and clear the former and then attempting
  the `free()` on the local variable. This way if we decide to do deferred exit handling after the
  `free()` occurred, we will notice a NULL value of `ptr->backup_hdr` and so avoid the double free.

Notes
-----
* This is considered a too rare a race condition to be encountered in practice and so it is expected
  to not be noticed by a user. Therefore no YDB issue is created for this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants