New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Repeated calls to $order(^xxx("")) run faster #12

Closed

nars1 opened this issue Jul 14, 2017 · 1 comment

Assignees

Labels

Milestone

Collaborator

nars1 commented Jul 14, 2017 •

edited

Loading

Final Release Note

Repeated calls to $order(^xxx("")) where xxx is a global name run faster. (YDB#12)

Description

Below is a test case that shows $order(^x("")) done in a loop takes .2487 microseconds on average. Whereas $order(^x(""),-1) takes .1652 microseconds on average. That is forward $order is 50% slower than reverse $order. In this case, the global ^x has 2 levels of index blocks. For globals that are taller, the slowdown is even more. Turns out this is because the clue (which is set on any traversal of the GVT) is not set for the forward $order case whereas it is for the reverse $order and that clue helps later calls to the reverse $order to run a lot faster (by avoiding linear scans of the index blocks and data blocks).

mumps -run orderperf
450,094 nodes created in 357,239 microseconds
Avg. time to access first node .2487 microseconds
Avg. time to access last node .1652 microseconds

cat orderperf.m
orderperf ; $order() performance statistics
do create
do order
do previous
quit
create ;
kill ^x
set time1=$zut
for i=1:1:1E5 do
. set j=$random(1E9),k=$random(10)
. for l=1:1:k if $increment(m) set ^x(j,l)=$random(1E6)
set time2=$zut
write $fnumber(m,",")," nodes created in ",$fnumber(time2-time1,",")," microseconds",!
quit
order ;
set time1=$zut
set o=0
for i=1:1:1E4 do
. set x=$order(^x(""))
set time2=$zut if $increment(o,time2-time1)
write "Avg. time to access first node ",$fnumber(o/i,",")," microseconds",!
quit
previous;
set time1=$zut
set x=$order(^x(""))
set time1=$zut
set o=0
for i=1:1:1E4 do
. set x=$order(^x(""),-1)
set time2=$zut if $increment(o,time2-time1)
write "Avg. time to access last node ",$fnumber(o/i,",")," microseconds",!
quit

Draft Release Note

Repeated calls to $order(^xxx("")) where xxx is a global name run faster.

Member

ksbhaskar commented Jul 17, 2017

Suggested alternate release note that omits implementation details:

Repeated calls to $order(^xxx("")) where xxx is a global name run faster.

YottaDB added enhancement help wanted labels

nars1 added this to the r110 milestone

nars1 self-assigned this

nars1 removed the help wanted label

nars1 closed this as completed

nars1 added a commit to nars1/YottaDB that referenced this issue


          [NARS1] [estess] Handle the possibility of an UNWIND in a condition h…

5d49253

…andler in threaded code

```
The v63000/gtm8394 subtest failed an assert with the following stack trace.

 #0  0x00007f3f038734c7 in kill () from /usr/lib64/libc.so.6
 #1  0x00000000006c2413 in gtm_dump_core () at R110/sr_unix/gtm_dump_core.c:69
 #2  0x00000000006d5dd0 in gtm_fork_n_core () at R110/sr_unix/gtm_fork_n_core.c:211
 YottaDB#3  0x0000000000695b5b in ch_cond_core () at R110/sr_unix/ch_cond_core.c:59
 YottaDB#4  0x000000000087e6ba in rts_error_va (csa=0x0, argcnt=7, var=0x7f3ef864e178) at R110/sr_unix/rts_error.c:153
 YottaDB#5  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=7) at R110/sr_unix/rts_error.c:85
 YottaDB#6  0x0000000000916610 in hashtab_rehash_ch (arg=150373340) at R110/sr_port/hashtab_rehash_ch.c:33
 YottaDB#7  0x000000000087ec12 in rts_error_va (csa=0x0, argcnt=5, var=0x7f3ef864e438) at R110/sr_unix/rts_error.c:153
 YottaDB#8  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=5) at R110/sr_unix/rts_error.c:85
 YottaDB#9  0x00000000008fa778 in raise_gtmmemory_error () at R110/sr_port/gtm_malloc_src.h:1074
 YottaDB#10 0x00000000008f5ee2 in gtm_malloc (size=835672) at R110/sr_port/gtm_malloc_src.h:724
 YottaDB#11 0x0000000000978722 in init_hashtab_intl_int8 (table=0x7f3ef864e780, minsize=24594, old_table=0x10e8718 <murgbl+88>) at R110/sr_port/hashtab_implementation.h:392
 YottaDB#12 0x000000000097971e in expand_hashtab_int8 (table=0x10e8718 <murgbl+88>, minsize=24594) at R110/sr_port/hashtab_implementation.h:436
 YottaDB#13 0x000000000097a063 in add_hashtab_intl_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0, changing_table_size=0) at R110/sr_port/hashtab_implementation.h:499
 YottaDB#14 0x000000000097a005 in add_hashtab_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0) at R110/sr_port/hashtab_implementation.h:483
 YottaDB#15 0x000000000052a9cc in mur_back_processing_one_region (mur_back_options=0x7f3ef864ee40) at R110/sr_port/mur_back_process.c:1064
 YottaDB#16 0x0000000000523e09 in mur_back_phase1 (rctl=0x2e8fc20) at R110/sr_port/mur_back_process.c:535
 YottaDB#17 0x00000000006e75b8 in gtm_multi_thread_helper (tparm=0x7ffe5753ef30) at R110/sr_unix/gtm_multi_thread.c:228
 YottaDB#18 0x00007f3f03629e25 in start_thread () from /usr/lib64/libpthread.so.0
 YottaDB#19 0x00007f3f0393634d in clone () from /usr/lib64/libc.so.6

This is a test where a memory-error is forced (using limit vmemorysize). And various rollbacks are run. One of them runs with multiple threads and one thread gets a memory error during hashtable expansion. Normally a memory error causes the thread to exit and in turn that signals other threads to exit which is handled fine. But in this case, the condition handler hashtab_rehash_ch() did an UNWIND because it decided an out-of-memory situation implies we will abort the expansion and continue with the previous hashtable (this was a good-to-expand call, not a need-to-expand call). And the UNWIND macro had an assert that we better not be inside multi-threaded code. But that is exactly where we were in this failure.

The reason why the UNWIND has that logic is because in pro it would return control to the erroring thread and let it continue processing but we would not have released the pthread-mutex-lock that we obtained in rts_error_va() for this thread. That means all other threads will not be able to get this lock for various actions they do until the erroring thread tries to obtain the lock again (at which point we would check that we already hold the lock and not try to get the lock again) and later when we release it, other threads will be able to get the thread lock.

The fix is to make sure we release the thread-level lock in the UNWIND macro (and assert that we do hold the lock in dbg).

The pro implication of this issue is that a MUPIP JOURNAL command that encounters a memory error in some cases could in the worst case transform a multi-threaded recovery to a non-threaded recovery command thereby slowing it down. No other user-visible implications are expected out of this.

```

nars1 added a commit that referenced this issue


          [NARS1] [estess] Handle the possibility of an UNWIND in a condition h…

377c436

…andler in threaded code

```
The v63000/gtm8394 subtest failed an assert with the following stack trace.

 #0  0x00007f3f038734c7 in kill () from /usr/lib64/libc.so.6
 #1  0x00000000006c2413 in gtm_dump_core () at R110/sr_unix/gtm_dump_core.c:69
 #2  0x00000000006d5dd0 in gtm_fork_n_core () at R110/sr_unix/gtm_fork_n_core.c:211
 #3  0x0000000000695b5b in ch_cond_core () at R110/sr_unix/ch_cond_core.c:59
 #4  0x000000000087e6ba in rts_error_va (csa=0x0, argcnt=7, var=0x7f3ef864e178) at R110/sr_unix/rts_error.c:153
 #5  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=7) at R110/sr_unix/rts_error.c:85
 #6  0x0000000000916610 in hashtab_rehash_ch (arg=150373340) at R110/sr_port/hashtab_rehash_ch.c:33
 #7  0x000000000087ec12 in rts_error_va (csa=0x0, argcnt=5, var=0x7f3ef864e438) at R110/sr_unix/rts_error.c:153
 #8  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=5) at R110/sr_unix/rts_error.c:85
 #9  0x00000000008fa778 in raise_gtmmemory_error () at R110/sr_port/gtm_malloc_src.h:1074
 #10 0x00000000008f5ee2 in gtm_malloc (size=835672) at R110/sr_port/gtm_malloc_src.h:724
 #11 0x0000000000978722 in init_hashtab_intl_int8 (table=0x7f3ef864e780, minsize=24594, old_table=0x10e8718 <murgbl+88>) at R110/sr_port/hashtab_implementation.h:392
 #12 0x000000000097971e in expand_hashtab_int8 (table=0x10e8718 <murgbl+88>, minsize=24594) at R110/sr_port/hashtab_implementation.h:436
 #13 0x000000000097a063 in add_hashtab_intl_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0, changing_table_size=0) at R110/sr_port/hashtab_implementation.h:499
 #14 0x000000000097a005 in add_hashtab_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0) at R110/sr_port/hashtab_implementation.h:483
 #15 0x000000000052a9cc in mur_back_processing_one_region (mur_back_options=0x7f3ef864ee40) at R110/sr_port/mur_back_process.c:1064
 #16 0x0000000000523e09 in mur_back_phase1 (rctl=0x2e8fc20) at R110/sr_port/mur_back_process.c:535
 #17 0x00000000006e75b8 in gtm_multi_thread_helper (tparm=0x7ffe5753ef30) at R110/sr_unix/gtm_multi_thread.c:228
 #18 0x00007f3f03629e25 in start_thread () from /usr/lib64/libpthread.so.0
 #19 0x00007f3f0393634d in clone () from /usr/lib64/libc.so.6

This is a test where a memory-error is forced (using limit vmemorysize). And various rollbacks are run. One of them runs with multiple threads and one thread gets a memory error during hashtable expansion. Normally a memory error causes the thread to exit and in turn that signals other threads to exit which is handled fine. But in this case, the condition handler hashtab_rehash_ch() did an UNWIND because it decided an out-of-memory situation implies we will abort the expansion and continue with the previous hashtable (this was a good-to-expand call, not a need-to-expand call). And the UNWIND macro had an assert that we better not be inside multi-threaded code. But that is exactly where we were in this failure.

The reason why the UNWIND has that logic is because in pro it would return control to the erroring thread and let it continue processing but we would not have released the pthread-mutex-lock that we obtained in rts_error_va() for this thread. That means all other threads will not be able to get this lock for various actions they do until the erroring thread tries to obtain the lock again (at which point we would check that we already hold the lock and not try to get the lock again) and later when we release it, other threads will be able to get the thread lock.

The fix is to make sure we release the thread-level lock in the UNWIND macro (and assert that we do hold the lock in dbg).

The pro implication of this issue is that a MUPIP JOURNAL command that encounters a memory error in some cases could in the worst case transform a multi-threaded recovery to a non-threaded recovery command thereby slowing it down. No other user-visible implications are expected out of this.

```

ksbhaskar changed the title ~~Speed up repeated calls to $order(^xxx("")) where xxx is a global name~~ Repeated calls to $order(^xxx("")) where xxx is a global name run faster

ksbhaskar changed the title ~~Repeated calls to $order(^xxx("")) where xxx is a global name run faster~~ Repeated calls to $order(^xxx("")) run faster

nars1 added a commit to nars1/YottaDB that referenced this issue


          [NARS1] [estess] Do not play with triple chains in case of compile-ti…

08316a4

…me errors as they could cause SIG-11 YottaDB#90

A few issues related to compile-time errors were discovered.

1) The below M program correctly issues a PATNOTFOUND error when compiling. But if one tries to run the compiled object code (which should be okay since the execution does not reach the portions of the M code where the compiler error was found), a SIG-11 is observed. This happens only with GT.M V63002 (and in turn YottaDB r1.10) but not with V63001A (and in turn YottaDB r1.00).

```
> cat test.m
main    ;
        do good
        quit
bad     ;
        if 1?1B
        quit
good    ;
        write "hello",!
        quit
```

Related to the above, the below M program test1.m produces a GTMASSERT when run with a debug build. Unlike the previous test case (test.m), the production build did not have problems with test1.m.

```
> cat test1.m
        if 1?1B

> $gtm_dist/mumps -run test1
%GTM-F-GTMASSERT, GT.M V6.3-002 Linux x86_64 - Assert failed /Distrib/GT.M/V63002/sr_port/chktchain.c line 28
```

The primary issue in both the above tests was in bx_boollit() which noticed a pattern match operator usage with both operands being literals and hence invoked do_patfixed() which encountered a PATNOTFOUND error. That caused ins_errtriple() to be invoked which in turn removed all triples corresponding to the current M line (dqdelchain() call) and returned back to bx_boollit() which did not realize this and went ahead with manipulating the triple chains (dqrins() call etc.) and returned to its caller bool_expr() which also did triple chain manipulation (dqdel() call etc.) all the while operating on triples that were no longer part of the execution chain (due to the prior delqchain() call). This caused a corruption in the doubly-linked triple list in "t_orig" which resulted in incorrect object code being generated that later ended up as the SIG-11 when one tried running this M program.

In GT.M V63002, boolean expression evaluation and literal optimization got a significant rework. As part of that change, the macros RETURN_IF_RTS_ERROR and RETURN_EXPR_IF_RTS_ERROR were introduced to check for compile-time errors and if so return from functions right away instead of manipulating triple chains. These safety checks needed to be added in a few more places. That fixed the primary issue.

2) In addition, it was noticed that the following M program fails an assert when run with the debug build.

```
> cat test2.m
        xecute "if ""a""?1B"

> mumps -run test2
%GTM-F-ASSERT, Assert failed in /Distrib/GT.M/V63002/sr_port/zlcompile.c line 81 for expression ((FALSE == run_time) && (TRUE == TREF(compile_time)))
```

Below is the corresponding C-stack.

```
 #0  0x00007ff2e6988767 in kill () at ../sysdeps/unix/syscall-template.S:84
 #1  0x00007ff2e6014a5c in gtm_dump_core () at /Distrib/GT.M/V63002/sr_unix/gtm_dump_core.c:69
 #2  0x00007ff2e5f1de97 in gtm_fork_n_core () at /Distrib/GT.M/V63002/sr_unix/gtm_fork_n_core.c:211
 YottaDB#3  0x00007ff2e6007f2b in ch_cond_core () at /Distrib/GT.M/V63002/sr_unix/ch_cond_core.c:59
 YottaDB#4  0x00007ff2e5f443a2 in rts_error_va (csa=0x0, argcnt=7, var=0x7ffffc4b0a90) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:153
 YottaDB#5  0x00007ff2e5f439b8 in rts_error_csa (csa=0x0, argcnt=7) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:85
 YottaDB#6  0x00007ff2e636d64c in zlcompile (len=48 '0', addr=0x7ffffc4b0e30 "/extra1/testarea1/nars/test/temp/tmp/tmp/test2.m") at /Distrib/GT.M/V63002/sr_port/zlcompile.c:81
 YottaDB#7  0x00007ff2e60e6f1c in op_zlink (v=0x7ffffc4b14a0, quals=0x7ffffc4b0cf0) at /Distrib/GT.M/V63002/sr_unix/op_zlink.c:443
 YottaDB#8  0x00007ff2e5f6a2d7 in job_addr (rtn=0x7ffffc4b1590, label=0x7ffffc4b15a0, offset=0, hdr=0x7ffffc4b1518, labaddr=0x7ffffc4b1510, need_rtnobj_shm_free=0x7ffffc4b14e4) at /Distrib/GT.M/V63002/sr_port/job_addr.c:41
 YottaDB#9  0x00007ff2e5f40b48 in jobchild_init () at /Distrib/GT.M/V63002/sr_unix/jobchild_init.c:146
 YottaDB#10 0x00007ff2e5f3835d in gtm_startup (svec=0x7ffffc4b1d30) at /Distrib/GT.M/V63002/sr_unix/gtm_startup.c:252
 YottaDB#11 0x00007ff2e5f3b2f6 in init_gtm () at /Distrib/GT.M/V63002/sr_unix/init_gtm.c:201
 YottaDB#12 0x00007ff2e5f072ea in gtm_main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm_main.c:162
 YottaDB#13 0x0000000000400cbe in main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm.c:131
```

In this case, run_time was TRUE and caused the assert failure. Turns out this was due to m_xecute() function (invoked by zlcompile()) temporarily setting run_time to FALSE but when a PATNOTFOUND error was encountered, the condition handler compiler_ch() was invoked which did an UNWIND back to zlcompile() incorrectly persisting the global variable changes done by the interim function call of m_xecute().

The fix for this was to reset the run_time and TREF(xecute_literal_parse) global variables just like is being done in mdb_condition_handler().

nars1 added a commit that referenced this issue


          [NARS1] [estess] Do not play with triple chains in case of compile-ti…

abacef8

…me errors as they could cause SIG-11 #90

A few issues related to compile-time errors were discovered.

1) The below M program correctly issues a PATNOTFOUND error when compiling. But if one tries to run the compiled object code (which should be okay since the execution does not reach the portions of the M code where the compiler error was found), a SIG-11 is observed. This happens only with GT.M V63002 (and in turn YottaDB r1.10) but not with V63001A (and in turn YottaDB r1.00).

```
> cat test.m
main    ;
        do good
        quit
bad     ;
        if 1?1B
        quit
good    ;
        write "hello",!
        quit
```

Related to the above, the below M program test1.m produces a GTMASSERT when run with a debug build. Unlike the previous test case (test.m), the production build did not have problems with test1.m.

```
> cat test1.m
        if 1?1B

> $gtm_dist/mumps -run test1
%GTM-F-GTMASSERT, GT.M V6.3-002 Linux x86_64 - Assert failed /Distrib/GT.M/V63002/sr_port/chktchain.c line 28
```

The primary issue in both the above tests was in bx_boollit() which noticed a pattern match operator usage with both operands being literals and hence invoked do_patfixed() which encountered a PATNOTFOUND error. That caused ins_errtriple() to be invoked which in turn removed all triples corresponding to the current M line (dqdelchain() call) and returned back to bx_boollit() which did not realize this and went ahead with manipulating the triple chains (dqrins() call etc.) and returned to its caller bool_expr() which also did triple chain manipulation (dqdel() call etc.) all the while operating on triples that were no longer part of the execution chain (due to the prior delqchain() call). This caused a corruption in the doubly-linked triple list in "t_orig" which resulted in incorrect object code being generated that later ended up as the SIG-11 when one tried running this M program.

In GT.M V63002, boolean expression evaluation and literal optimization got a significant rework. As part of that change, the macros RETURN_IF_RTS_ERROR and RETURN_EXPR_IF_RTS_ERROR were introduced to check for compile-time errors and if so return from functions right away instead of manipulating triple chains. These safety checks needed to be added in a few more places. That fixed the primary issue.

2) In addition, it was noticed that the following M program fails an assert when run with the debug build.

```
> cat test2.m
        xecute "if ""a""?1B"

> mumps -run test2
%GTM-F-ASSERT, Assert failed in /Distrib/GT.M/V63002/sr_port/zlcompile.c line 81 for expression ((FALSE == run_time) && (TRUE == TREF(compile_time)))
```

Below is the corresponding C-stack.

```
 #0  0x00007ff2e6988767 in kill () at ../sysdeps/unix/syscall-template.S:84
 #1  0x00007ff2e6014a5c in gtm_dump_core () at /Distrib/GT.M/V63002/sr_unix/gtm_dump_core.c:69
 #2  0x00007ff2e5f1de97 in gtm_fork_n_core () at /Distrib/GT.M/V63002/sr_unix/gtm_fork_n_core.c:211
 #3  0x00007ff2e6007f2b in ch_cond_core () at /Distrib/GT.M/V63002/sr_unix/ch_cond_core.c:59
 #4  0x00007ff2e5f443a2 in rts_error_va (csa=0x0, argcnt=7, var=0x7ffffc4b0a90) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:153
 #5  0x00007ff2e5f439b8 in rts_error_csa (csa=0x0, argcnt=7) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:85
 #6  0x00007ff2e636d64c in zlcompile (len=48 '0', addr=0x7ffffc4b0e30 "/extra1/testarea1/nars/test/temp/tmp/tmp/test2.m") at /Distrib/GT.M/V63002/sr_port/zlcompile.c:81
 #7  0x00007ff2e60e6f1c in op_zlink (v=0x7ffffc4b14a0, quals=0x7ffffc4b0cf0) at /Distrib/GT.M/V63002/sr_unix/op_zlink.c:443
 #8  0x00007ff2e5f6a2d7 in job_addr (rtn=0x7ffffc4b1590, label=0x7ffffc4b15a0, offset=0, hdr=0x7ffffc4b1518, labaddr=0x7ffffc4b1510, need_rtnobj_shm_free=0x7ffffc4b14e4) at /Distrib/GT.M/V63002/sr_port/job_addr.c:41
 #9  0x00007ff2e5f40b48 in jobchild_init () at /Distrib/GT.M/V63002/sr_unix/jobchild_init.c:146
 #10 0x00007ff2e5f3835d in gtm_startup (svec=0x7ffffc4b1d30) at /Distrib/GT.M/V63002/sr_unix/gtm_startup.c:252
 #11 0x00007ff2e5f3b2f6 in init_gtm () at /Distrib/GT.M/V63002/sr_unix/init_gtm.c:201
 #12 0x00007ff2e5f072ea in gtm_main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm_main.c:162
 #13 0x0000000000400cbe in main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm.c:131
```

In this case, run_time was TRUE and caused the assert failure. Turns out this was due to m_xecute() function (invoked by zlcompile()) temporarily setting run_time to FALSE but when a PATNOTFOUND error was encountered, the condition handler compiler_ch() was invoked which did an UNWIND back to zlcompile() incorrectly persisting the global variable changes done by the interim function call of m_xecute().

The fix for this was to reset the run_time and TREF(xecute_literal_parse) global variables just like is being done in mdb_condition_handler().

ChristopherEdwards pushed a commit to ChristopherEdwards/YottaDB that referenced this issue


          Merge pull request YottaDB#12 from tuskentower/darwin

59d1d84

CMake build improvements and GT.M can use MOSX's ICU library

chathaway-codes pushed a commit that referenced this issue


          Handle edge case in jnlpool mutex_salvage with kill -9 of online roll…

45344b0

…back process

This is an issue identified based on a rare ideminter_rolrec/interrupted_rollback_or_recover subtest
failure. The test ran 4 rollbacks and killed all but the last one using kill -9 before they finished.
The last rollback failed an assert.

%YDB-F-ASSERT, Assert failed in sr_unix/mutex.c line 1045 for expression (((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno)

Below is the C-stack

(gdb) where
 #0  kill () at ../sysdeps/unix/syscall-template.S:84
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:69
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fff2eebdeb0) at sr_unix/rts_error.c:159
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:92
 #6  mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045
 #7  gtm_mutex_lock (reg=0x106f960, mutex_spin_parms=0x7f7690440080, crash_count=0, mutex_lock_type=MUTEX_LOCK_WRITE) at sr_unix/mutex.c:703
 #8  grab_lock (reg=0x106f960, is_blocking_wait=1, onln_rlbk_action=1) at sr_unix/grab_lock.c:83
 #9  mur_open_files () at sr_port/mur_open_files.c:492
 #10 mupip_recover () at sr_port/mupip_recover.c:195
 #11 mupip_main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip_main.c:124
 #12 dlopen_libyottadb (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688, main_func=0x401424 "mupip_main") at sr_unix/dlopen_libyottadb.c:148
 #13 main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip.c:19

(gdb) f 6
 #6  0x00007f76931b75bd in mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045
1045            assert(((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno);

(gdb) p lastJplCmt->jnl_seqno
$1 = 323033

(gdb) p jpl->jnl_seqno
$2 = 293120

We were expecting the two seqnos to be 1 apart but they are way apart.

This is because had killed a prior rollback (the first rollback) just before it had finished the rollback.
Below is its log.

> cat ROLLBACK2_1.logx
.
.
%YDB-I-RLBKJNSEQ, Journal seqno of the instance after rollback is 293120 [0x0000000000047900]
.
%YDB-I-FILERENAME, File ideminter_rolrec_0_7/interrupted_rollback_or_recover/g.mjl_2018284231355 is renamed to ideminter_rolrec_0_7/interrupted_rollback_or_recover/rolled_bak_g.mjl_2018284231355
Killed

The fact that it printed the RLBKJNSEQ and FILERENAME messages implies it was in mur_close_files()
when it was killed.

There is code in mur_close_files() where we reset various fields in "jpl" to reset the state of
the journal pool based on the post-rollback instance seqno. This code also needs to clear a few
2-phase-jnl-commit related fields so "mutex_salvage" when it comes in later (in this test, it came in
as part of the later rollback) after the kill -9 does not fail the above assert.

This failure can be easily reproduced by running an online rollback with the -resync qualifier to take back
the state of the instance to a prior seqno, setting a break point in "rel_lock()" and quitting from the
debugger once that break point is hit (this simulates kill -9 of online rollback). Reissuing the same online
rollback out of the debugger should show the assert failure. Reissuing the online rollback with a production
build did not show any issues so the suspicion is that this is a dbg-only issue hence no tracking is done
as a separate issue at gitlab.

chathaway-codes pushed a commit that referenced this issue


          Fix incorrect assert exposed by v63005/gtm8956 subtest

aeb1036

When ydb_chset env var is set to "M", compiling the following line

	set c=$PIECE("Hello "_$ZCH(190)_" world!",$ZCH(191),1,2)

Failed an assert

%YDB-F-ASSERT, Assert failed in sr_unix/gtm_utf8.c line 273 for expression (gtm_utf8_mode)

with the following C-stack

 #6  utf8_badchar_real () at sr_unix/gtm_utf8.c:273
 #7  utf8_badchar_dec () at sr_unix/gtm_utf8.c:249
 #8  valid_utf_string () at sr_unix/gtm_utf8.c:410
 #9  op_fnzpiece () at sr_port/op_fnzpiece.c:53
 #10 f_piece () at sr_unix/f_piece.c:171
 #11 expritem () at sr_port/expritem.c:619
 #12 expratom () at sr_port/expratom.c:29
 #13 eval_expr () at sr_port/eval_expr.c:63
 #14 expr () at sr_port/expr.c:29
 #15 m_write () at sr_port/m_write.c:71
 #16 cmd () at sr_port/cmd.c:302
 #17 linetail () at sr_port/linetail.c:35
 #18 line () at sr_port/line.c:230
 #19 compiler_startup () at sr_port/compiler_startup.c:144
 #20 compile_source_file () at sr_unix/source_file.c:132
 #21 gtm_compile () at sr_unix/gtm_compile.c:120

The assert that failed is correct. The issue is that we called the utf8_badchar_real() function
in non-UTF8 mode (i.e. when "gtm_utf8_mode" is 0). The issue is in op_fnzpiece() where we
invoke the valid_utf_string() function only if we are in UTF-8 mode (indicated by "gtm_utf8_mode == 1").
The assert (likely introduced as part of GTM-7762 in GT.M V6.3-000) is now fixed to take care of this.

chathaway-codes pushed a commit that referenced this issue


          Fix incorrect assert (rare debug-only v53002/C9E04002596 subtest fail…

bb2141a

…ure)

In one run of the v53002/C9E04002596 subtest, the following assert failed.

%YDB-F-ASSERT, Assert failed in sr_port/mutex_deadlock_check.c line 102 for expression
	(!jnlpool || !jnlpool->jnlpool_dummy_reg || jnlpool->jnlpool_dummy_reg->open || (repl_csa->critical != criticalPtr) || (NULL == cs_addrs))

And below was the C-stack at the time of the assert failure.

(gdb) where
 #0  kill () at ../sysdeps/unix/syscall-template.S:84
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:69
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va () at sr_unix/rts_error.c:159
 #5  rts_error_csa () at sr_unix/rts_error.c:92
 #6  mutex_deadlock_check () at sr_port/mutex_deadlock_check.c:101
 #7  mutex_long_sleep () at sr_unix/mutex.c:511
 #8  gtm_mutex_lock () at sr_unix/mutex.c:856
 #9  grab_lock () at sr_unix/grab_lock.c:83
 #10 repl_inst_ftok_counter_halted () at sr_unix/repl_inst_ftok_counter_halted.c:45
 #11 jnlpool_init () at sr_unix/jnlpool_init.c:764
 #12 gvcst_init () at sr_port/gvcst_init.c:917
 #13 gv_init_reg () at sr_port/gv_init_reg.c:56
 #14 gv_bind_name () at sr_port/gv_bind_name.c:75
 #15 op_gvname_common () at sr_port/op_gvname.c:117
 #16 op_gvname_fast () at sr_port/op_gvname.c:81

The assert at line 102 expects that if ever we are in mutex_deadlock_check() for the jnlpool, we better have
not opened any database file (i.e. non-NULL cs_addrs). Whereas in this call sequence clearly, we have a case
where while opening the database, we need to open the journal pool (because ydb_custom_errors env var is set)
and while opening the journal pool we notice a 32K semaphore counter overflow (artificially created by the test
system in this debug-run using ydb_db_counter_sem_incr env var) which results in getting a lock on the jnlpool
before making changes to the instance file and while trying to get the lock we notice it is being held by
some other process which is taking a long time so we go to mutex_deadlock_check() eventually for the jnlpool
and fail the assert while in the middle of a database open (so cs_addrs is non-NULL).

The assert is clearly wrong for this case.

The assert was only introduced in V6.3-001A in the below state.

assert((NULL == jnlpool.jnlpool_dummy_reg) || jnlpool.jnlpool_dummy_reg->open || (repl_csa->critical != criticalPtr));

And it was modified in V6.3-003 and V6.3-005 to add || conditions. Most likely to account for exceptions to
the assert as they were encountered.

I don't see the value in this assert so removing it.

Also corrected a pre-existing comment a few lines after the removed assert to reflect our renewed understanding
of the current failure possibility.

chathaway-codes pushed a commit that referenced this issue


          Skip $ZROUTINES initialization if process is already exiting; Avoids …

f8a6125

…secondary errors if primary error is out-of-memory

If already exiting, do not open any object/source directories (which could include relinkctl files)
as part of $ZROUTINES initialization. This avoids potentially nasty codepaths particulary if the
reason we are exiting is an out-of-memory.

We do not expect any user to run such extreme out-of-memory codepaths/tests so it is not considered
necessary to create a user-visible issue for this.

For example, below are two C-stacks that showed up in core dumps while running the
simpleapi/fatalerror2 subtest. In both cases, if we avoid the zro_init() call we can avoid
such cores.

Core1
------
Notice the local variables passed in #0 have "Cannot access memory" errors. Most likely there was no
space allocating the C-stack in this core.

(gdb) where
 #0  ydb_trans_log_name (envindx=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c5c>, trans=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c50>, buffer=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c48>, buffer_len=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c58>, ignore_errors=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c44>, is_ydb_env_match=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c38>) at sr_port/ydb_trans_log_name.c:41
 #1  util_out_send_oper (addr=0x7ffe1e3c7800 "%YDB-E-RELINKCTLERR, Error with relink control structure for $ZROUTINES directory ., %YDB-E-SYSCALL, Error received from system call mmap() -- called from module "..., len=287) at sr_unix/util_output.c:731
 #2  util_out_print_vaparm (message=0x0, flush=4, var=0x7ffe1e3c8050, faocnt=2147483647) at sr_unix/util_output.c:871
 #3  util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904
 #4  jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:261
 #5  gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36
 #6  rts_error_va (csa=0x0, argcnt=12, var=0x7ffe1e3c82b0) at sr_unix/rts_error.c:159
 #7  rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92
 #8  relinkctl_map (linkctl=0x7ffe1e3c8890) at sr_unix/relinkctl.c:679
 #9  relinkctl_open (linkctl=0x7ffe1e3c8890, object_dir_missing=0) at sr_unix/relinkctl.c:333
 #10 relinkctl_attach (obj_container_name=0x7ffe1e3cbb50, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188
 #11 zro_load (str=0x5611ed710ce8) at sr_unix/zro_load.c:159
 #12 zro_init () at sr_port/zro_init.c:51
 #13 zshow_svn (output=0x7ffe1e40f0b0, one_sv=0) at sr_port/zshow_svn.c:694
 #14 op_zshow (func=0x7ffe1e4171b0, type=1, lvn=0x0) at sr_port/op_zshow.c:166
 #15 jobexam_dump (dump_filename_arg=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0, fatal_file_name_buff=0x7ffe1e417c40 "simpleapi_0_2/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_65362_1.txt") at sr_port/jobexam_process.c:232
 #16 jobexam_process (dump_file_name=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0) at sr_port/jobexam_process.c:152
 #17 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66
 #18 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224
 #19 rts_error_va (csa=0x0, argcnt=5, var=0x7ffe1e41a6a0) at sr_unix/rts_error.c:159
 #20 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92
 #21 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114
 #22 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748
 #23 lvtreenode_newblock (sym=0x5611ed733b40, numElems=2097152) at sr_port/lv_newblock.c:82
 #24 lvtreenode_getslot (sym=0x5611ed733b40) at sr_port/lv_getslot.c:145
 #25 lvAvlTreeNodeInsert (lvt=0x5611ed736050, key=0x7ffe1e41aab0, parent=0x5611f87cb608) at sr_port/lv_tree.c:1698
 #26 op_putindx (argcnt=1, start=0x5611ed73b0a0) at sr_port/op_putindx.c:192
 #27 callg (fnptr=0x7fb75d4f4fff <op_putindx>, paramlist=0x7ffe1e41ae60) at sr_unix/callg.c:60
 #28 ydb_set_s (varname=0x7ffe1e41b5e0, subs_used=1, subsarray=0x7ffe1e41b5f0, value=0x7ffe1e41ade0) at sr_unix/ydb_set_s.c:108
 #29 gvnset () at fatalerror.c:56
 #30 ydb_tp_s (tpfn=0x5611ed225260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193
 #31 main () at fatalerror.c:32

Core2
-----
In this case there is a SIG-11 deep inside syslog(). Most likely due to an out-of-memory situation.

Program terminated with signal SIGSEGV, Segmentation fault.
 #0  vfprintf () from /usr/lib64/libc.so.6
 #1  fprintf () from /usr/lib64/libc.so.6
 #2  __vsyslog_chk () from /usr/lib64/libc.so.6
 #3  syslog () from /usr/lib64/libc.so.6
 #4  util_out_send_oper (addr=0x7ffdadd5ec10 "%YDB-E-JOBEXAMFAIL, YottaDB process 50787 executing $ZJOBEXAM function failed with the preceding error message -- generated from 0x", '0' <repeats 16 times>, ".", len=149) at sr_unix/util_output.c:761
 #5  util_out_print_vaparm (message=0x0, flush=4, var=0x7ffdadd5f460, faocnt=2147483647) at sr_unix/util_output.c:871
 #6  util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904
 #7  send_msg_va (csa=0x0, arg_count=0, var=0x7ffdadd5fa00) at sr_unix/send_msg.c:149
 #8  send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:79
 #9  jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:264
 #10 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36
 #11 rts_error_va (csa=0x0, argcnt=12, var=0x7ffdadd5fc60) at sr_unix/rts_error.c:159
 #12 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92
 #13 relinkctl_map (linkctl=0x7ffdadd60240) at sr_unix/relinkctl.c:679
 #14 relinkctl_open (linkctl=0x7ffdadd60240, object_dir_missing=0) at sr_unix/relinkctl.c:333
 #15 relinkctl_attach (obj_container_name=0x7ffdadd63500, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188
 #16 zro_load (str=0x55df19dd3ce8) at sr_unix/zro_load.c:159
 #17 zro_init () at sr_port/zro_init.c:51
 #18 zshow_svn (output=0x7ffdadda6a60, one_sv=0) at sr_port/zshow_svn.c:694
 #19 op_zshow (func=0x7ffdaddaeb60, type=1, lvn=0x0) at sr_port/op_zshow.c:166
 #20 jobexam_dump (dump_filename_arg=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660, fatal_file_name_buff=0x7ffdaddaf5f0 "simpleapi_0_40/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_50787_1.txt") at sr_port/jobexam_process.c:232
 #21 jobexam_process (dump_file_name=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660) at sr_port/jobexam_process.c:152
 #22 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66
 #23 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224
 #24 rts_error_va (csa=0x0, argcnt=5, var=0x7ffdaddb2050) at sr_unix/rts_error.c:159
 #25 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92
 #26 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114
 #27 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748
 #28 lvtreenode_newblock (sym=0x55df19df6b40, numElems=2097152) at sr_port/lv_newblock.c:82
 #29 lvtreenode_getslot (sym=0x55df19df6b40) at sr_port/lv_getslot.c:145
 #30 lvAvlTreeNodeInsert (lvt=0x55df19df9050, key=0x7ffdaddb2460, parent=0x55df24e8e5c8) at sr_port/lv_tree.c:1698
 #31 op_putindx (argcnt=1, start=0x55df19dfe0a0) at sr_port/op_putindx.c:192
 #32 callg (fnptr=0x7feae36c6fff <op_putindx>, paramlist=0x7ffdaddb2810) at sr_unix/callg.c:60
 #33 ydb_set_s (varname=0x7ffdaddb2f90, subs_used=1, subsarray=0x7ffdaddb2fa0, value=0x7ffdaddb2790) at sr_unix/ydb_set_s.c:108
 #34 gvnset () at fatalerror.c:56
 #35 ydb_tp_s (tpfn=0x55df18a5c260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193
 #36 main () at fatalerror.c:32

chathaway-codes pushed a commit that referenced this issue


          [#205] Fix one case of MAXRTSERRDEPTH failure due to PTHREAD_MUTEX_LO…

cbbf955

…CK being called during exit handling

When a C program that spawned off multiple threads that used the SimpleThreadAPI (e.g. ydb_tp_st() etc.)
was deadlocked (due to a code issue), pressing Ctrl-C (SIGINT) did nothing so pressing Ctrl-\ (SIGQUIT)
to terminate the C program caused a MAXRTSERRDEPTH fatal error and resulted in a core dump.

Below is the actual output.

^C^\%YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with coreQuit (core dumped)

The corresponding C-stack follows.

(gdb) where
 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52090) at sr_unix/rts_error.c:144
 #3  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52270) at sr_unix/rts_error.c:146
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52450) at sr_unix/rts_error.c:146
 #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #8  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52630) at sr_unix/rts_error.c:146
 #9  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #10 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52810) at sr_unix/rts_error.c:146
 #11 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #12 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df529f0) at sr_unix/rts_error.c:146
 #13 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #14 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52bd0) at sr_unix/rts_error.c:146
 #15 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #16 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52db0) at sr_unix/rts_error.c:146
 #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #18 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52f90) at sr_unix/rts_error.c:146
 #19 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #20 send_msg_va (csa=0x0, arg_count=8, var=0x7fb28df53570) at sr_unix/send_msg.c:125
 #21 send_msg_csa (csa=0x0, arg_count=8) at sr_unix/send_msg.c:84
 #22 generic_signal_handler (sig=3, info=0x7fb28df53830, context=0x7fb28df53700) at sr_unix/generic_signal_handler.c:244
 #23 <signal handler called>
 #24 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb2880180a8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
 #25 __pthread_cond_wait_common (abstime=0x0, mutex=0x7fb288018040, cond=0x7fb288018080) at pthread_cond_wait.c:502
 #26 __pthread_cond_wait (cond=0x7fb288018080, mutex=0x7fb288018040) at pthread_cond_wait.c:655
 #27 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:80
 #28 start_thread (arg=0x7fb28df54700) at pthread_create.c:463
 #29 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The primary error was at #20 in send_msg_va() inside the PTHREAD_MUTEX_LOCK_IF_NEEDED macro.
The actual assert that failed inside the macro was the following.

sr_unix/gtm_multi_thread.h
---------------------------
     99                 /* We should never use pthread_* calls inside a signal/timer handler. Assert that */                    \
    100                 assert(!in_nondeferrable_signal_handler);                                                               \

We were in a signal handler handling a non-deferrable signal (Ctrl-\ aka SIGQUIT) and are about to do
a pthread_mutex_lock() library call which is a no-no.

If we are in an exit handler, it is possible for send_msg() to be needed (to log the signal that was received
etc.) but it is safer to not do any pthread activity since we cannot be sure if we are exiting while inside
a signal handler or not. Therefore the fix for this is to check if "process_exiting" global variable is TRUE
and if so, we skip all pthread* calls in the PTHREAD_MUTEX_LOCK_IF_NEEDED and PTHREAD_MUTEX_UNLOCK_IF_NEEDED
macros.

chathaway-codes pushed a commit that referenced this issue


          [#205] Fix assert to handle case where a SimpleThreadAPI process is e…

1db06c5

…xiting and exit handler is being invoked in a thread other than the MAIN worker thread

We got a test failure in the simplethreadapi/tp subtest where a SimpleThreadAPI process was exiting
and as part of the exit handler, we ended up checking for deferred timers and that failed the following
assert in timer_handler().

	assert(gtm_is_main_thread() || gtm_jvm_process);

In this case, we were exiting as the below C-stack shows.

(gdb) where
 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:62
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffe7c683120) at sr_unix/rts_error.c:194
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  timer_handler (why=0) at sr_unix/gt_timers.c:724
 #7  check_for_deferred_timers () at sr_unix/gt_timers.c:1178
 #8  deferred_signal_handler () at sr_port/deferred_signal_handler.c:49
 #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:191
 #10 __run_exit_handlers (status=0, listp=0x7fb6d8d2f5f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82
 #11 __GI_exit (status=<optimized out>) at exit.c:104
 #12 __libc_start_main (main=0x400f76 <main>, argc=1, argv=0x7ffe7c683648, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe7c683638) at ../csu/libc-start.c:325
 #13 _start ()

So the assert is enhanced to reflect this.

chathaway-codes pushed a commit that referenced this issue


          [#205] Ensure TP worker thread does not invoke YottaDB engine (even t…

4c4ca46

…o issue an error); Nix NOTSUPSTAPI message

The simplethreadapi/tp subtest failed once with the following signature in the
tp5_TPTIMEOUT.c section of the test.

%YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 725 for expression
	(gtm_is_main_thread() || gtm_jvm_process || exit_handler_active && (DUMMY_SIG_NUM == why))

with the following C-stack

(gdb) where
 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:62
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fe3a13888b0) at sr_unix/rts_error.c:194
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  timer_handler (why=0) at sr_unix/gt_timers.c:725
 #7  check_for_deferred_timers () at sr_unix/gt_timers.c:1179
 #8  deferred_signal_handler () at sr_port/deferred_signal_handler.c:49
 #9  rts_error_va (csa=0x0, argcnt=4, var=0x7fe3a1388c70) at sr_unix/rts_error.c:194
 #10 rts_error_csa (csa=0x0, argcnt=4) at sr_unix/rts_error.c:101
 #11 ydb_hiber_start (sleep_nsec=1000000) at sr_unix/ydb_hiber_start.c:46
 #12 gvnset (tptoken=1) at tp5_TPTIMEOUT.c:77
 #13 ydb_stm_tpthreadq_process (curTPWorkQHead=0x13dac40) at sr_unix/ydb_stm_tpthread.c:197
 #14 ydb_stm_tpthread (parm=0x0) at sr_unix/ydb_stm_tpthread.c:78
 #15 start_thread (arg=0x7fe3a1389700) at pthread_create.c:333
 #16 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

This was a process that had already made SimpleThreadAPI calls but is now making a SimpleAPI call
(ydb_hiber_start()) and so a NOTSUPSTAPI error is about to be issued. But since this call is happening
in the user-defined callback function inside a TP transaction, it is the TP worker thread (not the
MAIN worker thread) that is executing the "ydb_hiber_start". This means the rts_error_csa invocation
is running in the TP worker thread while the YottaDB engine is concurrently being modified by the
MAIN worker thread. A no-no since the YottaDB engine is not multi-threaded.

To fix this issue, the VERIFY_NON_THREADED_API macro is now fixed to do a "return YDB_ERR_INVAPIMODE".
This means ydb_hiber_start() would return a lot sooner thereby not requiring an "rts_error" invocation.

But while doing this change, noticed a few issues. The VERIFY_NON_THREADED_API is used from a few
functions that do not return any value (sr_unix/ydb_free.c and sr_unix/ydb_timer_cancel.c) so a new
macro VERIFY_NON_THREADED_API_NORETVAL is created which is very similar except it does a plain "return".
sr_unix/ydb_malloc.c needed special handling since it was returning a "void *" and so a new
VERIFY_NON_THREADED_API_RETNULL macro is created for that purpose.

While at this, noticed that the VERIFY_NON_THREADED_API macro was not resetting
TREF(libyottadb_active_rtn) in case of an INVAPIMODE return (since this macro is usually invoked
after a LIBYOTTADB_INIT) so fixed it to do so.

Note that the VERIFY_THREADED_API macro stayed the same in that it did not do this reset since it is
usually invoked before the LIBYOTTADB_INIT macro. But two exceptions to this rule were found,
sr_unix/ydb_cip_helper.c and sr_unix/ydb_tp_s_common.c. They are now fixed so the VERIFY_THREADED_API
macro invocation happens before the LIBYOTTADB_INIT macro.

Another issue that was noticed is that "ydb_ci" and "ydb_cip" were not doing a VERIFY_NON_THREADED_API
check like other SimpleAPI calls do so a new sr_unix/ydb_ci.c and sr_unix/ydb_cip.c were created to
do this before invoking ydb_ci_exec(). And the existing ydb_ci() and ydb_cip() function definitions
in sr_unix/gtmci.c were removed. A new VERIFY_NON_THREADED_API_DO_NOT_SHUTOFF_ACTIVE_RTN macro was
introduced for this purpose since we do not want to do a LIBYOTTADB_INIT in these functions (to avoid
unnecessary SIMPLEAPINEST errors).

With all these changes, the NOTSUPSTAPI error (currently issued in sr_unix/ydb_hiber_start_wait_any.c
and sr_unix/ydb_hiber_start.c) was no longer necessary since an INVAPIMODE error would have been issued
before this error codepath  is reached in all callers. So this error message is now removed.

chathaway-codes pushed a commit that referenced this issue


          [#205] Skip deferred exit handling if inside timer handler and Simple…

8c78c78

…ThreadAPI is active

This issue was exposed by a failure in the dual_fail_extend/dual_fail2_mustop_sigquit subtest.
This test terminates processes by sending them a SIGQUIT/SIG-3 or SIGTERM/SIG-15 signal.
But since one of the threads (the MAIN worker thread) in this multi-threaded process was inside wcs_wtstart() in a
non-interruptable code zone (DEFER_INTERRUPTS had been done), the exit handler invoked in
another concurrently running thread decided to defer the exit until the ENABLE_INTERRUPTS
happened in the worker thread. When the ENABLE_INTERRUPTS did happen, the worker thread invoked
exit handling code while it was already inside a timer handler. And since this particular test
was running with GDSV4 format blocks, wcs_wtstart() could not flush such blocks (since it required
a call to gtm_malloc() which meant a pthread_mutex_lock() call while inside a timer handler which is
a no-no) and so wcs_flu() was not able to flush any blocks as part of exit handling causing it to
fail an assert. Below is the C-stack corresponding to the assert failure.

(gdb) where
 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7f59dccc02a0) at sr_unix/rts_error.c:194
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  wcs_flu (options=519) at sr_unix/wcs_flu.c:587
 #7  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:608
 #8  gv_rundown () at sr_port/gv_rundown.c:123
 #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:204
 #10 __run_exit_handlers (status=-3, listp=0x7f59e2319718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
 #11 __GI_exit (status=<optimized out>) at exit.c:139
 #12 gtm_image_exit (status=-3) at sr_unix/gtm_image_exit.c:27
 #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:111
 #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:45
 #15 wcs_wtstart (region=0x55b9581d66d8, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:829
 #16 wcs_stale (tid=94254535632600, hd_len=8, region=0x55b9581d62a8) at sr_port/t_end_sysops.c:1387
 #17 timer_handler (why=14) at sr_unix/gt_timers.c:821
 #18 <signal handler called>
 #19 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:277
 #20 gtm_memcpy_validate_and_execute (target=0x7f59dccc25c0, src=0x7f59e32fd6c6, len=0) at sr_port/gtm_memcpy_validate_and_execute.c:42
 #21 gvcst_put2 (val=0x7f59e30c7440 <increment_delta_mval>, parms=0x7f59dccc4be0) at sr_port/gvcst_put.c:626
 #22 gvcst_put (val=0x7f59e30c7440 <increment_delta_mval>) at sr_port/gvcst_put.c:299
 #23 gvcst_incr (increment=0x55b9581a05a0, result=0x7f59d8009410) at sr_port/gvcst_incr.c:56
 #24 op_gvincr (increment=0x55b9581a05a0, result=0x7f59d8009410) at sr_port/op_gvincr.c:58

The fix for this issue is to not invoke exit handling while inside the timer handler if we know
SimpleThreadAPI is active. In that case, finish the timer handler first and invoke exit handling
a little later in mainline code where it is safe to invoke exit handling.

chathaway-codes pushed a commit that referenced this issue


          [#205] If MAIN worker thread gets SIG-15, wait for MAIN/TP worker thr…

305fe69

…eads to reach logical point before starting exit handler processing

We had a test failure (in the dual_fail_extend/dual_fail2_mustop_sigquit subtest) where a SimpleThreadAPI
process was sent a SIG-15 by the test and the signal got delivered to the MAIN worker thread but it
went ahead with exit handler processing (including rolling back an active TP transaction) while a
TP worker thread was concurrently running the TP callback function without realizing all of this going on.
The TP worker thread effectively got an INVTPTRANS error since it was using a non-zero tptoken in a
ydb_set_st() call when there was no active TP transaction (due to the exit handler doing an op_trollback()).

The fix is to defer exit processing in generic_signal_handler.c if we find out that we are the
MAIN worker thread. This way the MAIN worker thread will invoke the exit handler gtm_exit_handler()
inside ydb_stm_thread() when it knows it is a logical/safe point to do so.

In addition, deferred_signal_handler() is now fixed to skip invoking the exit handler in case we
are the MAIN worker thread. This is because ydb_stm_thread() has an already established scheme
(using "forced_simplethreadapi_exit" global variable) to determine the logical point and then invoke
gtm_exit_handler().

Below is the C-stack of all threads at the time of the core for the record.

(gdb) thread apply all bt

Thread 3 (Thread 0x7fde4cb67700 (LWP 14698)):
 #0  fsync () from /usr/lib64/libc.so.6
 #1  jnl_fsync (reg=0x55af6c90e7b8, fsync_addr=38517184) at sr_unix/jnl_fsync.c:134
 #2  wcs_flu (options=519) at sr_unix/wcs_flu.c:413
 #3  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:608
 #4  gv_rundown () at sr_port/gv_rundown.c:123
 #5  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:216
 #6  __run_exit_handlers () from /usr/lib64/libc.so.6
 #7  exit () from /usr/lib64/libc.so.6
 #8  gtm_image_exit (status=-15) at sr_unix/gtm_image_exit.c:27
 #9  generic_signal_handler (sig=15, info=0x7fde4cb66830, context=0x7fde4cb66700) at sr_unix/generic_signal_handler.c:380
 #10 <signal handler called>
 #11 pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0
 #12 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:123
 #13 start_thread () from /usr/lib64/libpthread.so.0
 #14 clone () from /usr/lib64/libc.so.6

Thread 2 (Thread 0x7fde510c6dc0 (LWP 14695)):
 #0  do_futex_wait.constprop () from /usr/lib64/libpthread.so.0
 #1  __new_sem_wait_slow.constprop.0 () from /usr/lib64/libpthread.so.0
 #2  ydb_stm_args (callblk=0x55af6c96b550) at sr_unix/ydb_stm_args.c:183
 #3  ydb_stm_args5 (tptoken=0, errstr=0x0, calltyp=16, p1=94211928230125, p2=140733677288928, p3=94211928265280, p4=1, p5=140733677288912) at sr_unix/ydb_stm_args.c:320
 #4  ydb_tp_st (tptoken=0, errstr=0x0, tpfn=0x55af6c8408ed <tpfn_stage1>, tpfnparm=0x7fff1cd7bde0, transid=0x55af6c849240 <tptypebuff> "BATCH", namecount=1, varnames=0x7fff1cd7bdd0) at sr_unix/ydb_tp_st.c:33
 #5  impjob (childnum=2) at simplethreadapi_imptp.c:1148
 #6  main (argc=1, argv=0x7fff1cd7c198) at simplethreadapi_imptp.c:602

Thread 1 (Thread 0x7fde47fff700 (LWP 14705)):
 #0  pthread_kill () from /usr/lib64/libpthread.so.0
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  ch_cond_core () at sr_unix/ch_cond_core.c:76
 #3  rts_error_va (csa=0x0, argcnt=7, var=0x7fde47ffeaa0) at sr_unix/rts_error.c:194
 #4  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #5  ydb_stm_args (callblk=0x7fde40000b20) at sr_unix/ydb_stm_args.c:126
 #6  ydb_stm_args4 (tptoken=7085, errstr=0x0, calltyp=12, p1=94211928265184, p2=2, p3=94211928261632, p4=94211928263568) at sr_unix/ydb_stm_args.c:298
 #7  ydb_set_st (tptoken=7085, errstr=0x0, varname=0x55af6c8491e0 <ygbl_arandom>, subs_used=2, subsarray=0x55af6c848400 <subscr>, value=0x55af6c848b90 <ybuff_val>) at sr_unix/ydb_set_st.c:33
 #8  tpfn_stage1 (tptoken=7085, errstr=0x0, parm_array=0x7fff1cd7bde0) at simplethreadapi_imptp.c:1384
 #9  ydb_stm_tpthreadq_process (curTPWorkQHead=0x7fde48024c40, forced_simplethreadapi_exit_seen=0x7fde47ffeea8) at sr_unix/ydb_stm_tpthread.c:225
 #10 ydb_stm_tpthread (parm=0x0) at sr_unix/ydb_stm_tpthread.c:84
 #11 start_thread () from /usr/lib64/libpthread.so.0
 #12 clone () from /usr/lib64/libc.so.6

chathaway-codes pushed a commit that referenced this issue


          [#419] Avoid hangs in source server due to trying to flush the journa…

021f634

…l buffers while instance freeze is ON

DO_JNL_FLUSH_IF_POSSIBLE macro is invoked as a desire to flush if possible.  If the act of flushing
is going to hang due to a frozen instance, it is better to skip the jnl flush and avoid the hang.
That is what is done as the fix in this commit.

This addresses a hang seen in the v62000/gtm8086 subtest where the source server was
stuck waiting for the instance to be unfrozen (while trying to flush the journal file using the
DO_JNL_FLUSH_IF_POSSIBLE macro) while the test script (which does the unfreeze) was waiting for the
source server to clear some backlog. Below is the C-stack of the stuck source server for the record.

(gdb) where
 #0  clock_nanosleep () from /usr/lib64/libc.so.6
 #1  m_usleep () at sr_unix/sleep.c:25
 #2  wait_for_repl_inst_unfreeze_nocsa_jpl () at sr_port/anticipatory_freeze.h:490
 #3  wait_for_repl_inst_unfreeze () at sr_port/anticipatory_freeze.h:513
 #4  jnl_write_attempt () at sr_port/jnl_write_attempt.c:335
 #5  jnl_flush () at sr_port/jnl_flush.c:57
 #6  update_max_seqno_info () at sr_unix/gtmsource_readfiles.c:741
 #7  first_read () at sr_unix/gtmsource_readfiles.c:881
 #8  read_regions () at sr_unix/gtmsource_readfiles.c:1711
 #9  read_and_merge () at sr_unix/gtmsource_readfiles.c:1544
 #10 gtmsource_readfiles () at sr_unix/gtmsource_readfiles.c:1974
 #11 gtmsource_get_jnlrecs () at sr_unix/gtmsource_process_ops.c:980
 #12 gtmsource_process () at sr_unix/gtmsource_process.c:1544
 #13 gtmsource () at sr_unix/gtmsource.c:528
 #14 mupip_main () at sr_unix/mupip_main.c:124
 #15 dlopen_libyottadb () at sr_unix/dlopen_libyottadb.c:148
 #16 main () at sr_unix/mupip.c:19

chathaway-codes pushed a commit that referenced this issue


          [DEBUG_ONLY] Fix assert to take into account an edge case with update…

b17b53f

… process

We had the update process fail with a SIG-11 (only in a debug build) due to a bad assert.
Below is the C-stack trace for the record.

(gdb) where
 #0  pthread_kill () from /usr/lib64/libpthread.so.0
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
 #3  generic_signal_handler () at sr_unix/generic_signal_handler.c:341
 #4  <signal handler called>
 #5  mutex_deadlock_check () at sr_port/mutex_deadlock_check.c:110
 #6  mutex_long_sleep () at sr_unix/mutex.c:513
 #7  gtm_mutex_lock () at sr_unix/mutex.c:858
 #8  grab_lock () at sr_unix/grab_lock.c:83
 #9  updproc_actions () at sr_port/updproc.c:907
 #10 updproc () at sr_port/updproc.c:501
 #11 mupip_main () at sr_unix/mupip_main.c:124
 #12 dlopen_libyottadb () at sr_unix/dlopen_libyottadb.c:148
 #13 main () at sr_unix/mupip.c:19

(gdb) f 5
 #5  mutex_deadlock_check () at sr_port/mutex_deadlock_check.c:110
110      assert(REPL_ALLOWED(cs_addrs));

(gdb) p cs_addrs
$1 = (sgmnt_addrs *) 0x0

     106   if ((NULL != repl_csa) && (repl_csa->critical == criticalPtr))
     107   {       /* grab_lock going for crit on the jnlpool region. gv_cur_region points to the current region of
     108            * interest, which better have REPL_ENABLED or REPL_WAS_ENABLED. Assert that.
     109            */
 --> 110           assert(REPL_ALLOWED(cs_addrs));

In case of the update process, it is possible we do not have any current region of interest (like
the comment in line 107 above indicates) if we are doing a grab_lock() call to add a history
record to the replication instance file. The assert is now modified to allow cs_addrs to be NULL
only in case of the update process.

chathaway-codes pushed a commit that referenced this issue


          [#420] Fix assert to take into account timer_handler(DUMMY_SIG_NUM) i…

ce35b55

…nvocation can be interrupted by a real timer handler interrupt i.e. timer_handler(SIGALRM)

In an M program invocation (i.e. no SimpleThreadAPI), the below assert failed.

%YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 730 for expression
	(simpleThreadAPI_active || !STAPI_IS_SIGNAL_HANDLER_DEFERRED(sig_hndlr_timer_handler))

And below is the corresponding C-stack

 #0  pthread_kill () from /usr/lib64/libpthread.so.0
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:79
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffef57234f0) at sr_unix/rts_error.c:194
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  timer_handler (why=14, info=0x7f26ac6b3e48 <stapi_signal_handler_oscontext+10728>, context=0x7f26ac6b3ec8 <stapi_signal_handler_oscontext+10856>) at sr_unix/gt_timers.c:730
 #7  <signal handler called>
 #8  timer_handler (why=0, info=0x0, context=0x0) at sr_unix/gt_timers.c:727
 #9  check_for_deferred_timers () at sr_unix/gt_timers.c:1205
 #10 deferred_signal_handler () at sr_port/deferred_signal_handler.c:68
 #11 gtm_trigger_complink () at sr_unix/gtm_trigger.c:382
 #12 process_xecute () at sr_unix/trigger_parse.c:1214
 #13 trigger_parse () at sr_unix/trigger_parse.c:1446
 #14 trigger_update_rec () at sr_unix/trigger_update.c:1253
 #15 trigger_update_rec_helper () at sr_unix/trigger_update.c:2007
 #16 trigger_update () at sr_unix/trigger_update.c:2060
 #17 op_fnztrigger () at sr_port/op_fnztrigger.c:245

The issue is that STAPI_IS_SIGNAL_HANDLER_DEFERRED(sig_hndlr_timer_handler) is set to TRUE
by the debug-only STAPI_FAKE_TIMER_HANDLER_WAS_DEFERRED macro invocation in check_for_deterred_timers()
before calling timer_handler(DUMMY_SIG_NUM). And if a real timer handler interrupt happens before
the timer_handler(DUMMY_SIG_NUM) invocation is finished, we will fail the assert.

The assert is not of much use now now that a lot more assertions are already folded into the
FORWARD_SIG_TO_MAIN_THREAD_IF_NEEDED macro so it is removed.

chathaway-codes pushed a commit that referenced this issue


          [#420] Do not treat case of signal forwarded from MAIN worker thread …

e46f313

…as a new signal; Reuse pre-existing info/context from original signal handler for this case too

Fixes an occasional dual_fail_extend/dual_fail2_mustop_sigquit subtest failure where
a KILLBYSIGUINFO message is expected when another process sends a SIG-3 but instead we
see a KILLBYSIGSINFO1 message in the .mje file.

Below is one such stack trace where such an incorrect message gets sent out. While we were
handling the SIG-3 in a deferred fashion (through deferred_signal_handler()), another SIG-3
came in from the MAIN worker thread which drove generic_signal_handler() in a nested fashion
and caused the issue.

(gdb) where
 #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  ch_cond_core () at sr_unix/ch_cond_core.c:77
 #3  rts_error_va () at sr_unix/rts_error.c:194
 #4  rts_error_csa () at sr_unix/rts_error.c:101
 #5  generic_signal_handler () at sr_unix/generic_signal_handler.c:195
 #6  <signal handler called>
 #7  semop () at ../sysdeps/unix/sysv/linux/semop.c:30
 #8  try_semop_get_c_stack () at sr_unix/gtm_c_stack_trace_semop.c:59
 #9  ftok_sem_lock () at sr_unix/ftok_sems.c:232
 #10 gds_rundown () at sr_unix/gds_rundown.c:324
 #11 gv_rundown () at sr_port/gv_rundown.c:123
 #12 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:215
 #13 __run_exit_handlers () at exit.c:108
 #14 __GI_exit () at exit.c:139
 #15 gtm_image_exit () at sr_unix/gtm_image_exit.c:27
 #16 generic_signal_handler () at sr_unix/generic_signal_handler.c:361
 #17 ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:51
 #18 deferred_signal_handler () at sr_port/deferred_signal_handler.c:55
 #19 tp_tend () at sr_port/tp_tend.c:1887
 #20 op_tcommit () at sr_port/op_tcommit.c:496

This is now fixed by checking if the signal came in from another thread in the same process
(SI_TKILL) and if so treat this as a forwarded signal and not reset info/context but instead
reuse whatever was there from the original signal handler invocation.

A consequence of this change is that a pre-existing assert (that checked "stapi_signal_handler_deferred")
could now fail. That is now removed.

chathaway-codes pushed a commit that referenced this issue


          [DEBUG_ONLY] Fix incorrect assert introduced in #350 (failed after in…

b7a9304

…tegrating GT.M V6.3-006)

After integrating GT.M V6.3-006 into the YottaDB master branch, in a terminal,
key in "mupip upgrade" and hit <Enter>. It failed with the following assert.

%YDB-F-ASSERT, Assert failed in sr_unix/iott_readfl.c line 1238 for expression (!1 || IS_SETTERM_DONE(io_ptr))

The failure C-stack was the following.

 #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
 #1  0x00007f5c9ef1cc03 in gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  0x00007f5c9ef1e12b in gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
 #3  0x00007f5c9eefad61 in ch_cond_core () at sr_unix/ch_cond_core.c:80
 #4  0x00007f5c9efef55c in rts_error_va () at sr_unix/rts_error.c:192
 #5  0x00007f5c9efeea6a in rts_error_csa () at sr_unix/rts_error.c:99
 #6  0x00007f5c9f4a994a in iott_readfl () at sr_unix/iott_readfl.c:1238
 #7  0x00007f5c9f4a149d in iott_read () at sr_unix/iott_read.c:26
 #8  0x00007f5c9f27b397 in op_read () at sr_port/op_read.c:70
 #9  0x00007f5c9ef02f57 in cli_get_parm () at sr_unix/cli_parse.c:1025
 #10 0x00007f5c9eefb8cd in cli_get_str () at sr_unix/cli.c:271
 #11 0x00007f5c9f1fa1a1 in mupip_upgrade () at sr_port/mupip_upgrade.c:115
 #12 0x00007f5c9eebf406 in mupip_main () at sr_unix/mupip_main.c:124
 #13 0x000055ba15601913 in dlopen_libyottadb () at sr_unix/dlopen_libyottadb.c:148
 #14 0x000055ba15601240 in main () at sr_unix/mupip.c:19

This is because cli_get_parm() now does an op_read() call which means even MUPIP can go through
iott_readfl() which means the RESETTERM_IF_NEEDED macro invocation done at the end of iott_readfl()
will assert fail because it was not written to be in sync with the SETTERM_IF_NEEDED macro.

The SETTERM_IF_NEEDED macro skips the setterm() call if IS_GTM_IMAGE is FALSE.
But the RESETTERM_IF_NEEDED macro has an assert that does not take this into account.
This is now fixed.

chathaway-codes pushed a commit that referenced this issue


          Fix rare possibility of SIG-11 while issuing DBCCERR, SYSCALL and FIL…

16ee020

…EOPENFAIL errors

On a slow ARMV6L box, we saw a test failure in the manually_start/4g_journal subtest
where the update process reader helper got a SIG-11. This was because it got a
DBCCERR error because the timeout of 200 seconds was not enough to get the queue interlock
on the slow system but while issuing the DBCCERR error, it had passed the arguments
in the wrong order (LIT_AND_LEN usage instead of LEN_AND_LIT) and that caused the
SIG-11. The SIG-11 is now fixed by using LEN_AND_LIT.

Below is the C-stack of the process.

 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  0xb5e9e384 in gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  0xb5e9fae0 in gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
 #3  0xb5e911c4 in generic_signal_handler (sig=11, info=0xb6bedad8 <stapi_signal_handler_oscontext+3528>, context=0xb6bedb58 <stapi_signal_handler_oscontext+3656>) at sr_unix/generic_signal_handler.c:377
 #4  <signal handler called>
 #5  0xb5ea6760 in gtm_wcswidth (ptr=0xa <error: Cannot access memory at address 0xa>, len=-1230382516, strict=0, nonprintwidth=1) at sr_unix/gtm_utf8.c:147
 #6  0xb5fec564 in util_format (message=0xbe9a961e "", fao=..., buff=0xedec30 "%YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object", size=2046, faocnt=0) at sr_unix/util_output.c:365
 #7  0xb5fef290 in util_out_print_vaparm ( message=0xbe9a95cc "%YDB-E-DBCCERR, Interlock instruction failure in critical mechanism for region !AD", flush=0, var=..., faocnt=2) at sr_unix/util_output.c:798
 #8  0xb5ea3a64 in gtm_putmsg_list (csa=0x0, arg_count=7, var=...) at sr_unix/gtm_putmsg_list.c:119
 #9  0xb5ea2208 in gtm_putmsg_csa (csa=0x0, argcnt=9) at sr_unix/gtm_putmsg.c:71
 #10 0xb63cb02c in updproc_preread () at sr_port/updhelper_reader.c:227
 #11 0xb63cac44 in updhelper_reader () at sr_port/updhelper_reader.c:139
 #12 0xb5e280d4 in mupip_main (argc=4, argv=0xbe9ac5a4, envp=0xbe9ac5b8) at sr_unix/mupip_main.c:124
 #13 0x000114ac in dlopen_libyottadb (argc=4, argv=0xbe9ac5a4, envp=0xbe9ac5b8, main_func=0x115f8 "mupip_main") at sr_unix/dlopen_libyottadb.c:148
 #14 0x00010b68 in main (argc=4, argv=0xbe9ac5a4, envp=0xbe9ac5b8) at sr_unix/mupip.c:19

While fixing this, a couple of similar issues were found in other places of the code and they are
also fixed.

Since all of these are rare error scenarios, a user-visible issue is not created for this.

chathaway-codes pushed a commit that referenced this issue


          [#456] Fix SIG-11 from ZWRITE of global after a name-level $ORDER if …

31e91c3

…database files of some regions do not exist

There are 2 issues.

1) In sr_port/op_gvorder.c (and sr_port/op_zprevious.c), we set "gv_cur_region" to a region in the
   global directory before invoking gv_init_reg(). But if the gv_init_reg() call fails (e.g. DBFILERR
   error due to a missing database file), we end up with gv_cur_region set to a non-NULL value but
   gv_cur_region->open is FALSE which is an out-of-design state for most of the database code as that
   assumes the global variable "gv_cur_region" corresponds to a valid and open database file.

2) This out-of-design state leaves the process in a state that is vulnerable to SIG-11 later like
   the below C-stack . The SIG-11 happens in change_reg() because gv_cur_region is set to a non-NULL
   value but the region has not yet been open (due to the missing database file).

 #0  __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
 #3  generic_signal_handler (sig=11, ...) at sr_unix/generic_signal_handler.c:405
 #4  <signal handler called>
 #5  change_reg () at sr_port/change_reg.c:49
 #6  gvzwrite_clnup () at sr_port/gvzwrite_clnup.c:47
 #7  gvzwrite_ch () at sr_port/gvzwrite_ch.c:20
 #8  rts_error_va () at sr_unix/rts_error.c:192
 #9  rts_error_csa () at sr_unix/rts_error.c:99
 #10 dbfilopn () at sr_unix/gvcst_init_sysops.c:613
 #11 gvcst_init () at sr_port/gvcst_init.c:862
 #12 gv_init_reg () at sr_port/gv_init_reg.c:56
 #13 gv_bind_name () at sr_port/gv_bind_name.c:75
 #14 op_gvname_common () at sr_port/op_gvname.c:117
 #15 op_gvname () at sr_port/op_gvname.c:70
 #16 gvzwr_fini () at sr_port/gvzwr_fini.c:76
 #17 op_gvzwrite () at sr_port/op_gvzwrite.c:65

The fixes are two fold as well.

1) Primary fix is to ensure the out-of-design state is not created by op_gvorder.c (and op_zprevious.c).
   This is done by moving the initialization of the global variable "gv_cur_region" to AFTER the
   gv_init_reg() call. This ensures if a DBFILERR error occurs inside gv_init_reg(), the global
   gv_cur_region still reflects the state it was in before the name level $ORDER started.

2) Secondary fix is to change_reg.c to ensure it handles the out-of-design state (if any other code that
   we don't know of creates that situation) without a SIG-11. This is done by checking for
   gv_cur_region->open and if it is FALSE setting cs_addrs/cs_data to FALSE. Just like is already done
   in the TP_CHANGE_REG macro.

nars1 added a commit that referenced this issue


          [DEBUG-ONLY] Fix rare longstanding assert in sr_unix/gds_rundown.c

7aa5ed3

Background
----------
* While running the TCK04 bats subtest in the YDBOcto repo using a Debug build of YottaDB
  that was built using `clang` (not `gcc`), I encountered a very rare failure (took hundreds
  of test reruns to reproduce once).

* Below is the stack trace of the core file from the assert using the gdb debugger.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140433852622656) at pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140433852622656) at pthread_kill.c:80
  #2  __GI___pthread_kill (threadid=140433852622656, signo=3) at pthread_kill.c:91
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd0ac05210) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:1108
  #9  gv_rundown () at sr_port/gv_rundown.c:122
  #10 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #11 signal_exit_handler (exit_handler_name=0x7fb94dc90e5a "deferred_exit_handler", sig=15, info=0x7fb94ddf0ca8 <stapi_signal_handler_oscontext+4424>, context=0x7fb94ddf0d28 <stapi_signal_handler_oscontext+4552>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #14 gtm_malloc_main (size=512, stack_level=1) at sr_port/gtm_malloc_src.h:800
  #15 gtm_malloc (size=512) at sr_port/gtm_malloc_src.h:1486
  #16 gvcst_tp_init (greg=0x22b98d8) at sr_port/gvcst_tp_init.c:68
  #17 tp_set_sgm () at sr_port/tp_set_sgm.c:53
  #18 change_reg () at sr_port/change_reg.c:57
  #19 gv_bind_name (addr=0x22b94e0, gvname=0x7ffd0ac06048) at sr_port/gv_bind_name.c:144
  #20 op_gvname_common (count=8, hash_code=112891184, val_arg=0x7fb94e21c978, var=0x7ffd0ac0cdb0) at sr_port/op_gvname.c:117
  #21 op_gvname_fast (count_arg=10, hash_code=112891184, val_arg=0x7fb94e21c978) at sr_port/op_gvname.c:81

  (gdb) f 8
  #8  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:1108
  1108              assert(NULL != si->cr_array);

  (gdb) f 16
  #16 gvcst_tp_init (greg=0x22b98d8) at sr_port/gvcst_tp_init.c:68
  68              si->cr_array = (cache_rec_ptr_ptr_t)malloc(SIZEOF(cache_rec_ptr_t) * si->cr_array_size);
  ```

Issue
-----
* `si->cr_array_size` is initialized at line 67 below and `si->cr_array` is initialized at line 68 below.

  **sr_port/gvcst_tp_init.c**
  ```c
     67    si->cr_array_size = si->cur_tp_hist_size;
     68    si->cr_array = (cache_rec_ptr_ptr_t)malloc(SIZEOF(cache_rec_ptr_t) * si->cr_array_size);
  ```

* But the assert in line 1108 below assumes that if `si->cr_array_size` is set, then `si->cr_array` must
  also have been set. This is not right if a signal (say `SIG-15` aka `SIGTERM`) comes in between lines
  67 and 68 above like it did in the above failure.

  **sr_unix/gds_rundown.c**
  ```c
   1100                         if (NULL != si->blks_in_use)
   1101                         {
   1102                                 free_hashtab_int4(si->blks_in_use);
   1103                                 free(si->blks_in_use);
   1104                                 si->blks_in_use = NULL;
   1105                         }
   1106                         if (si->cr_array_size)
   1107                         {
   1108                                 assert(NULL != si->cr_array);
   1109                                 if (NULL != si->cr_array)
   1110                                         free(si->cr_array);
   1111                         }
  ```

Fix
---
* `si->cr_array` is checked directly for whether it is `NULL` or not and only in the latter case do we
  invoke `free(si->cr_array)`. This is no longer based on the value of `si->cr_array_size`. This is more
  in line with how we already handle `si->blks_in_use` in line 1100.

* In effect the assert at line 1108 is now removed.

Notes
-----
* In Release builds, the `assert` had no effect and so there was no issue as we later did an `if` check
  anyways.

nars1 added a commit that referenced this issue


          Fix rare SIG-11 while handling fatal signals like SIGTERM etc.

3d79b89

Background
----------
* While running the TCK04 bats subtest in the YDBOcto repo using a Debug build of YottaDB
  that was built using `clang` (not `gcc`), I encountered a very rare failure (took hundreds
  of test reruns to reproduce once).

* Although the failure happened only with `clang`, the same issue can happen with `gcc` builds
  of YottaDB too given the right timing of events/signals.

* Below is the stack trace of the core file from the assert using the gdb debugger.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140299547846464) at pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140299547846464) at pthread_kill.c:80
  #2  __GI___pthread_kill (threadid=140299547846464, signo=3) at pthread_kill.c:91
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  generic_signal_handler (sig=11, info=0x7f9a08aecca8 <stapi_signal_handler_oscontext+4424>, context=0x7f9a08aecd28 <stapi_signal_handler_oscontext+4552>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:492
  #6  ydb_os_signal_handler (sig=11, info=0x7fff10881b70, context=0x7fff10881a40) at sr_unix/ydb_os_signal_handler.c:85
  #7  <signal handler called>
  #8  cleanup_list (list=0xaf8a40) at sr_port/buddy_list.c:205
  #9  gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:1098
  #10 gv_rundown () at sr_port/gv_rundown.c:122
  #11 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #12 signal_exit_handler (exit_handler_name=0x7f9a0898ce5a "deferred_exit_handler", sig=15, info=0x7f9a08aecca8 <stapi_signal_handler_oscontext+4424>, context=0x7f9a08aecd28 <stapi_signal_handler_oscontext+4552>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #15 gtm_malloc_main (size=520, stack_level=1) at sr_port/gtm_malloc_src.h:800
  #16 gtm_malloc (size=520) at sr_port/gtm_malloc_src.h:1486
  #17 initialize_list (list=0xaf8a40, elemSize=192, initAlloc=64) at sr_port/buddy_list.c:52
  #18 gvcst_tp_init (greg=0xaf1a18) at sr_port/gvcst_tp_init.c:55
  #19 tp_set_sgm () at sr_port/tp_set_sgm.c:53
  #20 change_reg () at sr_port/change_reg.c:57
  #21 gv_bind_name (addr=0xaf1470, gvname=0x7fff10882e98) at sr_port/gv_bind_name.c:144
  #22 op_gvname_common (count=4, hash_code=-1391378772, val_arg=0x7f9a08f1f998, var=0x7fff10889c00) at sr_port/op_gvname.c:117
  #23 op_gvname_fast (count_arg=6, hash_code=-1391378772, val_arg=0x7f9a08f1f998) at sr_port/op_gvname.c:81

  (gdb) f 8
  #8  cleanup_list (list=0xaf8a40) at sr_port/buddy_list.c:205
  205             while(*curr)

  (gdb) f 17
  #17 initialize_list (list=0xaf8a40, elemSize=192, initAlloc=64) at sr_port/buddy_list.c:52
  52              list->ptrArray = (char **)malloc((size_t)SIZEOF(char *) * (MAX_MEM_SIZE_IN_BITS + 2));
  ```

Issue
-----
* A SIG-15/SIGTERM signal interrupted the `initialize_list()` call in frame 17. In frame 18, we were
  trying to initialize `si->tlvl_cw_set_list` as the below line of code indicates.

  **sr_port/gvcst_tp_init.c**
  ```c
     55   initialize_list(si->tlvl_cw_set_list, SIZEOF(cw_set_element), TLVL_CW_SET_LIST_INIT_ALLOC);
  ```

* The signal caused us to proceed to exit handling and as part of that we tried to cleanup the
  incompletely set up structure `si->tlvl_cw_set_list` at line 1098 below.

  **sr_unix/gds_rundown.c**
  ```c
   1082                 if (csa->sgm_info_ptr)
   1083                 {
   1084                         si = csa->sgm_info_ptr;
   1085                         /* It is possible we got interrupted before initializing all fields of "si"
   1086                          * completely so account for NULL values while freeing/releasing those fields.
   1087                          */
   1088                         assert((si->tp_csa == csa) || (NULL == si->tp_csa));
   1089                         if (si->jnl_tail)
   1090                         {
   1091                                 PROBE_FREEUP_BUDDY_LIST(si->format_buff_list);
   1092                                 PROBE_FREEUP_BUDDY_LIST(si->jnl_list);
   1093                                 FREE_JBUF_RSRV_STRUCT(si->jbuf_rsrv_ptr);
   1094                         }
   1095                         PROBE_FREEUP_BUDDY_LIST(si->recompute_list);
   1096                         PROBE_FREEUP_BUDDY_LIST(si->new_buff_list);
   1097                         PROBE_FREEUP_BUDDY_LIST(si->tlvl_info_list);
   1098                         PROBE_FREEUP_BUDDY_LIST(si->tlvl_cw_set_list);
   1099                         PROBE_FREEUP_BUDDY_LIST(si->cw_set_list);
  ```

* And that caused the SIG-11.

Fix
---
* A lot of the above cleanup in `sr_unix/gds_rundown.c` happens only if `csa->sgm_info_ptr` is non-NULL.

* But this field gets set to a non-NULL value at the very start of `sr_port/gvcst_tp_init.c` before
  a lot of the individual fields (like `si->tlvl_cw_set_list` etc.) get initialized.

* Therefore, the fix is to set `csa->sgm_info_ptr` to a non-NULL value `AFTER` all the initialization
  of the individual members in that structure has happened.

Notes
-----
* Even though the user-visible symptom is a SIG-11, this issue is considered rare enough for a user to
  encounter so a separate issue is not created for this fix.

nars1 added a commit that referenced this issue


          [#860] Fix SIG-11 when .m file is opened read-write and ZLINK of same…

3896ddd

… .m file is attempted

Background
----------
* Below is a simple test case obtained from a fuzz test failure in in-house testing.

  ```m
  $ cat test.m
   set fn="generated.m"
   open fn:new
   use fn
   write " z"
   Set $ZROUTINES=""
   zlink "generated.m"

  $ $ydb_dist/yottadb -run test
  %YDB-F-KILLBYSIGSINFO1, YottaDB process 55439 has been killed by a signal 11 at address 0x00007F4F4F82EED7 (vaddr 0x0000000000000008)
  %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object
  Segmentation fault (core dumped)
  ```

* This is a failure in both Release and Debug builds of YottaDB as well as the upstream GT.M.

Issue
-----
* Below is the stack trace from the core file.

  ```c
  (gdb) where
  #0  ins_errtriple (in_error=150373618) at sr_port/ins_errtriple.c:51
  #1  stx_error_va (in_error=150373618, args=0x7f6559aa53c0) at sr_port/stx_error.c:164
  #2  rts_error_va (csa=0x0, argcnt=1, var=0x7f6559aa54a0) at sr_unix/rts_error.c:179
  #3  rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #4  iorm_wteol (x=1, iod=0x62d000004840) at sr_unix/iorm_wteol.c:87
  #5  iorm_cond_wteol (iod=0x62d000004840) at sr_unix/iorm_flush.c:42
  #6  iorm_close (iod=0x62d000004840, pp=0x7f6559aa63b0) at sr_unix/iorm_close.c:112
  #7  io_dev_close (d=0x62d000005ec0) at sr_port/io_rundown.c:102
  #8  io_rundown (rundown_type=0) at sr_port/io_rundown.c:60
  #9  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:239
  #10 signal_exit_handler (exit_handler_name=0x7f6555366520 "generic_signal_handler", sig=11, info=0x7f6555881948 <stapi_signal_handler_oscontext+4424>, context=0x7f65558819c8 <stapi_signal_handler_oscontext+4552>, is_deferred_exit=0) at sr_unix/signal_exit_handler.c:78
  #11 generic_signal_handler (sig=11, info=0x7f6555881948 <stapi_signal_handler_oscontext+4424>, context=0x7f65558819c8 <stapi_signal_handler_oscontext+4552>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:500
  #12 ydb_os_signal_handler (sig=11, info=0x7f6559aa6bf0, context=0x7f6559aa6ac0) at sr_unix/ydb_os_signal_handler.c:85
  #13 <signal handler called>
  #14 ins_errtriple (in_error=150373618) at sr_port/ins_errtriple.c:51
  #15 stx_error_va (in_error=150373618, args=0x7ffe77c31f90) at sr_port/stx_error.c:164
  #16 rts_error_va (csa=0x0, argcnt=1, var=0x7ffe77c32070) at sr_unix/rts_error.c:179
  #17 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #18 iorm_wteol (x=1, iod=0x62d000004840) at sr_unix/iorm_wteol.c:87
  #19 iorm_readfl (v=0x7ffe77c33bb0, width=32767, nsec_timeout=<optimized out>) at sr_unix/iorm_readfl.c:229
  #20 op_readfl (v=0x7ffe77c33bb0, length=32767, timeout=0x7f65555111a0 <literal_notimeout>) at sr_port/op_readfl.c:80
  #21 read_source_file () at sr_unix/source_file.c:290
  #22 compiler_startup () at sr_port/compiler_startup.c:159
  #23 zlcompile (len=11 '\v', addr=0x7ffe77c34820 "generated.m") at sr_port/zlcompile.c:45
  #24 op_zlink (v=0x62d0000062e0, quals=0x7f6555fbe6c0) at sr_unix/op_zlink.c:496
  ```

* The SIG-11 happened because we were trying to access `TREF(pos_in_chain)` to get the last triple
  before we started parsing the current line.

  **sr_port/ins_errtriple.c**
  ```c
    49   x = (TREF(pos_in_chain)).exorder.bl;
    50   /* If first error in the current line/cmd, delete all triples and replace them with an OC_RTERROR triple. */
    51   add_rterror_triple = (OC_RTERROR != x->exorder.fl->opcode);
  ```

  But turns out we are issuing an error even before we started parsing the first line in the M program.
  This is because the `iorm_wteol()` call, while trying to read from the M source file as part of the ZLINK,
  tried to write an EOL to the source M program and it cannot because the source is opened read-only and so
  issued a ERR_DEVICEREADONLY error.

  And because of this, the contents of `TREF(pos_in_chain)` are not appropriately initialized and so are not
  reliable (they will contain triples left over from the previous compile and can point to freed memory
  or NULL pointers resulting in SIG-11).

Fix
---
* The first fix is to initialize `TREF(pos_in_chain)` to `*TREF(curtchain)` in `sr_port/tripinit.c` right
  after `TREF(curtchain)` is initialized.

  This way any errors in compilation will result in `ins_errtriple()` referencing an initialized
  `TREF(pos_in_chain)`.

* The second fix is in `sr_port/ins_errtriple.c` where we should now account for the possibility that
  `TREF(pos_in_chain).exorder.bl` could be `NULL`. In that case, we should add an `OC_RTERROR` triple
  just like we would if we find that the start of the current M line already has triples and the first
  triple in that chain is not already a `OC_RTERROR` triple. So the change is to set `add_rterror_triple`
  variable to TRUE in case we find `TREF(pos_in_chain).exorder.bl` is NULL.

* With just the above two fixes, I noticed the simple test case presented above no longer failing with a
  SIG-11. But it still had some extraneous output.

  ```sh
  $ $ydb_dist/yottadb -run test40

                                     ^-----
                  At column 28, line 1, source module generated.m
  %YDB-E-DEVICEREADONLY, Cannot write to read-only device
  ```

  I expected only the `%YDB-E-DEVICEREADONLY` error line. Not the 3 lines before it which is syntax
  highlighting a non-existent M source line.

  Turns out this is an issue in `sr_port/show_source_line.c` where we issue a sequence of `ERR_SRCLIN`,
  `ERR_SRCLNNTDSP` and `ERR_SRCLOC` messages to take care of the syntax highlighting even if there is
  no M source code to highlight.

  This is now fixed by checking `line_chwidth` and only if it is greater than 0 do we issue those messages.
  Otherwise we skip those messages.

  With that change, the revised output is as follows. This looks a lot cleaner to me.

  ```sh
  $ $ydb_dist/yottadb -run test40
  %YDB-E-DEVICEREADONLY, Cannot write to read-only device
  ```

nars1 added a commit that referenced this issue


          [#860] Fix SIG-11 when boolean expressions occur inside extended refe…

7b3f8ff

…rence using [] syntax

Background
----------
* This is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/860#note_1079650087.

* This comment is to track a longstanding issue identified by ongoing fuzz testing.
  This is an issue present even in the upstream GT.M versions.

* Below is a simple test case demonstrating the issue.

  **Release build**
  ```m
  YDB>lock +[(0!^|"x"|a)]x
  %YDB-F-KILLBYSIGSINFO1, YottaDB process 31691 has been killed by a signal 11 at address 0x00007FEB3DDC7E15 (vaddr 0x0000000000000008)
  %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object
  ```

  **Debug build**
  ```m
  YDB>lock +[(0!^|"x"|a)]x
  %YDB-F-ASSERT, Assert failed in sr_port/gvn.c line 188 for expression (NULL != TREF(expr_start))
  ```

Issue
-----
* Below is the stack trace from the assert failure

  ```c
  (gdb) where
  .
  .
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  gvn () at sr_port/gvn.c:188
  #9  glvn (a=0x7ffde466d940) at sr_port/glvn.c:38
  #10 expratom (a=0x7ffde466d940) at sr_port/expratom.c:27
  #11 eval_expr (a=0x7ffde466dc00) at sr_port/eval_expr.c:248
  #12 expritem (a=0x7ffde466dc00) at sr_port/expritem.c:551
  #13 expratom (a=0x7ffde466dc00) at sr_port/expratom.c:29
  #14 expratom_coerce_mval (a=0x7ffde466dc00) at sr_port/expratom_coerce_mval.c:34
  #15 lkglvn (gblvn=0) at sr_port/lkglvn.c:63
  #16 nref () at sr_port/nref.c:40
  #17 m_lock () at sr_port/m_lock.c:93
  #18 cmd () at sr_port/cmd.c:312
  #19 linetail () at sr_port/linetail.c:35
  #20 op_commarg (v=0x5603a5cfe598, argcode=19 '\023') at sr_port/op_commarg.c:84
  #21 op_dmode () at sr_port/op_dmode.c:159

  (gdb) f 8
  #8  gvn () at sr_port/gvn.c:188
  188                             assert(NULL != TREF(expr_start));
  ```

* The issue is that in frame number 8, we saw `TREF(shift_side_effects)` to be TRUE at line 55.

  **sr_port/gvn.c**
  ```c
     55         if (shifting = (TREF(shift_side_effects) && (!TREF(saw_side_effect) || (YDB_BOOL == TREF(ydb_fullbool)
  ```

* This caused the `shifting` variable to be set to TRUE.

* And at the end of that function, we had to insert a `OC_GVSAVTARG` triple but found that `TREF(expr_start)`
  was NULL.

* The issue is that `TREF(expr_start)` and `TREF(shift_side_effects)` were out of sync.

* If `TREF(shift_side_effects)` was non-zero, then `TREF(expr_start)` should also have been non-NULL.

* `TREF(shift_side_effects)` was set to 1 by frame number 11 in the below line.

  **sr_port/eval_expr.c**
  ```c
    104                                 TREF(shift_side_effects) = TRUE;
  ```

* And `TREF(expr_start)` was also set to a non-NULL value around then.

  **sr_port/eval_expr.c**
  ```c
     95                                 TREF(expr_start) = TREF(expr_start_orig) = ref;
  ```

* But the issue was that frame 11 `gvn()` invoke `expr()`

  **sr_port/gvn.c**
  ```c
     69                         parse_status = expr(sb1++, MUMPS_EXPR);
  ```

  And that in turn did the following.

  **sr_port/expr.c**
  ```c
     29         INCREMENT_EXPR_DEPTH;
  ```

  And this macro found `TREF(expr_depth)` set to 0 and therefore cleared `TREF(expr_depth)`

  **sr_port/compiler.h**
  ```c
    420 #define INCREMENT_EXPR_DEPTH
    424         if (!(TREF(expr_depth))++)
    425                 TREF(expr_start) = TREF(expr_start_orig) = NULL;
  ```

* Therefore, `TREF(expr_start)` was non-NULL when we entered frame 11 `gvn()` but was NULL
  towards the end of that function and that is the issue.

* The real issue is that `TREF(expr_depth)` was 0 even though we were already evaluating a boolean
  expression (and doing shifting operations for global references).

* And the cause of this is that there are 3 callers of `eval_expr()`.
  - sr_port/bool_expr.c
  - sr_port/expr.c
  - sr_port/expritem.c

* The first 2 of the above callers do a `INCREMENT_EXPR_DEPTH` before calling `eval_expr()`.

* But the 3rd caller does not. And that is where the issue lies.

* It is not clear to me why this inconsistency was there all this while. I suspect it is an oversight
  instead of being intentional.

Fix
---
* The fix is very simple and that is to call `INCREMENT_EXPR_DEPTH` (and `DECREMENT_EXPR_DEPTH`) in
  the 3rd caller `sr_port/expritem.c` before calling `eval_expr()`. This ensures `TREF(expr_depth)`
  stays a non-zero value in case `TREF(expr_start)` gets set to a non-NULL value inside `eval_expr()`.

* Additionally, I also added an assert in `sr_port/eval_expr.c` that if ever we set `TREF(expr_start)`
  to a non-NULL value, the `TREF(expr_depth)` global variable better be greater than 0.

nars1 added a commit that referenced this issue


          [#935] Fix assert failure for main M program that invokes Simple API …

8c8c971

…through an external call

Background
----------
* The `simpleapi/nodenext` subtest in the YDBTest project failed occasionally with the following signature.

  ```c
  %YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 1187 for expression (MUMPS_CALLIN == invocation_mode)
  ```

* From the debugger, the value of `invocation_mode` was 4 (i.e. `MUMPS_DIRECT`, not `MUMPS_CALLIN`).
  And hence the assert failure.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140498704824128) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140498704824128) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140498704824128, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffc97bac600) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  sys_canc_timer () at sr_unix/gt_timers.c:1187
  #9  gtm_cancel_timer (tid=94629833760832) at sr_unix/gt_timers.c:558
  #10 wcs_wtstart (region=0x5610b9a07970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:843
  #11 wcs_stale (tid=94629833767280, hd_len=8, region=0x5610b997bb78) at sr_port/t_end_sysops.c:1420
  #12 timer_handler (why=0, info=0x7fc8675d29a8 <stapi_signal_handler_oscontext+11048>, context=0x7fc8675d2a28 <stapi_signal_handler_oscontext+11176>, is_os_signal_handler=0) at sr_unix/gt_timers.c:914
  #13 check_for_deferred_timers () at sr_unix/gt_timers.c:1314
  #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #15 check_for_timer_pops (sig_handler_changed=1) at sr_unix/gt_timers.c:1365
  #16 op_fnfgncal (n_mvals=6, dst=0x0, package=0x5610b98e21c0, extref=0x5610b98e2060, mask=1, argcnt=1) at sr_unix/op_fnfgncal.c:1278

  (gdb) f 8
  #8  sys_canc_timer () at sr_unix/gt_timers.c:1187
  1187                    assert(MUMPS_CALLIN == invocation_mode);

  (gdb) p invocation_mode
  $1 = 4
  ```

* This assert was introduced in 5989822 (as part of a #935 commit).

Issue
-----
* The code below (originated in 5989822) assumes that if the `if` check at line 1180 succeeds, we are guaranteed
  it is a Simple API application. And hence the assert in line 1187.

  **sr_unix/gt_timers.c**
  ```c
   1180         if (!simpleThreadAPI_active && IS_SIMPLEAPI_MODE)
   1181         {       /* This process uses YottaDB in Simple API mode. In this case, it is possible the Simple API application
   1182                  * spawns multiple threads (but ensures serial access to the YottaDB engine). This was seen when using
   1183                  * the YDBPython wrapper (YDB#935). In this case, we need to not just clear "posix_timer_created" but
   1184                  * also "posix_timer_thread_id" since it could otherwise point to a dead thread-id.
   1185                  * Hence we use the macro call below to clear both fields.
   1186                  */
   1187                 assert(MUMPS_CALLIN == invocation_mode);
   1188                 CLEAR_POSIX_TIMER_FIELDS_IF_APPLICABLE;
   1189         } else
   1190         {       /* This process uses YottaDB in one of the following modes.
   1191                  *   1) Simple Thread API       (i.e. invocation_mode = MUMPS_CALLIN)
   1192                  *      In this case, we want to keep the non-zero "posix_timer_thread_id" as is (points to the MAIN worker
   1193                  *      thread id).
   1194                  *   2) yottadb -direct (i.e. invocation_mode = MUMPS_DIRECT)
  ```

* But it is possible for a M main program to invoke an external call that in turn makes Simple API calls. And starts
  and cancels timers while inside the external call.

* It is exactly that which happened in this test failure.

* In that case, `invocation_mode` would be `MUMPS_DIRECT` because the main program was a `yottadb -direct` invocation.
  But because it is in an external call and we did Simple API calls, the `IS_SIMPLEAPI_MODE` macro is TRUE.

Fix
---
* The original intent behind 5989822 was to allow for a main program that is YDBPython or C and does Simple API
  calls to invoke the `CLEAR_POSIX_TIMER_FIELDS_IF_APPLICABLE` macro. And leave the code flow the same for other
  invocation modes.

* To ensure this is the case, what we do is to move the `assert` in line 1187 into the `if` check at line 1180.

nars1 added a commit that referenced this issue


          [DEBUG-ONLY] [ASYNCIO] Fix assert failure due to sleep for 0 microsec…

…onds while waiting for WIP queue to clear

Background
----------
* This is an internal test failure that happened once. One of the symptoms of the failure was the following
  assert.

  ```diff
  > v53003_0_4/D9I10002706/bkgrnd_d002706.mje1
  > %YDB-F-ASSERT, Assert failed in sr_unix/sleep.c line 28 for expression ((8 == SIZEOF(useconds)) || ((MICROSECS_IN_SEC > useconds) && (0 < useconds)))
  ```

* The gdb analysis of the resulting core file showed the following.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140004321584192) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140004321584192) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140004321584192, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd4c844df0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  m_usleep (useconds=0) at sr_unix/sleep.c:28
  #9  wcs_sleep (sleepfactor=0) at sr_port/wcs_sleep.c:28
  #10 wait_for_wip_queue_to_clear (cnl=0x7f55457cb000, crwipq=0x7f5545a5b000, cr=0x7f5545abe6a8, reg=0x61d0000015b0) at sr_unix/wcs_wt.h:122
  #11 wcs_get_space (reg=0x61d0000015b0, needed=0, cr=0x7f5545abe6a8) at sr_unix/wcs_get_space.c:211
  #12 bt_put (reg=0x61d0000015b0, block=1833) at sr_port/bt_put.c:78
  #13 bg_update_phase1 (cs=0x7f55483e7680 <cw_set+192>, ctn=140737488387042, si=0x0) at sr_port/t_end_sysops.c:471
  #14 t_end (hist1=0x61b000194500, hist2=0x6160000ad480, ctn=18446744071629176832) at sr_port/t_end.c:1664
  #15 gvcst_kill2 (do_subtree=1, span_status=0x0, killing_chunks=0) at sr_port/gvcst_kill.c:781
  #16 gvcst_kill (do_subtree=1) at sr_port/gvcst_kill.c:149
  #17 op_gvkill () at sr_port/op_gvkill.c:83

  (gdb) f 8
  #8  0x00007f55466a5222 in m_usleep (useconds=0) at sr_unix/sleep.c:28
  28              SLEEP_USEC(useconds, TRUE);

  (gdb) p useconds
  $4 = 0

  (gdb) up
  #9  0x00007f5547016fe2 in wcs_sleep (sleepfactor=0) at sr_port/wcs_sleep.c:28
  28              SHORT_SLEEP(slpfctr);

  (gdb) p slpfctr
  $5 = 0

  (gdb) up
  #10 0x00007f5547624a07 in wait_for_wip_queue_to_clear (cnl=0x7f55457cb000, crwipq=0x7f5545a5b000, cr=0x7f5545abe6a8, reg=0x61d0000015b0) at sr_unix/wcs_wt.h:122
  122                     wcs_sleep(lcnt);

  (gdb) p lcnt
  $6 = 0
  ```

Issue
-----
* `wcs_sleep()` is not designed to be invoked for a `0` milli-second sleep.

Fix
---
* `lcnt` is reset to `1` in case it becomes `0` after the modulo operation (`%`).

Notes
-----
* Interestingly, this fix is seen in GT.M V7.0-001 in the upstream project so we will eventually get this fix.

nars1 added a commit that referenced this issue


          [#935] Remove incorrect assert when there is a SimpleAPI application …

660a9a8

…with multiple threads

Background
----------
* While trying to come up with an automated test case for the #935 fixes, we encountered an assert failure
  in the test case (see https://gitlab.com/YottaDB/DB/YDBTest/-/merge_requests/1528#note_1183513413 for details).

  ```c
  %YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 798 for expression (gtm_is_main_thread() || gtm_jvm_process || simpleThreadAPI_active)
  ```

* This assert failure happened only with a Debug build. With a Release build, there were no issues.

Issue
-----
* Below is the C-stack from the debugger at the time of the assert failure.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140643673292800) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140643673292800) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140643673292800, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffe12cbf370) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  timer_handler (why=0, info=0x7fea279638b8 <stapi_signal_handler_oscontext+11048>, context=0x7fea27963938 <stapi_signal_handler_oscontext+11176>, is_os_signal_handler=0) at sr_unix/gt_timers.c:798
  #9  check_for_deferred_timers () at sr_unix/gt_timers.c:1313
  #10 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #11 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:216
  #12 __run_exit_handlers (status=0, listp=0x7fea28516838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
  #13 __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
  #14 __libc_start_call_main (main=main@entry=0x55bda12f3b80, argc=argc@entry=2, argv=argv@entry=0x7ffe12cbf998) at ../sysdeps/nptl/libc_start_call_main.h:74
  #15 __libc_start_main_impl (main=0x55bda12f3b80, argc=2, argv=0x7ffe12cbf998, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe12cbf988) at ../csu/libc-start.c:392
  #16 _start ()

  (gdb) f 8
  #8  timer_handler (why=0, info=0x7fea279638b8 <stapi_signal_handler_oscontext+11048>, context=0x7fea27963938 <stapi_signal_handler_oscontext+11176>, is_os_signal_handler=0) at /Distrib/YottaDB/V998_R135/sr_unix/gt_timers.c:798
  798             assert(gtm_is_main_thread() || gtm_jvm_process || simpleThreadAPI_active);

  (gdb) p gtm_main_thread_id_set
  $1 = 1

  (gdb) p/x gtm_main_thread_id
  $2 = 0x7fea269f2640

  (gdb)  p/x posix_timer_thread_id
  $3 = 0x1165e

  (gdb) info threads
    Id   Target Id                         Frame
  * 1    Thread 0x7fea282fc000 (LWP 71310) __pthread_kill_implementation (no_tid=0, signo=3, threadid=140643673292800) at ./nptl/pthread_kill.c:44

  (gdb) p simpleThreadAPI_active
  $4 = 0
  ```

* In this case, `gtm_is_main_thread()` returns FALSE because the current thread is not the thread that
  the process started out with. And is why the assert fails.

* This is a case of a Simple API application with multiple threads where the application ensures only one
  thread makes calls to the YottaDB engine. And the assert does not handle that case.

Fix
---
* It is not considered worth the effort to make the assert more complicated by adding a test for that case
  and so the assert in `sr_unix/gt_timers.c` is now removed.

* Similar assert in `sr_unix/aio_shim.c` is also removed for the same reason.

nars1 added a commit that referenced this issue


          [DEBUG-ONLY] Fix rare assert failure in read_regions() in sr_unix/gtm…

7ef77e4

…source_readfiles.c

Background
----------
* The `multisrv_crash_1/M_REORG_CRASH` subtest failed in a rare occurrence in internal testing
  with an assert failure in the source server.

* The core file from the assert failure had the following stack and relevant variables.

  ```c
  (gdb) where
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffeeef88400) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  read_regions (buff=0x7ffeeef89400, buff_avail=0x7ffeeef89440, attempt_open_oldnew=0, brkn_trans=0x7ffeeef89450, read_jnl_seqno=591774) at sr_unix/gtmsource_readfiles.c:1922
  #7  read_and_merge (buff=0x7f3fdee8e3c0 "", maxbufflen=1281752, read_jnl_seqno=591774) at sr_unix/gtmsource_readfiles.c:1578
  #8  gtmsource_readfiles (buff=0x7f3fdee63968 "\rp", data_len=0x7ffeeef8a4c0, maxbufflen=1281752, read_multiple=1) at sr_unix/gtmsource_readfiles.c:2092
  #9  gtmsource_get_jnlrecs (buff=0x7f3fded9c848 "\rp", data_len=0x7ffeeef8a4c0, maxbufflen=2097144, read_multiple=1) at sr_unix/gtmsource_process_ops.c:1030
  #10 gtmsource_process () at sr_unix/gtmsource_process.c:1565
  #11 gtmsource () at sr_unix/gtmsource.c:525
  #12 mupip_main (argc=10, argv=0x7ffeeef8fd08, envp=0x7ffeeef8fd60) at sr_unix/mupip_main.c:122
  #13 dlopen_libyottadb (argc=10, argv=0x7ffeeef8fd08, envp=0x7ffeeef8fd60, main_func=0x524480 <.str> "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #14 main (argc=10, argv=0x7ffeeef8fd08, envp=0x7ffeeef8fd60) at sr_unix/mupip.c:22

  (gdb) f 6
  #6  read_regions (buff=0x7ffeeef89400, buff_avail=0x7ffeeef89440, attempt_open_oldnew=0, brkn_trans=0x7ffeeef89450, read_jnl_seqno=591774) at sr_unix/gtmsource_readfiles.c:1922
  1922                    assert((0 < read_len) || (repl_errno == EREPL_JNLEARLYEOF));

  (gdb) p read_len
  $1 = 0

  (gdb) p repl_errno
  $2 = 258
  ```

* Below are the numeric values of the EREPL* error codes from the source code.

  **sr_port/repl_errno.h**
  ```c
     25         EREPL_RECV,                             /* 258 */
     33         EREPL_JNLEARLYEOF,                      /* 266 */
  ```

* Therefore, `repl_errno` ended up being `EREPL_RECV` in the assert failure but the assert expected
  it to be `EREPL_JNLEARLYEOF`.

* The corresponding source server log had the following 2 lines at the very end.

  ```
  Thu Dec  8 21:57:38 2022 : Connection reset while attempting to receive from secondary. Status = 104 ; Connection reset by peer
  Thu Dec  8 21:57:39 2022 : State change detected in read_transaction
  ```

* The `State change detected in read_transaction` message indicates we went through line 1070 below
  in the `read_transaction()` function.

  **sr_unix/gtmsource_readfiles.c**
  ```c
       961 static  int read_transaction(repl_ctl_element *ctl, unsigned char **buff, int *bufsiz, seq_num read_jnl_seqno)
         .
         .
      1068      if (gtmsource_recv_ctl_nowait())
      1069      {
  --> 1070              repl_log(gtmsource_log_fp, TRUE, TRUE, "State change detected in read_transaction\n");
      1071              gtmsource_set_lookback();       /* In case we read ahead, enable looking back. */
      1072              return 0;
      1073      }
  ```

* And the actual assert failure happened in line 1922 below.

  **sr_unix/gtmsource_readfiles.c**
  ```c
      1921           read_len = read_transaction(ctl, buff, buff_avail, read_jnl_seqno);
  --> 1922           assert((0 < read_len) || (repl_errno == EREPL_JNLEARLYEOF));
      1923           if (GTMSOURCE_NOW_TRANSITIONAL(gtmsource_state_sav))
      1924                   return 0;
      1925           cumul_read += read_len;
  ```

* Notice that just after the assert there is a check of the `GTMSOURCE_NOW_TRANSITIONAL` macro in
  line 1923.

* This macro and the macros it depends on are pasted below.

  **sr_unix/gtmsource.h**
  ```c
    366 #define GTMSOURCE_CHANGED_STATE(STATEVAR)       (((STATEVAR) != gtmsource_state) && ((STATEVAR) != GTMSOURCE_DUMMY_STATE))

    371 #define GTMSOURCE_IS_TRANSITIONAL_STATE()                                                               \
    372                 ((GTMSOURCE_CHANGING_MODE == gtmsource_state)                                           \
    373                         || (GTMSOURCE_WAITING_FOR_CONNECTION == gtmsource_state)                        \
    374                         || (GTMSOURCE_WAITING_FOR_XON == gtmsource_state))
    375
    376 #define GTMSOURCE_NOW_TRANSITIONAL(STATEVAR)                                                            \
    377                 ((GTMSOURCE_CHANGED_STATE(STATEVAR) && GTMSOURCE_IS_TRANSITIONAL_STATE())               \
    378                         GTMTLS_ONLY(|| (REPLTLS_WAITING_FOR_RENEG_ACK == repl_tls.renegotiate_state))   \
    379                         || (GTMSOURCE_HANDLE_ONLN_RLBK == gtmsource_state))
  ```

* From the debugger, below are the values of relevant variables.

  ```c
  (gdb) p gtmsource_state_sav
  $1 = GTMSOURCE_SENDING_JNLRECS

  (gdb) p gtmsource_state
  $2 = GTMSOURCE_WAITING_FOR_CONNECTION
  ```

* This means, the `GTMSOURCE_NOW_TRANSITIONAL(gtmsource_state_sav)` check in line 1923 above would
  have returned `TRUE` and would cause us to return right away (`return 0` in line 1924) in this failing
  case if the assert in line 1922 was not there.

Fix
---
* One way to fix the assert is to add an `|| (EREPL_RECV == repl_errno)` check to line 1922.

* But I am not sure what other error codes `repl_errno` can hold in all possible cases where the
  `GTMSOURCE_NOW_TRANSITIONAL()` macro returns `TRUE`. All those need to be added to the assert as
  `||` cases too.

* Therefore, decided to fix this by moving the assert from line 1922 to line 1925. That way, in case
  the connection with the receiver server gets reset, we return right away and not do the assert.

* Note that this is a debug-only issue. A Release build would have returned 0 and continued fine as
  the incorrect assert would not have executed there.

nars1 added a commit that referenced this issue


          [#964] Fix ASAN heap-buffer-overflow error in $ZTRIGGER

b1b8e74

Background
----------
Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/964#note_1267198021

* https://gitlab.com/YottaDB/DB/YDB/-/issues/964#note_1267114939 mentions that the ASAN
  `heap-buffer-overflow` error is not reproducible. I was able to later find a reproducible
  test case. Pasting that below. This failure requires a build of YottaDB with ASAN enabled.

  ```m
  $ cat asan.m
   if $ztrigger("item","+^a(1) -xecute=")
   if $ztrigger("item","+^a(1")
  ```

  ```c
  $ yottadb -direct < asan.m
  =================================================================
  ==17210==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62d000008c28 at pc 0x0000004987c7 bp 0x7ffdc8a9d080 sp 0x7ffdc8a9c848
  READ of size 33023 at 0x62d000008c28 thread T0
      #0 __asan_memcpy (dbg/yottadb+0x4987c6)
      #1 gtm_memcpy_validate_and_execute sr_port/gtm_memcpy_validate_and_execute.c:44:9
      #2 cli_str_setup sr_unix/cli_lex.c:202:2
      #3 trigger_parse sr_unix/trigger_parse.c:1423:2
      #4 trigger_update_rec sr_unix/trigger_update.c:1417:7
      #5 trigger_update_rec_helper sr_unix/trigger_update.c:2217:19
      #6 trigger_update sr_unix/trigger_update.c:2270:21
      #7 op_fnztrigger sr_port/op_fnztrigger.c:248:31

  0x62d000008c28 is located 0 bytes to the right of 34856-byte region [0x62d000000400,0x62d000008c28)
  allocated by thread T0 here:
      #0 malloc (dbg/yottadb+0x49936d)
      #1 findStorElem sr_port/gtm_malloc_src.h:598:3
      #2 findStorElem sr_port/gtm_malloc_src.h:571:3
      #3 findStorElem sr_port/gtm_malloc_src.h:571:3
      #4 findStorElem sr_port/gtm_malloc_src.h:571:3
      #5 findStorElem sr_port/gtm_malloc_src.h:571:3
      #6 gtm_malloc_main sr_port/gtm_malloc_src.h:743:6
      #7 gtm_malloc_main sr_port/gtm_malloc_src.h:823:19
      #8 gtm_malloc sr_port/gtm_malloc_src.h:1486:9
      #9 gtm_env_init_sp sr_unix/gtm_env_init_sp.c:175:28
      #10 gtm_env_init sr_port/gtm_env_init.c:385:3
      #11 common_startup_init sr_port/common_startup_init.c:121:2
      #12 gtm_main sr_unix/gtm_main.c:103:2
      #13 dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11
      #14 main sr_unix/gtm.c:20:9
      #15 __libc_start_main csu/../csu/libc-start.c:308:16

  SUMMARY: AddressSanitizer: heap-buffer-overflow (dbg/yottadb+0x4987c6) in __asan_memcpy
  ```

Issue
-----
* Notice that in `asan.m`, there are 2 calls to `ztrigger()`. The second call does not have a closing
  paren after the `^a(1` but the first call had the `)` at the exact same spot.

* Because of this, line 1049 below in the `process_subscripts()` function incorrectly concludes that
  the `)` was seen even in the second call. This is because even though `len` is 0 at that point
  (indicating we are at the end of the input buffer), we check `*ptr` which is `)` from the previous
  `$ztrigger()` call. Therefore the error at line 1051 is not issued.

  **sr_unix/trigger_parse.c**
  ```c
    694 STATICFNDEF boolean_t process_subscripts(char *subscr_str, uint4 *subscr_len, char **next_str, char *out_str, int4 *out_max)
    695 {
      .
    730         while ((0 < len) && (')' != *ptr))
    731         {
      .
   1048         }
   1049         if ((0 == len) && (')' != *ptr))
   1050         {
   1051                 util_out_print_gtmio("Missing \")\" after global subscript", FLUSH);
   1052                 return FALSE;
   1053         }
  ```

* And because no error was issued, line 1409 below (in the caller) continues processing to line 1425
  below where the first parameter ends up being a negative value which when treated as a `uint4` ends
  up being a huge positive quantity.

  **sr_unix/trigger_parse.c**
  ```c
   1409                 if (!process_subscripts(ptr1, &len, &ptr2, values[GVSUBS_SUB], &max_output_len))
   1410                 {
   1411                         ERROR_MSG_RETURN("", input_len, input);
   1412                 }
   1413         } else
   1414                 len = 0;
   1415         if (0 > --max_output_len)
   1416         {
   1417                 util_out_print_gtmio("Error : Trigger definition too long", FLUSH);
   1418                 return TRIG_FAILURE;
   1419         }
   1420         values[GVSUBS_SUB][len] = '\0';
   1421         value_len[GVSUBS_SUB] = (uint4)len;
   1422         save_cmd_ary = cmd_ary;
   1423         cmd_ary = &trigger_cmd_ary[0];
   1424         gtm_cli_interpret_string = FALSE;
   1425         cli_str_setup(input_len - (uint4)(ptr2 - input), ptr2);
  ```

* And later inside `cli_str_setup()` (lines 190 and 201 below) we end up with an `alloclen` and `addrlen`
  that is close to `MAX_LINE` resulting in a `memcpy()` call that goes past the allocated buffer limits
  resulting in the `heap-buffer-overflow` ASAN error.

  **sr_unix/cli_lex.c**
  ```c
    185 void cli_str_setup(uint4 addrlen, char *addr)
    186 {       /* callers trigger_parse and zl_cmd_qlf create command strings with knowledge of their length */
      .
    190         alloclen = ((MAX_LINE <= addrlen) ? MAX_LINE : addrlen) + 1;
      .
    201         addrlen = MIN(addrlen, alloclen - 1);
    202         memcpy(cli_lex_in_ptr->in_str, addr, addrlen);
  ```

Fix
---
* The fix is simple and is in `process_subscripts()` to fix the incorrect check of whether a `)` was seen.

* Since the `while` loop at line 730 below terminates if `len` is either 0 or if `*ptr` is `)`, all that is
  needed is for line 1049 to just check if `0 == len`. That implies we terminated the while loop without seeing
  a `)` and if so we should issue an error. No need to also check `')' != *ptr` like it currently does below.

  **sr_unix/trigger_parse.c**
  ```c
    694 STATICFNDEF boolean_t process_subscripts(char *subscr_str, uint4 *subscr_len, char **next_str, char *out_str, int4 *out_max)
    695 {
      .
    730         while ((0 < len) && (')' != *ptr))
    731         {
      .
   1048         }
   1049         if ((0 == len) && (')' != *ptr))
  ```

nars1 added a commit that referenced this issue


          [DEBUG-ONLY] Remove incorrect assert in sr_port/lvzwr_out.c (missed o…

5e95519

…ut in dee9d0c)

Background
----------
* The `mem_stress_1/memleak` subtest failed in one rare test run on a slow in-house system with
  various core files. Below is an analysis of the first core file using gdb.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffed66c2ec0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  lvzwr_out_targkey (one=0x7ffed66c30c0) at sr_port/lvzwr_out.c:96
  #7  lvzwr_out (lvp=0x103fb48) at sr_port/lvzwr_out.c:286
  #8  lvzwr_var (lv=0x103fb48, n=3) at sr_port/lvzwr_var.c:233
  #9  lvzwr_var (lv=0x103faf0, n=2) at sr_port/lvzwr_var.c:312
  #10 lvzwr_var (lv=0x103fa98, n=1) at sr_port/lvzwr_var.c:312
  #11 lvzwr_var (lv=0x10c95d0, n=0) at sr_port/lvzwr_var.c:309
  #12 lvzwr_fini (out=0x7ffed66ce590, t=1) at sr_port/lvzwr_fini.c:83
  #13 op_lvpatwrite (count=0, arg1=140732495881408) at sr_port/op_lvpatwrite.c:85
  #14 zshow_zwrite (output=0x7ffed66ce590) at sr_port/zshow_zwrite.c:40
  #15 op_zshow (func=0x7ffed66d66c0, type=1, lvn=0x0) at sr_port/op_zshow.c:220
  #16 jobexam_dump (dump_filename_arg=0x7f28bdf51b60, dump_file_spec=0x10323b8, fatal_file_name_buff=0x7ffed66d7210 "", zshowcodes=0x7f28bdf51b60, dev_in_use=0x7ffed66d67a0) at sr_port/jobexam_process.c:237
  #17 jobexam_process (dump_file_name=0x7f28bdf51b60, zshowcodes=0x7f28bdf51b60, dump_file_spec=0x10323b8) at sr_port/jobexam_process.c:147
  #18 op_fnzjobexam (prelimSpec=0x7f28bdf51b60, zshowcodes=0x7f28bdf51b60, finalSpec=0x10323b8) at sr_port/op_fnzjobexam.c:22

  (gdb) f 6
  #6  lvzwr_out_targkey (one=0x7ffed66c30c0) at sr_port/lvzwr_out.c:96
  96              assert(MAX_STRLEN       /* WARNING assignment below; check in op_putindx should assure this */
  97                      >= (length += ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len));

  (gdb) p gtm_threadgbl_true->util_outbuff
  $1 = "%YDB-F-ASSERT, Assert failed in sr_port/lvzwr_out.c line 97 for expression (MAX_STRLEN >= (length += ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len))", '\000' <repeats 5946 times>

  (gdb) p length
  $2 = 1048577

  (gdb) p ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len
  $4 = 1048576
  ```

* Based on this, I was able to come up with a simple test case that demonstrates the same issue.

  ```m
  YDB>set x(1,$justify(2,2**20))="" zwrite x
  %YDB-F-ASSERT, Assert failed in sr_port/lvzwr_out.c line 97 for expression (MAX_STRLEN >= (length += ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len))
  ```

* This failure happens only in a Debug build. A Release build runs fine and prints a long string
  corresponding to the contents of the subscripted local variable node `x(1,<2**20-long-string>)`
  in the zwrite format.

Issue
-----
* As part of dee9d0c, the following change happened where we started allowing sets of subscripted
  local variable nodes where each subscript is 1Mib long.

* Below is relevant text from the commit message of dee9d0c.

  ```
  Files that had merge conflicts but the V63003 change was discarded
  ------------------------------------------------------------------
  Reason for discard is mentioned below against each module.

  * sr_port/op_fnquery.c & sr_port/op_putindx.c
          --> GTM-6115/GTM-8792 in GT.M V6.3-003 release notes describes that only $QUERY
          --> on lvns with subscripts exceeding 1Mb in total length will be prohibited, not
          --> other operations like SET but the change in this module does the exact opposite.
  ```

* This meant YottaDB allowed SETs of lvns where each subscript was 1MiB long. Whereas GT.M did not.

  Below is an example using GT.M V7.0-005.

  GT.M only allows a subscript that is 5 bytes shorter than 1MiB when there is just 2 subscripts in
  the lvn. It does not allow a subscript that is 4 bytes shorter than 1MiB.

  ```m
  GTM>set x($justify(1,2**20-4))=""
  %GTM-E-MAXSTRLEN, Maximum string length exceeded

  GTM>set x($justify(1,2**20-5))=""

  GTM>
  ```

  But if one tries to use 3 subscripts, GT.M only allows a subscript that is 68 bytes short of 1MiB.

  ```m
  GTM>set x($justify(1,2,2**20-5))=""
  %GTM-E-MAXSTRLEN, Maximum string length exceeded

  GTM>set x($justify(1,2,2**20-67))=""
  %GTM-E-MAXSTRLEN, Maximum string length exceeded

  GTM>set x($justify(1,2,2**20-68))=""

  GTM>
  ```

  So the maximum allowed subscript length is dependent on other subscripts in the lvn.

* The assert that failed in `sr_port/lvzwr_out.c` is tied to this logic in GT.M and relies on the
  fact that a SET of such a lvn would have been disallowed in `sr_port/op_putindx.c`.

* But YottaDB allows each subscript to be 1MiB long since dee9d0c. Independent of other subscripts
  in the lvn.

* Therefore this assert should have been removed as part of dee9d0c but was missed out then.

Fix
---
* The assert is removed in this commit. Along with it, a debug-only variable `length` as well as some
  comments describing the reliance on the obsolete `sr_port/op_putindx.c` behavior also got removed.

nars1 added a commit that referenced this issue


          [#722] MUPIP REPLIC -SOURCE -TRIGUPDATE replicates updates inside tri…

a2c2efb

…ggers too

Background
----------
* Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/722#description

  _Database updates made by triggers are [not propagated by the replication
  stream](https://docs.yottadb.com/ProgrammersGuide/triggers.html#multisite-database-replication) because
  the design point was that database updates from triggers are derivable from the primary update. However,
  this means that times (and other data such as process ids) are not replicated and hence not easily
  recreated. The proposed enhancement adds an option when starting a Source Server to include trigger
  updates in the replication stream._

Core changes
------------
* `sr_unix/mupip_cmd.c` has a new `TRIGUPDATE` option in the `gtmsource_qual[]` array.

* `-trigupdate` is allowed only if `-secondary` is also specified. This limits the possible commands that
  can specify `-trigupdate` to `mupip replic -source -start` or `mupip replic -source -activate` when they
  also specify `-secondary=...`. Towards this, `sr_unix/mupip_cmd_disallow.c` is modified to disallow
  `TRIGUPDATE` if `SECONDARY` is not also specified.

* An active source server startup (`mupip replic -source -start -secondary=...`) or a source server activation
  (`mupip replic -source -activate -secondary=...`) now supports an optional `-trigupdate` option which when
  specified implies that database updates made inside trigger invocations are also included in the replication
  stream.

* The fact that a `-trigupdate` option was specified is noted down in the boolean valued variable
  `gtmsource_options.trigupdate` (set to FALSE if `-trigupdate` was not specified and TRUE if it was).

  For a `mupip replic -source -start` command, the source server that eventually starts is forked off the
  process that specifies the source server startup command and so `gtmsource_options.trigupdate` is inherited
  implicitly and is therefore usable.

  But for a `mupip replic -source -activate` command, the process that specifies this command is different
  from the concurrently running source server and so `gtmsource_options.trigupdate` is only usable in the
  activate process and not transferred to the running source server. It is therefore necessary to copy over
  this user specified option from the activate command to the corresponding source server specific structure
  in the journal pool (`gtmsource_local` structure which is in shared memory). This copy happens in
  `sr_unix/jnlpool_init.c` (for a `mupip replic -source -start`) and in `sr_unix/gtmsource_mode_change.c`
  (for a `mupip replic -source -activate` command).

  And because of the above, the replication filter can safely use `jnlpool->gtmsource_local->trigupdate`
  when it needs to know whether `-trigupdate` was specified if the caller is a source server.

* The replication filter functions `jnl_v44TOv44()` and `jnl_v44TOv24()` were modified in `sr_port/repl_filter.c`
  to check if `-trigupdate` was specified (checked using `gtmsource_local->trigupdate`). And if so, they
  replicate updates that happen inside triggers (those which have `JS_NOT_REPLICATED_MASK` bit set in the
  `nodeflags` member). The `nodeflags` member is modified to clear the `JS_NOT_REPLICATED_MASK` bit in such
  records.

  Additionally they do not replicate any LGTRIG (trigger definitions) or ZTRIG (ZTRIGGER command) or
  $ZTWORMHOLE jnl records.

  Both these filter functions had code that previously issued an error if no conversion occurred as it was
  an out-of-design scenario. That code has now been replaced with code that generates a NULL record since in
  this case the entire transaction consists of journal records that are not replicated. This NULL record
  logic was copied over from the filter function `jnl_v44TOv22()` which was otherwise unchanged because
  that function is only used in case the receiver side is pre-V6.2-000 in which case it does not support
  LGTRIG records (issues a `%YDB-E-REPLNOHASHTREC` error).

  `jrec_null.bitmask.filler` is also initialized in `INITIALIZE_V44_NULL_RECORD` now that this macro is used
  for `jnl_v44TOv44()` filter conversion too (not just for `jnl_v44TOv22()` conversion like previously).

Misc changes
------------
* Since the `gtmsource_local_struct` structure has a new `trigupdate` member, `repl_inst_dump_gtmsourcelocal()`
  in `sr_unix/repl_inst_dump.c` was modified to dump this new field.

* Initialization of default values for `CONNECT_PARMS` was duplicated in `sr_unix/gtmsource_get_opt.c` in
  the `if` and `else` blocks. That is now moved to before the `if` thereby avoiding the duplication.

* A stale comment in `sr_port/jnl.h` was fixed (`align_str` was no longer in use like `ztworm_str` or
  `lgtrig_str` was).

* Regenerated GTMDefinedTypesInit*.m for sr_x86_64,sr_aarch64,sr_armv6l due to `gtmsource_local_struct` and
  `gtmsource_options_t` structure layout/size changes.

* Fixed a pre-existing issue with `gv_target` maintenance in `sr_port/gvcst_jrt_null.c`.

  Some background first. After enhancing the YDBTest test framework to test with `-trigupdate`, I noticed
  a rare test failure with the following symptom.

  This issue showed up only now due to more NULL records possible with the use of `-trigupdate`.

  ```sh
  $ cat ##REMOTE_PATH##/online_rollback_1_10/trestartrootverify/instance2/RCVR_22_01_55_4.log.updproc
  .
  .
  Thu Feb 16 22:02:29 2023 :  ----> TPRETRY for sequence number 11252 [0x2bf4]
  Thu Feb 16 22:02:29 2023 : Jnl seq no : 11252 [0x2bf4];Rectype : 17 - TCOM
  %YDB-F-ASSERT, Assert failed in sr_port/gvcst_root_search.c line 210 for expression (cs_addrs == gv_target->gd_csa)
  %YDB-I-IPCNOTDEL, Thu Feb 16 22:02:29 2023 : Update process did not delete IPC resources for region AREG
  ```

  And below was the debugger analysis.

  ```c
  (gdb) where
  #8  gvcst_redo_root_search () at sr_port/gvcst_root_search.c:210
  #9  t_retry (failure=cdb_sc_gvtrootmod2) at sr_port/t_retry.c:558
  #10 t_end (hist1=0x0, hist2=0x0, ctn=18446744071629176832) at sr_port/t_end.c:1896
  #11 gvcst_jrt_null (salvaged=0) at sr_port/gvcst_jrt_null.c:72
  #12 updproc_actions (gld_db_files=0x62d000001ac0) at sr_port/updproc.c:1046
  #13 updproc () at sr_port/updproc.c:502
  #14 mupip_main (argc=3, argv=0x7ffc86786618, envp=0x7ffc86786638) at sr_unix/mupip_main.c:122
  #15 dlopen_libyottadb (argc=3, argv=0x7ffc86786618, envp=0x7ffc86786638, main_func=0x563cba3c1020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #16 main (argc=3, argv=0x7ffc86786618, envp=0x7ffc86786638) at sr_unix/mupip.c:22

  (gdb) f 8
  #8  gvcst_redo_root_search () at sr_port/gvcst_root_search.c:210
  210             assert(cs_addrs == gv_target->gd_csa);

  (gdb) p cs_addrs
  $1 = (sgmnt_addrs *) 0x62d000026040

  (gdb) p gv_target->gd_csa
  $2 = (sgmnt_addrs *) 0x62d00001e840

  (gdb) p cs_addrs->sgm_info_ptr->gv_cur_region->rname
  $4 = "DREG", '\000' <repeats 27 times>

  (gdb) p $2->sgm_info_ptr->gv_cur_region->rname
  $5 = "DEFAULT", '\000' <repeats 24 times>

  (gdb) p gv_target->gvname
  $6 = {var_name = {char_len = 0, len = 31, addr = 0x62d000057f10 "jrandomvariableinimptpfillprogr"}, hash_code = 1897224776, marked = 0}
  ```

  ```m
  YDB>write $view("region","^jrandomvariableinimptpfillprogr")
  DEFAULT
  ```

  So the assert failure is because `gv_target` pointed to a global variable name that mapped to the `DEFAULT`
  region whereas the current region were writing the NULL record was for `DREG`.

  The cause of the assert was the call to `gvcst_redo_root_search()` in line 558.

  **sr_port/t_retry.c**
  ```c
      545   case cdb_sc_gvtrootmod2:
      546           if (!redo_root_search_done)
      547                   RESET_ALL_GVT_CLUES;
      548           /* It is possible for a read-only transaction to release crit after detecting gvtrootmod2, during
      549            * which time yet another root block could have moved. In that case, the MISMATCH_ROOT_CYCLES check
      550            * would have already done the redo_root_search.
      551            */
      552           assert(!redo_root_search_done || !update_trans);
      553           if (WANT_REDO_ROOT_SEARCH)
      554           {       /* Note: An online rollback can occur DURING gvcst_redo_root_search, which can remove gbls
      555                    * from db, leading to gv_target->root being 0, even though failure code is not
      556                    * cdb_sc_onln_rlbk2
      557                    */
  --> 558                   gvcst_redo_root_search();
      559           }
  ```

  The `cdb_sc_gvtrootmod2` failure code is because of a concurrent online rollback on the receiver side.
  But in order to handle it, we invoke the `gvcst_redo_root_search()` if the `WANT_REDO_ROOT_SEARCH` macro
  returned TRUE.

  The macro is defined as follows.

  **sr_port/t_retry.c**
  ```c
      67 /* In mu_reorg if we are in gvcst_bmp_mark_free, we actually have a valid gv_target. Find its root before the next iteration
      68  * in mu_reorg.
      69  */
      70 #define WANT_REDO_ROOT_SEARCH                                                           \
  --> 71                         (       (NULL != gv_target)                                     \
      72                              && (DIR_ROOT != gv_target->root)                           \
      73                              && !redo_root_search_done                                  \
      74                              && !TREF(in_gvcst_redo_root_search)                        \
      75                              && !mu_reorg_upgrd_dwngrd_in_prog                          \
      76                              && !mu_reorg_encrypt_in_prog                               \
      77                              && (!TREF(in_gvcst_bmp_mark_free) || mu_reorg_process)     \
      78                         )
  ```

  Line 71 is the issue. `gv_target` should have been `NULL` in the `gvcst_jrt_null()` case since we are
  not dealing with any global name (the `NULL` journal record is to denote an empty transaction).

  Given the above analysis, the fix is to set `gv_target` to `NULL` in `sr_port/gvcst_jrt_null.c`.

  Additionally, `gv_currkey->base[0]` had to also be cleared to keep it in sync with gv_target in DEBUG code
  (just like is being already done in the `GVTR_SWITCH_REG_AND_HASHT_BIND_NAME` macro in `sr_unix/gv_trigger.h`).

* Fixed `sr_unix/mupip_cmd_disallow.c` to disallow `ZEROBACKLOG` option if `SHUTDOWN` is not also specified.
  I noticed this issue while adding disallow code for `TRIGUPDATE`. The `ZEROBACKLOG` option was introduced
  in 60e7e2d (GT.M V6.3-000) which was many years ago but this option was allowed to be specified in various
  source server commands that had nothing to do with this option (for example, the command
  `mupip replic -source -deactivate -zerobacklog` was allowed but it did not make any sense). This misfeature
  is fixed in the current commit by generating a `%YDB-E-CLIERR` error for such meaningless commands.

* Made `cstart` and `jstart` variables DEBUG_ONLY in `sr_port/repl_filter.c`. This removed all the
  `clang-tidy` warnings related to these variables in the following reference files.
  - ci/tidy_warnings_release_x86_64.ref
  - ci/tidy_warnings_release_aarch64.ref

  In addition, a `Value stored to 'prefix' is never read [clang-analyzer-deadcode.DeadStores]` warning also
  no longer shows up after all changes to `sr_port/repl_filter.c` in this commit. Not exactly sure where
  the change happened but not spending time on it since the warning has now disappeared. This meant removing
  one line of warning from the following reference files.
  - ci/tidy_warnings_release_x86_64.ref
  - ci/tidy_warnings_release_aarch64.ref
  - ci/tidy_warnings_debug_aarch64.ref
  - ci/tidy_warnings_debug_x86_64.ref

* Enhanced `ci/create_tidy_warnings.sh` to capture detail in case of compilation errors.

  As a background, I had a `clang-tidy-amd64` pipeline job fail because my changes to `sr_port/repl_filter.c`
  had a compilation error. In this case, the `clang-tidy-14` call in `ci/create_tidy_warnings.sh` exited with
  a non-zero status but since we had redirected the stderr to `/dev/null` (i.e. `2>/dev/null`) we had no error
  text to look at to see why it exited abnormally. I had to run the same command locally to determine the
  cause of the error.

  This is fixed by redirecting stderr to `$output_dir/tidy_warnings.err`. And if the `clang-tidy` exited with
  a non-zero status, we examine this file for any lines containing `error` (case insensitive). We print those
  lines in the standard output. That way it is more likely to give helpful information as to which C file
  had compilation troubles.

  In my case, I saw the following 2 lines of extra output indicating `sr_port/repl_filter.c` had compilation
  errors.

  ```
  4617 warnings and 20 errors generated.
  Error while processing /builds/nars1/YDB/sr_port/repl_filter.c.
  ```

nars1 added a commit that referenced this issue


          [YottaDB/Lang/YDBPython#32] Avoid _exit() if exit_handler_complete as…

4b643ed

… it can cause hang with CLANG/ASAN

Background
----------
* While running the YDBOcto tests with CLANG, I noticed various tests hang. All of them had a
  similar stack-trace.

  ```c
  (gdb) where
  #0  __sanitizer::FutexWait(__sanitizer::atomic_uint32_t*, unsigned int) ()
  #1  __sanitizer::Semaphore::Wait() ()
  #2  __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >::GetFromAllocator(__sanitizer::AllocatorStats*, unsigned long, unsigned int*, unsigned long) ()
  #3  __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::Refill(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::PerClass*, __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >*, unsigned long) ()
  #4  __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >*, unsigned long, unsigned long) ()
  #5  __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) ()
  #6  __asan::asan_calloc(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*) ()
  #7  calloc ()
  #8  __pthread_attr_extension (attr=0x7f29af3cee48) at ./nptl/pthread_attr_extension.c:28
  #9  __GI___pthread_attr_setaffinity_np (attr=attr@entry=0x7f29af3cee48, cpusetsize=cpusetsize@entry=32, cpuset=cpuset@entry=0x603000001b40) at ./nptl/pthread_attr_setaffinity.c:45
  #10 __pthread_getattr_np (thread_id=139817006390848, attr=0x7f29af3cee48) at ./nptl/pthread_getattr_np.c:194
  #11 __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) ()
  #12 __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) ()
  #13 __asan::PlatformUnpoisonStacks() ()
  #14 __asan_handle_no_return ()
  #15 generic_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:187
  #16 ydb_os_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0) at sr_unix/ydb_os_signal_handler.c:85
  #17 <signal handler called>
  #18 sched_yield () at ../sysdeps/unix/syscall-template.S:120
  #19 __sanitizer::StopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), void*) ()
  #20 __lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*) ()
  #21 __GI___dl_iterate_phdr (callback=0x55bd48373320 <__lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*)>, data=0x7ffe13010eb8) at ./elf/dl-iteratephdr.c:74
  #22 __lsan::LockStuffAndStopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), __lsan::CheckForLeaksParam*) ()
  #23 __lsan::CheckForLeaks() ()
  #24 __lsan::DoLeakCheck() ()
  #25 __cxa_finalize (d=0x55bd483af128) at ./stdlib/cxa_finalize.c:83
  #26 __do_global_dtors_aux ()
  #27 ?? ()
  #28 _dl_fini () at ./elf/dl-fini.c:142
  ```

Issue
-----
* The YottaDB SIG-15/SIGTERM signal handler got invoked for a SIG-15. But it noticed that all YottaDB
  exit handler code has already been run (`exit_handler_complete` global variable is TRUE). In that
  case, it invoked any non-YottaDB signal handler for SIG-15 and afterwards, it invoked `_exit()` to
  terminate the process (in line 187).

  **sr_unix/generic_signal_handler.c**
  ```c
    182         if (exit_handler_complete)
    183         {
    184                 if (!using_alternate_sighandling)       /* Go does not send us signals so no need to forward */
    185                 {
    186                         drive_non_ydb_signal_handler_if_any("generic_signal_handler1", sig, info, context, TRUE);
    187                         UNDERSCORE_EXIT(-sig);
    188                 }
    189                 return;         /* Nothing we can do if exit handler has run */
    190         }
  ```

* And because of the `_exit()` all, the CLANG/ASAN library ended up doing a `calloc()` call which hung
  waiting for a futex. Most likely due to re-entrant invocations of C library functions that are not
  async-signal safe.

* The cause of this is line 187 above in my opinion.

* If YottaDB exit handler has already run (as part of SIGTERM handling) and we are getting the SIGTERM signal
  again, then I don't see any reason to do the `_exit()` call (using the `UNDERSCORE_EXIT` macro in line 187).

* This code has been there for a long time but I don't think it is doing the right thing.

Fix
---
* Lines 184-188 are now removed in this commit. I think the right thing to do is to just return in case the
  YottaDB exit handler has already been invoked.

* With this change, I verified that the CLANG/ASAN tests run fine in YDBOcto. So at least one Simple API
  use case runs fine with the fix in this commit.

* Initially I thought of disabling lines 184-188 above only when ASAN is enabled. But then I realized it
  is a good change for all cases and so removed lines 184-188.

nars1 added a commit that referenced this issue


          [DEBUG-ONLY] Fix assert in DBG_CHECK_DIO_ALIGNMENT macro to correctly…

ae495ea

… detect signal/timer handling

Background
----------
* We had one rare test failure during in-house testing. The `ideminter_rolrec/mupipstop_rollback_or_recover`
  subtest failed with the following symptom.

  ```sh
  $ cat ROLLBACK1_3.logx
  mupip journal -ROLLBACK -back -verify -verbose "*"  -noonline -resync=369813 -lost=ROLLBACK1_3.lost
  Sat Sep  9 04:17:18 PM EDT 2023
  .
  .
  %YDB-I-MUJNLSTAT, Forward processing started at Sat Sep  9 16:19:23 2023
  %YDB-I-MUINFOUINT8, mur_process_seqno_table returns min_broken_seqno : 18446744073709551615 [0xFFFFFFFFFFFFFFFF]
  %YDB-I-MUINFOUINT8, mur_process_seqno_table returns losttn_seqno : 369813 [0x000000000005A495]
  %YDB-I-MUINFOSTR, Module : mur_forward:at the start at Sat Sep  9 16:19:23 2023
  .
  .
  %YDB-I-MUINFOSTR,     Journal file : ideminter_rolrec_0/mupipstop_rollback_or_recover/g.mjl_2023252161233
  %YDB-I-MUINFOUINT4,     Record Offset : 65744 [0x000100D0]
  %YDB-F-FORCEDHALT, Image HALTed by MUPIP STOP
  %YDB-F-ASSERT, Assert failed in sr_unix/db_ipcs_reset.c line 110 for expression (((TREF(dio_buff)).aligned != (char *)(csd)) || (!timer_in_handler && !multi_thread_in_use))
  Sat Sep  9 04:20:35 PM EDT 2023
  The time the mupip command took:  197
  ```

* The core file corresponding to the above assert failure had the following stack trace.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140217990231872, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fff160fdc00) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  db_ipcs_reset (reg=0x563c77a1c0b0) at sr_unix/db_ipcs_reset.c:110
  #9  mur_close_files () at sr_port/mur_close_files.c:841
  #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:116
  #11 signal_exit_handler (exit_handler_name=0x7f870b624acc "deferred_exit_handler", sig=15, info=0x7f870b7856a8 <stapi_signal_handler_oscontext+3320>, context=0x7f870b785728 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #14 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862
  #15 wcs_stale (tid=94817705118848, hd_len=8, region=0x563c77924b08) at sr_port/t_end_sysops.c:1445
  #16 timer_handler (why=0, info=0x7f870b787088 <stapi_signal_handler_oscontext+9944>, context=0x7f870b787108 <stapi_signal_handler_oscontext+10072>, is_os_signal_handler=0) at sr_unix/gt_timers.c:913
  #17 check_for_deferred_timers () at sr_unix/gt_timers.c:1312
  #18 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #19 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862
  #20 wcs_timer_start (reg=0x563c77a1cc80, io_ok=1) at sr_port/t_end_sysops.c:1344
  #21 op_tcommit () at sr_port/op_tcommit.c:535
  #22 mur_output_record (rctl=0x563c77a28a40) at sr_port/mur_output_record.c:323
  #23 mur_forward_play_cur_jrec (rctl=0x563c77a28a40) at sr_port/mur_forward_play_cur_jrec.c:362
  #24 mur_forward_multi_proc (rctl=0x563c77a28a40) at sr_port/mur_forward.c:400
  #25 gtm_multi_proc (fnptr=0x7f870ae20f00 <mur_forward_multi_proc>, ntasks=1, max_procs=1, ret_array=0x563c7cb21a40, parm_array=0x563c77a27c40, parmElemSize=512, extra_shm_size=2640, init_fnptr=0x7f870ae2b9f0 <mur_forward_multi_proc_init>, finish_fnptr=0x7f870ae2bc10 <mur_forward_multi_proc_finish>) at sr_unix/gtm_multi_proc.c:122
  #26 mur_forward (min_broken_time=4294967295, min_broken_seqno=18446744073709551615, losttn_seqno=369813) at sr_port/mur_forward.c:158
  #27 mupip_recover () at sr_port/mupip_recover.c:588
  #28 mupip_main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip_main.c:122
  #29 dlopen_libyottadb (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0, main_func=0x563c761b1004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #30 main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip.c:22

  (gdb) p gtm_threadgbl_true->dio_buff.aligned
  $5 = 0x563c78429000 "GDSDYNUNX04"
  (gdb) p csd
  $6 = (sgmnt_data_ptr_t) 0x563c78429000
  (gdb) p timer_in_handler
  $1 = 1
  (gdb) p multi_thread_in_use
  $2 = 0

  (gdb) p forced_exit
  $3 = 2
  (gdb) p exit_handler_active
  $4 = 1
  (gdb) p in_os_signal_handler
  $1 = 0
  ```

Issue
-----
* The assert failure was in the db_ipcs_reset() -> DB_LSEEKREAD -> DBG_CHECK_DIO_ALIGNMENT.

* The `DBG_CHECK_DIO_ALIGNMENT` macro had the following comment.

  ```c
     53         /* If we are using the global variable "dio_buff.aligned", then we better not be executing in timer     \
     54          * code or in threaded code (as we have only ONE buffer to use). Assert that.                           \
     55          */                                                                                                     \
     56         assert(((TREF(dio_buff)).aligned != (char *)(buff)) || (!timer_in_handler && !multi_thread_in_use));    \
  ```

* In the failure case, even though we are executing in timer code we are actually in exit handler code
  (as can be seen by the `forced_exit` and `exit_handler_active` variables in the gdb analysis above).
  In this case, the exit handler code will not return out of the timer code and so it is okay for the
  assert to not be TRUE.

* The global variable being checked in the assert is `timer_in_handler`. This is where the issue is.
  That global variable being TRUE just means the `timer_handler()` function is in the current call stack.
  It does not mean that we are handling a SIGALRM/timer signal and interrupting the mainline code.
  The assert is intended to protect against signal handler interrupting the mainline code. Therefore,
  the correct global variable to check in the assert is `in_os_signal_handler`.

Fix
---
* The fix is simple and is to use `in_os_signal_handler` instead of `timer_in_handler` in the assert.

nars1 added a commit that referenced this issue


          [#835] [V70001] [GTM-9333] Fix SET_FORCED_EXIT_STATE macro flow to co…

d988d72

…mplete deferred state setup before invoking xfer_set_handlers()

* After merging GT.M V7.0-001, the following tests failed in rare cases.
  - -t dual_fail_extend -replic -st dual_fail2_mustop_sigquit
  - -t v60000 -replic -st gtm4525b

* The failure symptom was the following.

  ```c
  (gdb) x/s gtm_threadgbl_true->util_outbuff
  0x17d3ed8:  "%YDB-F-ASSERT, Assert failed in sr_port/deferred_signal_handler.c line 38 for expression (GET_DEFERRED_EXIT_CHECK_NEEDED || (1 != forced_exit))"
  ```

* And the C-stack was the following.

  ```c
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fffa2b5d390) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  deferred_signal_handler () at sr_port/deferred_signal_handler.c:38
  #7  set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:48
  #8  xfer_set_handlers (event_type=11, param_val=1730866112, popped_entry=0) at sr_port/deferred_events.c:191
  #9  generic_signal_handler (sig=15, info=0x7f7167e24218 <stapi_signal_handler_oscontext+3320>, context=0x7f7167e24298 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:305
  #10 ydb_os_signal_handler (sig=15, info=0x7fffa2b5d9f0, context=0x7fffa2b5d8c0) at sr_unix/ydb_os_signal_handler.c:85
  #11 <signal handler called>
  #12 __GI___clock_nanosleep (clock_id=1, flags=1, req=0x7fffa2b5e058, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  #13 wait_for_repl_inst_unfreeze_nocsa_jpl (jpl=0x17ec240) at sr_port/anticipatory_freeze.h:517
  #14 wait_for_repl_inst_unfreeze (csa=0x18f7040) at sr_port/anticipatory_freeze.h:547
  #15 jnl_write_attempt (jpc=0x18f7a40, threshold=29324848) at sr_port/jnl_write_attempt.c:348
  #16 jnl_flush (reg=0x189afe8) at sr_port/jnl_flush.c:57
  #17 tp_tend () at sr_port/tp_tend.c:795
  #18 op_tcommit () at sr_port/op_tcommit.c:497

  (gdb) f 6
  #6  0x00007f71672ae771 in deferred_signal_handler () at sr_port/deferred_signal_handler.c:38
  38              assert(GET_DEFERRED_EXIT_CHECK_NEEDED || (1 != forced_exit));

  ```

* The `SET_FORCED_EXIT_STATE` macro call (in frame 9 above) is where the issue is.

  **sr_port/have_crit.h**
  ```c
      172 #define SET_FORCED_EXIT_STATE(SIG)                                                                                              \
      173 {                                                                                                                               \
      174         char                    *rname;                                                                                         \
      175                                                                                                                                 \
      176         GBLREF VSIG_ATOMIC_T    forced_exit;                                                                                    \
      177         GBLREF int              forced_exit_sig;                                                                                \
      178         GBLREF boolean_t        (*xfer_set_handlers_fnptr)(int4, void (*callback)(int4), int4 param, boolean_t popped_entry);   \
      179         GBLREF void             (*deferred_signal_set_fnptr)(int4 dummy_val);                                                   \
      180                                                                                                                                 \
      181         /* Below code is not thread safe as it modifies global variables "forced_exit"                                          \
      182          * and "forced_exit_sig".                                                                                               \
      183          */                                                                                                                     \
      184         assert(!INSIDE_THREADED_CODE(rname));                                                                                   \
      185         assert((0 == forced_exit) || (1 == forced_exit));                                                                       \
  --> 186         forced_exit = 1;                                                                                                        \
      187         forced_exit_sig = SIG;          /* Record the signal forcing us to exit */                                              \
      188         if (in_os_signal_handler)                                                                                               \
      189         {       /* If we are inside an OS signal handler and therefore had to defer exit                                        \
      190                  * handling, treat this as an outofband event as this is checked by lots of                                     \
      191                  * potentially long-running commands in the runtime (e.g. HANG etc.) and we                                     \
      192                  * want all of those to automatically trigger process exit handling.                                            \
      193                  * The below invocation takes care of the signal as a deferred outofband event                                  \
      194                  * that gets handled at the earliest safe point.                                                                \
      195                  */                                                                                                             \
      196                 if (NULL != xfer_set_handlers_fnptr)                                                                            \
  --> 197                         (*xfer_set_handlers_fnptr)(deferred_signal, deferred_signal_set_fnptr, 0, FALSE);                       \
      198                 /* else: it is "gtmsecshr" in which case outofband does not apply */                                            \
      199         }                                                                                                                       \
      200         /* Whenever "forced_exit" gets set to 1, set the corresponding deferred event too */                                    \
  --> 201         SET_DEFERRED_EXIT_CHECK_NEEDED;                                                                                         \
      202         SET_FORCED_THREAD_EXIT;         /* Signal any running threads to stop */                                                \
      203         SET_FORCED_MULTI_PROC_EXIT;     /* Signal any parallel processes to stop */                                             \
      204 }
  ```

* Line 186 sets `forced_exit` and Line 201 sets the corresponding deferred event. But Line 197
  ends up invoking `deferred_signal_handler()` which has an assert that expects Line 186 and 201 to
  have happened at the same time.

* This is fixed by moving lines 188-199 above to execute AFTER lines 200-203. That way the state setup
  of the forced exit is finished first and then the outofband set up happens by the
  `xfer_set_handlers_fnptr` call.

* Now that Lines 186 and 201 are executed BEFORE line 197 in this commit, the assert failure seen
  in the test failure should be automatically fixed.

nars1 added a commit that referenced this issue


          [#835] [V70001] Fix incorrect reserve_bytes usage in sr_port/mu_split.c

ce5567d

* The `reorg/on_ntp_njnl_reorg` subtest failed once in 10 runs or so with the following assert.

  ```diff
  > reorg_5_6/on_ntp_njnl_reorg/reorg1.out
  > %YDB-F-ASSERT, Assert failed in sr_port/mu_split.c line 689 for expression (*top_off < max_fill)
  ```

* Below is the core analysis using gdb.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140573380031552) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140573380031552) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140573380031552, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd10fd0290) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  locate_block_split_point (blk_stat=0x61d00029eb98, level=1, cur_blk_size=16, max_fill=4, last_rec_size=0x7ffd10fd0950, last_key=0x7ffd10fd0e10 "srqponm", last_keysz=0x7ffd10fd0920, top_off=0x7ffd10fd0940) at sr_port/mu_split.c:689
  #9  mu_split (cur_level=0, i_max_fill=4096, d_max_fill=2908, blks_created=0x7ffd10fd15a0, lvls_increased=0x7ffd10fd15b0) at sr_port/mu_split.c:314
  #10 mu_reorg (gl_ptr=0x62d00021fec0, exclude_glist_ptr=0x7ffd10fd2820, resume=0x7ffd10fd26e0, index_fill_factor=100, data_fill_factor=71, reorg_op=0) at sr_port/mu_reorg.c:356
  #11 mupip_reorg () at sr_port/mupip_reorg.c:334
  #12 mupip_main (argc=5, argv=0x7ffd10fe51e8, envp=0x7ffd10fe5218) at sr_unix/mupip_main.c:121
  #13 dlopen_libyottadb (argc=5, argv=0x7ffd10fe51e8, envp=0x7ffd10fe5218, main_func=0x560122181020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #14 main (argc=5, argv=0x7ffd10fe51e8, envp=0x7ffd10fe5218) at sr_unix/mupip.c:22

  (gdb) p *top_off
  $5 = 16

  (gdb) p max_fill
  $6 = 4

  (gdb) up
  #9  mu_split (cur_level=0, i_max_fill=4096, d_max_fill=2908, blks_created=0x7ffd10fd15a0, lvls_increased=0x7ffd10fd15b0) at sr_port/mu_split.c:314
  314                             status = locate_block_split_point(old_blk1_hist_ptr, level, old_blk1_sz, max_fill,

  (gdb) p old_blk1_sz
  $30 = 16

  (gdb) p delta
  $31 = 56

  (gdb) p blk_size
  $32 = 4096

  (gdb) p reserve_bytes
  $33 = 4096

  (gdb) p max_fill_sav
  $34 = 2908

  (gdb) p bstar_rec_sz
  $35 = 12

  (gdb) p cs_data->reserved_bytes
  $36 = 0
  ```

* Below is the source code corresponding to frame 9.

  **sr_port/mu_split.c**
  ```c
    299    if ((old_blk1_sz + delta) > (blk_size - reserve_bytes))
    300    {
    301            split_required = TRUE;
    302            if (level == gv_target->hist.depth)
    303            {
    304                    create_root = TRUE;
    305                    if ((MAX_BT_DEPTH - 1) <= level)
    306                            return cdb_sc_maxlvl;                           /* maximum level reached */
    307            }
    308            if (max_fill + bstar_rec_sz > old_blk1_sz)
    309            {       /* need more space than what was in the old block, so new block will be "too big" */
    310                    if (((SIZEOF(blk_hdr) + bstar_rec_sz) == old_blk1_sz) && !mu_reorg_upgrd_dwngrd_in_prog)
    311                            return cdb_sc_oprnotneeded;                     /* Improve code to avoid this */
    312                    max_fill = old_blk1_sz - bstar_rec_sz;
    313            }
    314            status = locate_block_split_point(old_blk1_hist_ptr, level, old_blk1_sz, max_fill,
    315                    &old_blk1_last_rec_size, new_blk1_last_key, &new_blk1_last_keysz, &new_leftblk_top_off);
  ```

* As the core analysis indicates, `max_fill` is 4 which is the reason the assert failed. And that happened
  at line 312.

* And this is because `old_blk1_sz` was a small value of `16`.

* But then I was wondering how come we got into the `if` at line 299 as we would get into it only if the
  block has lot of content and needs a split.

* Turns out the issue is that `reserve_bytes` is `4096` which is the same as `blk_size`. And so the
  right hand side of the `>` check in line 299 was 0 which is why the `if` succeeded even though
  `old_blk1_sz + delta` was only `16 + 56` i.e. 72 bytes, which is a lot less than the value that would
  usually require a block split.

* I then started looking at why `reserve_bytes` ended up being such a huge value when the file header
  field it mirrors (i.e. `cs_data->reserved_bytes`) is 0 as indicated in the gdb analysis above.

* That is when I realized this happened in the following line.

  **sr_port/mu_split.c**
  ```c
    233         reserve_bytes = i_max_fill;
  ```

* This is a bug that was introduced in GT.M V7.0-001 changes (52a92df) to `sr_port/mu_split.c`. And I see
  that this incorrect set of `reserve_bytes = i_max_fill` is removed in GT.M V7.1-000 (f9ca5ad).

* f9ca5ad has a lot more changes to support `mupip reorg upgrade` so I cannot cherry-pick that commit.

* Therefore, I am fixing this issue by removing line 233 in this commit.

* In addition, 895c2d3a had fixed some `clang-tidy` warnings in `sr_port/mu_split.c` related to the
  `reserve_bytes` variable. There was a block of code that set `reserve_bytes` BEFORE line 233. And
  that was flagged as a `[clang-analyzer-deadcode.DeadStores]` warning. And so the previous block of
  code that set `reserve_bytes` was removed in 895c2d3a. And along with it, logic related to
  `available_bytes` was made `#ifdef DEBUG`. Both these changes are now removed in this commit as
  that previous block had correctly set `reserve_bytes = cs_data->reserved_bytes` (and is the line
  that remains even in GT.M V7.1-000 f9ca5ad). That said, the update of `available_bytes` inside
  the `#ifdef DEBUG` block continues to be used only by a later assert so that is still kept inside
  a `DEBUG_ONLY()` macro in the new code to avoid the following `clang-tidy` warning.

  ```
  mu_split.c:warning: Value stored to 'available_bytes' is never read [clang-analyzer-deadcode.DeadStores]
  ```

nars1 added a commit that referenced this issue


          [YDBTest#501] [DEBUG-ONLY] Fix ydb_test_4g_db_blks env var handling i…

8ef4338

…n is_free_blks_ctr_ok() (regression in ea9950a)

* The resil_4/resil subtest failed in one rare in-house test run with the following symptom.

  ```diff
  > resil_4_6/resil/reorg4.out
  > %YDB-F-ASSERT, Assert failed in sr_port/bm_getfree.c line 401 for expression (FALSE)
  ```

* Below is relevant information from the core file.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (threadid=<optimized out>, signo=3, no_tid=<optimized out>) at ./nptl/pthread_kill.c:44
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fafb799c5a0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  is_free_blks_ctr_ok () at sr_port/bm_getfree.c:401
  #7  gdsfilext (blocks=100, filesize=12559321033, trans_in_prog=1) at sr_unix/gdsfilext.c:335
  #8  bm_getfree (hint_arg=196239399, blk_used=0x7fafb80a4410, cw_work=3, cs=0x7fafb6d841c0 <cw_set>, cw_depth_ptr=0x7fafb80a4540) at sr_port/bm_getfree.c:185
  #9  t_end (hist1=0x62d0001128c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:520
  #10 mu_reorg (gl_ptr=0x62d000112040, exclude_glist_ptr=0x7ffc96944400, resume=0x7ffc96933f60, index_fill_factor=78, data_fill_factor=44, reorg_op=0) at sr_port/mu_reorg.c:367
  #11 mupip_reorg () at sr_port/mupip_reorg.c:334
  #12 mupip_main (argc=5, argv=0x7ffc96944958, envp=0x7ffc96944988) at sr_unix/mupip_main.c:121
  #13 dlopen_libyottadb (argc=5, argv=0x7ffc96944958, envp=0x7ffc96944988, main_func=0x55aebf0d8420 <str> "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #14 main (argc=5, argv=0x7ffc96944958, envp=0x7ffc96944988) at sr_unix/mupip.c:22

  (gdb) f 6
  #6  0x00007fafb5ce61b2 in is_free_blks_ctr_ok () at sr_port/bm_getfree.c:401
  401                             assert(FALSE);  /* In pro, we will simply skip counting this local bitmap. */

  (gdb) list
  379    for (free_blocks = 0, free_bml = 0; free_bml < local_maps; free_bml++)
  380    {
  381            #ifdef DEBUG
  382            if ((0 != ydb_skip_bml_num)
  383                    && (BLKS_PER_LMAP <= (BLKS_PER_LMAP * free_bml))
  384                    && ((BLKS_PER_LMAP * free_bml) < ydb_skip_bml_num))
  385            {
  386                    free_bml = (ydb_skip_bml_num / BLKS_PER_LMAP) - 1;
  387                            /* - 1 to compensate the "free_bml++" done in "for" loop line */
  388                    free_blocks += (ydb_skip_bml_num - BLKS_PER_LMAP) / BLKS_PER_LMAP * (BLKS_PER_LMAP - 1);
  389                    continue;
  390            }
  391            #endif
  392            bml = bmm_find_free((uint4)free_bml, (sm_uc_ptr_t)MM_ADDR(cs_data), local_maps);
  393            if (bml < free_bml)
  394                    break;
  395            free_bml = bml;
  396            bml *= BLKS_PER_LMAP;
  397            if (!(bmp = t_qread(bml, (sm_int_ptr_t)&cycle, &cr))
  398                            || (BM_SIZE(BLKS_PER_LMAP) != ((blk_hdr_ptr_t)bmp)->bsiz)
  399                            || (LCL_MAP_LEVL != ((blk_hdr_ptr_t)bmp)->levl))
  400            {
  401                    assert(FALSE);  /* In pro, we will simply skip counting this local bitmap. */
  402                    continue;
  403            }

  (gdb) p/x bml
  $4 = 0x200

  (gdb) p/x ydb_skip_bml_num
  $3 = 0x2ec97f600

  (gdb) p free_bml
  $5 = 1
  ```

* The value of `bml` is 512 at line 401 and `ydb_skip_bml_num` is a non-zero value. This means we are
  in the HOLE section of the database file and should never do a `t_qread()` call on such a block. But
  we did it at line 397 and is why we ended up with an assert failure.

* The value of `free_bml` is 1 at line 401.

* If `free_bml` was 1 at line 382, the `if` block would have kicked in and recognized this is a hole
  and counted the free blocks for the HOLE and then move on to the non-HOLE section of the database file
  using the `continue` at line 389 in a different iteration of the for loop.

* But this did not happen. Therefore, `free_bml` was 0 at line 382 but became 1 at line 401.

* This means that `free_bml` was 0 at line 382 and got set to 1 at line 395.

* That is, the `bmm_find_free()` call at line 392 was passed in `free_bml=0` as the first parameter
  and returned `bml=1`. That is, it found all blocks in bitmap block 0 as busy and so returned bitmap
  block 1 as that which has free space.

* In this case, we should redo the `if` check in lines 381-391 right after setting `free_bml` in line 395.
  That would take care of skipping the `t_qread()` call for bitmap blocks in the HOLE section.
  This is exactly what is taken care of in this commit.

nars1 added a commit that referenced this issue


          [DEBUG-ONLY] [#1029] Fix incorrect assert failure when 1000s of jobin…

e7e6149

…terrupts cause %YDB-E-STACKCRIT

Background
----------
* This is an issue identified by @shabiel while trying to reproduce some other issue. The next
  bullet is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/1029#description.

* To reproduce this, start an M process in direct mode while it is at the `YDB>` prompt. And in
  another terminal send it a lot of repeated job interrupts (i.e. `mupip intrpt`), and then go
  back to the M process and type anything and press enter. Eventually, the process will crash
  with this:

  ```c
  YDB>%YDB-E-STACKCRIT, Stack space critical
  %YDB-F-ASSERT, Assert failed in sr_port/mdb_condition_handler.c line 1184
          for expression (!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL))
  ```

* One needs to send repeated jobinterrupts to the same process. One easy way to do this is in tcsh
  using the following command from a different terminal. The below issues 10,000 interrupts to
  pid 52545.

  ```
  repeat 10000 $ydb_dist/mupip intrpt 52545
  ```

* One would see the assert show up in the original terminal.

Issue
-----
Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/1029#note_1581268407

* The debugger shows it is a `STACKCRIT` error that in turn triggers an assert failure.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140478350915392) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140478350915392) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140478350915392, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffe608dc2d0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  mdb_condition_handler (arg=150373738) at sr_port/mdb_condition_handler.c:1184
  #9  rts_error_va (csa=0x0, argcnt=1, var=0x7ffe608dc7a0) at sr_unix/rts_error.c:198
  #10 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #11 jobintrpt_ztime_process (ztime=0) at sr_port/jobintrpt_ztime_process.c:84
  #12 trans_code () at sr_port/trans_code.c:193
  #13 stkok2 () at sr_x86_64/mum_tstart.s:37

  (gdb) f 11
  #11 jobintrpt_ztime_process (ztime=0) at sr_port/jobintrpt_ztime_process.c:84
  84              PUSH_MV_STENT(MVST_ZINTR);      /* MVST_ZTIMEOUT is identical to MVST_ZINTR with a flag to differentiate in debugging */

  (gdb) f 8
  #8  mdb_condition_handler (arg=150373738) at sr_port/mdb_condition_handler.c:1184
  1184                            assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL));

  (gdb) list
  1182                    if (!(SFT_ZINTR & proc_act_type) && !(SFT_ZTIMEOUT & proc_act_type))    /* ztimeout vector precompiled */
  1183                    {
  1184                            assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL));
  1185                            trans_code_cleanup();
  ```

* I verified that with a Release/PRO build, there is no issue.

  ```c
  YDB>%YDB-E-STACKCRIT, Stack space critical
  %YDB-E-ERRWZINTR, Error while processing $ZINTERRUPT
  %YDB-E-ZINTDIRECT, Attempt to enter direct mode from $ZINTERRUPT
                  At M source location +1^GTM$DMOD
  ```

* The issue is in line 1184 which needs to take a STACKCRIT error into account in the assert.

Fix
---
* The following change should fix the failure in my understanding and is implemented in this commit.

  ```diff
  < assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL));
  > assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL) || ((int)ERR_STACKCRIT == SIGNAL));
  ```

* And it did fix the failure when I manually tested with the above change.

nars1 added a commit that referenced this issue


          [DEBUG-ONLY] Add assert in sr_port/bm_getfree.c to catch rare/incorre…

d366c66

…ct block allocations in previous bitmap

Background
----------
* The `reorg_5/on_ntp_njnl_reorg` subtest failed with the following symptom.

  ```diff
  3c3,5
  < PASS from on_ntp_njnl_reorg
  ---
  > FAIL from on_ntp_njnl_reorg  # Subtest stopped as it has too many core files (Threshold = 10 ; Actual = 13)
  > Killed
  ```

* There were a lot of assert failures and core files. The primary failure was an assert failures
  in a `mupip reorg -fill=39 -index=95 -truncate` process.

  Line 1552 below is the primary assert failure (in `bml_status_check.c`).

  **reorg_5_7/online_reorg_31388.outx.4**
  ```
      1 # Fri 10 Nov 2023 06:16:09 PM EST : cnt = 4 ; ff = 39 ; inff = 95
      2 # Fri 10 Nov 2023 06:16:09 PM EST : nice +19 mupip reorg -fill=39 -index=95 -truncate
      .
   1544 Global: zyxwvu95 (region DEFAULT)
   1545 Blocks processed    : 16
   1546 Blocks coalesced    : 3
   1547 Blocks split        : 9
   1548 Blocks swapped      : 16
   1549 Blocks freed        : 0
   1550 Blocks reused       : 9
   1551 Blocks extended     : 0
   1552 %YDB-F-ASSERT, Assert failed in sr_port/bml_status_check.c line 72 for expression ((gds_t_acquired != cs->mode) || (BLK_BUSY != bml_status))
   1553 %YDB-F-ASSERT, Assert failed in sr_port/sec_shr_map_build.c line 62 for expression (bitnum > prev_bitnum)
   1554 %YDB-F-ASSERT, Assert failed in sr_port/wcs_recover.c line 173 for expression ((!dollar_tlevel && !cr_array_index) || (dollar_tlevel && (!si->cr_array_index || (NULL != si->kip_csa))))
   1555 %YDB-E-NOTALLDBRNDWN, Not all regions were successfully rundown
  ```

Issue
-----
* The assert failure in `bml_status_check.c` indicates that we found a block that was allocated in this
  transaction (in `bm_getfree.c`) but the local bitmap for that allocated block says the block is already
  marked `BUSY`. This is an out-of-design situation as only blocks marked `FREE` or `RECYCLED` in the
  local bitmap should be allocated.

* The C-stack and relevant variables from the core files using gdb are captured below.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fff32e6efc0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  bml_status_check (cs=0x7f9e639cc4e0 <cw_set>) at sr_port/bml_status_check.c:72
  #7  t_end (hist1=0x2150840, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1637
  #8  mu_swap_root (gl_ptr=0x2151540, root_swap_statistic_ptr=0x7fff32e727ac, upg_mv_block=0) at sr_unix/mu_swap_root.c:233
  #9  mupip_reorg () at sr_port/mupip_reorg.c:399
  #10 mupip_main (argc=5, argv=0x7fff32e84d48, envp=0x7fff32e84d78) at sr_unix/mupip_main.c:117
  #11 dlopen_libyottadb (argc=5, argv=0x7fff32e84d48, envp=0x7fff32e84d78, main_func=0x4014a4 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #12 main (argc=5, argv=0x7fff32e84d48, envp=0x7fff32e84d78) at sr_unix/mupip.c:22

  (gdb) p/x ydb_skip_bml_num
  $4 = 0x320dd1200

  (gdb) p/x cw_set[0].blk
  $33 = 0x320dd1bff
  (gdb) p/x cw_set[1].blk
  $35 = 0x1
  (gdb) p/x cw_set[2].blk
  $34 = 0x320dd1c00

  (gdb) p cw_set[0].mode
  $37 = gds_t_acquired
  (gdb) p cw_set[1].mode
  $39 = gds_t_write
  (gdb) p cw_set[2].mode
  $38 = gds_t_writemap
  ```

* The debug-only `ydb_test_4g_db_blks` env var is enabled in this test run as `ydb_skip_bml_num` is set
  to a non-zero value (seen above).

* This means that after the local bitmap `0`, the next local bitmap that is used would be `0x320dd1200`.
  All blocks from `0x200` till `0x320dd11ff` will not be used by the database logic (i.e. there will be
  a huge hole in the database file) for block allocation.

* The `cw_set[0].blk` indicates the allocated block number is `0x320dd1bff`. But `cw_set[2].blk` indicates
  this allocation happened in the local bitmap block `0x320dd1c00`.

* Notice that the allocated block number actually is 1 less than the bitmap block number.

* This means that the allocated block was actually in the `previous` local bitmap. This is an out-of-design
  situation.

* The allocation happened in `bm_getfree.c` but the point of assert failure is much later.

Changes
-------
* There is logic in `bm_getfree.c` that finds a free bit in the local bitmap and stores it in a variable
  `free_bit`. In the failure case above, `free_bit` must have ended up with a value of `-1` (which also
  happens to be the value of the `NO_FREE_SPACE` macro) to result in the failure.

* It is not clear to me how this happened but I suspect this was possible only because of interactions
  with the debug-only `ydb_skip_bml_num` scheme.

* That is, my suspicion is that it is some issue in the `ydb_skip_bml_num` scheme.

* Towards better understanding how `free_bit` ended up with the `-1` value, this commit adds an assert
  that `free_bit` is never negative.

* In the test failure case above, we would have seen this new assert fail at a much earlier point thereby
  giving us a better core file to analyze.

* This commit enables us to better analyze such failures if/when they happen in the future (the cause is
  still unknown).

nars1 added a commit that referenced this issue


          Fix subtle issue in TPNOTACID_CHECK macro (caused insert_region.c ass…

156c110

…ert failure)

Background
----------
* Below is a first-time failure, when running the `r126/ydb464` subtest (from the YDBTest project), that
  I noticed while trying to reproduce some other failure.

  ```diff
  --- ydb464/ydb464.diff ---
  19a20,73
  > r126_0_31/ydb464/simpleapi2/child98118.log
  > %YDB-F-ASSERT, Assert failed in sr_port/insert_region.c line 110 for expression ((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit))
  ```

* The C-stack and relevant variables from the core file are pasted below.

  ```c
  (gdb) where
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffee07f7480) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110
  #7  mlk_unlock (p=0x1591940) at sr_port/mlk_unlock.c:70
  #8  tp_unwind (newlevel=0, invocation_type=ROLLBACK_INVOCATION, tprestart_rc=0x0) at sr_port/tp_unwind.c:294
  #9  op_trollback (rb_levels=0) at sr_port/op_trollback.c:200
  #10 secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:569
  #11 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:230
  #12 signal_exit_handler (exit_handler_name=0x7ff4913b071e "deferred_exit_handler", sig=2, info=0x7ff491795458 <stapi_signal_handler_oscontext+3224>, context=0x7ff4917954d8 <stapi_signal_handler_oscontext+3352>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #15 rel_crit (reg=0x14d0170) at sr_unix/rel_crit.c:81
  #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120
  #17 op_lock2_common (timeout=0, laflag=64 '@') at sr_port/op_lock2.c:242
  #18 op_incrlock_common (timeout=0) at sr_port/op_incrlock.c:49
  #19 ydb_lock_incr_s (timeout_nsec=0, varname=0x7ffee07f8c30, subs_used=0, subsarray=0x0) at sr_unix/ydb_lock_incr_s.c:91
  #20 runProc (settings=0x7ffee07fab80, curDepth=1) at simpleapi/inref/randomWalk.c:489
  #21 tpHelper (tpfnparm=0x7ffee07fa100) at simpleapi/inref/randomWalk.c:691
  #22 ydb_tp_s_common (lydbrtn=LYDB_RTN_TP, tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:256
  #23 ydb_tp_s (tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:38
  #24 runProc (settings=0x7ffee07fab80, curDepth=0) at simpleapi/inref/randomWalk.c:666
  #25 runProc_driver (settings=0x7ffee07fab80) at simpleapi/inref/randomWalk.c:145
  #26 main () at simpleapi/inref/randomWalk.c:93

  (gdb) f 6
  #6  insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110
  110                                     assert((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit));

  (gdb) p process_exiting
  $3 = 1

  (gdb) p t_tries
  $4 = 3

  (gdb) p dollar_tlevel
  $5 = 1

  (gdb) p csa->now_crit
  $6 = 0

  (gdb) up
  #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120
  120                             TPNOTACID_CHECK(LOCKGCINTP);
  ```

Issue
-----
* The assert that failed in `insert_region()` (frame 6 in above stack trace) indicates that we were in the
  final retry (i.e. `t_tries` is equal to `3` or `CDB_STAGNATE`) but we did not hold crit on the current
  region where we are trying to do an `mlk_unlock()` operation.

* The assert is valid and did expose an issue.

* In frame 16, in `mlk_lock()`, we did a `rel_crit()` call in the `TPNOTACID_CHECK` macro while in the
  final retry.

  **sr_port/mlk_lock.c**
  ```c
    120                         TPNOTACID_CHECK(LOCKGCINTP);
  ```

* Below is the code inside the macro.

  **sr_port/tp.h**
  ```c
     979 #define TPNOTACID_CHECK(CALLER_STR)                                                                                             \
     980 {                                                                                                                               \
     981         GBLREF  boolean_t       mupip_jnl_recover;                                                                              \
     982         mval            zpos;                                                                                                   \
     983                                                                                                                                 \
     984         if (IS_TP_AND_FINAL_RETRY)                                                                                              \
     985         {                                                                                                                       \
  -> 986                 TP_REL_CRIT_ALL_REG;                                                                                            \
     987                 assert(!mupip_jnl_recover);                                                                                     \
     988                 TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK;                                                                         \
  ```

* Line 986 is where the issue is. We do a `rel_crit()` call there but `t_tries` is still not decremented.
  The decrement of `t_tries` happens 2 lines later at line 988.

* Before doing the `rel_crit()` call, we need to decrement `t_tries`. This way, in case `rel_crit()`
  decides to invoke exit handling due to handling a deferred SIGINT signal (sent in the `ydb464` subtest),
  the assert in `insert_region()` would not be confused by seeing this out-of-design state and will not
  attempt to invoke `t_retry()` etc. which is a no-no as we should not transfer control to M code as
  part of a TP restart while the process is about to terminate on receipt of a SIGINT signal.

Fix
---
* Notice that in `sr_port/t_commit_cleanup.c`, the `t_tries` decrement happens BEFORE the `rel_crit()`
  call.

  **sr_port/t_commit_cleanup.c**
  ```c
    288       if (CDB_STAGNATE <= t_tries)
    289               TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK; /* t_tries untouched for rollback and recover */
      .
      .
    303               if (!csa->hold_onto_crit && csa->now_crit)
    304                       rel_crit(tr->reg);      /* Undo Step (CMT01) */
  ```

* In a similar fashion, in the `TPNOTACID_CHECK` macro in `sr_port/tp.h`, the `TP_REL_CRIT_ALL_REG` call
  should happen AFTER the `TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK` call. And that is the fix.

* While doing this fix, I noticed a similar ordering issue in `sr_port/gvcst_init.c` and so fixed that too.

Notes
-----
* While this failure happened with a Debug build of YottaDB, I suspect there is an issue in the Release
  build of YottaDB too. But not sure exactly what the user-visible implications are. Even if so, it is
  likely to be not encountered in practice and so no user-visible issue is created for this.

nars1 added a commit that referenced this issue


          [DEBUG-ONLY] [V71001] Remove incorrect outofband related assert in sr…

eaa1da1

…_port/deferred_events.c

Background
----------
* The `v61000/intrpt_wcs_wtstart` subtest (in the YDBTest project) failed a few rare occasions
  during internal testing with the following symptom.

  ```diff
  12a13,299
  > v61000_0_22/intrpt_wcs_wtstart/mumps-wb.out
  > %YDB-F-ASSERT, Assert failed in sr_port/deferred_events.c line 114 for expression (no_event == outofband || (event_type == outofband))
  ```

Issue
-----
* The stack trace and relevant details from the gdb core analysis are pasted below.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffcc56fd8c0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114
  #7  jobinterrupt_event (sig=10, info=0x7fb372b8a518 <stapi_signal_handler_oscontext+5528>, context=0x7fb372b8a598 <stapi_signal_handler_oscontext+5656>) at sr_port/jobinterrupt_event.c:61
  #8  <signal handler called>
  #9  clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  #10 m_usleep (useconds=10000) at sr_unix/sleep.c:37
  #11 wcs_sleep (sleepfactor=6310) at sr_port/wcs_sleep.c:28
  #12 wcs_flu (options=519) at sr_unix/wcs_flu.c:571
  #13 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:632
  #14 gv_rundown () at sr_port/gv_rundown.c:122
  #15 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #16 signal_exit_handler (exit_handler_name=0x7fb372a19ecf "generic_signal_handler", sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=0) at sr_unix/signal_exit_handler.c:78
  #17 generic_signal_handler (sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:502
  #18 ydb_os_signal_handler (sig=15, info=0x7ffcc56ffd30, context=0x7ffcc56ffc00) at sr_unix/ydb_os_signal_handler.c:88
  #19 <signal handler called>
  #20 clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  #21 m_usleep (useconds=999000) at sr_unix/sleep.c:37
  #22 wcs_wtstart (region=0xc30970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:216
  #23 wcs_timer_start (reg=0xc30970, io_ok=1) at sr_port/t_end_sysops.c:1346
  #24 t_end (hist1=0xcfe798, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1848
  #25 gvcst_put2 (val=0xc928b8, parms=0x7ffcc5709a80) at sr_port/gvcst_put.c:2796
  #26 gvcst_put (val=0xc928b8) at sr_port/gvcst_put.c:302
  #27 op_gvput (var=0xc928b8) at sr_port/op_gvput.c:79

  (gdb) f 6
  #6  xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114
  114                     assert(no_event == outofband || (event_type == outofband));

  (gdb) p (enum outofbands)no_event
  $2 = no_event

  (gdb) p (enum outofbands)outofband
  $1 = deferred_signal

  (gdb) p (enum outofbands)event_type
  $3 = jobinterrupt
  ```

* The test sends a SIGTERM (i.e. SIG-15) signal. This caused `outofband` variable to be set to
  `deferred_signal` in frame 17 above (`generic_signal_handler.c` inside the `SET_FORCED_EXIT_STATE` macro).

* And then the process was sleeping (due to a white-box test case in the test).

* At that point, it was holding crit and another process was waiting for this and so was about to send
  a `MUTEXLCKALERT` message. At this point, since the test framework had set the `gtm_procstuckexec` env
  var to `com/gtmprocstuck_get_stack_trace.csh`, that was invoked and it in turn invoked `^%YDBPROCSTUCKEXEC`
  which in turn sent a `SIGUSR1` signal (i.e. a `mupip intrpt`) to this very same process that was sleeping
  while holding crit.

* And at this point, the process got the assert failure because the `outofband` variable indicated that
  a `SIG-15` signal needs to be handled whereas the `event_type` variable indicated that the current
  out of band event is a `jobinterrupt` event.

Fix
---
* This seems like a valid scenario and I suspect the assert is invalid.

* I noticed that this very same assert has been removed in a later GT.M release V7.1-001.

  ```diff
  $ cd YDB
  $ git show tags/V7.1-001 sr_port/deferred_events.c | head -35 | tail -8
  @@ -127,7 +127,6 @@ boolean_t xfer_set_handlers(int4  event_type, int4 param_val, boolean_t popped_e
          }
          if (!already_ev_handling)
          {
  -               assert(no_event == outofband || (event_type == outofband));
                  assert(!dollar_zininterrupt || (jobinterrupt != event_type));
                  if (entry != (TREF(save_xfer_root_ptr))->ev_que.fl)
                  {       /* no event in play so pend this one by jiggeriing the xfer_table */
  ```

* I assume GT.M noticed a similar issue but not while releasing V7.0-001 (which is what YottaDB master
  currently has merged) but when releasing a much later V7.1-001 version and fixed it then.

* Therefore, I am removing the assert that failed.

* This should let the `v61000/intrpt_wcs_wtstart` test run fine until GT.M V7.1-001 gets merged into
  the YottaDB master branch.

nars1 added a commit that referenced this issue


          Fix rare case of multiple threads executing YottaDB engine code in a …

5d7db9c

…Simple Thread API application

Background
----------
* The `r126/ydb464` subtest failed in one rare run with the following failure symptom.

  ```diff
  > %YDB-F-ASSERT, Assert failed in sr_port/deferred_events_queue.c line 48 for expression (INTRPT_IN_EVENT_HANDLING == intrpt_ok_state)
  ```

* When this specific test was rerun around 10000 times, we saw around a dozen failures (with differing assert
  failures but all pointing to the same underlying issue) so this failure was reproducible but not easily.

Issue
-----
* Relevant details from the core file analysis is pasted below.

  ```c
  (gdb) thread apply all bt

  Thread 6 (Thread 0x7fa62a6c0640 (LWP 99885)):
    .
  #6  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #7  set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:37
  #8  xfer_set_handlers (event_type=11, param_val=939582496, popped_entry=0) at sr_port/deferred_events.c:190
  #9  generic_signal_handler (sig=2, info=0x7fa63a1b6fd8 <stapi_signal_handler_oscontext+3320>, context=0x7fa63a1b7058 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:305
  #10 ydb_os_signal_handler (sig=2, info=0x7fa625096bf0, context=0x7fa625096ac0) at sr_unix/ydb_os_signal_handler.c:88
  #11 <signal handler called>
  #12 __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at ./nptl/pthread_create.c:835
  #13 pthread_create ()
  #14 runProc (tptoken=..., errstr=0x0, settings=..., curDepth=6) at simplethreadapi/inref/randomWalk.c:662
  #15 threadHelper (args=0x7fa62a6ba880) at simplethreadapi/inref/randomWalk.c:723
  #16 tpHelper (tptoken=..., errstr=0x7fa62a6ba850, tpfnparm=0x7fa62a6ba880) at simplethreadapi/inref/randomWalk.c:712
  #17 ydb_tp_st (tptoken=..., errstr=0x7fa62a6ba850, tpfn=0x55fa406e2d20 <tpHelper>, tpfnparm=0x7fa62a6ba880, transid=0x55fa406f69ea "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_st.c:100
  #18 runProc (tptoken=..., errstr=0x0, settings=..., curDepth=5) at simplethreadapi/inref/randomWalk.c:642
  #19 threadHelper (args=0x7fa62a6bb7e0) at simplethreadapi/inref/randomWalk.c:723
    .
  #41 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

  Thread 1 (Thread 0x7fa61ef2e640 (LWP 7158)):
    .
  #6  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #7  xfer_reset_handlers (event_type=11) at sr_port/deferred_events.c:235
  #8  outofband_clear () at sr_port/outofband_clear.c:41
  #9  outofband_action (lnfetch_or_start=0) at sr_port/outofband_action.c:55
  #10 ydb_zwr2str_s (zwr=0x7fa61ef2d550, str=0x7fa61ef2d560) at sr_unix/ydb_zwr2str_s.c:55
  #11 ydb_zwr2str_st (tptoken=..., errstr=0x7fa61ef2d530, zwr=0x7fa61ef2d550, str=0x7fa61ef2d560) at sr_unix/ydb_zwr2str_st.c:40
  #12 runProc (tptoken=..., errstr=0x0, settings=..., curDepth=7) at simplethreadapi/inref/randomWalk.c:545
  #13 threadHelper (args=0x7fa62a6b9940) at simplethreadapi/inref/randomWalk.c:723
  #14 start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
  #15 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

  (gdb) p dollar_tlevel
  $4 = 6

  (gdb) p/x ydb_engine_threadsafe_mutex_holder[0]
  $14 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[1]
  $15 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[2]
  $16 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[3]
  $17 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[4]
  $18 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[5]
  $19 = 0x7fa62a6c0640
  (gdb) p/x ydb_engine_threadsafe_mutex_holder[6]
  $20 = 0x7fa61ef2e640
  ```

* This is a case when signals are sent (SIGINT aka SIG-2 in this case) to a Simple Thread API process
  and one thread (`Thread 1` below) is running under the YottaDB engine lock already but the signal
  gets delivered to another thread (`Thread 6` below) and that incorrectly starts executing the signal
  handler which in turn invokes `xfer_set_handlers()` etc.. And so at the same time, 2 threads are
  executing YottaDB engine/runtime code although only one holds the lock. This is a no-no since YottaDB
  runtime logic is not multi-thread safe.

* From the above analysis, it is clear that the process was executing a TP transaction with `dollar_tlevel`
  equal to `6`.

  `Thread 6` had invoked `ydb_tp_st()` (in frame 17) which in turn invoked a callback function that created
  a new thread `Thread 1`.

  `Thread 6` held the YottaDB engine multi-thread lock for tlevels 0, 1, 2, 3, 4, 5.

  For tlevel 6, `Thread 1` held the YottaDB engine multi-thread lock.

* But the `SIGINT` signal (sent by the test) got sent to `Thread 6`. Therefore, it should have realized,
  while in `generic_signal_handler()`, that `dollar_tlevel` is 6 and it does not own the tlevel=6 lock
  (`Thread 1` owns it) and therefore should have done a `return` at line 404 below.

  ```c
     315 #define FORWARD_SIG_TO_MAIN_THREAD_IF_NEEDED(SIGHNDLRTYPE, SIG, IS_EXI_SIGNAL, INFO, CONTEXT)                                   \
       .
     332     if (simpleThreadAPI_active)                                                                                     \
     333     {                                                                                                               \
       .
     355             thisThreadId = pthread_self();                                                                          \
     356             assert(thisThreadId);                                                                                   \
     357             SET_YDB_ENGINE_MUTEX_HOLDER_THREAD_ID(mutexHolderThreadId, tLevel);                                     \
       .
     374             thisThreadIsMutexHolder = pthread_equal(mutexHolderThreadId, thisThreadId);                             \
       .
     386             if (!thisThreadIsMutexHolder                                                                            \
     387                             || (!IS_EXI_SIGNAL && (tLevel && (!isSigThreadDirected || signalForwarded))))           \
     388             {       /* Two possibilities.                                                                           \
       .
  -> 404                     return;                                                                                         \
     405             } else                                                                                                  \
  ```

* But clearly that did not happen (from the core file). Therefore, `thisThreadIsMutexHolder` (set at line
  374 above) should have been `TRUE`.

* How that happened can be seen in line 286 below inside the macro (invoked from line 357 above).

  ```c
    268 #define SET_YDB_ENGINE_MUTEX_HOLDER_THREAD_ID(HOLDER_THREAD_ID, TLEVEL)                                         \
    269 {                                                                                                               \
    270    GBLREF  uint4           dollar_tlevel;                                                                  \
    271    GBLREF  pthread_t       ydb_engine_threadsafe_mutex_holder[];                                           \
    272                                                                                                            \
    273    /* If not in TP, the YottaDB engine lock index is 0 (i.e. ydb_engine_threadsafe_mutex_holder[0] is      \
    274     * current lock holder thread if it is non-zero). But if we are in TP, then lock index could be         \
    275     * "dollar_tlevel"     : e.g. if a "ydb_get_st" call occurs inside of the "ydb_tp_st" call OR           \
    276     * "dollar_tlevel - 1" : if control is in the TP callback function inside "ydb_tp_st" but not a         \
    277     *      SimpleThreadAPI call like "ydb_get_st" etc.                                                     \
    278     */                                                                                                     \
    279    TLEVEL = dollar_tlevel; /* take a local copy of global variable as it could be concurrently changing */ \
    280    if (!TLEVEL)                                                                                            \
    281            HOLDER_THREAD_ID = ydb_engine_threadsafe_mutex_holder[0];                                       \
    282    else                                                                                                    \
    283    {                                                                                                       \
    284            HOLDER_THREAD_ID = ydb_engine_threadsafe_mutex_holder[TLEVEL];                                  \
    285            if (!HOLDER_THREAD_ID)                                                                          \
    286                    HOLDER_THREAD_ID = ydb_engine_threadsafe_mutex_holder[TLEVEL - 1];                      \
    287    }                                                                                                       \
    288 }
  ```

* Line 284 must have returned a value of 0 for `HOLDER_THREAD_ID` and so we went to line 286 and
  used the thread owner of tlevel=5 which was `Thread 6`.

* In the core file, we see that tlevel=6 lock owned is `Thread 1`. But at the time line 284 got executed,
  `Thread 1` was not owning the lock.

* That can be explained if `Thread 1` had not yet done the `ydb_zwr2str_st()` call when line 284 got
  executed.

* The issue then is that when we found no one holding the tlevel=6 lock, we went to see who holds the
  tlevel=5 lock and returned that thread is as the current YottaDB engine multi-thread lock holder.

* This is where the issue is. `Thread 1` even though it had not yet attempted to get the lock, owns
  the lock at this point since `Thread 6` has invoked the callback function and has no control of
  what calls the callback function can invoke (including creating new threads that in turn do
  Simple Thread API calls on their own like happened with `Thread 1`).

* Treating `Thread 6` as owning the lock ended up with a situation where 2 threads think they each
  own the engine lock and run YottaDB code at the same time causing the assert failures.

* This issue is long standing (started in 2afcbd2, which was committed 2019/03/25) but it manifests
  as assert failures only after the GT.M V7.0-001 code merge. That is because the deferred event queue
  handling got reworked in V7.0-001 making it possible for more logic to execute while in the
  signal handler thereby exposing this long standing issue.

* Note that even then it has taken a few months of testing to show this one failure in a C program that
  invokes multiple threads. So it is really a rare issue.

Fix
---
* The fix is thankfully simple and is to remove lines 285-286 above. That is, check the lock
  holder for the tlevel which `dollar_tlevel` global currently points to. Do not go one before that
  if we find the top level not being held by any thread.

* With this change, `Thread 6` will not incorrectly conclude it is the owner. This is because it will
  find that the owner of the YottaDB engine lock is no thread in this case and since that does not
  match its own thread id, it will `return` if it gets delivered the SIGINT (after noting down the
  fact that this signal handling was deferred) and the next thread that runs YottaDB runtime logic
  will notice this happened and handle the signal while it holds the engine lock.

Notes
-----
* Since this issue is very unlikely to be seen in practice (needs a Simple Thread API application that
  creates threads while inside a `ydb_tp_st()` call and also sends SIGINT signals), no YDB issue is
  created for this.

nars1 added a commit that referenced this issue


          [#1041] Fix rare bug in MUPIP REORG -TRUNCATE that can cause database…

c11ab5b

… damage

Background
----------
Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/1041#description

* The `reorg_5/on_ntp_njnl_reorg` subtest (from the YDBTest project) failed with the following symptom
  in a rare in-house test run.

  ```
  %YDB-F-ASSERT, Assert failed in sr_port/bml_status_check.c line 72 for expression ((gds_t_acquired != cs->mode) || (BLK_BUSY != bml_status))
  ```

* The assert failure indicates that we found a block that was allocated in this transaction (in
  `bm_getfree.c`) but the local bitmap for that allocated block says the block is already marked
  `BUSY`. This is an out-of-design situation as only blocks marked `FREE` or `RECYCLED` in the local
  bitmap should be allocated.

* The cause of this failure was not ascertained when it first happened and so !1409 (d366c66) added
  an assert to try catch the failure early in `bm_getfree.c` (pre-commit) rather than much later in
  `bml_status_check.c` (in-commit).

* But even after !1409 got merged, we encountered another rare test failure with the same symptom. The
  failure still happened in `bml_status_check.c` and not in the expected `bm_getfree.c`.

* Therefore, the new failure was further analyzed and this time around the cause was finally identified. It
  turned out to be a `mupip reorg -truncate` issue that can result in assert failures in Debug builds
  and database damage in Release builds.

* But for this subtle timing issue to happen, a database file extension needs to happen concurrently
  among many things.

* Below is an example mupip integ report from the damaged database after the `mupip reorg -truncate`
  in a manual test that I did.

  ```
  Integ of region DEFAULT

  Block:Offset Level
  %YDB-E-DBINCLVL,         Nature: #DANGER***
               3FF:0      2  Block at incorrect level
                             Directory Path:  1:10, 2:23
                             Path:  202:104, 3E6:17E, 3FF:0
  Keys from ^y(493.1) to the end are suspect.
  %YDB-E-DBBDBALLOC,         Nature: #DANGER***
               3FF:0      B  Block doubly allocated
                             Directory Path:  1:10, 2:36
                             Path:  3FF:0
  Keys from ^z to the end are suspect.
  ```

Issue
-----
* Below is the stack trace from the latest failure.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=<optimized out>, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffef29217e0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  bml_status_check (cs=0x7fe6935dfa40 <cw_set>) at sr_port/bml_status_check.c:72
  #9  t_end (hist1=0x62d0000a2840, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1637
  #10 mu_swap_root (gl_ptr=0x62d0000a3fc0, root_swap_statistic_ptr=0x7ffef29256a0, upg_mv_block=0) at sr_unix/mu_swap_root.c:233
  #11 mupip_reorg () at sr_port/mupip_reorg.c:399
  #12 mupip_main (argc=5, argv=0x7ffef2938148, envp=0x7ffef2938178) at sr_unix/mupip_main.c:117
  #13 dlopen_libyottadb (argc=5, argv=0x7ffef2938148, envp=0x7ffef2938178, main_func=0x55c02b55a020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #14 main (argc=5, argv=0x7ffef2938148, envp=0x7ffef2938178) at sr_unix/mupip.c:22
  ```

* Notice `mu_swap_root()` in the stack. It was seen as a caller in all failures so far where the symptom
  was an assert failure in `bml_status_check.c`.

* This specific function (that is called by `mupip reorg -truncate`) turned out to be where the issue is
  and not the general purpose function `bm_getfree.c` (like was suspected in d366c66).

* This function does block allocation, just like `bm_getfree.c`. But there is a subtle difference between
  the two which is what led me to the issue.

  **sr_port/bm_getfree.c**
  ```c
    306            free_bit = bm_find_blk((int4)offset, (sm_uc_ptr_t)bmp + SIZEOF(blk_hdr), map_size, blk_used);
      .
    311    if (NO_FREE_SPACE != free_bit)
    312            break;
  ```

  **sr_unix/mu_swap_root.c**
  ```c
    341         master_bit = bmm_find_free((hint_blk_num / BLKS_PER_LMAP), csa->bmm, num_local_maps);
    342         if ((NO_FREE_SPACE == master_bit))
      .
    360     free_bit = bm_find_blk(hint_bit, bmlhist.buffaddr + SIZEOF(blk_hdr), maxbitsthismap, &free_blk_recycled);
    361     free_blk_id = bmlhist.blk_num + free_bit;
  ```

* `bm_getfree.c` invokes `bm_find_blk()` to find a free block in the local bitmap and stores the result in
  `free_bit`. It then checks whether that indicates no free space (`NO_FREE_SPACE`) and if so it moves
  on to another local bitmap.

* `mu_swap_root.c` does a call to `bm_find_blk()` as well to find a free block but does no `NO_FREE_SPACE`
  check.

* This is where the issue is.

* If `free_bit` turns out to be `NO_FREE_SPACE`, we will end up setting `free_blk_id` (line 361 above)
  as 1 block BEFORE the current local bitmap block. This would mean we would consider a block in the
  previous bitmap as the FREE block and use it for the swap operation. If that block happens to be
  already used in some other global variable tree, then we would end up with integrity errors once the
  incorrect swap operation completes.

* In order to end up in this situation, what is needed is that we find the local bitmap has free space
  in the master bitmap (line 341 above) but when we look inside the local bitmap block we find that
  it has no free space.

* I initially tried to see if I can concurrently have a process do a transaction that allocates a block
  and ends up marking this local bitmap as full in the master map (in `bm_update()`). But in that case,
  I noticed that the `mu_swap_root()` function ended up restarting the transaction because it noticed
  the local bitmap block contents changed concurrently (restart code `cdb_sc_bmlmod`).

* After some trial and error, finally landed on the needed last link in this puzzle. And that is a
  database file extension which concurrently happens.

* It is possible the last local bitmap in a database file could have its status marked as `Full` even
  though it is only a partial bitmap (because the total block count stops midway in the local bitmap).

* When a database file extension happens on this partial last bitmap, it would mark this bitmap from
  `Full` status to having `Free space` (line 561 below).

  **sr_unix/gdsfilext.c**
  ```c
    556         cs_addrs->ti->free_blocks += blocks;
    557         cs_addrs->total_blks = cs_addrs->ti->total_blks = new_total;
    558         blocks = old_total;
    559         if (blocks / bplmap * bplmap != blocks)
    560         {
    561                 bit_set(blocks / bplmap, MM_ADDR(cs_data)); /* Mark old last local map as having space */
  ```

  **sr_unix/mu_swap_root.c**
  ```c
    339         total_blks = csa->ti->total_blks;
      .
    341         master_bit = bmm_find_free((hint_blk_num / BLKS_PER_LMAP), csa->bmm, num_local_maps);
    342         if ((NO_FREE_SPACE == master_bit))
      .
    360     free_bit = bm_find_blk(hint_bit, bmlhist.buffaddr + SIZEOF(blk_hdr), maxbitsthismap, &free_blk_recycled);
    361     free_blk_id = bmlhist.blk_num + free_bit;
  ```

* Given that, the timing of the `mupip reorg -truncate` (in `mu_swap_root.c`) and `mupip extend`
  (in `gdsfilext.c`) processes has to such that line 339 above in `mu_swap_root.c` takes a note of
  `total_blks` BEFORE line 556 is reached by `gdsfilext.c` but line 341 in `mu_swap_root.c` should
  happen AFTER line 561 is done by `gdsfilext.c`.

  When this occurs, `mu_swap_root.c` will incorrectly proceed at line 360 with a `free_bit` value of `-1`
  which would then cause a block from the previous bitmap to be allocated and the commit validation
  logic would not restart the transaction in this situation ending up in database damage.

Fix
---
* The fix is simple and is to check the return value of `bm_find_blk()` (stored in `free_bit` variable)
  for `NO_FREE_SPACE` and if so restart the transaction.

nars1 added a commit that referenced this issue


          [DEBUG-ONLY] Remove incorrect assert in sr_port/tp_restart.c (STATSDB…

64c8296

… restart related)

Background
----------
* While trying to come up with the `v70001/gtm9131` subtest (in the YDBTest project), I encountered
  an assert failure. Running that subtest with the below changes reproduces the assert.

  ```diff
  $ git diff -U1 v70001/inref/gtm9131.m
  diff --git a/v70001/inref/gtm9131.m b/v70001/inref/gtm9131.m
  index 14094d4c..92641556 100644
  --- a/v70001/inref/gtm9131.m
  +++ b/v70001/inref/gtm9131.m
  @@ -23,3 +23,4 @@ gtm9131       ;
          set jobid=1
  -       zsystem:$trestart=0 "$gtm_dist/mumps -run job^gtm9131 "_njobs_" "_jobid
  +       set $zcmdline=njobs_" "_jobid
  +       do job
          tcommit                                ; Commit TP transaction where we expect a TP restart due to concurrent statsdb extension
  ```

* Below is what I saw in the subtest output (with the above change).

  ```
  $ cat gtm9131.log
  .
  .
  # Execute [mumps -run gtm9131] which will create a TPRESTART message due to a statsdb database file extension restart
  %YDB-F-ASSERT, Assert failed in sr_port/tp_restart.c line 288 for expression (IS_STATSDB_REG(restart_reg) ? !memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN) : memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN))
  ```

Issue
-----
* The core file in that case had the following stack trace and interesting variables.

  ```
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140359692803712) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140359692803712) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140359692803712, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fff9c3d9330) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  tp_restart (newlevel=1, handle_errors_internally=1) at sr_port/tp_restart.c:288
  #9  mdb_condition_handler (arg=150376098) at sr_port/mdb_condition_handler.c:345
  #10 rts_error_va (csa=0x0, argcnt=1, var=0x7fff9c3dc110) at sr_unix/rts_error.c:198
  #11 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #12 op_tcommit () at sr_port/op_tcommit.c:514

  (gdb) f 8
  #8  0x00007fa805089766 in tp_restart (newlevel=1, handle_errors_internally=1) at sr_port/tp_restart.c:288
  288        assert(IS_STATSDB_REG(restart_reg)      /* global ^%YGS if, and only if, statsDB */

  (gdb) list
  286          if (NULL != restart_reg)
  287          {
  288                  assert(IS_STATSDB_REG(restart_reg)      /* global ^%YGS if, and only if, statsDB */
  289                         ? !memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN)
  290                         : memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN));
  291                  reg_mstr.len = restart_reg->dyn.addr->fname_len;
  292                  reg_mstr.addr = (char *)restart_reg->dyn.addr->fname;

  (gdb) x/s restart_reg->rname
  0x6220000023e2: "default"

  (gdb) x/s gv_currkey->base
  0x62d000005046: "%jobwait"
  ```

* The assert in lines 288-290 checks that if the region where the restart happened is a statsdb region,
  then the current `$reference` (i.e. `gv_currkey`) should point to `^%YGS`, the global name that maps
  to the statsdb region. And vice versa.

* But in this case, the global name is `^%jobwait` which is not `^%YGS`.

* This is a case where the statsdb region had a tp restart due to a statsdb file extension scenario
  and the restart status code was `cdb_sc_helpedout` (see first occurrence in sr_port/tp_tend.c).
  In this case, the restart did not occur as part of a global reference, but as part of TCOMMIT and so
  we are not guaranteed that the most recent global reference (`gv_currkey`) would point to a statsdb
  global name. Therefore, this assert is not necessarily correct.

* Since restarts are possible in various scenarios other than as part of the current global reference,
  this assert could fail in other scenarios too.

* This assert was introduced in GT.M V6.3-008.

Fix
---
* Not sure the assert serves much purpose so am removing this inaccurate assert in this commit.

nars1 added a commit that referenced this issue


          [YDBTest#550] Fix MUPIP REORG to prevent MUPIP STOP from causing KILL…

57006c9

…ABANDONED (fixes GTM-9400 for real)

Background
----------
The below is pasted from https://gitlab.com/YottaDB/DB/YDBTest/-/issues/550#note_1733171439

* While trying to test YDBTest#550, I noticed that the KILLABANDONED error happens even with V7.0-001
  whereas the GTM-9400 release note in GT.M V7.0-001 indicates this as being fixed in V7.0-001.

* Below is the test case (using `tcsh`, not `sh`) that stops after a few seconds with a `KILLABANDONED`
  error with V7.0-000 as well as with V7.0-001.

  ```sh
  cat > kill.csh << CAT_EOF
  while (1)
          k15 reorg
          if (-e STOP) then
                  break
          endif
  end
  CAT_EOF

  rm -f STOP
  unsetenv ydb_gbldir
  setenv gtmgbldir mumps.gld
  rm -f mumps.gld mumps.dat
  $gtm_dist/mumps -run GDE exit
  $gtm_dist/mupip create
  $gtm_dist/mumps -run %XCMD 'for i=1:1:100000 set ^x(i)=$j(i,200)'
  source kill.csh &
  while (1)
          foreach fillfactor (50 10 90)
                  $gtm_dist/mupip reorg -fill=$fillfactor -region DEFAULT
                  $gtm_dist/mupip integ -reg "*"
                  if ($status) then
                          touch STOP
                          break
                  endif
          end
          if (-e STOP) then
                  break
          endif
  end
  ```

Issue
-----
* I added an assert in `secshr_db_clnup.c` where it invoked the `INCR_ABANDONED_KILLS` macro and ran
  the above test to see the cause of the above `KILLABANDONED` error.

* With that change, I got an assert failure (in line 446 below) and the core file showed the below
  stack trace.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140185231378240) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140185231378240) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140185231378240, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffcb98a5b50) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:446
  #9  mupip_exit_handler () at sr_unix/mupip_exit_handler.c:124
  #10 signal_exit_handler (exit_handler_name=0x7f7f6ac987be "deferred_exit_handler", sig=15, info=0x7f7f6ae3ddf8 <stapi_signal_handler_oscontext+3320>, context=0x7f7f6ae3de78 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #11 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #12 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #13 t_end (hist1=0x5617ac86b8c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1813
  #14 mu_reorg (gl_ptr=0x5617ac865bc0, exclude_glist_ptr=0x7ffcb98aa450, resume=0x7ffcb98aa344, index_fill_factor=90, data_fill_factor=90, reorg_op=0) at sr_port/mu_reorg.c:572
  #15 mupip_reorg () at sr_port/mupip_reorg.c:334
  #16 mupip_main (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948) at sr_unix/mupip_main.c:117
  #17 dlopen_libyottadb (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948, main_func=0x5617ab37c004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #18 main (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948) at sr_unix/mupip.c:22

  (gdb) f 8
  #8  secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:446
  446                assert(!mu_reorg_process);

  (gdb) list
  441             } else if (!dollar_tlevel)
  442             {
  443                     if ((NULL != kip_csa) && (csa == kip_csa))
  444                     {
  445                             /* Assert that MUPIP REORG never leaves the database with an abandoned kill */
  446                             assert(!mu_reorg_process);
  447                             assert(0 < kip_csa->hdr->kill_in_prog);
  448                             DECR_KIP(csd, csa, kip_csa);
  449                             INCR_ABANDONED_KILLS(csd, csa);
  450                     }

  (gdb) f 13
  #13 t_end (hist1=0x5617ac86b8c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1813
  1813            REVERT; /* no need for t_ch to be invoked if any errors occur after this point */
  ```

* And below is the macro sequence how frame 13 `REVERT` ends up in frame 12 `deferred_signal_handler` call.

  ```
  REVERT -> ENABLE_INTERRUPTS -> DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED -> deferred_signal_handler
  ```

* GT.M V7.0-001 fixed GTM-9400 by adding a `DEFERRED_EXIT_REORG_CHECK` macro at logical points in
  the mupip reorg code flow where we are guaranteed the kill-in-progress condition has been cleared
  (i.e. DECR_KIP has been called).

  ```diff
  $ git show -U1 tags/V7.0-001 sr_port/mu_reorg.c | grep -B2 -A1 DEFERRED_EXIT_REORG_CHECK
  @@ -449,2 +449,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume,
                                                          DECR_KIP(cs_data, cs_addrs, kip_csa);
  +                                                       DEFERRED_EXIT_REORG_CHECK;
                                                          if (detailed_log)
  --
  @@ -579,2 +580,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume,
                                                  DECR_KIP(cs_data, cs_addrs, kip_csa);
  +                                               DEFERRED_EXIT_REORG_CHECK;
                                                  if (detailed_log)
  @@ -677,2 +679,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume,
                                  DECR_KIP(cs_data, cs_addrs, kip_csa);
  +                               DEFERRED_EXIT_REORG_CHECK;
                                  if (detailed_log)

  $ git show -U1 tags/V7.0-001 sr_unix/mu_swap_root.c | grep -B2 -A1 DEFERRED_EXIT_REORG_CHECK
  @@ -271,2 +248,3 @@ void        mu_swap_root(glist *gl_ptr, int *root_swap_statistic_ptr)
          }
  +       DEFERRED_EXIT_REORG_CHECK;      /* a single directory tree has to be quick, so check at end, rather than each DECR_KIP  */
          return;
  ```

* But what it did not realize is that even before those logical points are reached, it is possible
  for `t_end.c` to invoke the `REVERT` macro which in turn would invoke `deferred_signal_handler` like
  is seen in the above stack trace.

* Not sure how this did not get caught during the GT.M testing.

Fix
---
* In any case, the fix is simple and is to enhance `sr_port/deferred_signal_handler.c` to not invoke
  `deferred_exit_handler()` but instead `return` in case we are a `mupip reorg` process (indicated by
  the boolean_t typed `mu_reorg_process` global variable being TRUE) and we are in the middle of a
  kill-in-progress (indicated by `cs_data->kill_in_prog` being TRUE).

* This way, we delay the deferred signal handling of the `MUPIP STOP` (aka `SIG-15`/SIGTERM) a little
  more until the logical point in `sr_port/mu_reorg.c` or `sr_unix/mu_swap_root.c` is reached where
  the `DEFERRED_EXIT_REORG_CHECK` macro is invoked.

nars1 added a commit that referenced this issue


          [YDBTest#550] Fix assert in deferred_signal_handler.c (broken due to …

02571cf

…GTM-9400 in V7.0-001)

Background
----------
* The `v70001/gtm9400` subtest failed in one rare in-house run with the following symptom.

  ```diff
  > ##TEST_PATH##/v70001_0_4/gtm9400/reorg_1_90.out
  > %YDB-F-ASSERT, Assert failed in sr_port/deferred_signal_handler.c line 40 for expression (INTRPT_OK_TO_INTERRUPT == intrpt_ok_state)
  ```

* Below are the relevant details from the core file created by the `mupip reorg` process.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffe32ae6b80) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  deferred_signal_handler () at sr_port/deferred_signal_handler.c:40
  #7  t_end (hist1=0x7ffe32ae9d10, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:655
  #8  gvcst_bmp_mark_free (ks=0x7ffe32aeaa10) at sr_port/gvcst_bmp_mark_free.c:214
  #9  mu_reorg (gl_ptr=0x14562c0, exclude_glist_ptr=0x7ffe32aeb368, resume=0x7ffe32aeb3bc, index_fill_factor=90, data_fill_factor=90, reorg_op=0) at sr_port/mu_reorg.c:452
  #10 mupip_reorg () at sr_port/mupip_reorg.c:334
  #11 mupip_main (argc=5, argv=0x7ffe32afd848, envp=0x7ffe32afd878) at sr_unix/mupip_main.c:117
  #12 dlopen_libyottadb (argc=5, argv=0x7ffe32afd848, envp=0x7ffe32afd878, main_func=0x4015f4 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #13 main (argc=5, argv=0x7ffe32afd848, envp=0x7ffe32afd878) at sr_unix/mupip.c:22

  (gdb) f 6
  #6  deferred_signal_handler () at sr_port/deferred_signal_handler.c:40
  40              assert(INTRPT_OK_TO_INTERRUPT == intrpt_ok_state);      /* DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED ensures this */
  (gdb) p forced_exit
  $2 = 1
  (gdb) p mu_reorg_process
  $3 = 1
  (gdb) p cs_data->kill_in_prog
  $4 = 1
  (gdb) p intrpt_ok_state
  $5 = INTRPT_IN_KILL_CLEANUP
  ```

Issue
-----
* The reorg process that assert failed was sent a `mupip stop` by the `gtm9400` subtest.

* From the failure symptoms noted above, what happened is that the `mupip stop` got delivered AFTER
  line 321 below (when `intrpt_ok_state` was `INTRPT_OK_TO_INTERRUPT`) but before line 322.

  **sr_port/have_crit.h**
  ```c
      314 /* Restore deferrable interrupts back to the state it was at time of corresponding DEFER_INTERRUPTS call */
      315 #define ENABLE_INTERRUPTS(OLDSTATE, NEWSTATE)                                                   \
      316 {                                                                                               \
      317         if (!multi_thread_in_use)                                                               \
      318         {                                                                                       \
      319                 assert(OLDSTATE == intrpt_ok_state);                                            \
      320                 intrpt_ok_state = NEWSTATE;                                                     \
      321                 if (INTRPT_OK_TO_INTERRUPT == intrpt_ok_state)                                  \
      322                         DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED;                                 \
      323                                 /* check if signals were deferred in deferred zone */           \
      324         }                                                                                       \
      325 }
  ```

* And as part of processing the `mupip stop`, the below code (newly introduced as part of incorporating
  GTM-9400 in GT.M V7.0-001 in commit 5ae98ef) set `intrpt_ok_state` to `INTRPT_IN_KILL_CLEANUP`.

  **sr_unix/generic_signal_handler.c**
  ```c
      320                                 /* If nothing pending AND we have crit or in wcs_wtstart() or already in exit processing, wait to
      321                                  * invoke shutdown. wcs_wtstart() manipulates the active queue that a concurrent process in crit
      322                                  * in bt_put() might be waiting for. interrupting it can cause deadlocks (see C9C11-002178).
      323                                  */
      324                                 if (mu_reorg_process && OK_TO_INTERRUPT && cs_data && cs_data->kill_in_prog)
      325                                         DEFER_INTERRUPTS(INTRPT_IN_KILL_CLEANUP, prev_intrpt_state);    /* avoid ABANDONEDKILL */
  ```

* Therefore, when the `DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED` macro (in line 322 of the `ENABLE_INTERRUPTS`
  macro was invoked, `intrpt_ok_state` had a value that was not `INTRPT_OK_TO_INTERRUPT` and failed
  the assert at line 40 below.

  **sr_port/deferred_signal_handler.c**
  ```c
       40         assert(INTRPT_OK_TO_INTERRUPT == intrpt_ok_state);      /* DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED ensures this */
  ```

Fix
---
* Commit 57006c9 fixed `GTM-9400` for real (see commit message there for detail).

* All that is needed is for the failing assert to account for this new possibility. And that is done
  by adding a `||` condition in this commit.

nars1 added a commit that referenced this issue


          [#863] [V70002] Fix r130/ydb560 subtest failure (assert failed in out…

…ofband_clear.c)

Background
----------
* After GT.M V7.0-002 changes were merged, the `r130/ydb560` subtest started failing with the
  following symptom.

  ```
  %YDB-F-ASSERT, Assert failed in sr_port/outofband_clear.c line 43 for expression (TRUE == status)
  ```

* A simple way to reproduce this issue is to run the following and in a parallel terminal send
  a `kill -4` to the `mumps` process (that is stuck in the `hang` command).

  ```sh
  $ cat test.m
   set x=1
   hang 100

  $ mumps -run test
  ```

* Before V7.0-002 merge, one would see just 1 core file (due to the `kill -4`). But after the
  merge, one would see 3 core files. And the 2nd core file had the following stack trace.

  ```
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140112165532736, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd024690b0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  outofband_clear () at sr_port/outofband_clear.c:43
  #9  outofband_action (lnfetch_or_start=0) at sr_port/outofband_action.c:58
  #10 async_action (lnfetch_or_start=false) at sr_port/deferred_events.c:394
  #11 lvzwr_var (lv=0x60f0000005f0, n=0) at sr_port/lvzwr_var.c:184
  #12 lvzwr_fini (out=0x7ffd02471dc0, t=1) at sr_port/lvzwr_fini.c:84
  #13 op_lvpatwrite (count=0, arg1=140724641668224) at sr_port/op_lvpatwrite.c:85
  #14 zshow_zwrite (output=0x7ffd02471dc0) at sr_port/zshow_zwrite.c:40
  #15 op_zshow (func=0x7ffd0247a0e0, type=1, lvn=0x0) at sr_port/op_zshow.c:166
  #16 jobexam_dump (dump_filename_arg=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fatal_file_name_buff=0x7ffd0247ae20 "/extra4/testarea1/nars/V998/tst_V998_R201_dbg_28_240320_111309/r130_0/ydb560/YDB_FATAL_ERROR.ZSHOW_DMP_89246_1.txt", fmt=0x0, dev_in_use=0x7ffd0247a240) at sr_port/jobexam_process.c:238
  #17 jobexam_process (dump_file_name=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fmt=0x0) at sr_port/jobexam_process.c:147
  #18 create_fatal_error_zshow_dmp (signal=4) at sr_port/create_fatal_error_zshow_dmp.c:66
  #19 signal_exit_handler (exit_handler_name=0x7f6e64c43140 "deferred_exit_handler", sig=4, info=0x7f6e6519f938 <stapi_signal_handler_oscontext+3320>, context=0x7f6e6519f9b8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:59
  #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95
  #22 set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:48
  #23 async_action (lnfetch_or_start=true) at sr_port/deferred_events.c:380
  #24 l1 () at sr_x86_64/op_startintrrpt.s:40

  (gdb) f 8
  #8  outofband_clear () at sr_port/outofband_clear.c:43
  43                      assert(TRUE == status);

  (gdb) list
  41              {
  42                      status = xfer_reset_if_setter(outofband);
  43                      assert(TRUE == status);
  44              }
  45      }

  (gdb) p outofband
  $1 = 11

  (gdb) p (enum outofbands)outofband
  $2 = deferred_signal
  ```

Issue
-----
* The issue was that `xfer_reset_if_setter()` had been reworked in GT.M V7.0-002. And that caused the
  handling of the `deferred_signal` type of outofband (which is a YottaDB-only value, unknown to the
  GT.M code base) not be handled correctly.

* The reason why `xfer_reset_if_setter()` returned FALSE in line 42 above is that the `event_state`
  for `deferred_signal` event_type at line 249 below was `pending`. Not `active` and so the call to
  line 250 got skipped. That would have done the real reset that was needed.

  **sr_port/deferred_events.c**
  ```c
    212 boolean_t xfer_reset_if_setter(int4 event_type)
      .
    249     if (res = (active == TAREF1(save_xfer_root, event_type).event_state))   /* WARNING: assignment */
    250             res = (real_xfer_reset(event_type));
  ```

Fix
---
* The fix was to set the event_state for `deferred_signal` outofband to `active` in `deferred_signal_set()`
  just like it is done for `jobinterrupt` outofband in `jobinterrupt_set()`.

* After this change though, an assert in line 370 below (in the `async_action()` function) failed.

  **sr_port/deferred_events.c**
  ```c
    350 void async_action(bool lnfetch_or_start)
      .
    358         if (jobinterrupt == outofband)
    359         {
      .
    367                 TAREF1(save_xfer_root, jobinterrupt).event_state = pending;     /* jobinterrupt gets a pass from the assert below */
    368         } else if (!lnfetch_or_start)
    369         {       /* something other than a new line caugth this, so  */
    370                 assert(pending >= TAREF1(save_xfer_root, outofband).event_state);
    371                 TAREF1(save_xfer_root, outofband).event_state = pending;        /* make it pending in case it was not there yet */
    372         }
  ```

  I noticed that `jobinterrupt` gets special handling in line 367. So decided to have special handling
  for `deferred_signal` as well. But the special handling is different here in that we do not modify
  the `event_state` (like is done for `jobinterrupt` in line 367 above) for the `deferred_signal` case.
  Just that we skip lines 370-371.

* With the changes in the above 2 bullets, the simple test case shown above started working fine in that
  it only generated 1 core file (not 3 core files).

nars1 added a commit that referenced this issue


          [#936] [V70004] Address merge conflicts involving deleted files

c660618

* This commit addresses merge conflicts involving deleted files.

* The list of deleted files was identified using the following commands as part of the prior
commit (i.e. after the `git cherry-pick` command was run but before the `git commit` command
was run).

```sh
$ git status | grep 'deleted'
deleted by us: README
deleted by us: sr_unix/Makefile.mk
deleted by us: sr_unix/gtm_logicals.h
deleted by us: sr_unix/gtm_tls_impl.c
```

* All files that show up as `deleted by us:` are deleted in this commit since those were deleted
even before in the YDB git repository. That is a straightforward change. But each of these files
needed to be reviewed to see if the GT.M changes to each file needs to be incorporated somewhere
else in the YDB git repository. They are described below.

- README : This file only has cosmetic changes in every GT.M release. No need to incorporate this
into the README.md file (corresponding file in the YDB repository).

- sr_unix/Makefile.mk : The GT.M changes to this file have been incorporated into `YDBEncrypt/Makefile`
which is where the YottaDB version of this file lies. There was a failure in `Hunk #2` (out of a
total of 5 Hunks). This failure was manually resolved by picking the GT.M change and then
incorporating the pre-existing YottaDB change of `gtm_tls_interface.h` -> `ydb_tls_interface.h`.

- sr_unix/gtm_logicals.h : GT.M added a `GTM_DB_CREATE_VER` macro which in turn held the value
`$gtm_db_create_ver`. This was incorporated into YottaDB side by adding a new env var line in
`sr_port/ydb_logicals_tab.h` with the index name `YDBENVINDX_DB_CREATE_VER`. The env var names
in the YottaDB and GT.M side for this are respectively `ydb_db_create_ver` and `gtm_db_create_ver`.

In `sr_unix/gtm_env_init_sp.c`, replaced logic that was introduced in the GT.M side which used the
`GTM_DB_CREATE_VER` macro to instead use `ydb_trans_log_name()` with `YDBENVINDX_DB_CREATE_VER` and
allow for the values `6` or `V6` to imply a `V6` format database. I did think about allowing a value
of `1` or `R1` for the `ydb_db_create_ver` env var since YottaDB releases have r1.x or r2.x numbering
but decided the extra effort is not worth it as it might confuse the user more than help.

While at this, noticed that the line for `ydb_dollartest` was not in sorted order like the rest of
the lines. So fixed it by moving it to the sorted position.

- sr_unix/gtm_tls_impl.c : The GT.M changes to this file have been incorporated into
`YDBEncrypt/gtm_tls_impl.c` which is where the YottaDB version of this file lies. There were 12 Hunks
out of which there was a failure in `Hunk #5`, `Hunk #6` and `Hunk #12`.
- `Hunk #5` was related to `Initialize OpenSSL library`. The GT.M side had added a lot of logic inside
a `#if OPENSSL_VERSION_NUMBER >= 0x10100000L` define. But the YottaDB side which had added TLS 3
support in a prior commit had a `#if OPENSSL_VERSION_MAJOR < 3` logic in a smaller code block. It
was not clear which was a better approach but to keep conflicts to a minimum, I picked the GT.M
approach and discarded the YottaDB side of the conflict.
- `Hunk #6` was related to GT.M moving a call to `gc_load_gtmshr_symbols()` to a few lines earlier.
This call was replaced by `gc_load_yottadb_symbols()` call in a prior YottaDB commit unrelated to
the TLS 3 support changes so that YottaDB change was retained but the GT.M move of the call was
also picked. GT.M also added a `SSL_CTX_new(TLS_method())` call and that was picked as well.
- `Hunk #12` was related to GT.M side adding a `#if OPENSSL_VERSION_NUMBER >= 0x10101000L` check in
a `switch/case` block around a `case TLS1_3_VERSION:` in the `gtm_tls_get_conn_info()` function. But
this switch/case block was removed in a prior YottaDB commit and replaced with a `SSL_get_version()`
call and so the GT.M side of this conflict was discarded.

nars1 added a commit that referenced this issue


          [#963] [V70005] Fix assert failure in MUPIP INTEG (jnl_file_close_tim…

7af2c3a

…er call in ss_initiate)

Background
----------
* When run with a Debug build of YottaDB, a `mupip integ` failed as follows.

  ```c
  %YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 504 for expression ((INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) || (INTRPT_IN_DB_CSH_GETN == intrpt_ok_state) || (INTRPT_IN_GDS_RUNDOWN == intrpt_ok_state))
  ```

Issue
-----
* The issue is that GT.M V7.0-005 added a `START_JNL_FILE_CLOSE_TIMER_IF_NEEDED` call to
  the `SET_SNAPSHOTS_IN_PROG` macro as can be seen in the below diff.

  ```diff
  $ git show -U0 tags/V7.0-005 sr_port/gdsfhead.h
  @@ -4187 +4187 @@ MBSTART {
  -#define SET_SNAPSHOTS_IN_PROG(X)	((X)->snapshot_in_prog = TRUE)
  +#define SET_SNAPSHOTS_IN_PROG(X)	MBSTART { (X)->snapshot_in_prog = TRUE; START_JNL_FILE_CLOSE_TIMER_IF_NEEDED; } MBEND
  ```

* This caused the assert failure in the following code (which is present only in YottaDB, not in GT.M).

  **sr_unix/gt_timers.c**
  ```c
    501         } else if (jnl_file_close_timer_fptr == handler)
    502         {       /* Account for known instances of the above function being called from within a deferred zone. */
    503                 assert((INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) || (INTRPT_IN_DB_CSH_GETN == intrpt_ok_state)
    504                         || (INTRPT_IN_GDS_RUNDOWN == intrpt_ok_state));
    505                 safe_to_add = TRUE;
  ```

* Below are details from the gdb analysis of the core file

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140357926831936) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140357926831936) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140357926831936, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fff19e828f0) at sr_unix/rts_error.c:199
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:100
  #8  start_timer (tid=140357912091280, time_to_expir=60000000000, handler=0x7fa79f7dda90 <jnl_file_close_timer>, hdata_len=0, hdata=0x0) at sr_unix/gt_timers.c:503
  #9  ss_initiate (reg=0x55b37821b170, util_ss_ptr=0x55b37821a9c0, ss_ctx=0x55b378219110, preserve_snapshot=0, calling_utility=0x7fa7a03c8506 "MUPIP INTEG") at sr_unix/ss_initiate.c:666
  #10 mu_int_reg (reg=0x55b37821b170, return_value=0x7fff19e8560c, return_after_open=0) at sr_port/mu_int_reg.c:192
  #11 mupip_integ () at sr_port/mupip_integ.c:438
  #12 mupip_main (argc=4, argv=0x7fff19e89608, envp=0x7fff19e89630) at sr_unix/mupip_main.c:130
  #13 dlopen_libyottadb (argc=4, argv=0x7fff19e89608, envp=0x7fff19e89630, main_func=0x55b37723f004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #14 main (argc=4, argv=0x7fff19e89608, envp=0x7fff19e89630) at sr_unix/mupip.c:21

  (gdb) f 8
  #8  start_timer (tid=140357912091280, time_to_expir=60000000000, handler=0x7fa79f7dda90 <jnl_file_close_timer>, hdata_len=0, hdata=0x0) at sr_unix/gt_timers.c:503
  503                     assert((INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) || (INTRPT_IN_DB_CSH_GETN == intrpt_ok_state)

  (gdb) p intrpt_ok_state
  $1 = INTRPT_IN_SS_INITIATE
  ```

Fix
---
* Now that we know this is expected, the above value is also added as an accepted value in the assert.

nars1 added a commit that referenced this issue


          Fix rare double free in MUPIP BACKUP in case of MUPIP STOP

454f2e7

Background
----------
* In internal testing, we noticed a rare failure in the `v51000/mu_bkup_stop` subtest
  where a `mupip backup` process that was sent a `SIGTERM` (by the test) ended up
  creating a core file due to ASAN assert failing on a double free.

* Below are relevant details from the core file.

  ```c
  Core was generated by `mupip backup -online -dbg * ./49181_online1'.
  Program terminated with signal SIGSEGV, Segmentation fault.

  (gdb) where
  #0  ydb_os_signal_handler (sig=11, info=0x7fd09968c3f0, context=0x7fd09968c2c0) at sr_unix/ydb_os_signal_handler.c:57
  #1  <signal handler called>
  #2  ydb_os_signal_handler (sig=6, info=0x7fd09968caf0, context=0x7fd09968c9c0) at sr_unix/ydb_os_signal_handler.c:57
  #3  <signal handler called>
  #4  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
  #5  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
  #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
  #7  __GI_abort () at ./stdlib/abort.c:79
  #8  __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:143
  #9  __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
  #10 __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x7ffda6de6ebe, __in_chrg=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:190
  #11 __asan::ReportDoubleFree (addr=140533757257728, free_stack=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:224
  #12 __asan::Allocator::ReportInvalidFree (this=<optimized out>, stack=0x7ffda6de79f0, chunk_state=<optimized out>, ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_allocator.cpp:757
  #13 __interceptor_free (ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:53
  #14 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1485
  #15 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  #16 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  #18 mupip_backup_call_on_signal () at sr_port/mupip_backup.c:208
  #19 signal_exit_handler (exit_handler_name=0x7fd097f1dda0 "deferred_exit_handler", sig=15, info=0x7fd098480fd8 <stapi_signal_handler_oscontext+3320>, context=0x7fd098481058 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:67
  #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95
  #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486
  #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  #26 mupip_backup () at sr_port/mupip_backup.c:1585
  #27 mupip_main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip_main.c:130
  #28 dlopen_libyottadb (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50, main_func=0x55af49fd9020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #29 main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip.c:21

  (gdb) f 25
  #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  103                                     free(ptr->backup_hdr);

  (gdb) f 17
  #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  103                                     free(ptr->backup_hdr);

  (gdb) down
  #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  1501            gtm_free_main(addr, TAIL_CALL_LEVEL);

  (gdb) down
  #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  854                     system_free(addr);

  (gdb) down
  #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486
  1486            ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);

  (gdb) list
  1481    {
  1482            intrpt_state_t  prev_intrpt_state;
  1483
  1484            DEFER_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);
  1485            free(addr);
  1486            ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);
  1487            return;
  1488    }
  ```

Issue
-----
* We did a `free(ptr->backup_hdr)` at line 103. And that in turn ended up using the system `free()`
  function because the test framework had randomly set the `gtmdbglvl` env var to a value of
  `0x80000000`.

* So at line 1485 above, the system free finished but at line 1486 we noticed the SIGTERM that was
  deferred and so decided to handle it. But the `ptr->backup_hdr` variable was still set to a
  non-NULL value so as part of the deferred exit handler, we tried to free this again resulting
  in the double free.

Fix
---
* The fix is to note `ptr->backup_hdr` in a local variable and clear the former and then attempting
  the `free()` on the local variable. This way if we decide to do deferred exit handling after the
  `free()` occurred, we will notice a NULL value of `ptr->backup_hdr` and so avoid the double free.

Notes
-----
* This is considered a too rare a race condition to be encountered in practice and so it is expected
  to not be noticed by a user. Therefore no YDB issue is created for this.

nars1 added a commit that referenced this issue


          [#1084] Fix ASAN heap-buffer-overflow error in ydb_shebang.c

7fe1c16

Background
----------
* This is an issue in c087690 that was found by YDBTest@a8eaadc4 (YDBTest!2075).

* When YottaDB was built with ASAN, running the `r202/shebang-ydb1084` subtest gave the
  following error.

  ```c
  =================================================================
  ==29909==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6020000001b6 at pc 0x559522eb8f9d bp 0x7ffd2cb5aa60 sp 0x7ffd2cb5a1e8
  READ of size 7 at 0x6020000001b6 thread T0
      #0 0x559522eb8f9c in printf_common(void*, char const*, __va_list_tag*) asan_interceptors.cpp.o
      #1 0x7f9d29cc83d3 in gtm_snprintf sr_unix/gtm_stdio.c:102:2
      #2 0x7f9d2ac7629a in ydb_shebang sr_port/ydb_shebang.c:186:3
      #3 0x7f9d29e41973 in jobchild_init sr_unix/jobchild_init.c:138:16
      #4 0x7f9d29cc5db5 in gtm_startup sr_unix/gtm_startup.c:287:3
      #5 0x7f9d29e04dd8 in init_gtm sr_unix/init_gtm.c:203:2
      #6 0x7f9d29b62e82 in gtm_main sr_unix/gtm_main.c:194:2
      #7 0x559522f55eee in dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11
      #8 0x559522f54f1d in main sr_unix/gtm.c:20:9
      #9 0x7f9d2efa1d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
      #10 0x7f9d2efa1e3f in __libc_start_main csu/../csu/libc-start.c:392:3
      #11 0x559522e97344 in _start (/usr/library/V998_R201/dbg/yottadb+0x1e344) (BuildId: a420a9112acf24ed4fac1e0bcc38f1947ddde8e8)

  0x6020000001b6 is located 0 bytes to the right of 6-byte region [0x6020000001b0,0x6020000001b6)
  allocated by thread T0 here:
      #0 0x559522f1a18e in __interceptor_malloc (/usr/library/V998_R201/dbg/yottadb+0xa118e) (BuildId: a420a9112acf24ed4fac1e0bcc38f1947ddde8e8)
      #1 0x7f9d2a42cd8f in system_malloc sr_port/gtm_malloc_src.h:1470:9
      #2 0x7f9d2a4274f9 in gtm_malloc_main sr_port/gtm_malloc_src.h:673:10
      #3 0x7f9d2a436de6 in gtm_malloc sr_port/gtm_malloc_src.h:1496:9
      #4 0x7f9d2b4d28f2 in zro_load sr_unix/zro_load.c:388:42
      #5 0x7f9d2ac7d169 in zro_init sr_port/zro_init.c:95:2
      #6 0x7f9d2b4d301f in zro_search sr_unix/zro_search.c:79:3
      #7 0x7f9d2ac75b53 in ydb_shebang sr_port/ydb_shebang.c:152:2
      #8 0x7f9d29e41973 in jobchild_init sr_unix/jobchild_init.c:138:16
      #9 0x7f9d29cc5db5 in gtm_startup sr_unix/gtm_startup.c:287:3
      #10 0x7f9d29e04dd8 in init_gtm sr_unix/init_gtm.c:203:2
      #11 0x7f9d29b62e82 in gtm_main sr_unix/gtm_main.c:194:2
      #12 0x559522f55eee in dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11
      #13 0x559522f54f1d in main sr_unix/gtm.c:20:9
      #14 0x7f9d2efa1d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  ```

Issue
-----
* `TREF(dollar_zroutines)` is an mstr. That is, it has a `.addr` and `.len` field. It is not a
  null-terminated string. But it was used as one in the following line where the last `%s` in the
  SNPRINTF command corresponds to that.

  **sr_port/ydb_shebang.c**
  ```
  186     SNPRINTF(new_zro, space_needed, "%s(%s) %s", buf, rtn_path, (TREF(dollar_zroutines)).addr);
  ```

* This caused `snprintf` to treat it as a null terminated string and access the null byte as part
  of its processing. And that caused ASAN to fail with the `heap-buffer-overflow` error.

Fix
---
* The fix is simple and is to not treat it as a null terminated string in the `snprintf` command.
  And that is easily achieved by using a `%.*s` instead of a `%s` (it now requires passing the
  length in addition to the char pointer).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment