Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Enhancement) A way for application code to determine whether it is running on YottaDB or GT.M #2

Closed
YottaDB opened this issue Jul 3, 2017 · 0 comments

Comments

@YottaDB
Copy link
Collaborator

YottaDB commented Jul 3, 2017

Final Release Note

The $zyre[lease] intrinsic special variable contains a string value that application code can use to determine the YottaDB release it is running on. $zyrelease is a space delimited string with four pieces as follows:

<product> <release> <os> <architecture>

  • <product> is always "YottaDB"
  • <release> always begins with an "r" and is a number structured as follows:
    • The number before the decimal point (".") is the major release number.
    • A decimal point.
    • The first digit after the decimal point is a minor release number.
    • The second (and last) digit after the decimal point is a bug fix level (so "r3.21" would the first bug fix level of the "r3.20" release).
  • <os> is the operating system, e.g., "Linux"
  • <architecture> is the underlying CPU architecture, e.g., "x86_64"

$zyrelease is a read-only intrinsic special variable. As the code generator treats $zyrelease as a string constant known at compile time, and optimizes accordingly, ensure that you run object code only on the same YottaDB release on which you compiled it.

Summary of need

Application code needs a way to determine whether it is running on YottaDB or GT.M. Each YottaDB release is built on a GT.M version, and each YottaDB release should be upward compatible with the underlying GT.M version. As application code may check $zversion, either the entire string or pieces of it, YottaDB cannot change $zversion.

Draft release note of proposed solution

The $zyre[lease] intrinsic special variable contains a string value that application code can use to determine the YottaDB release it is running on. $zyrelease is a space delimited string with four pieces as follows:

<product> <release> <os> <architecture>

  • <product> is always "YottaDB"
  • <release> always begins with an "r" and is a number structured as follows:
    • The number before the decimal point (".") is the major release number.
    • A decimal point.
    • The first digit after the decimal point is a minor release number.
    • The second (and last) digit after the decimal point is a bug fix level (so "r3.21" would the first bug fix level of the "r3.20" release).
  • <os> is the operating system, e.g., "Linux"
  • <architecture> is the underlying CPU architecture, e.g., "x86_64"

$zyrelease is a read-only intrinsic special variable. As the code generator treats $zyrelease as a string constant known at compile time, and optimizes accordingly, ensure that you run object code only on the same YottaDB release on which you compiled it.

@YottaDB YottaDB closed this as completed Jul 5, 2017
nars1 referenced this issue in nars1/YottaDB Sep 25, 2017
…h an external filter followed by a deactivate and shutdown

Release Note
-------------
The replication source server terminates normally in case it was started with an external filter and later
transitioned from an active to a passive source server before being asked to shut down. Previously, the source
server would terminate abnormally with a SIG-11. (YDB#30)

Test
-----
* New r110/srcsrv_extfilter_sig11 subtest fails reliably without the fixes and passes reliably with the fix.
* E_ALL run many times to ensure no regressions.

README
-------
When an active replication source server is deactivated, it changes mode from ACTIVE to PASSIVE mode. This also
causes it to close external filters (if it had them open) by a call to repl_stop_filter() (in gtmsource.c).

    517                 if (gtmsource_filter & EXTERNAL_FILTER)
    518                         repl_stop_filter();

While it is in passive mode (and waiting to be activated again or asked to shutdown), if it encounters a shutdown
signal, it goes to gtmsource_end() which in turn ends up invoking replstop_filter(). But this function cannot be
invoked more than once.  That is because it sends a message to the other side of the filter (the receiver side)
to stop and then closes the source side of the filter and frees up and nullifies the associated buffers (including
the global variable "extract_buff"). The second invocation of this function ends up with a SIG-11 when trying to
send a message to the other side because "extract_buff" is NULL. Below is the stack trace.

(gdb) where
    at /Distrib/GT.M/V63002/sr_unix/generic_signal_handler.c:374
    at /Distrib/GT.M/V63002/sr_port/repl_filter.c:660

(gdb) where
 #0  0x000000000082200a in rel_lock (reg=0x3044958) at /Distrib/GT.M/V63002/sr_unix/rel_lock.c:85
 #1  0x0000000000474602 in gtmsource_end1 (auto_shutdown=1) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:80
 #2  0x000000000049d514 in gtmsource_stop (exit=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:323
 YottaDB#3  0x000000000049d55b in gtmsource_sigstop () at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:333
 YottaDB#4  0x00000000006d5377 in generic_signal_handler (sig=11, info=0x7ffd3122d6b0, context=0x7ffd3122d580)
    at /Distrib/GT.M/V63002/sr_unix/generic_signal_handler.c:374
 YottaDB#5  <signal handler called>
 YottaDB#6  0x000000000057b776 in repl_filter_send (tr_num=0, tr=0x0, tr_len=0, first_send=1)
    at /Distrib/GT.M/V63002/sr_port/repl_filter.c:660
 YottaDB#7  0x0000000000581342 in repl_stop_filter () at /Distrib/GT.M/V63002/sr_port/repl_filter.c:1128
 YottaDB#8  0x000000000047504f in gtmsource_end1 (auto_shutdown=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:136
 YottaDB#9  0x000000000047509e in gtmsource_end () at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:148
 YottaDB#10 0x000000000046d968 in gtmsource () at /Distrib/GT.M/V63002/sr_unix/gtmsource.c:520
 YottaDB#11 0x000000000044a022 in main (argc=11, argv=0x7ffd3122e7d8) at /Distrib/GT.M/V63002/sr_unix/mupip.c:123

The fix is to turn off the EXTERNAL_FILTER bit in "gtmsource_filter" right after a call to repl_stop_filter().
Since repl_stop_filter() is called from various places in the source server a macro STOP_EXTERNAL_FILTER_IF_NEEDED
was introduced to take care of this.

Although the receiver server does not suffer from this exact issue, it also invokes repl_stop_filter() in various
places and might have a similar issue. So all those callers too were fixed to use this new macro. But they
manipulate the global variable "gtmrecv_filter" (instead of "gtmsource_filter").
nars1 referenced this issue in nars1/YottaDB Sep 25, 2017
…h an external filter followed by a deactivate and shutdown

Release Note
-------------
The replication source server terminates normally in case it was started with an external filter and later
transitioned from an active to a passive source server before being asked to shut down. Previously, the source
server would terminate abnormally with a SIG-11. (YDB#30)

Test
-----
* New r110/srcsrv_extfilter_sig11 subtest fails reliably without the fixes and passes reliably with the fix.
* E_ALL run many times to ensure no regressions.

README
-------
When an active replication source server is deactivated, it changes mode from ACTIVE to PASSIVE mode. This also
causes it to close external filters (if it had them open) by a call to repl_stop_filter() (in gtmsource.c).

    517                 if (gtmsource_filter & EXTERNAL_FILTER)
    518                         repl_stop_filter();

While it is in passive mode (and waiting to be activated again or asked to shutdown), if it encounters a shutdown
signal, it goes to gtmsource_end() which in turn ends up invoking replstop_filter(). But this function cannot be
invoked more than once.  That is because it sends a message to the other side of the filter (the receiver side)
to stop and then closes the source side of the filter and frees up and nullifies the associated buffers (including
the global variable "extract_buff"). The second invocation of this function ends up with a SIG-11 when trying to
send a message to the other side because "extract_buff" is NULL. Below is the stack trace.

(gdb) where
 #0  0x000000000082200a in rel_lock (reg=0x3044958) at /Distrib/GT.M/V63002/sr_unix/rel_lock.c:85
 #1  0x0000000000474602 in gtmsource_end1 (auto_shutdown=1) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:80
 #2  0x000000000049d514 in gtmsource_stop (exit=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:323
 YottaDB#3  0x000000000049d55b in gtmsource_sigstop () at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:333
 YottaDB#4  0x00000000006d5377 in generic_signal_handler (sig=11, info=0x7ffd3122d6b0, context=0x7ffd3122d580)
    at /Distrib/GT.M/V63002/sr_unix/generic_signal_handler.c:374
 YottaDB#5  <signal handler called>
 YottaDB#6  0x000000000057b776 in repl_filter_send (tr_num=0, tr=0x0, tr_len=0, first_send=1)
    at /Distrib/GT.M/V63002/sr_port/repl_filter.c:660
 YottaDB#7  0x0000000000581342 in repl_stop_filter () at /Distrib/GT.M/V63002/sr_port/repl_filter.c:1128
 YottaDB#8  0x000000000047504f in gtmsource_end1 (auto_shutdown=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:136
 YottaDB#9  0x000000000047509e in gtmsource_end () at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:148
 YottaDB#10 0x000000000046d968 in gtmsource () at /Distrib/GT.M/V63002/sr_unix/gtmsource.c:520
 YottaDB#11 0x000000000044a022 in main (argc=11, argv=0x7ffd3122e7d8) at /Distrib/GT.M/V63002/sr_unix/mupip.c:123

The fix is to turn off the EXTERNAL_FILTER bit in "gtmsource_filter" right after a call to repl_stop_filter().
Since repl_stop_filter() is called from various places in the source server a macro STOP_EXTERNAL_FILTER_IF_NEEDED
was introduced to take care of this.

Although the receiver server does not suffer from this exact issue, it also invokes repl_stop_filter() in various
places and might have a similar issue. So all those callers too were fixed to use this new macro. But they
manipulate the global variable "gtmrecv_filter" (instead of "gtmsource_filter").
nars1 referenced this issue in nars1/YottaDB Sep 27, 2017
…h an external filter followed by a deactivate and shutdown

Release Note
-------------
The replication source server terminates normally in case it was started with an external filter and later
transitioned from an active to a passive source server before being asked to shut down. Previously, the source
server would terminate abnormally with a SIG-11. (YDB#30)

Test
-----
* New r110/srcsrv_extfilter_sig11 subtest fails reliably without the fixes and passes reliably with the fix.
* E_ALL run many times to ensure no regressions.

README
-------
When an active replication source server is deactivated, it changes mode from ACTIVE to PASSIVE mode. This also
causes it to close external filters (if it had them open) by a call to repl_stop_filter() (in gtmsource.c).

    517                 if (gtmsource_filter & EXTERNAL_FILTER)
    518                         repl_stop_filter();

While it is in passive mode (and waiting to be activated again or asked to shutdown), if it encounters a shutdown
signal, it goes to gtmsource_end() which in turn ends up invoking replstop_filter(). But this function cannot be
invoked more than once.  That is because it sends a message to the other side of the filter (the receiver side)
to stop and then closes the source side of the filter and frees up and nullifies the associated buffers (including
the global variable "extract_buff"). The second invocation of this function ends up with a SIG-11 when trying to
send a message to the other side because "extract_buff" is NULL. Below is the stack trace.

(gdb) where
 #0  0x000000000082200a in rel_lock (reg=0x3044958) at /Distrib/GT.M/V63002/sr_unix/rel_lock.c:85
 #1  0x0000000000474602 in gtmsource_end1 (auto_shutdown=1) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:80
 #2  0x000000000049d514 in gtmsource_stop (exit=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:323
 YottaDB#3  0x000000000049d55b in gtmsource_sigstop () at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:333
 YottaDB#4  0x00000000006d5377 in generic_signal_handler (sig=11, info=0x7ffd3122d6b0, context=0x7ffd3122d580)
    at /Distrib/GT.M/V63002/sr_unix/generic_signal_handler.c:374
 YottaDB#5  <signal handler called>
 YottaDB#6  0x000000000057b776 in repl_filter_send (tr_num=0, tr=0x0, tr_len=0, first_send=1)
    at /Distrib/GT.M/V63002/sr_port/repl_filter.c:660
 YottaDB#7  0x0000000000581342 in repl_stop_filter () at /Distrib/GT.M/V63002/sr_port/repl_filter.c:1128
 YottaDB#8  0x000000000047504f in gtmsource_end1 (auto_shutdown=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:136
 YottaDB#9  0x000000000047509e in gtmsource_end () at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:148
 YottaDB#10 0x000000000046d968 in gtmsource () at /Distrib/GT.M/V63002/sr_unix/gtmsource.c:520
 YottaDB#11 0x000000000044a022 in main (argc=11, argv=0x7ffd3122e7d8) at /Distrib/GT.M/V63002/sr_unix/mupip.c:123

The fix is to turn off the EXTERNAL_FILTER bit in "gtmsource_filter" right after a call to repl_stop_filter().
Since repl_stop_filter() is called from various places in the source server a macro STOP_EXTERNAL_FILTER_IF_NEEDED
was introduced to take care of this.

Although the receiver server does not suffer from this exact issue, it also invokes repl_stop_filter() in various
places and might have a similar issue. So all those callers too were fixed to use this new macro. But they
manipulate the global variable "gtmrecv_filter" (instead of "gtmsource_filter").
nars1 added a commit that referenced this issue Sep 27, 2017
…h an external filter followed by a deactivate and shutdown

Release Note
-------------
The replication source server terminates normally in case it was started with an external filter and later
transitioned from an active to a passive source server before being asked to shut down. Previously, the source
server would terminate abnormally with a SIG-11. (YDB#30)

Test
-----
* New r110/srcsrv_extfilter_sig11 subtest fails reliably without the fixes and passes reliably with the fix.
* E_ALL run many times to ensure no regressions.

README
-------
When an active replication source server is deactivated, it changes mode from ACTIVE to PASSIVE mode. This also
causes it to close external filters (if it had them open) by a call to repl_stop_filter() (in gtmsource.c).

    517                 if (gtmsource_filter & EXTERNAL_FILTER)
    518                         repl_stop_filter();

While it is in passive mode (and waiting to be activated again or asked to shutdown), if it encounters a shutdown
signal, it goes to gtmsource_end() which in turn ends up invoking replstop_filter(). But this function cannot be
invoked more than once.  That is because it sends a message to the other side of the filter (the receiver side)
to stop and then closes the source side of the filter and frees up and nullifies the associated buffers (including
the global variable "extract_buff"). The second invocation of this function ends up with a SIG-11 when trying to
send a message to the other side because "extract_buff" is NULL. Below is the stack trace.

(gdb) where
 #0  0x000000000082200a in rel_lock (reg=0x3044958) at /Distrib/GT.M/V63002/sr_unix/rel_lock.c:85
 #1  0x0000000000474602 in gtmsource_end1 (auto_shutdown=1) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:80
 #2  0x000000000049d514 in gtmsource_stop (exit=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:323
 #3  0x000000000049d55b in gtmsource_sigstop () at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:333
 #4  0x00000000006d5377 in generic_signal_handler (sig=11, info=0x7ffd3122d6b0, context=0x7ffd3122d580)
    at /Distrib/GT.M/V63002/sr_unix/generic_signal_handler.c:374
 #5  <signal handler called>
 #6  0x000000000057b776 in repl_filter_send (tr_num=0, tr=0x0, tr_len=0, first_send=1)
    at /Distrib/GT.M/V63002/sr_port/repl_filter.c:660
 #7  0x0000000000581342 in repl_stop_filter () at /Distrib/GT.M/V63002/sr_port/repl_filter.c:1128
 #8  0x000000000047504f in gtmsource_end1 (auto_shutdown=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:136
 #9  0x000000000047509e in gtmsource_end () at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:148
 #10 0x000000000046d968 in gtmsource () at /Distrib/GT.M/V63002/sr_unix/gtmsource.c:520
 #11 0x000000000044a022 in main (argc=11, argv=0x7ffd3122e7d8) at /Distrib/GT.M/V63002/sr_unix/mupip.c:123

The fix is to turn off the EXTERNAL_FILTER bit in "gtmsource_filter" right after a call to repl_stop_filter().
Since repl_stop_filter() is called from various places in the source server a macro STOP_EXTERNAL_FILTER_IF_NEEDED
was introduced to take care of this.

Although the receiver server does not suffer from this exact issue, it also invokes repl_stop_filter() in various
places and might have a similar issue. So all those callers too were fixed to use this new macro. But they
manipulate the global variable "gtmrecv_filter" (instead of "gtmsource_filter").
estess added a commit to estess/YottaDB that referenced this issue Oct 12, 2017
…nperf)

Files changes in this commit:

- sr_i386/g_msf.si
  1. Remove SFF_CI flag and resequence flags

- sr_port/alias_funcs.c
  1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI
     frame no longer being there.

  A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it
  has to do with the elimination of the GTM$CI frame. The frame chain used to look like this:

    CI-base-frame <- GTM$CI <- M routine

  Now it looks like:

    CI-base-frame <- M routine

  In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame.
  In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame.
  The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to
  back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS
  the base frame.

- sr_port/dollar_zlevel.c
  1. Spruce up some comments to make more sense.

- sr_port/f_text.c
  1. Remove GTM$CI from the list of supressed values (no longer exists).

- sr_port/fgncal.h
  1. Remove declaration for fgncal_lookup.c (no longer used).

- sr_port/fgncal_lookup.c
  1. Deleted - no longer used

- sr_port/gbldefs.c
  1. Remove param_list global (no longer used).

- sr_port/mdb_condition_handler.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.
  4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0
     instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1)
     and also the mumps -run base frame.

- sr_port/mdef.h
  1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of
    requiring it to be a compiler option.

- sr_port/mprof_funcs.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. When running the stack, skip over both TRIGGER and CALLIN base frames.

- sr_port/op_bindparm.c
  1. Minor cosmetic changes and comment clarifications.

- sr_port/op_clralsvals.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer
     need to unwind one more frame to the base frame as the flagged frame IS the base frame).

- sr_port/op_halt.c
  2. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/op_unwind.c
  2. Add include of gtmio.h for debugging macro(s).

- sr_port/op_zg1.c
  2. If call-in mode, restart the frame we unwind to. This returns to the caller.

- sr_port/op_zgoto.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.

- sr_port/op_zhalt.c
  1. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/parm_pool.c
  1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in
     two forms:
     a. Form that gets its arguments from the varargs parameter list.
     b. Form that gets its arguments from a parmblk_struct.

- sr_port/stack_frame.h
  1. Remove SFF_CI flag and resequence flags to fill the gap.
  2. Add SFT_CI type id.
  3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames.

- sr_port/tp_unwind.c
  1.  Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too.

- sr_port/unw_mv_ent.c
  1. Change debugging output to be more clear which path is being taken.

- sr_port/zlput_rname.c
  1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer
     frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the
     M stack so it is more effectively utilized.

- sr_port/zshow_stack.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Add marker for when call-in base frame shows up in stack.

- sr_unix/ci_ret_code.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed.  The ci_ret_code()
     routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind
     procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in
     base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp()
     which we are trying to avoid. That condition is now special cased in ZGOTO.
  3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments
     added to better describe the dependencies

- sr_unix/error_return.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Don't let CALL-IN mode do an EXIT here - force return to caller.

- sr_unix/fgncalsp.h
  1. Get rid of no longer needed pointer fields (now passed directly)

- sr_unix/gtm_startup.c
  1. Add comments on odd happenings.

- sr_unix/gtm_trigger.c
  1. Change an M stack running loop to work without GTM$CI frame.

- sr_unix/gtm_unlink_all.c
  1. Remove check for GTM$CI executable.

- sr_unix/gtmci.c
  1. Removed use of longjmp() to effect return from call-ins.
  2. Change SFF_CI flag usage to SFT_CI type usage.
  3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters
     appropriately for call-in to op_bindparm in generated M code.
  4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call
     push_parm to set up called-routine parameters ourselves.

- sr_unix/gtmci.h
  1. Removed GTM$CI as an intermediate call-in frame.
  2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame.

- sr_unix/gtmci_isv.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Some comment cleanup.

- sr_unix/gtmci_signals.c
  1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere.

- sr_unix/invocation_mode.h
  1. Remove MUMPS_GTMCI_OFF macro as unused.
  2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set).

- sr_unix/jobchild_init.c
  1. Don't create GTM$CI anymore (make_cimode() not needed and removed).
  2. Create single base CI frame instead of base + GTM$CI frame.

- sr_unix/make_cimode.c
  1. Removed - no longer needed.

- sr_unix/make_mode.c
  1. Removed all capability to create GTM$CI as this "program" is no longer needed.
  2. Removed all #ifdef of __ia64, __hpux, and _AIX.

- sr_unix/ojchildparms.c
  1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't
     matter what type of frame this is so long as there is one that can be used to return.

- sr_unix/op_fnfgncal.c
  1. Change SFF_CI flag usage to SFT_CI type usage.

- sr_unix/relinkctl.c
  1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame
     now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame).

- sr_unix/rtnhdr.h
  1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a
     base frame doesn't screw things up and cause that NULL to be dereferenced.

- sr_x86_64/ci_restart.s
  1. Remove this parameter massaging routine as being unnecessary anymore.

- sr_x86_64/g_msf.si
  1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble).

[9/2017 SEE]

Files changes in this (squashed) commit:

sr_port/op_fnview.c
  1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER".
     Multiple comma separated tokens may be returned.

sr_port/viewtab.h
  1. Add "ENVIRONMENT" option.
  2. Remove non-UNIX environment #ifdef code.

Files changes in this (squashed) commit:

- sr_linux/release_name.h - Change to R1.10

Files changes in this (squashed) commit:

sr_port/alias.h:
  1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not
     have to specify a compiler option.

sr_port/alias_funcs.c:
  1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables
     for when they get changed and there were the LVMON* view commands that help to debug aliases and are active
     only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG
     error since the view commands that followed it in the table were LVMON*. This update changes the name of the
     VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON*
     in the test system due to it requiring a compiler flag to be incorporated so no tests were affected.

sr_port/gbldefs.c:
sr_port/lv_val.h:
sr_port/op_view.h
sr_port/tp_unwind.c
sr_port/viewtab.h
  1. Rename lvmon* for lvamon*.

sr_port/mdef.h
  1. Comment out define for DEBUG_ALIAS

-----------------------------------

This project significantly changes how call-ins works. To properly describe it, we need to first describe the
current (original) operation of callins. The following write-up between the ------ lines below describes the
current (r100 and previous versions) call-in operation:

-----------------------------------

How call-ins work:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine
               is called GTM$CI.
  9.         - gtm_init_env(base_addr, transfer_addr)
 10.           - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
                 of gtm_ret_code() but with rvector set to addr of GTM$CI.
 11.           - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the
                 GTM$CI routine.
 12.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by gtm_init_env()).
     - rvector is for GTM$CI routine.
     - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M).
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.
  2. Executable frame (created by new_stack_frame() as called by gtm_init_env()).
     - rvector pointer is set to GTM$CI routine
     - Frame is flagged as as call-in

gtm_ci[p]() - Perform call-in:
  1. Executable frame set up in step YottaDB#11 of gtm_init() above has its execution field modified to point to the GTM$CI routine.
  2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which
     will add a new stack frame and have the call-in parameters set up.
  3. Drives dm_start() to invoke the callin. The actual call path goes like this:
       a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI).
       b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var
          param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it.
       c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are
          calling into is invoked.
       d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where
          mdb_condition_handler was setup in dm_start().
       e. dm_start() then returns to gtm_ci[p] for return value processing.

The old parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms.
  4. When op_extexfun() is driven, parms are shifted around for call to push_parm().
  5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms.
  6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local
     vars in the called routine.

Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its
parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit
in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the
parameter conversion necessary at the time. The new support described next has made some changes that make this extra
conversion step unnecessary.

-----------------------------------

Following (between the *********** lines) is a description of how call-ins work with this new support in place:

***********************************

How the new call-ins functionality works:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
               of gtm_ret_code() but with rvector set to addr of GTM$CI.
  9.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by jobchild_init()).
     - rvector is NULL as this frame is just for return.
     - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame.
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.

gtm_ci[p]() - Perform call-in:
  1. Create new executable frame (new_stack_frame()) for the called M routine to run in.
  2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter.
  3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by
     op_bindparm() which gets called at the top of the called target routine. See parameter flow below.
  4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven.
  5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp().
  6. dm_start() then returns to gtm_ci[p] for return value processing.

The parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used
     for buffering parameters).
  4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty)
     op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in
     the called routine then releases the parameter space for reuse.

***********************************

-----------------------------------

Operational changes with this project in the execution of a call-in:

- No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init YottaDB#8).
- Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus
  the GTM$CI frame (step gtm_init YottaDB#9).
- No longer setup GTM$CI as part of gtm_ci[p] (gtm_ci[p] YottaDB#1). Instead, we allocate the frame using new_stack_frame()
  directly (gtm_ci[p]() YottaDB#1).
- After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes
  the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can
  be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in
  the M source code (gtm_ci[p]() YottaDB#2 and YottaDB#3).
- The execution flow used to look like:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive GTM$CI "build" routine.
          * op_extexfun()
	  * Drive target M routine
	  * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start).
        * return to gtm_ci[p]
    * Return to C caller
- The flow has similar steps but one less layer and returns without the use of a (longjmp) system call:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive target M routine
	* M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start.
        * return to gtm_ci[p]
      * return to C caller.

-----------------------------------

Notes on the removal of GTM$CI.

- A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked
  a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller.
- A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine"
  was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before
  calling the call-in routine.
- Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but
  was created by the make_mode() routine. Here's the operation:
    a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall
       depending on whether has args or return value). The parm_blk also contains the parameters and other things
       needed to effect a call.
    b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to
       be GTM$CI.
    c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized
       version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the
       input and output parameters for the call-in routine and drives it to create the stack.
    d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame.
    e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return.
- So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that
  made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the
  system call flavor of return is unnecessary. Wanted to avoid that too.
- Two changes allowed us to be rid of the extra GTM$CI overhead:
    1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return
       without being unwound. We make use of that so we can do the simple unwind of call-in levels.
    2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine
       directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them
       back off in the glue routine. This avoided one of the argument restructurings that were happening.
- Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and
  it was moved to the base frame itself which is how triggers also does it.

-----------------------------------

User visible changes in this project (for specification in release note):

1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings.
2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in
   environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine
   is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1.
3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A
   stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text
   "(Call-In Level Entry)" appears in the stack list.
4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used.
   You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first
   call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in
   base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame.
5. Similar issue when gtm_trigger() is verifying a trigger is not in use anywhere on the stack.
6. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't
   find it.
7. $VIEW(ENVIRONMENT) description.
8. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously.

```
estess added a commit to estess/YottaDB that referenced this issue Oct 12, 2017
…nperf)

Files changes in this commit:

- sr_i386/g_msf.si
  1. Remove SFF_CI flag and resequence flags

- sr_port/alias_funcs.c
  1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI
     frame no longer being there.

  A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it
  has to do with the elimination of the GTM$CI frame. The frame chain used to look like this:

    CI-base-frame <- GTM$CI <- M routine

  Now it looks like:

    CI-base-frame <- M routine

  In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame.
  In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame.
  The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to
  back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS
  the base frame.

- sr_port/dollar_zlevel.c
  1. Spruce up some comments to make more sense.

- sr_port/f_text.c
  1. Remove GTM$CI from the list of supressed values (no longer exists).

- sr_port/fgncal.h
  1. Remove declaration for fgncal_lookup.c (no longer used).

- sr_port/fgncal_lookup.c
  1. Deleted - no longer used

- sr_port/gbldefs.c
  1. Remove param_list global (no longer used).

- sr_port/mdb_condition_handler.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.
  4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0
     instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1)
     and also the mumps -run base frame.

- sr_port/mdef.h
  1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of
    requiring it to be a compiler option.

- sr_port/mprof_funcs.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. When running the stack, skip over both TRIGGER and CALLIN base frames.

- sr_port/op_bindparm.c
  1. Minor cosmetic changes and comment clarifications.

- sr_port/op_clralsvals.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer
     need to unwind one more frame to the base frame as the flagged frame IS the base frame).

- sr_port/op_halt.c
  2. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/op_unwind.c
  2. Add include of gtmio.h for debugging macro(s).

- sr_port/op_zg1.c
  2. If call-in mode, restart the frame we unwind to. This returns to the caller.

- sr_port/op_zgoto.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.

- sr_port/op_zhalt.c
  1. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/parm_pool.c
  1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in
     two forms:
     a. Form that gets its arguments from the varargs parameter list.
     b. Form that gets its arguments from a parmblk_struct.

- sr_port/stack_frame.h
  1. Remove SFF_CI flag and resequence flags to fill the gap.
  2. Add SFT_CI type id.
  3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames.

- sr_port/tp_unwind.c
  1.  Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too.

- sr_port/unw_mv_ent.c
  1. Change debugging output to be more clear which path is being taken.

- sr_port/zlput_rname.c
  1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer
     frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the
     M stack so it is more effectively utilized.

- sr_port/zshow_stack.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Add marker for when call-in base frame shows up in stack.

- sr_unix/ci_ret_code.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed.  The ci_ret_code()
     routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind
     procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in
     base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp()
     which we are trying to avoid. That condition is now special cased in ZGOTO.
  3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments
     added to better describe the dependencies

- sr_unix/error_return.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Don't let CALL-IN mode do an EXIT here - force return to caller.

- sr_unix/fgncalsp.h
  1. Get rid of no longer needed pointer fields (now passed directly)

- sr_unix/gtm_startup.c
  1. Add comments on odd happenings.

- sr_unix/gtm_trigger.c
  1. Change an M stack running loop to work without GTM$CI frame.

- sr_unix/gtm_unlink_all.c
  1. Remove check for GTM$CI executable.

- sr_unix/gtmci.c
  1. Removed use of longjmp() to effect return from call-ins.
  2. Change SFF_CI flag usage to SFT_CI type usage.
  3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters
     appropriately for call-in to op_bindparm in generated M code.
  4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call
     push_parm to set up called-routine parameters ourselves.

- sr_unix/gtmci.h
  1. Removed GTM$CI as an intermediate call-in frame.
  2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame.

- sr_unix/gtmci_isv.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Some comment cleanup.

- sr_unix/gtmci_signals.c
  1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere.

- sr_unix/invocation_mode.h
  1. Remove MUMPS_GTMCI_OFF macro as unused.
  2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set).

- sr_unix/jobchild_init.c
  1. Don't create GTM$CI anymore (make_cimode() not needed and removed).
  2. Create single base CI frame instead of base + GTM$CI frame.

- sr_unix/make_cimode.c
  1. Removed - no longer needed.

- sr_unix/make_mode.c
  1. Removed all capability to create GTM$CI as this "program" is no longer needed.
  2. Removed all #ifdef of __ia64, __hpux, and _AIX.

- sr_unix/ojchildparms.c
  1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't
     matter what type of frame this is so long as there is one that can be used to return.

- sr_unix/op_fnfgncal.c
  1. Change SFF_CI flag usage to SFT_CI type usage.

- sr_unix/relinkctl.c
  1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame
     now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame).

- sr_unix/rtnhdr.h
  1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a
     base frame doesn't screw things up and cause that NULL to be dereferenced.

- sr_x86_64/ci_restart.s
  1. Remove this parameter massaging routine as being unnecessary anymore.

- sr_x86_64/g_msf.si
  1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble).

[9/2017 SEE]

Files changes in this (squashed) commit:

sr_port/op_fnview.c
  1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER".
     Multiple comma separated tokens may be returned.

sr_port/viewtab.h
  1. Add "ENVIRONMENT" option.
  2. Remove non-UNIX environment #ifdef code.

Files changes in this (squashed) commit:

- sr_linux/release_name.h - Change to R1.10

Files changes in this (squashed) commit:

sr_port/alias.h:
  1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not
     have to specify a compiler option.

sr_port/alias_funcs.c:
  1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables
     for when they get changed and there were the LVMON* view commands that help to debug aliases and are active
     only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG
     error since the view commands that followed it in the table were LVMON*. This update changes the name of the
     VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON*
     in the test system due to it requiring a compiler flag to be incorporated so no tests were affected.

sr_port/gbldefs.c:
sr_port/lv_val.h:
sr_port/op_view.h
sr_port/tp_unwind.c
sr_port/viewtab.h
  1. Rename lvmon* for lvamon*.

sr_port/mdef.h
  1. Comment out define for DEBUG_ALIAS

-----------------------------------

This project significantly changes how call-ins works. To properly describe it, we need to first describe the
current (original) operation of callins. The following write-up between the ------ lines below describes the
current (r100 and previous versions) call-in operation:

-----------------------------------

How call-ins work:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine
               is called GTM$CI.
  9.         - gtm_init_env(base_addr, transfer_addr)
 10.           - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
                 of gtm_ret_code() but with rvector set to addr of GTM$CI.
 11.           - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the
                 GTM$CI routine.
 12.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by gtm_init_env()).
     - rvector is for GTM$CI routine.
     - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M).
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.
  2. Executable frame (created by new_stack_frame() as called by gtm_init_env()).
     - rvector pointer is set to GTM$CI routine
     - Frame is flagged as as call-in

gtm_ci[p]() - Perform call-in:
  1. Executable frame set up in step YottaDB#11 of gtm_init() above has its execution field modified to point to the GTM$CI routine.
  2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which
     will add a new stack frame and have the call-in parameters set up.
  3. Drives dm_start() to invoke the callin. The actual call path goes like this:
       a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI).
       b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var
          param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it.
       c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are
          calling into is invoked.
       d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where
          mdb_condition_handler was setup in dm_start().
       e. dm_start() then returns to gtm_ci[p] for return value processing.

The old parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms.
  4. When op_extexfun() is driven, parms are shifted around for call to push_parm().
  5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms.
  6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local
     vars in the called routine.

Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its
parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit
in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the
parameter conversion necessary at the time. The new support described next has made some changes that make this extra
conversion step unnecessary.

-----------------------------------

Following (between the *********** lines) is a description of how call-ins work with this new support in place:

***********************************

How the new call-ins functionality works:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
               of gtm_ret_code() but with rvector set to addr of GTM$CI.
  9.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by jobchild_init()).
     - rvector is NULL as this frame is just for return.
     - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame.
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.

gtm_ci[p]() - Perform call-in:
  1. Create new executable frame (new_stack_frame()) for the called M routine to run in.
  2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter.
  3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by
     op_bindparm() which gets called at the top of the called target routine. See parameter flow below.
  4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven.
  5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp().
  6. dm_start() then returns to gtm_ci[p] for return value processing.

The parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used
     for buffering parameters).
  4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty)
     op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in
     the called routine then releases the parameter space for reuse.

***********************************

-----------------------------------

Operational changes with this project in the execution of a call-in (note steps are from original code):

- No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init YottaDB#8).
- Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus
  the GTM$CI frame (step gtm_init YottaDB#9).
- No longer setup GTM$CI as part of gtm_ci[p](). Instead, we allocate the frame using new_stack_frame() called
  directly (gtm_ci[p]() YottaDB#1).
- After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes
  the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can
  be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in
  the M source code (gtm_ci[p]() YottaDB#2 and YottaDB#3).
- The execution flow used to look like:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive GTM$CI "build" routine.
          * op_extexfun()
	  * Drive target M routine
	  * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start).
        * return to gtm_ci[p] via longjmp() in ci_ret_code() which returns to dm_start() and returns from there.
    * Return to C caller
- The flow has similar steps but one less layer and returns without the use of a (longjmp) system call:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive target M routine
	* M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start.
        * return to gtm_ci[p]
      * return to C caller.

-----------------------------------

Notes on the removal of GTM$CI.

- A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked
  a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller.
- A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine"
  was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before
  calling the call-in routine.
- Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but
  was created by the make_mode() routine. Here's the operation:
    a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall
       depending on whether has args or return value). The parm_blk also contains the parameters and other things
       needed to effect a call.
    b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to
       be GTM$CI.
    c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized
       version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the
       input and output parameters for the call-in routine and drives it to create the stack.
    d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame.
    e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return.
- So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that
  made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the
  system call flavor of return is unnecessary. Wanted to avoid that too.
- Two changes allowed us to be rid of the extra GTM$CI overhead:
    1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return
       without being unwound. We make use of that so we can do the simple unwind of call-in levels.
    2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine
       directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them
       back off in the glue routine. This avoided one of the argument restructurings that were happening.
- Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and
  it was moved to the base frame itself which is how triggers also does it.

-----------------------------------

User visible changes in this project (for specification in release note):

1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings.
2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in
   environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine
   is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1.
3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A
   stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text
   "(Call-In Level Entry)" appears in the stack list.
4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used.
   You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first
   call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in
   base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame.
5. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't
   find it.
6. $VIEW(ENVIRONMENT) description.
7. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously.

```
estess added a commit to estess/YottaDB that referenced this issue Oct 12, 2017
```
Files changes in this commit:

- sr_i386/g_msf.si
  1. Remove SFF_CI flag and resequence flags

- sr_port/alias_funcs.c
  1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI
     frame no longer being there.

  A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it
  has to do with the elimination of the GTM$CI frame. The frame chain used to look like this:

    CI-base-frame <- GTM$CI <- M routine

  Now it looks like:

    CI-base-frame <- M routine

  In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame.
  In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame.
  The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to
  back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS
  the base frame.

- sr_port/dollar_zlevel.c
  1. Spruce up some comments to make more sense.

- sr_port/f_text.c
  1. Remove GTM$CI from the list of supressed values (no longer exists).

- sr_port/fgncal.h
  1. Remove declaration for fgncal_lookup.c (no longer used).

- sr_port/fgncal_lookup.c
  1. Deleted - no longer used

- sr_port/gbldefs.c
  1. Remove param_list global (no longer used).

- sr_port/mdb_condition_handler.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.
  4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0
     instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1)
     and also the mumps -run base frame.

- sr_port/mdef.h
  1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of
    requiring it to be a compiler option.

- sr_port/mprof_funcs.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. When running the stack, skip over both TRIGGER and CALLIN base frames.

- sr_port/op_bindparm.c
  1. Minor cosmetic changes and comment clarifications.

- sr_port/op_clralsvals.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer
     need to unwind one more frame to the base frame as the flagged frame IS the base frame).

- sr_port/op_halt.c
  2. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/op_unwind.c
  2. Add include of gtmio.h for debugging macro(s).

- sr_port/op_zg1.c
  2. If call-in mode, restart the frame we unwind to. This returns to the caller.

- sr_port/op_zgoto.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.

- sr_port/op_zhalt.c
  1. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/parm_pool.c
  1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in
     two forms:
     a. Form that gets its arguments from the varargs parameter list.
     b. Form that gets its arguments from a parmblk_struct.

- sr_port/stack_frame.h
  1. Remove SFF_CI flag and resequence flags to fill the gap.
  2. Add SFT_CI type id.
  3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames.

- sr_port/tp_unwind.c
  1.  Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too.

- sr_port/unw_mv_ent.c
  1. Change debugging output to be more clear which path is being taken.

- sr_port/zlput_rname.c
  1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer
     frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the
     M stack so it is more effectively utilized.

- sr_port/zshow_stack.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Add marker for when call-in base frame shows up in stack.

- sr_unix/ci_ret_code.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed.  The ci_ret_code()
     routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind
     procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in
     base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp()
     which we are trying to avoid. That condition is now special cased in ZGOTO.
  3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments
     added to better describe the dependencies

- sr_unix/error_return.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Don't let CALL-IN mode do an EXIT here - force return to caller.

- sr_unix/fgncalsp.h
  1. Get rid of no longer needed pointer fields (now passed directly)

- sr_unix/gtm_startup.c
  1. Add comments on odd happenings.

- sr_unix/gtm_trigger.c
  1. Change an M stack running loop to work without GTM$CI frame.

- sr_unix/gtm_unlink_all.c
  1. Remove check for GTM$CI executable.

- sr_unix/gtmci.c
  1. Removed use of longjmp() to effect return from call-ins.
  2. Change SFF_CI flag usage to SFT_CI type usage.
  3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters
     appropriately for call-in to op_bindparm in generated M code.
  4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call
     push_parm to set up called-routine parameters ourselves.

- sr_unix/gtmci.h
  1. Removed GTM$CI as an intermediate call-in frame.
  2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame.

- sr_unix/gtmci_isv.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Some comment cleanup.

- sr_unix/gtmci_signals.c
  1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere.

- sr_unix/invocation_mode.h
  1. Remove MUMPS_GTMCI_OFF macro as unused.
  2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set).

- sr_unix/jobchild_init.c
  1. Don't create GTM$CI anymore (make_cimode() not needed and removed).
  2. Create single base CI frame instead of base + GTM$CI frame.

- sr_unix/make_cimode.c
  1. Removed - no longer needed.

- sr_unix/make_mode.c
  1. Removed all capability to create GTM$CI as this "program" is no longer needed.
  2. Removed all #ifdef of __ia64, __hpux, and _AIX.

- sr_unix/ojchildparms.c
  1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't
     matter what type of frame this is so long as there is one that can be used to return.

- sr_unix/op_fnfgncal.c
  1. Change SFF_CI flag usage to SFT_CI type usage.

- sr_unix/relinkctl.c
  1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame
     now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame).

- sr_unix/rtnhdr.h
  1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a
     base frame doesn't screw things up and cause that NULL to be dereferenced.

- sr_x86_64/ci_restart.s
  1. Remove this parameter massaging routine as being unnecessary anymore.

- sr_x86_64/g_msf.si
  1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble).

[9/2017 SEE]

Files changes in this (squashed) commit:

sr_port/op_fnview.c
  1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER".
     Multiple comma separated tokens may be returned.

sr_port/viewtab.h
  1. Add "ENVIRONMENT" option.
  2. Remove non-UNIX environment #ifdef code.

Files changes in this (squashed) commit:

- sr_linux/release_name.h - Change to R1.10

Files changes in this (squashed) commit:

sr_port/alias.h:
  1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not
     have to specify a compiler option.

sr_port/alias_funcs.c:
  1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables
     for when they get changed and there were the LVMON* view commands that help to debug aliases and are active
     only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG
     error since the view commands that followed it in the table were LVMON*. This update changes the name of the
     VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON*
     in the test system due to it requiring a compiler flag to be incorporated so no tests were affected.

sr_port/gbldefs.c:
sr_port/lv_val.h:
sr_port/op_view.h
sr_port/tp_unwind.c
sr_port/viewtab.h
  1. Rename lvmon* for lvamon*.

sr_port/mdef.h
  1. Comment out define for DEBUG_ALIAS

-----------------------------------

This project significantly changes how call-ins works. To properly describe it, we need to first describe the
current (original) operation of callins. The following write-up between the ------ lines below describes the
current (r100 and previous versions) call-in operation:

-----------------------------------

How call-ins work:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine
               is called GTM$CI.
  9.         - gtm_init_env(base_addr, transfer_addr)
 10.           - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
                 of gtm_ret_code() but with rvector set to addr of GTM$CI.
 11.           - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the
                 GTM$CI routine.
 12.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by gtm_init_env()).
     - rvector is for GTM$CI routine.
     - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M).
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.
  2. Executable frame (created by new_stack_frame() as called by gtm_init_env()).
     - rvector pointer is set to GTM$CI routine
     - Frame is flagged as as call-in

gtm_ci[p]() - Perform call-in:
  1. Executable frame set up in step YottaDB#11 of gtm_init() above has its execution field modified to point to the GTM$CI routine.
  2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which
     will add a new stack frame and have the call-in parameters set up.
  3. Drives dm_start() to invoke the callin. The actual call path goes like this:
       a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI).
       b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var
          param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it.
       c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are
          calling into is invoked.
       d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where
          mdb_condition_handler was setup in dm_start().
       e. dm_start() then returns to gtm_ci[p] for return value processing.

The old parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms.
  4. When op_extexfun() is driven, parms are shifted around for call to push_parm().
  5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms.
  6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local
     vars in the called routine.

Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its
parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit
in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the
parameter conversion necessary at the time. The new support described next has made some changes that make this extra
conversion step unnecessary.

-----------------------------------

Following (between the *********** lines) is a description of how call-ins work with this new support in place:

***********************************

How the new call-ins functionality works:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
               of gtm_ret_code() but with rvector set to addr of GTM$CI.
  9.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by jobchild_init()).
     - rvector is NULL as this frame is just for return.
     - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame.
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.

gtm_ci[p]() - Perform call-in:
  1. Create new executable frame (new_stack_frame()) for the called M routine to run in.
  2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter.
  3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by
     op_bindparm() which gets called at the top of the called target routine. See parameter flow below.
  4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven.
  5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp().
  6. dm_start() then returns to gtm_ci[p] for return value processing.

The parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used
     for buffering parameters).
  4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty)
     op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in
     the called routine then releases the parameter space for reuse.

***********************************

-----------------------------------

Operational changes with this project in the execution of a call-in (note steps are from original code):

- No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init YottaDB#8).
- Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus
  the GTM$CI frame (step gtm_init YottaDB#9).
- No longer setup GTM$CI as part of gtm_ci[p](). Instead, we allocate the frame using new_stack_frame() called
  directly (gtm_ci[p]() YottaDB#1).
- After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes
  the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can
  be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in
  the M source code (gtm_ci[p]() YottaDB#2 and YottaDB#3).
- The execution flow used to look like:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive GTM$CI "build" routine.
          * op_extexfun()
	  * Drive target M routine
	  * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start).
        * return to gtm_ci[p] via longjmp() in ci_ret_code() which returns to dm_start() and returns from there.
    * Return to C caller
- The flow has similar steps but one less layer and returns without the use of a (longjmp) system call:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive target M routine
	* M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start.
        * return to gtm_ci[p]
      * return to C caller.

-----------------------------------

Notes on the removal of GTM$CI.

- A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked
  a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller.
- A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine"
  was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before
  calling the call-in routine.
- Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but
  was created by the make_mode() routine. Here's the operation:
    a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall
       depending on whether has args or return value). The parm_blk also contains the parameters and other things
       needed to effect a call.
    b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to
       be GTM$CI.
    c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized
       version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the
       input and output parameters for the call-in routine and drives it to create the stack.
    d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame.
    e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return.
- So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that
  made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the
  system call flavor of return is unnecessary. Wanted to avoid that too.
- Two changes allowed us to be rid of the extra GTM$CI overhead:
    1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return
       without being unwound. We make use of that so we can do the simple unwind of call-in levels.
    2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine
       directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them
       back off in the glue routine. This avoided one of the argument restructurings that were happening.
- Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and
  it was moved to the base frame itself which is how triggers also does it.

-----------------------------------

User visible changes in this project (for specification in release note):

1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings.
2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in
   environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine
   is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1.
3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A
   stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text
   "(Call-In Level Entry)" appears in the stack list.
4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used.
   You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first
   call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in
   base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame.
5. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't
   find it.
6. $VIEW(ENVIRONMENT) description.
7. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously.

```
estess added a commit to estess/YottaDB that referenced this issue Oct 12, 2017
```
Files changes in this commit:

- sr_i386/g_msf.si
  1. Remove SFF_CI flag and resequence flags

- sr_port/alias_funcs.c
  1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI
     frame no longer being there.

  A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it
  has to do with the elimination of the GTM$CI frame. The frame chain used to look like this:

    CI-base-frame <- GTM$CI <- M routine

  Now it looks like:

    CI-base-frame <- M routine

  In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame.
  In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame.
  The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to
  back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS
  the base frame.

- sr_port/dollar_zlevel.c
  1. Spruce up some comments to make more sense.

- sr_port/f_text.c
  1. Remove GTM$CI from the list of supressed values (no longer exists).

- sr_port/fgncal.h
  1. Remove declaration for fgncal_lookup.c (no longer used).

- sr_port/fgncal_lookup.c
  1. Deleted - no longer used

- sr_port/gbldefs.c
  1. Remove param_list global (no longer used).

- sr_port/mdb_condition_handler.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.
  4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0
     instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1)
     and also the mumps -run base frame.

- sr_port/mdef.h
  1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of
    requiring it to be a compiler option.

- sr_port/mprof_funcs.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. When running the stack, skip over both TRIGGER and CALLIN base frames.

- sr_port/op_bindparm.c
  1. Minor cosmetic changes and comment clarifications.

- sr_port/op_clralsvals.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer
     need to unwind one more frame to the base frame as the flagged frame IS the base frame).

- sr_port/op_halt.c
  2. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/op_unwind.c
  2. Add include of gtmio.h for debugging macro(s).

- sr_port/op_zg1.c
  2. If call-in mode, restart the frame we unwind to. This returns to the caller.

- sr_port/op_zgoto.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.

- sr_port/op_zhalt.c
  1. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/parm_pool.c
  1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in
     two forms:
     a. Form that gets its arguments from the varargs parameter list.
     b. Form that gets its arguments from a parmblk_struct.

- sr_port/stack_frame.h
  1. Remove SFF_CI flag and resequence flags to fill the gap.
  2. Add SFT_CI type id.
  3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames.

- sr_port/tp_unwind.c
  1.  Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too.

- sr_port/unw_mv_ent.c
  1. Change debugging output to be more clear which path is being taken.

- sr_port/zlput_rname.c
  1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer
     frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the
     M stack so it is more effectively utilized.

- sr_port/zshow_stack.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Add marker for when call-in base frame shows up in stack.

- sr_unix/ci_ret_code.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed.  The ci_ret_code()
     routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind
     procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in
     base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp()
     which we are trying to avoid. That condition is now special cased in ZGOTO.
  3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments
     added to better describe the dependencies

- sr_unix/error_return.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Don't let CALL-IN mode do an EXIT here - force return to caller.

- sr_unix/fgncalsp.h
  1. Get rid of no longer needed pointer fields (now passed directly)

- sr_unix/gtm_startup.c
  1. Add comments on odd happenings.

- sr_unix/gtm_trigger.c
  1. Change an M stack running loop to work without GTM$CI frame.

- sr_unix/gtm_unlink_all.c
  1. Remove check for GTM$CI executable.

- sr_unix/gtmci.c
  1. Removed use of longjmp() to effect return from call-ins.
  2. Change SFF_CI flag usage to SFT_CI type usage.
  3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters
     appropriately for call-in to op_bindparm in generated M code.
  4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call
     push_parm to set up called-routine parameters ourselves.

- sr_unix/gtmci.h
  1. Removed GTM$CI as an intermediate call-in frame.
  2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame.

- sr_unix/gtmci_isv.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Some comment cleanup.

- sr_unix/gtmci_signals.c
  1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere.

- sr_unix/invocation_mode.h
  1. Remove MUMPS_GTMCI_OFF macro as unused.
  2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set).

- sr_unix/jobchild_init.c
  1. Don't create GTM$CI anymore (make_cimode() not needed and removed).
  2. Create single base CI frame instead of base + GTM$CI frame.

- sr_unix/make_cimode.c
  1. Removed - no longer needed.

- sr_unix/make_mode.c
  1. Removed all capability to create GTM$CI as this "program" is no longer needed.
  2. Removed all #ifdef of __ia64, __hpux, and _AIX.

- sr_unix/ojchildparms.c
  1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't
     matter what type of frame this is so long as there is one that can be used to return.

- sr_unix/op_fnfgncal.c
  1. Change SFF_CI flag usage to SFT_CI type usage.

- sr_unix/relinkctl.c
  1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame
     now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame).

- sr_unix/rtnhdr.h
  1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a
     base frame doesn't screw things up and cause that NULL to be dereferenced.

- sr_x86_64/ci_restart.s
  1. Remove this parameter massaging routine as being unnecessary anymore.

- sr_x86_64/g_msf.si
  1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble).

sr_port/op_fnview.c
  1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER".
     Multiple comma separated tokens may be returned.

sr_port/viewtab.h
  1. Add "ENVIRONMENT" option.
  2. Remove non-UNIX environment #ifdef code.

- sr_linux/release_name.h - Change to R1.10

sr_port/alias.h:
  1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not
     have to specify a compiler option.

sr_port/alias_funcs.c:
  1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables
     for when they get changed and there were the LVMON* view commands that help to debug aliases and are active
     only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG
     error since the view commands that followed it in the table were LVMON*. This update changes the name of the
     VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON*
     in the test system due to it requiring a compiler flag to be incorporated so no tests were affected.

sr_port/gbldefs.c:
sr_port/lv_val.h:
sr_port/op_view.h
sr_port/tp_unwind.c
sr_port/viewtab.h
  1. Rename lvmon* for lvamon*.

sr_port/mdef.h
  1. Comment out define for DEBUG_ALIAS

-----------------------------------

This project significantly changes how call-ins works. To properly describe it, we need to first describe the
current (original) operation of callins. The following write-up between the ------ lines below describes the
current (r100 and previous versions) call-in operation:

-----------------------------------

How call-ins work:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine
               is called GTM$CI.
  9.         - gtm_init_env(base_addr, transfer_addr)
 10.           - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
                 of gtm_ret_code() but with rvector set to addr of GTM$CI.
 11.           - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the
                 GTM$CI routine.
 12.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by gtm_init_env()).
     - rvector is for GTM$CI routine.
     - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M).
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.
  2. Executable frame (created by new_stack_frame() as called by gtm_init_env()).
     - rvector pointer is set to GTM$CI routine
     - Frame is flagged as as call-in

gtm_ci[p]() - Perform call-in:
  1. Executable frame set up in step YottaDB#11 of gtm_init() above has its execution field modified to point to the GTM$CI routine.
  2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which
     will add a new stack frame and have the call-in parameters set up.
  3. Drives dm_start() to invoke the callin. The actual call path goes like this:
       a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI).
       b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var
          param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it.
       c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are
          calling into is invoked.
       d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where
          mdb_condition_handler was setup in dm_start().
       e. dm_start() then returns to gtm_ci[p] for return value processing.

The old parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms.
  4. When op_extexfun() is driven, parms are shifted around for call to push_parm().
  5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms.
  6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local
     vars in the called routine.

Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its
parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit
in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the
parameter conversion necessary at the time. The new support described next has made some changes that make this extra
conversion step unnecessary.

-----------------------------------

Following (between the *********** lines) is a description of how call-ins work with this new support in place:

***********************************

How the new call-ins functionality works:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
               of gtm_ret_code() but with rvector set to addr of GTM$CI.
  9.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by jobchild_init()).
     - rvector is NULL as this frame is just for return.
     - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame.
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.

gtm_ci[p]() - Perform call-in:
  1. Create new executable frame (new_stack_frame()) for the called M routine to run in.
  2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter.
  3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by
     op_bindparm() which gets called at the top of the called target routine. See parameter flow below.
  4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven.
  5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp().
  6. dm_start() then returns to gtm_ci[p] for return value processing.

The parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used
     for buffering parameters).
  4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty)
     op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in
     the called routine then releases the parameter space for reuse.

***********************************

-----------------------------------

Operational changes with this project in the execution of a call-in (note steps are from original code):

- No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init YottaDB#8).
- Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus
  the GTM$CI frame (step gtm_init YottaDB#9).
- No longer setup GTM$CI as part of gtm_ci[p](). Instead, we allocate the frame using new_stack_frame() called
  directly (gtm_ci[p]() YottaDB#1).
- After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes
  the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can
  be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in
  the M source code (gtm_ci[p]() YottaDB#2 and YottaDB#3).
- The execution flow used to look like:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive GTM$CI "build" routine.
          * op_extexfun()
	  * Drive target M routine
	  * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start).
        * return to gtm_ci[p] via longjmp() in ci_ret_code() which returns to dm_start() and returns from there.
    * Return to C caller
- The flow has similar steps but one less layer and returns without the use of a (longjmp) system call:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive target M routine
	* M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start.
        * return to gtm_ci[p]
      * return to C caller.

-----------------------------------

Notes on the removal of GTM$CI.

- A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked
  a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller.
- A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine"
  was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before
  calling the call-in routine.
- Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but
  was created by the make_mode() routine. Here's the operation:
    a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall
       depending on whether has args or return value). The parm_blk also contains the parameters and other things
       needed to effect a call.
    b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to
       be GTM$CI.
    c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized
       version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the
       input and output parameters for the call-in routine and drives it to create the stack.
    d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame.
    e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return.
- So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that
  made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the
  system call flavor of return is unnecessary. Wanted to avoid that too.
- Two changes allowed us to be rid of the extra GTM$CI overhead:
    1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return
       without being unwound. We make use of that so we can do the simple unwind of call-in levels.
    2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine
       directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them
       back off in the glue routine. This avoided one of the argument restructurings that were happening.
- Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and
  it was moved to the base frame itself which is how triggers also does it.

-----------------------------------

User visible changes in this project (for specification in release note):

1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings.
2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in
   environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine
   is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1.
3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A
   stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text
   "(Call-In Level Entry)" appears in the stack list.
4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used.
   You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first
   call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in
   base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame.
5. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't
   find it.
6. $VIEW(ENVIRONMENT) description.
7. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously.

```
estess added a commit that referenced this issue Oct 12, 2017
```
Files changes in this commit:

- sr_i386/g_msf.si
  1. Remove SFF_CI flag and resequence flags

- sr_port/alias_funcs.c
  1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI
     frame no longer being there.

  A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it
  has to do with the elimination of the GTM$CI frame. The frame chain used to look like this:

    CI-base-frame <- GTM$CI <- M routine

  Now it looks like:

    CI-base-frame <- M routine

  In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame.
  In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame.
  The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to
  back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS
  the base frame.

- sr_port/dollar_zlevel.c
  1. Spruce up some comments to make more sense.

- sr_port/f_text.c
  1. Remove GTM$CI from the list of supressed values (no longer exists).

- sr_port/fgncal.h
  1. Remove declaration for fgncal_lookup.c (no longer used).

- sr_port/fgncal_lookup.c
  1. Deleted - no longer used

- sr_port/gbldefs.c
  1. Remove param_list global (no longer used).

- sr_port/mdb_condition_handler.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.
  4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0
     instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1)
     and also the mumps -run base frame.

- sr_port/mdef.h
  1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of
    requiring it to be a compiler option.

- sr_port/mprof_funcs.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. When running the stack, skip over both TRIGGER and CALLIN base frames.

- sr_port/op_bindparm.c
  1. Minor cosmetic changes and comment clarifications.

- sr_port/op_clralsvals.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer
     need to unwind one more frame to the base frame as the flagged frame IS the base frame).

- sr_port/op_halt.c
  2. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/op_unwind.c
  2. Add include of gtmio.h for debugging macro(s).

- sr_port/op_zg1.c
  2. If call-in mode, restart the frame we unwind to. This returns to the caller.

- sr_port/op_zgoto.c
  1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now).
  2. Cleanups - remove VMS conditional code.
  3. Change SFF_CI flag usage to SFT_CI type usage.

- sr_port/op_zhalt.c
  1. If call-in mode, do NOT just exit - make sure to return to the call-in caller.

- sr_port/parm_pool.c
  1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in
     two forms:
     a. Form that gets its arguments from the varargs parameter list.
     b. Form that gets its arguments from a parmblk_struct.

- sr_port/stack_frame.h
  1. Remove SFF_CI flag and resequence flags to fill the gap.
  2. Add SFT_CI type id.
  3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames.

- sr_port/tp_unwind.c
  1.  Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too.

- sr_port/unw_mv_ent.c
  1. Change debugging output to be more clear which path is being taken.

- sr_port/zlput_rname.c
  1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer
     frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the
     M stack so it is more effectively utilized.

- sr_port/zshow_stack.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Add marker for when call-in base frame shows up in stack.

- sr_unix/ci_ret_code.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed.  The ci_ret_code()
     routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind
     procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in
     base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp()
     which we are trying to avoid. That condition is now special cased in ZGOTO.
  3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments
     added to better describe the dependencies

- sr_unix/error_return.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Don't let CALL-IN mode do an EXIT here - force return to caller.

- sr_unix/fgncalsp.h
  1. Get rid of no longer needed pointer fields (now passed directly)

- sr_unix/gtm_startup.c
  1. Add comments on odd happenings.

- sr_unix/gtm_trigger.c
  1. Change an M stack running loop to work without GTM$CI frame.

- sr_unix/gtm_unlink_all.c
  1. Remove check for GTM$CI executable.

- sr_unix/gtmci.c
  1. Removed use of longjmp() to effect return from call-ins.
  2. Change SFF_CI flag usage to SFT_CI type usage.
  3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters
     appropriately for call-in to op_bindparm in generated M code.
  4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call
     push_parm to set up called-routine parameters ourselves.

- sr_unix/gtmci.h
  1. Removed GTM$CI as an intermediate call-in frame.
  2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame.

- sr_unix/gtmci_isv.c
  1. Change SFF_CI flag usage to SFT_CI type usage.
  2. Some comment cleanup.

- sr_unix/gtmci_signals.c
  1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere.

- sr_unix/invocation_mode.h
  1. Remove MUMPS_GTMCI_OFF macro as unused.
  2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set).

- sr_unix/jobchild_init.c
  1. Don't create GTM$CI anymore (make_cimode() not needed and removed).
  2. Create single base CI frame instead of base + GTM$CI frame.

- sr_unix/make_cimode.c
  1. Removed - no longer needed.

- sr_unix/make_mode.c
  1. Removed all capability to create GTM$CI as this "program" is no longer needed.
  2. Removed all #ifdef of __ia64, __hpux, and _AIX.

- sr_unix/ojchildparms.c
  1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't
     matter what type of frame this is so long as there is one that can be used to return.

- sr_unix/op_fnfgncal.c
  1. Change SFF_CI flag usage to SFT_CI type usage.

- sr_unix/relinkctl.c
  1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame
     now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame).

- sr_unix/rtnhdr.h
  1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a
     base frame doesn't screw things up and cause that NULL to be dereferenced.

- sr_x86_64/ci_restart.s
  1. Remove this parameter massaging routine as being unnecessary anymore.

- sr_x86_64/g_msf.si
  1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble).

sr_port/op_fnview.c
  1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER".
     Multiple comma separated tokens may be returned.

sr_port/viewtab.h
  1. Add "ENVIRONMENT" option.
  2. Remove non-UNIX environment #ifdef code.

- sr_linux/release_name.h - Change to R1.10

sr_port/alias.h:
  1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not
     have to specify a compiler option.

sr_port/alias_funcs.c:
  1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables
     for when they get changed and there were the LVMON* view commands that help to debug aliases and are active
     only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG
     error since the view commands that followed it in the table were LVMON*. This update changes the name of the
     VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON*
     in the test system due to it requiring a compiler flag to be incorporated so no tests were affected.

sr_port/gbldefs.c:
sr_port/lv_val.h:
sr_port/op_view.h
sr_port/tp_unwind.c
sr_port/viewtab.h
  1. Rename lvmon* for lvamon*.

sr_port/mdef.h
  1. Comment out define for DEBUG_ALIAS

-----------------------------------

This project significantly changes how call-ins works. To properly describe it, we need to first describe the
current (original) operation of callins. The following write-up between the ------ lines below describes the
current (r100 and previous versions) call-in operation:

-----------------------------------

How call-ins work:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine
               is called GTM$CI.
  9.         - gtm_init_env(base_addr, transfer_addr)
 10.           - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
                 of gtm_ret_code() but with rvector set to addr of GTM$CI.
 11.           - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the
                 GTM$CI routine.
 12.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by gtm_init_env()).
     - rvector is for GTM$CI routine.
     - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M).
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.
  2. Executable frame (created by new_stack_frame() as called by gtm_init_env()).
     - rvector pointer is set to GTM$CI routine
     - Frame is flagged as as call-in

gtm_ci[p]() - Perform call-in:
  1. Executable frame set up in step #11 of gtm_init() above has its execution field modified to point to the GTM$CI routine.
  2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which
     will add a new stack frame and have the call-in parameters set up.
  3. Drives dm_start() to invoke the callin. The actual call path goes like this:
       a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI).
       b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var
          param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it.
       c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are
          calling into is invoked.
       d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where
          mdb_condition_handler was setup in dm_start().
       e. dm_start() then returns to gtm_ci[p] for return value processing.

The old parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms.
  4. When op_extexfun() is driven, parms are shifted around for call to push_parm().
  5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms.
  6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local
     vars in the called routine.

Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its
parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit
in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the
parameter conversion necessary at the time. The new support described next has made some changes that make this extra
conversion step unnecessary.

-----------------------------------

Following (between the *********** lines) is a description of how call-ins work with this new support in place:

***********************************

How the new call-ins functionality works:

gtm_init() - Initializes YDB runtime
  1.   - image_type set to GTM_IMAGE
  2.   - invocation mode set to MUMPS_CALLIN
  3.   - init_gtm()
  4.     - gtm_startup()
  5.       - allocate M stack
  6.       - create (dummy) base frame with return addr of gtm_ret_code()
  7.       - jobchild_init()
  8.         - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr
               of gtm_ret_code() but with rvector set to addr of GTM$CI.
  9.         - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and
               and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when
               there's a ZGOTO 0.

Stack after gtm_init():
  0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails.
     - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there
       an unwind frame pointer hiding behind the frame like most base frames have
     - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this.
     - In practise, we won't see this frame as it isn't used past initialization.
  1. BaseFrame (created by base_frame() as called by jobchild_init()).
     - rvector is NULL as this frame is just for return.
     - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame.
     - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0.

gtm_ci[p]() - Perform call-in:
  1. Create new executable frame (new_stack_frame()) for the called M routine to run in.
  2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter.
  3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by
     op_bindparm() which gets called at the top of the called target routine. See parameter flow below.
  4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven.
  5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp().
  6. dm_start() then returns to gtm_ci[p] for return value processing.

The parameter flow is:
  1. Parms are passed in as part of the call to gtm_ci[p]().
  2. gtm_ci[p]() reformats the parameters into a "parm_blk".
  3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used
     for buffering parameters).
  4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty)
     op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in
     the called routine then releases the parameter space for reuse.

***********************************

-----------------------------------

Operational changes with this project in the execution of a call-in (note steps are from original code):

- No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init #8).
- Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus
  the GTM$CI frame (step gtm_init #9).
- No longer setup GTM$CI as part of gtm_ci[p](). Instead, we allocate the frame using new_stack_frame() called
  directly (gtm_ci[p]() #1).
- After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes
  the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can
  be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in
  the M source code (gtm_ci[p]() #2 and #3).
- The execution flow used to look like:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive GTM$CI "build" routine.
          * op_extexfun()
	  * Drive target M routine
	  * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start).
        * return to gtm_ci[p] via longjmp() in ci_ret_code() which returns to dm_start() and returns from there.
    * Return to C caller
- The flow has similar steps but one less layer and returns without the use of a (longjmp) system call:
    * C [possibly main] routine
    * call gtm_ci[p]()
      * drive dm_start() to kick off transition to M mode execution
        * drive target M routine
	* M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start.
        * return to gtm_ci[p]
      * return to C caller.

-----------------------------------

Notes on the removal of GTM$CI.

- A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked
  a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller.
- A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine"
  was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before
  calling the call-in routine.
- Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but
  was created by the make_mode() routine. Here's the operation:
    a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall
       depending on whether has args or return value). The parm_blk also contains the parameters and other things
       needed to effect a call.
    b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to
       be GTM$CI.
    c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized
       version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the
       input and output parameters for the call-in routine and drives it to create the stack.
    d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame.
    e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return.
- So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that
  made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the
  system call flavor of return is unnecessary. Wanted to avoid that too.
- Two changes allowed us to be rid of the extra GTM$CI overhead:
    1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return
       without being unwound. We make use of that so we can do the simple unwind of call-in levels.
    2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine
       directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them
       back off in the glue routine. This avoided one of the argument restructurings that were happening.
- Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and
  it was moved to the base frame itself which is how triggers also does it.

-----------------------------------

User visible changes in this project (for specification in release note):

1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings.
2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in
   environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine
   is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1.
3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A
   stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text
   "(Call-In Level Entry)" appears in the stack list.
4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used.
   You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first
   call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in
   base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame.
5. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't
   find it.
6. $VIEW(ENVIRONMENT) description.
7. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously.

```
nars1 referenced this issue in nars1/YottaDB Oct 24, 2017
…andler in threaded code

```
The v63000/gtm8394 subtest failed an assert with the following stack trace.

 #0  0x00007f3f038734c7 in kill () from /usr/lib64/libc.so.6
 #1  0x00000000006c2413 in gtm_dump_core () at R110/sr_unix/gtm_dump_core.c:69
 #2  0x00000000006d5dd0 in gtm_fork_n_core () at R110/sr_unix/gtm_fork_n_core.c:211
 YottaDB#3  0x0000000000695b5b in ch_cond_core () at R110/sr_unix/ch_cond_core.c:59
 YottaDB#4  0x000000000087e6ba in rts_error_va (csa=0x0, argcnt=7, var=0x7f3ef864e178) at R110/sr_unix/rts_error.c:153
 YottaDB#5  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=7) at R110/sr_unix/rts_error.c:85
 YottaDB#6  0x0000000000916610 in hashtab_rehash_ch (arg=150373340) at R110/sr_port/hashtab_rehash_ch.c:33
 YottaDB#7  0x000000000087ec12 in rts_error_va (csa=0x0, argcnt=5, var=0x7f3ef864e438) at R110/sr_unix/rts_error.c:153
 YottaDB#8  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=5) at R110/sr_unix/rts_error.c:85
 YottaDB#9  0x00000000008fa778 in raise_gtmmemory_error () at R110/sr_port/gtm_malloc_src.h:1074
 YottaDB#10 0x00000000008f5ee2 in gtm_malloc (size=835672) at R110/sr_port/gtm_malloc_src.h:724
 YottaDB#11 0x0000000000978722 in init_hashtab_intl_int8 (table=0x7f3ef864e780, minsize=24594, old_table=0x10e8718 <murgbl+88>) at R110/sr_port/hashtab_implementation.h:392
 YottaDB#12 0x000000000097971e in expand_hashtab_int8 (table=0x10e8718 <murgbl+88>, minsize=24594) at R110/sr_port/hashtab_implementation.h:436
 YottaDB#13 0x000000000097a063 in add_hashtab_intl_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0, changing_table_size=0) at R110/sr_port/hashtab_implementation.h:499
 YottaDB#14 0x000000000097a005 in add_hashtab_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0) at R110/sr_port/hashtab_implementation.h:483
 YottaDB#15 0x000000000052a9cc in mur_back_processing_one_region (mur_back_options=0x7f3ef864ee40) at R110/sr_port/mur_back_process.c:1064
 YottaDB#16 0x0000000000523e09 in mur_back_phase1 (rctl=0x2e8fc20) at R110/sr_port/mur_back_process.c:535
 YottaDB#17 0x00000000006e75b8 in gtm_multi_thread_helper (tparm=0x7ffe5753ef30) at R110/sr_unix/gtm_multi_thread.c:228
 YottaDB#18 0x00007f3f03629e25 in start_thread () from /usr/lib64/libpthread.so.0
 YottaDB#19 0x00007f3f0393634d in clone () from /usr/lib64/libc.so.6

This is a test where a memory-error is forced (using limit vmemorysize). And various rollbacks are run. One of them runs with multiple threads and one thread gets a memory error during hashtable expansion. Normally a memory error causes the thread to exit and in turn that signals other threads to exit which is handled fine. But in this case, the condition handler hashtab_rehash_ch() did an UNWIND because it decided an out-of-memory situation implies we will abort the expansion and continue with the previous hashtable (this was a good-to-expand call, not a need-to-expand call). And the UNWIND macro had an assert that we better not be inside multi-threaded code. But that is exactly where we were in this failure.

The reason why the UNWIND has that logic is because in pro it would return control to the erroring thread and let it continue processing but we would not have released the pthread-mutex-lock that we obtained in rts_error_va() for this thread. That means all other threads will not be able to get this lock for various actions they do until the erroring thread tries to obtain the lock again (at which point we would check that we already hold the lock and not try to get the lock again) and later when we release it, other threads will be able to get the thread lock.

The fix is to make sure we release the thread-level lock in the UNWIND macro (and assert that we do hold the lock in dbg).

The pro implication of this issue is that a MUPIP JOURNAL command that encounters a memory error in some cases could in the worst case transform a multi-threaded recovery to a non-threaded recovery command thereby slowing it down. No other user-visible implications are expected out of this.

```
nars1 added a commit that referenced this issue Oct 24, 2017
…andler in threaded code

```
The v63000/gtm8394 subtest failed an assert with the following stack trace.

 #0  0x00007f3f038734c7 in kill () from /usr/lib64/libc.so.6
 #1  0x00000000006c2413 in gtm_dump_core () at R110/sr_unix/gtm_dump_core.c:69
 #2  0x00000000006d5dd0 in gtm_fork_n_core () at R110/sr_unix/gtm_fork_n_core.c:211
 #3  0x0000000000695b5b in ch_cond_core () at R110/sr_unix/ch_cond_core.c:59
 #4  0x000000000087e6ba in rts_error_va (csa=0x0, argcnt=7, var=0x7f3ef864e178) at R110/sr_unix/rts_error.c:153
 #5  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=7) at R110/sr_unix/rts_error.c:85
 #6  0x0000000000916610 in hashtab_rehash_ch (arg=150373340) at R110/sr_port/hashtab_rehash_ch.c:33
 #7  0x000000000087ec12 in rts_error_va (csa=0x0, argcnt=5, var=0x7f3ef864e438) at R110/sr_unix/rts_error.c:153
 #8  0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=5) at R110/sr_unix/rts_error.c:85
 #9  0x00000000008fa778 in raise_gtmmemory_error () at R110/sr_port/gtm_malloc_src.h:1074
 #10 0x00000000008f5ee2 in gtm_malloc (size=835672) at R110/sr_port/gtm_malloc_src.h:724
 #11 0x0000000000978722 in init_hashtab_intl_int8 (table=0x7f3ef864e780, minsize=24594, old_table=0x10e8718 <murgbl+88>) at R110/sr_port/hashtab_implementation.h:392
 #12 0x000000000097971e in expand_hashtab_int8 (table=0x10e8718 <murgbl+88>, minsize=24594) at R110/sr_port/hashtab_implementation.h:436
 #13 0x000000000097a063 in add_hashtab_intl_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0, changing_table_size=0) at R110/sr_port/hashtab_implementation.h:499
 #14 0x000000000097a005 in add_hashtab_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0) at R110/sr_port/hashtab_implementation.h:483
 #15 0x000000000052a9cc in mur_back_processing_one_region (mur_back_options=0x7f3ef864ee40) at R110/sr_port/mur_back_process.c:1064
 #16 0x0000000000523e09 in mur_back_phase1 (rctl=0x2e8fc20) at R110/sr_port/mur_back_process.c:535
 #17 0x00000000006e75b8 in gtm_multi_thread_helper (tparm=0x7ffe5753ef30) at R110/sr_unix/gtm_multi_thread.c:228
 #18 0x00007f3f03629e25 in start_thread () from /usr/lib64/libpthread.so.0
 #19 0x00007f3f0393634d in clone () from /usr/lib64/libc.so.6

This is a test where a memory-error is forced (using limit vmemorysize). And various rollbacks are run. One of them runs with multiple threads and one thread gets a memory error during hashtable expansion. Normally a memory error causes the thread to exit and in turn that signals other threads to exit which is handled fine. But in this case, the condition handler hashtab_rehash_ch() did an UNWIND because it decided an out-of-memory situation implies we will abort the expansion and continue with the previous hashtable (this was a good-to-expand call, not a need-to-expand call). And the UNWIND macro had an assert that we better not be inside multi-threaded code. But that is exactly where we were in this failure.

The reason why the UNWIND has that logic is because in pro it would return control to the erroring thread and let it continue processing but we would not have released the pthread-mutex-lock that we obtained in rts_error_va() for this thread. That means all other threads will not be able to get this lock for various actions they do until the erroring thread tries to obtain the lock again (at which point we would check that we already hold the lock and not try to get the lock again) and later when we release it, other threads will be able to get the thread lock.

The fix is to make sure we release the thread-level lock in the UNWIND macro (and assert that we do hold the lock in dbg).

The pro implication of this issue is that a MUPIP JOURNAL command that encounters a memory error in some cases could in the worst case transform a multi-threaded recovery to a non-threaded recovery command thereby slowing it down. No other user-visible implications are expected out of this.

```
nars1 referenced this issue in nars1/YottaDB Nov 14, 2017
…me errors as they could cause SIG-11 YottaDB#90

A few issues related to compile-time errors were discovered.

1) The below M program correctly issues a PATNOTFOUND error when compiling. But if one tries to run the compiled object code (which should be okay since the execution does not reach the portions of the M code where the compiler error was found), a SIG-11 is observed. This happens only with GT.M V63002 (and in turn YottaDB r1.10) but not with V63001A (and in turn YottaDB r1.00).

```
> cat test.m
main    ;
        do good
        quit
bad     ;
        if 1?1B
        quit
good    ;
        write "hello",!
        quit
```

Related to the above, the below M program test1.m produces a GTMASSERT when run with a debug build. Unlike the previous test case (test.m), the production build did not have problems with test1.m.

```
> cat test1.m
        if 1?1B

> $gtm_dist/mumps -run test1
%GTM-F-GTMASSERT, GT.M V6.3-002 Linux x86_64 - Assert failed /Distrib/GT.M/V63002/sr_port/chktchain.c line 28
```

The primary issue in both the above tests was in bx_boollit() which noticed a pattern match operator usage with both operands being literals and hence invoked do_patfixed() which encountered a PATNOTFOUND error. That caused ins_errtriple() to be invoked which in turn removed all triples corresponding to the current M line (dqdelchain() call) and returned back to bx_boollit() which did not realize this and went ahead with manipulating the triple chains (dqrins() call etc.) and returned to its caller bool_expr() which also did triple chain manipulation (dqdel() call etc.) all the while operating on triples that were no longer part of the execution chain (due to the prior delqchain() call). This caused a corruption in the doubly-linked triple list in "t_orig" which resulted in incorrect object code being generated that later ended up as the SIG-11 when one tried running this M program.

In GT.M V63002, boolean expression evaluation and literal optimization got a significant rework. As part of that change, the macros RETURN_IF_RTS_ERROR and RETURN_EXPR_IF_RTS_ERROR were introduced to check for compile-time errors and if so return from functions right away instead of manipulating triple chains. These safety checks needed to be added in a few more places. That fixed the primary issue.

2) In addition, it was noticed that the following M program fails an assert when run with the debug build.

```
> cat test2.m
        xecute "if ""a""?1B"

> mumps -run test2
%GTM-F-ASSERT, Assert failed in /Distrib/GT.M/V63002/sr_port/zlcompile.c line 81 for expression ((FALSE == run_time) && (TRUE == TREF(compile_time)))
```

Below is the corresponding C-stack.

```
 #0  0x00007ff2e6988767 in kill () at ../sysdeps/unix/syscall-template.S:84
 #1  0x00007ff2e6014a5c in gtm_dump_core () at /Distrib/GT.M/V63002/sr_unix/gtm_dump_core.c:69
 #2  0x00007ff2e5f1de97 in gtm_fork_n_core () at /Distrib/GT.M/V63002/sr_unix/gtm_fork_n_core.c:211
 YottaDB#3  0x00007ff2e6007f2b in ch_cond_core () at /Distrib/GT.M/V63002/sr_unix/ch_cond_core.c:59
 YottaDB#4  0x00007ff2e5f443a2 in rts_error_va (csa=0x0, argcnt=7, var=0x7ffffc4b0a90) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:153
 YottaDB#5  0x00007ff2e5f439b8 in rts_error_csa (csa=0x0, argcnt=7) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:85
 YottaDB#6  0x00007ff2e636d64c in zlcompile (len=48 '0', addr=0x7ffffc4b0e30 "/extra1/testarea1/nars/test/temp/tmp/tmp/test2.m") at /Distrib/GT.M/V63002/sr_port/zlcompile.c:81
 YottaDB#7  0x00007ff2e60e6f1c in op_zlink (v=0x7ffffc4b14a0, quals=0x7ffffc4b0cf0) at /Distrib/GT.M/V63002/sr_unix/op_zlink.c:443
 YottaDB#8  0x00007ff2e5f6a2d7 in job_addr (rtn=0x7ffffc4b1590, label=0x7ffffc4b15a0, offset=0, hdr=0x7ffffc4b1518, labaddr=0x7ffffc4b1510, need_rtnobj_shm_free=0x7ffffc4b14e4) at /Distrib/GT.M/V63002/sr_port/job_addr.c:41
 YottaDB#9  0x00007ff2e5f40b48 in jobchild_init () at /Distrib/GT.M/V63002/sr_unix/jobchild_init.c:146
 YottaDB#10 0x00007ff2e5f3835d in gtm_startup (svec=0x7ffffc4b1d30) at /Distrib/GT.M/V63002/sr_unix/gtm_startup.c:252
 YottaDB#11 0x00007ff2e5f3b2f6 in init_gtm () at /Distrib/GT.M/V63002/sr_unix/init_gtm.c:201
 YottaDB#12 0x00007ff2e5f072ea in gtm_main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm_main.c:162
 YottaDB#13 0x0000000000400cbe in main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm.c:131
```

In this case, run_time was TRUE and caused the assert failure. Turns out this was due to m_xecute() function (invoked by zlcompile()) temporarily setting run_time to FALSE but when a PATNOTFOUND error was encountered, the condition handler compiler_ch() was invoked which did an UNWIND back to zlcompile() incorrectly persisting the global variable changes done by the interim function call of m_xecute().

The fix for this was to reset the run_time and TREF(xecute_literal_parse) global variables just like is being done in mdb_condition_handler().
nars1 added a commit that referenced this issue Nov 15, 2017
…me errors as they could cause SIG-11 #90

A few issues related to compile-time errors were discovered.

1) The below M program correctly issues a PATNOTFOUND error when compiling. But if one tries to run the compiled object code (which should be okay since the execution does not reach the portions of the M code where the compiler error was found), a SIG-11 is observed. This happens only with GT.M V63002 (and in turn YottaDB r1.10) but not with V63001A (and in turn YottaDB r1.00).

```
> cat test.m
main    ;
        do good
        quit
bad     ;
        if 1?1B
        quit
good    ;
        write "hello",!
        quit
```

Related to the above, the below M program test1.m produces a GTMASSERT when run with a debug build. Unlike the previous test case (test.m), the production build did not have problems with test1.m.

```
> cat test1.m
        if 1?1B

> $gtm_dist/mumps -run test1
%GTM-F-GTMASSERT, GT.M V6.3-002 Linux x86_64 - Assert failed /Distrib/GT.M/V63002/sr_port/chktchain.c line 28
```

The primary issue in both the above tests was in bx_boollit() which noticed a pattern match operator usage with both operands being literals and hence invoked do_patfixed() which encountered a PATNOTFOUND error. That caused ins_errtriple() to be invoked which in turn removed all triples corresponding to the current M line (dqdelchain() call) and returned back to bx_boollit() which did not realize this and went ahead with manipulating the triple chains (dqrins() call etc.) and returned to its caller bool_expr() which also did triple chain manipulation (dqdel() call etc.) all the while operating on triples that were no longer part of the execution chain (due to the prior delqchain() call). This caused a corruption in the doubly-linked triple list in "t_orig" which resulted in incorrect object code being generated that later ended up as the SIG-11 when one tried running this M program.

In GT.M V63002, boolean expression evaluation and literal optimization got a significant rework. As part of that change, the macros RETURN_IF_RTS_ERROR and RETURN_EXPR_IF_RTS_ERROR were introduced to check for compile-time errors and if so return from functions right away instead of manipulating triple chains. These safety checks needed to be added in a few more places. That fixed the primary issue.

2) In addition, it was noticed that the following M program fails an assert when run with the debug build.

```
> cat test2.m
        xecute "if ""a""?1B"

> mumps -run test2
%GTM-F-ASSERT, Assert failed in /Distrib/GT.M/V63002/sr_port/zlcompile.c line 81 for expression ((FALSE == run_time) && (TRUE == TREF(compile_time)))
```

Below is the corresponding C-stack.

```
 #0  0x00007ff2e6988767 in kill () at ../sysdeps/unix/syscall-template.S:84
 #1  0x00007ff2e6014a5c in gtm_dump_core () at /Distrib/GT.M/V63002/sr_unix/gtm_dump_core.c:69
 #2  0x00007ff2e5f1de97 in gtm_fork_n_core () at /Distrib/GT.M/V63002/sr_unix/gtm_fork_n_core.c:211
 #3  0x00007ff2e6007f2b in ch_cond_core () at /Distrib/GT.M/V63002/sr_unix/ch_cond_core.c:59
 #4  0x00007ff2e5f443a2 in rts_error_va (csa=0x0, argcnt=7, var=0x7ffffc4b0a90) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:153
 #5  0x00007ff2e5f439b8 in rts_error_csa (csa=0x0, argcnt=7) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:85
 #6  0x00007ff2e636d64c in zlcompile (len=48 '0', addr=0x7ffffc4b0e30 "/extra1/testarea1/nars/test/temp/tmp/tmp/test2.m") at /Distrib/GT.M/V63002/sr_port/zlcompile.c:81
 #7  0x00007ff2e60e6f1c in op_zlink (v=0x7ffffc4b14a0, quals=0x7ffffc4b0cf0) at /Distrib/GT.M/V63002/sr_unix/op_zlink.c:443
 #8  0x00007ff2e5f6a2d7 in job_addr (rtn=0x7ffffc4b1590, label=0x7ffffc4b15a0, offset=0, hdr=0x7ffffc4b1518, labaddr=0x7ffffc4b1510, need_rtnobj_shm_free=0x7ffffc4b14e4) at /Distrib/GT.M/V63002/sr_port/job_addr.c:41
 #9  0x00007ff2e5f40b48 in jobchild_init () at /Distrib/GT.M/V63002/sr_unix/jobchild_init.c:146
 #10 0x00007ff2e5f3835d in gtm_startup (svec=0x7ffffc4b1d30) at /Distrib/GT.M/V63002/sr_unix/gtm_startup.c:252
 #11 0x00007ff2e5f3b2f6 in init_gtm () at /Distrib/GT.M/V63002/sr_unix/init_gtm.c:201
 #12 0x00007ff2e5f072ea in gtm_main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm_main.c:162
 #13 0x0000000000400cbe in main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm.c:131
```

In this case, run_time was TRUE and caused the assert failure. Turns out this was due to m_xecute() function (invoked by zlcompile()) temporarily setting run_time to FALSE but when a PATNOTFOUND error was encountered, the condition handler compiler_ch() was invoked which did an UNWIND back to zlcompile() incorrectly persisting the global variable changes done by the interim function call of m_xecute().

The fix for this was to reset the run_time and TREF(xecute_literal_parse) global variables just like is being done in mdb_condition_handler().
nars1 added a commit that referenced this issue May 1, 2018
…-11 otherwise)

Below is a test case that demonstrates the SIG-11 TERM1 and TERM2 are two terminals.

TERM1: > rm mumps.gld mumps.dat
TERM1: > setenv ydb_gbldir mumps.gld
TERM1: > $ydb_dist/gde exit
TERM1: > $ydb_dist/mupip create
TERM1: > $ydb_dist/mumps -run x
TERM1: > gdb $ydb_dist/mumps
TERM1: (gdb) b gvcst_expand_prev_key
       Function "gvcst_expand_prev_key" not defined.
       Make breakpoint pending on future shared library load? (y or [n]) y
       Breakpoint 1 (gvcst_expand_prev_key) pending.
TERM1: (gdb) r -run onemore^x
       Starting program: mumps -run onemore^x
       [Thread debugging using libthread_db enabled]
       Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
       Breakpoint 1, gvcst_expand_prev_key (pStat=0x6ca8c8, srch_key=0x62c040, exp_key=0x62b840) at sr_port/gvcst_expand_key.h:40
       40 {

Now switch to TERM2 terminal

TERM2: > setenv ydb_gbldir mumps.gld
TERM2: > mumps -run kill^x

Now switch back to TERM1 terminal

TERM1: (gdb) cont
       Continuing.
       Program received signal SIGSEGV, Segmentation fault.
       0x00007ffff6bf9d55 in gvcst_put2 (val=0x62e860, parms=0x7fffffff6c00) at sr_port/gvcst_put.c:2374
       2374                                    n = (int)((sm_long_t)*cp2 - (sm_long_t)*cp1);
       (gdb) where
       #0  0x00007ffff6bf9d55 in gvcst_put2 (val=0x62e860, parms=0x7fffffff6c00) at sr_port/gvcst_put.c:2374
       #1  0x00007ffff6be5cd7 in gvcst_put (val=0x62e860) at sr_port/gvcst_put.c:300
       #2  0x00007ffff6cfc930 in op_gvput (var=0x62e860) at sr_port/op_gvput.c:74
       #3  0x00007ffff7ff61c8 in ?? ()
       #4  0x00007ffff7373490 in ?? () from libyottadb.so
       #5  0x00007fffffff9860 in ?? ()
       #6  0x0000000000000000 in ?? ()

And you see the SIG-11 without the code fixes in this commit.

> cat x.m
init    ;
        for i=1:1:825 set ^x(i)=i
        quit
kill    ;
        kill ^x(825)
        set i=826,^x(i)=""
        quit
onemore ;
        tstart ():serial
        set ^x(826)=""
        tcommit
        quit
estess added a commit to estess/YottaDB that referenced this issue Aug 13, 2018
Modules changes:

- sr_port/bm_getfree.c - Issues: free_bit (line 216), map_size (line 216), cs1 (line 223),
                         and depth (line 221) are all uninitialized. Additionally,
			 blkhist.cr and blkhist.cycle undefined at line 238.
  - Loop changed so always do one iteration. Will do a new scan and check to see which of
    these issues are fixed by the loop change and which remain.

- sr_port/dse_m_rest.c - Issue: The variable r_top was flagged as uninitialized in line 87.
  - The variable r_top was re-initialized a few lines further down so line 87 was not
    necessary and was removed.
  - Some reformats due to non-standard formatting pre-existing.

- sr_port/dse_shift.c - Issue: lbp undefined in line 99
  - Initialize 'lbp' to NULL earlier so no paths to it are left for it to be undefined.

- sr_port/gtmsource_ctl_init.c - Issue: tmp_ctl resource leak at line 182.
  - Free 'tmp_ctl' when returning or erroring out.

- sr_port/is_canonic_name.c - Issue: Issue was noted via an issue in a different module.
                              In the parse_gv_name_and_subscripts() routine, near the top,
			      seq is set to *subscripts though is never referenced again.
  - Remove variable 'seq' which is not used and may have issues dereferencing 'subscripts'
    when not provided.

- sr_port/mu_int_blk.c - Issue: free_blk_base resource leak going out of scope line 323.
  - Free malloc'd storage (in 'free_blk_base') before return.

- sr_port/mupip_set_journal.c - Issue: Resource leak rewriting gds_info, gds_info not
                                cleaned up.
  - Remove unused gds_info allocation (not plugged into anything) and changed entire clause
    to FILE_CNTL_INIT_IF_NULL() macro to do initialization correctly.

- sr_port/mur_output_show.c - Issue: first_time uninitialized in line 328.
  - Initialize 'first_time' to TRUE so no path to it being uninitialized.

- sr_port/op_fnzsocket.c - Issue: tls_options_mask uninitialized at 598.
  - Move initialization of 'tls_options_mask' up earlier so no paths to uninitialized use.

- sr_port/op_indincr.c - Issue: s is uninitialized at line 101
  - Only do put_tref() if something to put. While this leaves v undefined if there was an
    error, this does not matter because comp_fini() ignores v in that case.

- sr_port/stp_gcol_src.h - Issue: uninitialized value of cstr at line 1063.
  - Make sure cstr is initialized whether expansion succeeds or fails by moving point
    where cstr is set.

- sr_port/tcp_open.c - Issues: 1. resource leak - ai_ptr going out of scope at line 190.
                               2. remote_ai_head uninitialized at line 302.
			       3. calling close (gtm_close) without checking return value at
			          line 355.
			       4. resource leak - remote_ai_head out of scope at line 340.
  - Remove some old Tru64 stuff.
  - Remove unnecessary initialization of remote_ai_ptr.
  - Remove some unused variables.
  - Move hostname validation to client section and remove the conditional parsing of the
    host (required for client) which solves YottaDB#2.
  - Change close() to CLOSEFILE() macros which solves YottaDB#3.
  - Make sure to release linked list at ai_ptr which solves YottaDB#1.
  - Remove assert when erroring with timeout (now tested by online_bkup/online6). This was
    required since the tests in online6 now timeout instead getting GETADDRINFO for invalid
    passive hosts being specified.
  - Make sure release linked list at remote_ai_head (which solves YottaDB#4).

- sr_port/ydberrors.msg
  - Added TCPCONNTIMEOUT message as the other timeout messages didn't quite do what was
    needed. There were a couple util_out_print() timeout message in tcp_open() but it
    really needed to be a real error - especially since it was now being tested in a test
    (online_bkup/online6).

- sr_unix/anticipatory_freeze.c - Issue: Resource leak - handle goes out of scope at line 358.
  - Add an FCLOSE() and error check for same prior to module returns.

- sr_unix/bin_load.c - Issue: Several allocated blocks and buffers need cleanup when leave
  - Implement mechanism to track all of the allocated buffers and free any that are
    allocated if an error occurs or on normal routine return.

- sr_unix/cli_lex.c - Issue: retptr uninitialized at line 488 (return from module).
  - Initialize return value to null so is initialized when nothing is read.

- sr_unix/gtm_getpwuid.c - Originally had changes for an issue but backed those out but left
                           these changes intact anyway.
  - Removed an unneeded return value check from malloc (gtm_malloc does not return if
    no storage).
  - Reformatted an #ifdef block.

- sr_unix/gtmcrypt_entry.c - Issue: Resource leak - handle goes out of scope at line 135.
  - release/close the handle before return

- sr_unix/gtmrecv_end.c - Issue: jnlpool_strm_seqno[idx] uninitialized at line 194.
  - Only do the loop to dump the streams if the journal pool exists.

- sr_unix/gtmsecshr.c - Issue: Resource leak - procstrm goes out of scope at line 1026.
  - Close 'procstrm' before return on error.

- sr_unix/gtmsource_shutdown.c - Issue: maxindex is uninitialized at line 317.
  - Add '!auto_shutdown' to clause as 'maxindex' is only set when !auto_shutdown.

- sr_unix/iosocket_tls.c - Issue: 1. errlen2 is uninitialized at line 416.
                                  2. errlen is unintialized at line 466.
  - Errors aren't using the correct method of getting error out there. The 'errlen'
    var is only seldomly set. Change to use LEN_AND_STR() macro to provide length.

- sr_unix/mu_all_version_standalone.c - Issue: save_errno is unintialized at line 172.
  - Save 'errno' to 'save_errno' to initialize it for the error message.

- sr_unix/mucblkini.c - Issue: Vars bp1, bp2, and bmp were not released before errors.
  - Free these vars appropriately before leaving routine.

- sr_unix/op_zlink.c - Issue: srcnamelen uninitialized line 227.
  - Move initialization of 'srcstr' closer to where used and only for those options
    where srcnamelen is set.

- sr_unix/relinkctl.c - Issue: shm_hdr uninitialized in line 1069
  - Initialize 'shm_hdr' to prevent usage when is still uninitialized.
  - Add error checks and messages for SHMDT invocations.

- sr_unix/ss_anal_shdw_file.c - Issue: bitmap_buffer and bp allocations not freed prior
                                to error returns (lines 118 and 123).
  - Add frees for bitmap_buffer and bp before error returns.

- sr_unix/trigger_source_read_andor_verify.c - Issue: rttabent uninitialized on line 356.
  - Initialize 'rttabent' to NULL
  - If not initialized, set 'rttabent' to our rtn_names list to begin search.

- sr_unix/util_output.c - Issue: Origin of this change is lost. It was from an issue
                          raised for another module and digging down through the calls,
			  we ended up here where a var as uninitialized after this loop
			  because of the lack of a default clause such that 'length'
			  didn't get set IIRC (the chose switch value in the scan's
			  simulation did not match a case).
  - Add default clause to switch with an assert in it. Also added code to keep static
    scan from complaining about 'length'.

- sr_unix/wait_for_disk_space.c - Issue: freeze_cleared uninitialized at line 161
  - Add a default for 'freeze_cleared' so is initialized.

- sr_unix_cm/gtcm_bgn_net.c - Issue: Resource leak - ai_ptr goes out of scope at 135, 148.
  - Release 'ai_ptr' linked list before normal and returns.

- sr_unix_cm/omi_prc_conn.c - Issue: Neither agname or ag_pass allocated memory is
                              released before normal or error return.
  - Define OMI_FREE() macro for better cleanup. Add to any place before error or normal
    return.
  - Initialize 'agname' and 'ag_pass' so know if they are allocated.
  - Remove clauses that test memory allocation result. Note gtm_malloc() does not return
    if allocation fails.

- sr_unix_cm/rc_srvc_xct.c - Issue: Resource leak - elst var goes out of scope at line 198.
  - Added a macro to do the cleanups and added it before all return points.

- sr_unix_gnp/cmi_init.c - Issue: Resource leak - local_ai_ptr out of scope lines
  			   67, 74, 85, 95,122.
  - Some minor formatting changes for code standards.
  - Make sure local_ai_ptr linked list is freed before leave routine.

- sr_unix_gnp/cmi_open.c - Issue: Resource leak - ai_head out of scope line 123.
  - Some minor formatting changes for code standards.
  - Make sure ai_head linked list is freed before leave routine.

- sr_unix_gnp/cmj_get_port.c - Issue: Resource leak - ai_ptr out of scope line 123.
  - Removed - no users of this routine were found.

- sr_unix_gnp/cmj_getsockaddr.c - Issue: Resource leak - ai_ptr out of scope line 152.
  - Make sure ai_ptr linked list is freed before leave routine with this error return.

- sr_unix_gnp/cmu_getclb.c - Issue: Resource leak - ai_ptr out of scope lines 52, 57.
  - Minor changes for coding standards.
  - Make sure ai_ptr linked list is freed before leave routine.

- sr_unix_gnp/gtcm_gnp_server_main.c - Issue: status uninitialized line 287.
  - Change static routines to STATICFN{DCL,DEF} (coding standards)
  - The 'status' variable was not set until after it was referenced. Removed the test
    of the uninitialized status after the fetch.

- sr_unix_gnp/gtcm_open_cmerrlog.c - Issue: Potential to overrun end of lfn_path at
  				     line 80.
  - Change max usable length before truncation to make room for a null terminator.
estess pushed a commit to estess/YottaDB that referenced this issue Aug 14, 2018
Modules changes:

- sr_port/bm_getfree.c - Issues: free_bit (line 216), map_size (line 216), cs1 (line 223),
                         and depth (line 221) are all uninitialized. Additionally,
			 blkhist.cr and blkhist.cycle undefined at line 238.
  - Loop changed so always do one iteration. Will do a new scan and check to see which of
    these issues are fixed by the loop change and which remain.

- sr_port/dse_m_rest.c - Issue: The variable r_top was flagged as uninitialized in line 87.
  - The variable r_top was re-initialized a few lines further down so line 87 was not
    necessary and was removed.
  - Some reformats due to non-standard formatting pre-existing.

- sr_port/dse_shift.c - Issue: lbp undefined in line 99
  - Initialize 'lbp' to NULL earlier so no paths to it are left for it to be undefined.

- sr_port/gtmsource_ctl_init.c - Issue: tmp_ctl resource leak at line 182.
  - Free 'tmp_ctl' when returning or erroring out.

- sr_port/is_canonic_name.c - Issue: Issue was noted via an issue in a different module.
                              In the parse_gv_name_and_subscripts() routine, near the top,
			      seq is set to *subscripts though is never referenced again.
  - Remove variable 'seq' which is not used and may have issues dereferencing 'subscripts'
    when not provided.

- sr_port/mu_int_blk.c - Issue: free_blk_base resource leak going out of scope line 323.
  - Free malloc'd storage (in 'free_blk_base') before return.

- sr_port/mupip_set_journal.c - Issue: Resource leak rewriting gds_info, gds_info not
                                cleaned up.
  - Remove unused gds_info allocation (not plugged into anything) and changed entire clause
    to FILE_CNTL_INIT_IF_NULL() macro to do initialization correctly.

- sr_port/mur_output_show.c - Issue: first_time uninitialized in line 328.
  - Initialize 'first_time' to TRUE so no path to it being uninitialized.

- sr_port/op_fnzsocket.c - Issue: tls_options_mask uninitialized at 598.
  - Move initialization of 'tls_options_mask' up earlier so no paths to uninitialized use.

- sr_port/op_indincr.c - Issue: s is uninitialized at line 101
  - Only do put_tref() if something to put. While this leaves v undefined if there was an
    error, this does not matter because comp_fini() ignores v in that case.

- sr_port/stp_gcol_src.h - Issue: uninitialized value of cstr at line 1063.
  - Make sure cstr is initialized whether expansion succeeds or fails by moving point
    where cstr is set.

- sr_port/tcp_open.c - Issues: 1. resource leak - ai_ptr going out of scope at line 190.
                               2. remote_ai_head uninitialized at line 302.
			       3. calling close (gtm_close) without checking return value at
			          line 355.
			       4. resource leak - remote_ai_head out of scope at line 340.
  - Remove some old Tru64 stuff.
  - Remove unnecessary initialization of remote_ai_ptr.
  - Remove some unused variables.
  - Move hostname validation to client section and remove the conditional parsing of the
    host (required for client) which solves YottaDB#2.
  - Change close() to CLOSEFILE() macros which solves YottaDB#3.
  - Make sure to release linked list at ai_ptr which solves YottaDB#1.
  - Remove assert when erroring with timeout (now tested by online_bkup/online6). This was
    required since the tests in online6 now timeout instead getting GETADDRINFO for invalid
    passive hosts being specified.
  - Make sure release linked list at remote_ai_head (which solves YottaDB#4).

- sr_port/ydberrors.msg
  - Added TCPCONNTIMEOUT message as the other timeout messages didn't quite do what was
    needed. There were a couple util_out_print() timeout message in tcp_open() but it
    really needed to be a real error - especially since it was now being tested in a test
    (online_bkup/online6).

- sr_unix/anticipatory_freeze.c - Issue: Resource leak - handle goes out of scope at line 358.
  - Add an FCLOSE() and error check for same prior to module returns.

- sr_unix/bin_load.c - Issue: Several allocated blocks and buffers need cleanup when leave
  - Implement mechanism to track all of the allocated buffers and free any that are
    allocated if an error occurs or on normal routine return.

- sr_unix/cli_lex.c - Issue: retptr uninitialized at line 488 (return from module).
  - Initialize return value to null so is initialized when nothing is read.

- sr_unix/gtm_getpwuid.c - Originally had changes for an issue but backed those out but left
                           these changes intact anyway.
  - Removed an unneeded return value check from malloc (gtm_malloc does not return if
    no storage).
  - Reformatted an #ifdef block.

- sr_unix/gtmcrypt_entry.c - Issue: Resource leak - handle goes out of scope at line 135.
  - release/close the handle before return

- sr_unix/gtmrecv_end.c - Issue: jnlpool_strm_seqno[idx] uninitialized at line 194.
  - Only do the loop to dump the streams if the journal pool exists.

- sr_unix/gtmsecshr.c - Issue: Resource leak - procstrm goes out of scope at line 1026.
  - Close 'procstrm' before return on error.

- sr_unix/gtmsource_shutdown.c - Issue: maxindex is uninitialized at line 317.
  - Add '!auto_shutdown' to clause as 'maxindex' is only set when !auto_shutdown.

- sr_unix/iosocket_tls.c - Issue: 1. errlen2 is uninitialized at line 416.
                                  2. errlen is unintialized at line 466.
  - Errors aren't using the correct method of getting error out there. The 'errlen'
    var is only seldomly set. Change to use LEN_AND_STR() macro to provide length.

- sr_unix/mu_all_version_standalone.c - Issue: save_errno is unintialized at line 172.
  - Save 'errno' to 'save_errno' to initialize it for the error message.

- sr_unix/mucblkini.c - Issue: Vars bp1, bp2, and bmp were not released before errors.
  - Free these vars appropriately before leaving routine.

- sr_unix/op_zlink.c - Issue: srcnamelen uninitialized line 227.
  - Move initialization of 'srcstr' closer to where used and only for those options
    where srcnamelen is set.

- sr_unix/relinkctl.c - Issue: shm_hdr uninitialized in line 1069
  - Initialize 'shm_hdr' to prevent usage when is still uninitialized.
  - Add error checks and messages for SHMDT invocations.

- sr_unix/ss_anal_shdw_file.c - Issue: bitmap_buffer and bp allocations not freed prior
                                to error returns (lines 118 and 123).
  - Add frees for bitmap_buffer and bp before error returns.

- sr_unix/trigger_source_read_andor_verify.c - Issue: rttabent uninitialized on line 356.
  - Initialize 'rttabent' to NULL
  - If not initialized, set 'rttabent' to our rtn_names list to begin search.

- sr_unix/util_output.c - Issue: Origin of this change is lost. It was from an issue
                          raised for another module and digging down through the calls,
			  we ended up here where a var as uninitialized after this loop
			  because of the lack of a default clause such that 'length'
			  didn't get set IIRC (the chose switch value in the scan's
			  simulation did not match a case).
  - Add default clause to switch with an assert in it. Also added code to keep static
    scan from complaining about 'length'.

- sr_unix/wait_for_disk_space.c - Issue: freeze_cleared uninitialized at line 161
  - Add a default for 'freeze_cleared' so is initialized.

- sr_unix_cm/gtcm_bgn_net.c - Issue: Resource leak - ai_ptr goes out of scope at 135, 148.
  - Release 'ai_ptr' linked list before normal and returns.

- sr_unix_cm/omi_prc_conn.c - Issue: Neither agname or ag_pass allocated memory is
                              released before normal or error return.
  - Define OMI_FREE() macro for better cleanup. Add to any place before error or normal
    return.
  - Initialize 'agname' and 'ag_pass' so know if they are allocated.
  - Remove clauses that test memory allocation result. Note gtm_malloc() does not return
    if allocation fails.

- sr_unix_cm/rc_srvc_xct.c - Issue: Resource leak - elst var goes out of scope at line 198.
  - Added a macro to do the cleanups and added it before all return points.

- sr_unix_gnp/cmi_init.c - Issue: Resource leak - local_ai_ptr out of scope lines
  			   67, 74, 85, 95,122.
  - Some minor formatting changes for code standards.
  - Make sure local_ai_ptr linked list is freed before leave routine.

- sr_unix_gnp/cmi_open.c - Issue: Resource leak - ai_head out of scope line 123.
  - Some minor formatting changes for code standards.
  - Make sure ai_head linked list is freed before leave routine.

- sr_unix_gnp/cmj_get_port.c - Issue: Resource leak - ai_ptr out of scope line 123.
  - Removed - no users of this routine were found.

- sr_unix_gnp/cmj_getsockaddr.c - Issue: Resource leak - ai_ptr out of scope line 152.
  - Make sure ai_ptr linked list is freed before leave routine with this error return.

- sr_unix_gnp/cmu_getclb.c - Issue: Resource leak - ai_ptr out of scope lines 52, 57.
  - Minor changes for coding standards.
  - Make sure ai_ptr linked list is freed before leave routine.

- sr_unix_gnp/gtcm_gnp_server_main.c - Issue: status uninitialized line 287.
  - Change static routines to STATICFN{DCL,DEF} (coding standards)
  - The 'status' variable was not set until after it was referenced. Removed the test
    of the uninitialized status after the fetch.

- sr_unix_gnp/gtcm_open_cmerrlog.c - Issue: Potential to overrun end of lfn_path at
  				     line 80.
  - Change max usable length before truncation to make room for a null terminator.
estess added a commit to estess/YottaDB that referenced this issue Aug 14, 2018
Modules changes:

- sr_port/bm_getfree.c - Issues: free_bit (line 216), map_size (line 216), cs1 (line 223),
                         and depth (line 221) are all uninitialized. Additionally,
			 blkhist.cr and blkhist.cycle undefined at line 238.
  - Loop changed so always do one iteration. This has fixed all of the above issues (no
    longer reported by scan).

- sr_port/dse_m_rest.c - Issue: The variable r_top was flagged as uninitialized in line 87.
  - The variable r_top was re-initialized a few lines further down so line 87 was not
    necessary and was removed.
  - Some reformats due to non-standard formatting pre-existing.

- sr_port/dse_shift.c - Issue: lbp undefined in line 99
  - Initialize 'lbp' to NULL earlier so no paths to it are left for it to be undefined.

- sr_port/gtmsource_ctl_init.c - Issue: tmp_ctl resource leak at line 182.
  - Free 'tmp_ctl' when returning or erroring out.

- sr_port/is_canonic_name.c - Issue: Issue was noted via an issue in a different module.
                              In the parse_gv_name_and_subscripts() routine, near the top,
			      seq is set to *subscripts though is never referenced again.
  - Remove variable 'seq' which is not used and may have issues dereferencing 'subscripts'
    when not provided.

- sr_port/mu_int_blk.c - Issue: free_blk_base resource leak going out of scope line 323.
  - Free malloc'd storage (in 'free_blk_base') before return.

- sr_port/mupip_set_journal.c - Issue: Resource leak rewriting gds_info, gds_info not
                                cleaned up.
  - Remove unused gds_info allocation (not plugged into anything) and changed entire clause
    to FILE_CNTL_INIT_IF_NULL() macro to do initialization correctly.

- sr_port/mur_output_show.c - Issue: first_time uninitialized in line 328.
  - Initialize 'first_time' to TRUE so no path to it being uninitialized.

- sr_port/op_fnzsocket.c - Issue: tls_options_mask uninitialized at 598.
  - Move initialization of 'tls_options_mask' up earlier so no paths to uninitialized use.

- sr_port/op_indincr.c - Issue: s is uninitialized at line 101
  - Only do put_tref() if something to put. While this leaves v undefined if there was an
    error, this does not matter because comp_fini() ignores v in that case.

- sr_port/stp_gcol_src.h - Issue: uninitialized value of cstr at line 1063.
  - Make sure cstr is initialized whether expansion succeeds or fails by moving point
    where cstr is set.

- sr_port/tcp_open.c - Issues: 1. resource leak - ai_ptr going out of scope at line 190.
                               2. remote_ai_head uninitialized at line 302.
			       3. calling close (gtm_close) without checking return value at
			          line 355.
			       4. resource leak - remote_ai_head out of scope at line 340.
  - Remove some old Tru64 stuff.
  - Remove unnecessary initialization of remote_ai_ptr.
  - Remove some unused variables.
  - Move hostname validation to client section and remove the conditional parsing of the
    host (required for client) which solves YottaDB#2.
  - Change close() to CLOSEFILE() macros which solves YottaDB#3.
  - Make sure to release linked list at ai_ptr which solves YottaDB#1.
  - Remove assert when erroring with timeout (now tested by online_bkup/online6). This was
    required since the tests in online6 now timeout instead getting GETADDRINFO for invalid
    passive hosts being specified.
  - Make sure release linked list at remote_ai_head (which solves YottaDB#4).

- sr_port/ydberrors.msg
  - Added TCPCONNTIMEOUT message as the other timeout messages didn't quite do what was
    needed. There were a couple util_out_print() timeout message in tcp_open() but it
    really needed to be a real error - especially since it was now being tested in a test
    (online_bkup/online6).

- sr_unix/anticipatory_freeze.c - Issue: Resource leak - handle goes out of scope at line 358.
  - Add an FCLOSE() and error check for same prior to module returns.

- sr_unix/bin_load.c - Issue: Several allocated blocks and buffers need cleanup when leave
  - Implement mechanism to track all of the allocated buffers and free any that are
    allocated if an error occurs or on normal routine return.

- sr_unix/cli_lex.c - Issue: retptr uninitialized at line 488 (return from module).
  - Initialize return value to null so is initialized when nothing is read.

- sr_unix/gtm_getpwuid.c
  - Removed an unneeded return value check from malloc (gtm_malloc does not return if
    no storage).
  - Reformatted an #ifdef block.

- sr_unix/gtmcrypt_entry.c - Issue: Resource leak - handle goes out of scope at line 135.
  - release/close the handle before return

- sr_unix/gtmrecv_end.c - Issue: jnlpool_strm_seqno[idx] uninitialized at line 194.
  - Only do the loop to dump the streams if the journal pool exists.

- sr_unix/gtmsecshr.c - Issue: Resource leak - procstrm goes out of scope at line 1026.
  - Close 'procstrm' before return on error.

- sr_unix/gtmsource_shutdown.c - Issue: maxindex is uninitialized at line 317.
  - Add '!auto_shutdown' to clause as 'maxindex' is only set when !auto_shutdown.

- sr_unix/iosocket_tls.c - Issue: 1. errlen2 is uninitialized at line 416.
                                  2. errlen is unintialized at line 466.
  - Errors aren't using the correct method of getting error out there. The 'errlen'
    var is only seldomly set. Change to use LEN_AND_STR() macro to provide length.

- sr_unix/mu_all_version_standalone.c - Issue: save_errno is unintialized at line 172.
  - Save 'errno' to 'save_errno' to initialize it for the error message.

- sr_unix/mucblkini.c - Issue: Vars bp1, bp2, and bmp were not released before errors.
  - Free these vars appropriately before leaving routine.

- sr_unix/op_zlink.c - Issue: srcnamelen uninitialized line 227.
  - Move initialization of 'srcstr' closer to where used and only for those options
    where srcnamelen is set.

- sr_unix/relinkctl.c - Issue: shm_hdr uninitialized in line 1069
  - Initialize 'shm_hdr' to prevent usage when is still uninitialized.
  - Add error checks and messages for SHMDT invocations.

- sr_unix/ss_anal_shdw_file.c - Issue: bitmap_buffer and bp allocations not freed prior
                                to error returns (lines 118 and 123).
  - Add frees for bitmap_buffer and bp before error returns.

- sr_unix/trigger_source_read_andor_verify.c - Issue: rttabent uninitialized on line 356.
  - Initialize 'rttabent' to NULL
  - If not initialized, set 'rttabent' to our rtn_names list to begin search.

- sr_unix/util_output.c - Issue: Origin of this change is lost. It was from an issue
                          raised for another module and digging down through the calls,
			  we ended up here where a var as uninitialized after this loop
			  because of the lack of a default clause such that 'length'
			  didn't get set IIRC (the chose switch value in the scan's
			  simulation did not match a case).
  - Add default clause to switch with an assert in it. Also added code to keep static
    scan from complaining about 'length'.

- sr_unix/wait_for_disk_space.c - Issue: freeze_cleared uninitialized at line 161
  - Add a default for 'freeze_cleared' so is initialized.

- sr_unix_cm/gtcm_bgn_net.c - Issue: Resource leak - ai_ptr goes out of scope at 135, 148.
  - Release 'ai_ptr' linked list before normal and returns.

- sr_unix_cm/omi_prc_conn.c - Issue: Neither agname or ag_pass allocated memory is
                              released before normal or error return.
  - Define OMI_FREE() macro for better cleanup. Add to any place before error or normal
    return.
  - Initialize 'agname' and 'ag_pass' so know if they are allocated.
  - Remove clauses that test memory allocation result. Note gtm_malloc() does not return
    if allocation fails.

- sr_unix_cm/rc_srvc_xct.c - Issue: Resource leak - elst var goes out of scope at line 198.
  - Added a macro to do the cleanups and added it before all return points.

- sr_unix_gnp/cmi_init.c - Issue: Resource leak - local_ai_ptr out of scope lines
  			   67, 74, 85, 95,122.
  - Some minor formatting changes for code standards.
  - Make sure local_ai_ptr linked list is freed before leave routine.

- sr_unix_gnp/cmi_open.c - Issue: Resource leak - ai_head out of scope line 123.
  - Some minor formatting changes for code standards.
  - Make sure ai_head linked list is freed before leave routine.

- sr_unix_gnp/cmj_get_port.c - Issue: Resource leak - ai_ptr out of scope line 123.
  - Removed - no users of this routine were found.

- sr_unix_gnp/cmj_getsockaddr.c - Issue: Resource leak - ai_ptr out of scope line 152.
  - Make sure ai_ptr linked list is freed before leave routine with this error return.

- sr_unix_gnp/cmu_getclb.c - Issue: Resource leak - ai_ptr out of scope lines 52, 57.
  - Minor changes for coding standards.
  - Make sure ai_ptr linked list is freed before leave routine.

- sr_unix_gnp/gtcm_gnp_server_main.c - Issue: status uninitialized line 287.
  - Change static routines to STATICFN{DCL,DEF} (coding standards)
  - The 'status' variable was not set until after it was referenced. Removed the test
    of the uninitialized status after the fetch.

- sr_unix_gnp/gtcm_open_cmerrlog.c - Issue: Potential to overrun end of lfn_path at
  				     line 80.
  - Change max usable length before truncation to make room for a null terminator.
estess added a commit that referenced this issue Aug 14, 2018
Modules changes:

- sr_port/bm_getfree.c - Issues: free_bit (line 216), map_size (line 216), cs1 (line 223), and depth (line 221) are all uninitialized. Additionally, blkhist.cr and blkhist.cycle undefined at line 238.
  - Loop changed so always do one iteration. This has fixed all of the above issues (no longer reported by scan).

- sr_port/dse_m_rest.c - Issue: The variable r_top was flagged as uninitialized in line 87.
  - The variable r_top was re-initialized a few lines further down so line 87 was not necessary and was removed.
  - Some reformats due to non-standard formatting pre-existing.

- sr_port/dse_shift.c - Issue: lbp undefined in line 99
  - Initialize 'lbp' to NULL earlier so no paths to it are left for it to be undefined.

- sr_port/gtmsource_ctl_init.c - Issue: tmp_ctl resource leak at line 182.
  - Free 'tmp_ctl' when returning or erroring out.

- sr_port/is_canonic_name.c - Issue: Issue was noted via an issue in a different module. In the parse_gv_name_and_subscripts() routine, near the top, seq is set to *subscripts though is never referenced again.
  - Remove variable 'seq' which is not used and may have issues dereferencing 'subscripts' when not provided.

- sr_port/mu_int_blk.c - Issue: free_blk_base resource leak going out of scope line 323.
  - Free malloc'd storage (in 'free_blk_base') before return.

- sr_port/mupip_set_journal.c - Issue: Resource leak rewriting gds_info, gds_info not cleaned up.
  - Remove unused gds_info allocation (not plugged into anything) and changed entire clause to FILE_CNTL_INIT_IF_NULL() macro to do initialization correctly.

- sr_port/mur_output_show.c - Issue: first_time uninitialized in line 328.
  - Initialize 'first_time' to TRUE so no path to it being uninitialized.

- sr_port/op_fnzsocket.c - Issue: tls_options_mask uninitialized at 598.
  - Move initialization of 'tls_options_mask' up earlier so no paths to uninitialized use.

- sr_port/op_indincr.c - Issue: s is uninitialized at line 101
  - Only do put_tref() if something to put. While this leaves v undefined if there was an error, this does not matter because comp_fini() ignores v in that case.

- sr_port/stp_gcol_src.h - Issue: uninitialized value of cstr at line 1063.
  - Make sure cstr is initialized whether expansion succeeds or fails by moving point where cstr is set.

- sr_port/tcp_open.c - Issues: 1. resource leak - ai_ptr going out of scope at line 190.
                               2. remote_ai_head uninitialized at line 302.
                               3. calling close (gtm_close) without checking return value at line 355.
                               4. resource leak - remote_ai_head out of scope at line 340.
  - Remove some old Tru64 stuff.
  - Remove unnecessary initialization of remote_ai_ptr.
  - Remove some unused variables.
  - Move hostname validation to client section and remove the conditional parsing of the host (required for client) which solves #2.
  - Change close() to CLOSEFILE() macros which solves #3.
  - Make sure to release linked list at ai_ptr which solves #1.
  - Remove assert when erroring with timeout (now tested by online_bkup/online6). This was required since the tests in online6 now timeout instead getting GETADDRINFO for invalid passive hosts being specified.
  - Make sure release linked list at remote_ai_head (which solves #4).

- sr_port/ydberrors.msg
  - Added TCPCONNTIMEOUT message as the other timeout messages didn't quite do what was needed. There were a couple util_out_print() timeout message in tcp_open() but it really needed to be a real error - especially since it was now being tested in a test (online_bkup/online6).

- sr_unix/anticipatory_freeze.c - Issue: Resource leak - handle goes out of scope at line 358.
  - Add an FCLOSE() and error check for same prior to module returns.
  - Modify CLEAR_ANTICIPATORY_FREEZE() macro to set FREEZE_CLEARED to FALSE if it doesn't set it to TRUE so FREEZE_CLEARED is ALWAYS set by this macro.

- sr_unix/bin_load.c - Issue: Several allocated blocks and buffers need cleanup when leave.
  - Implement mechanism to track all of the allocated buffers and free any that are allocated if an error occurs or on normal routine return.

- sr_unix/cli_lex.c - Issue: retptr uninitialized at line 488 (return from module).
  - Initialize return value to null so is initialized when nothing is read.

- sr_unix/gtm_getpwuid.c
  - Removed an unneeded return value check from malloc (gtm_malloc does not return if no storage).
  - Reformatted an #ifdef block.

- sr_unix/gtmcrypt_entry.c - Issue: Resource leak - handle goes out of scope at line 135.
  - release/close the handle before return

- sr_unix/gtmrecv_end.c - Issue: jnlpool_strm_seqno[idx] uninitialized at line 194.
  - Only do the loop to dump the streams if the journal pool exists.

- sr_unix/gtmsecshr.c - Issue: Resource leak - procstrm goes out of scope at line 1026.
  - Close 'procstrm' before return on error.

- sr_unix/gtmsource_shutdown.c - Issue: maxindex is uninitialized at line 317.
  - Add '!auto_shutdown' to clause as 'maxindex' is only set when !auto_shutdown.

- sr_unix/iosocket_tls.c - Issue: 1. errlen2 is uninitialized at line 416.
                                  2. errlen is unintialized at line 466.
  - Errors aren't using the correct method of getting error out there. The 'errlen' var is only seldomly set. Change to use LEN_AND_STR() macro to provide length.

- sr_unix/mu_all_version_standalone.c - Issue: save_errno is unintialized at line 172.
  - Save 'errno' to 'save_errno' to initialize it for the error message.

- sr_unix/mucblkini.c - Issue: Vars bp1, bp2, and bmp were not released before errors.
  - Free these vars appropriately before leaving routine.

- sr_unix/op_zlink.c - Issue: srcnamelen uninitialized line 227.
  - Move initialization of 'srcstr' closer to where used and only for those options where srcnamelen is set.

- sr_unix/relinkctl.c - Issue: shm_hdr uninitialized in line 1069
  - Initialize 'shm_hdr' to prevent usage when is still uninitialized.
  - Add error checks and messages for SHMDT invocations.

- sr_unix/ss_anal_shdw_file.c - Issue: bitmap_buffer and bp allocations not freed prior to error returns (lines 118 and 123).
  - Add frees for bitmap_buffer and bp before error returns.

- sr_unix/trigger_source_read_andor_verify.c - Issue: rttabent uninitialized on line 356.
  - Initialize 'rttabent' to NULL
  - If not initialized, set 'rttabent' to our rtn_names list to begin search.

- sr_unix/util_output.c - Issue: Origin of this change is lost. It was from an issue raised for another module and digging down through the calls, we ended up here where a var as uninitialized after this loop because of the lack of a default clause such that 'length' didn't get set IIRC (the chose switch value in the scan's simulation did not match a case).
  - Add default clause to switch with an assert in it. Also added code to keep static scan from complaining about 'length'.

- sr_unix/wait_for_disk_space.c - Issue: freeze_cleared uninitialized at line 161.
 - Add a default for 'freeze_cleared' so is initialized.

- sr_unix_cm/gtcm_bgn_net.c - Issue: Resource leak - ai_ptr goes out of scope at 135, 148.
  - Release 'ai_ptr' linked list before normal and returns.

- sr_unix_cm/omi_prc_conn.c - Issue: Neither agname or ag_pass allocated memory is released before normal or error return.
  - Define OMI_FREE() macro for better cleanup. Add to any place before error or normal
    return.
  - Initialize 'agname' and 'ag_pass' so know if they are allocated.
  - Remove clauses that test memory allocation result. Note gtm_malloc() does not return
    if allocation fails.

- sr_unix_cm/rc_srvc_xct.c - Issue: Resource leak - elst var goes out of scope at line 198.
  - Added a macro to do the cleanups and added it before all return points.

- sr_unix_gnp/cmi_init.c - Issue: Resource leak - local_ai_ptr out of scope lines 67, 74, 85, 95,122.
  - Some minor formatting changes for code standards.
  - Make sure local_ai_ptr linked list is freed before leave routine.

- sr_unix_gnp/cmi_open.c - Issue: Resource leak - ai_head out of scope line 123.
  - Some minor formatting changes for code standards.
  - Make sure ai_head linked list is freed before leave routine.

- sr_unix_gnp/cmj_get_port.c - Issue: Resource leak - ai_ptr out of scope line 123.
  - Removed - no users of this routine were found.

- sr_unix_gnp/cmj_getsockaddr.c - Issue: Resource leak - ai_ptr out of scope line 152.
  - Make sure ai_ptr linked list is freed before leave routine with this error return.

- sr_unix_gnp/cmu_getclb.c - Issue: Resource leak - ai_ptr out of scope lines 52, 57.
  - Minor changes for coding standards.
  - Make sure ai_ptr linked list is freed before leave routine.

- sr_unix_gnp/gtcm_gnp_server_main.c - Issue: status uninitialized line 287.
  - Change static routines to STATICFN{DCL,DEF} (coding standards)
  - The 'status' variable was not set until after it was referenced. Removed the test of the uninitialized status after the fetch.

- sr_unix_gnp/gtcm_open_cmerrlog.c - Issue: Potential to overrun end of lfn_path at line 80.
  - Change max usable length before truncation to make room for a null terminator.
chathaway-codes pushed a commit that referenced this issue Sep 12, 2018
…le entries so prior actions like ZSTEP continue to work after the job interrupt is handled"

This reverts commit dc0b865.

The reason is that jobinterrupt_set() (which changes entries in the transfer table) is interrupt code
that can be called at arbitrary points in time, including if interrupted code is actually in the process
of changing the transfer table entries itself (for example while compiling a $ZSTEP action like the
C-stack below indicates).

 #0  jobinterrupt_set (dummy_val=0) at sr_port/jobinterrupt_event.c:92
 #1  0x00007ffff715a67a in xfer_set_handlers (event_type=1, set_fn=0x7ffff71bbe86 <jobinterrupt_set>, param_val=0) at sr_port/deferred_events.c:209
 #2  0x00007ffff71bbe70 in jobinterrupt_event (sig=10, info=0x7fffffff8d30, context=0x7fffffff8c00) at sr_port/jobinterrupt_event.c:65
 #3  <signal handler called>
 #4  op_zst_break () at sr_port/op_zst_break.c:38
 #5  0x00007ffff6ececcf in stkok16 () at sr_x86_64/op_bkpt.s:93

In this case, if we save the current transfer table (to later restore in xfer_reset_handlers()),
we could be restoring an inconsistent transfer table later which might not be a safe thing to do.
For example, if 5 transfer table entries are in the process of being changed and the SIGUSR1 interrupt
occurs after 2 entries have been changed but the remaining 3 have not, then we would save (and in turn
later restore) a transfer table state that is inconsistent.

The previous code always reset the transfer table to hardcoded normal values which effectively lose the
ZSTEP action but that was still consistent in itself. So it is considered an unsafe change and undone.

As for the original issue of #359, I found a workaround and that is to redo the ZBREAK/$ZSTEP action
while inside the $ZINTERRUPT code. That ensures the ZSTEP action continues to be active even after the
$ZINTERRUPT. That is considered good enough for now.
chathaway-codes pushed a commit that referenced this issue Oct 12, 2018
…back process

This is an issue identified based on a rare ideminter_rolrec/interrupted_rollback_or_recover subtest
failure. The test ran 4 rollbacks and killed all but the last one using kill -9 before they finished.
The last rollback failed an assert.

%YDB-F-ASSERT, Assert failed in sr_unix/mutex.c line 1045 for expression (((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno)

Below is the C-stack

(gdb) where
 #0  kill () at ../sysdeps/unix/syscall-template.S:84
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:69
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fff2eebdeb0) at sr_unix/rts_error.c:159
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:92
 #6  mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045
 #7  gtm_mutex_lock (reg=0x106f960, mutex_spin_parms=0x7f7690440080, crash_count=0, mutex_lock_type=MUTEX_LOCK_WRITE) at sr_unix/mutex.c:703
 #8  grab_lock (reg=0x106f960, is_blocking_wait=1, onln_rlbk_action=1) at sr_unix/grab_lock.c:83
 #9  mur_open_files () at sr_port/mur_open_files.c:492
 #10 mupip_recover () at sr_port/mupip_recover.c:195
 #11 mupip_main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip_main.c:124
 #12 dlopen_libyottadb (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688, main_func=0x401424 "mupip_main") at sr_unix/dlopen_libyottadb.c:148
 #13 main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip.c:19

(gdb) f 6
 #6  0x00007f76931b75bd in mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045
1045            assert(((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno);

(gdb) p lastJplCmt->jnl_seqno
$1 = 323033

(gdb) p jpl->jnl_seqno
$2 = 293120

We were expecting the two seqnos to be 1 apart but they are way apart.

This is because had killed a prior rollback (the first rollback) just before it had finished the rollback.
Below is its log.

> cat ROLLBACK2_1.logx
.
.
%YDB-I-RLBKJNSEQ, Journal seqno of the instance after rollback is 293120 [0x0000000000047900]
.
%YDB-I-FILERENAME, File ideminter_rolrec_0_7/interrupted_rollback_or_recover/g.mjl_2018284231355 is renamed to ideminter_rolrec_0_7/interrupted_rollback_or_recover/rolled_bak_g.mjl_2018284231355
Killed

The fact that it printed the RLBKJNSEQ and FILERENAME messages implies it was in mur_close_files()
when it was killed.

There is code in mur_close_files() where we reset various fields in "jpl" to reset the state of
the journal pool based on the post-rollback instance seqno. This code also needs to clear a few
2-phase-jnl-commit related fields so "mutex_salvage" when it comes in later (in this test, it came in
as part of the later rollback) after the kill -9 does not fail the above assert.

This failure can be easily reproduced by running an online rollback with the -resync qualifier to take back
the state of the instance to a prior seqno, setting a break point in "rel_lock()" and quitting from the
debugger once that break point is hit (this simulates kill -9 of online rollback). Reissuing the same online
rollback out of the debugger should show the assert failure. Reissuing the online rollback with a production
build did not show any issues so the suspicion is that this is a dbg-only issue hence no tracking is done
as a separate issue at gitlab.
chathaway-codes pushed a commit that referenced this issue Oct 18, 2018
…mer taking more than the allowed 50 seconds to pop

This fixes occasional test failures we see where a core file gets generated because the timer pop
gets delayed by around 75 seconds non a loaded system. Below is an example core trace.

(gdb) where
 #0  kill () at ../sysdeps/unix/syscall-template.S:84
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:69
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  timer_handler (why=14) at sr_unix/gt_timers.c:724
 #4  <signal handler called>
 #5  __fsync_nocancel () at ../sysdeps/unix/syscall-template.S:84
 #6  wcs_timer_start (reg=0x11e3970, io_ok=1) at sr_port/t_end_sysops.c:1281
 #7  t_end (hist1=0x12a20c0, hist2=0x0, ctn=18446744071629176832) at sr_port/t_end.c:1818
 #8  gvcst_put2 (val=0x122d008, parms=0x7ffcdda9aef0) at sr_port/gvcst_put.c:2641
 #9  gvcst_put (val=0x122d008) at sr_port/gvcst_put.c:299
 #10 op_gvput (var=0x122d008) at sr_port/op_gvput.c:74
chathaway-codes pushed a commit that referenced this issue Oct 25, 2018
…ure)

In one run of the v53002/C9E04002596 subtest, the following assert failed.

%YDB-F-ASSERT, Assert failed in sr_port/mutex_deadlock_check.c line 102 for expression
	(!jnlpool || !jnlpool->jnlpool_dummy_reg || jnlpool->jnlpool_dummy_reg->open || (repl_csa->critical != criticalPtr) || (NULL == cs_addrs))

And below was the C-stack at the time of the assert failure.

(gdb) where
 #0  kill () at ../sysdeps/unix/syscall-template.S:84
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:69
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va () at sr_unix/rts_error.c:159
 #5  rts_error_csa () at sr_unix/rts_error.c:92
 #6  mutex_deadlock_check () at sr_port/mutex_deadlock_check.c:101
 #7  mutex_long_sleep () at sr_unix/mutex.c:511
 #8  gtm_mutex_lock () at sr_unix/mutex.c:856
 #9  grab_lock () at sr_unix/grab_lock.c:83
 #10 repl_inst_ftok_counter_halted () at sr_unix/repl_inst_ftok_counter_halted.c:45
 #11 jnlpool_init () at sr_unix/jnlpool_init.c:764
 #12 gvcst_init () at sr_port/gvcst_init.c:917
 #13 gv_init_reg () at sr_port/gv_init_reg.c:56
 #14 gv_bind_name () at sr_port/gv_bind_name.c:75
 #15 op_gvname_common () at sr_port/op_gvname.c:117
 #16 op_gvname_fast () at sr_port/op_gvname.c:81

The assert at line 102 expects that if ever we are in mutex_deadlock_check() for the jnlpool, we better have
not opened any database file (i.e. non-NULL cs_addrs). Whereas in this call sequence clearly, we have a case
where while opening the database, we need to open the journal pool (because ydb_custom_errors env var is set)
and while opening the journal pool we notice a 32K semaphore counter overflow (artificially created by the test
system in this debug-run using ydb_db_counter_sem_incr env var) which results in getting a lock on the jnlpool
before making changes to the instance file and while trying to get the lock we notice it is being held by
some other process which is taking a long time so we go to mutex_deadlock_check() eventually for the jnlpool
and fail the assert while in the middle of a database open (so cs_addrs is non-NULL).

The assert is clearly wrong for this case.

The assert was only introduced in V6.3-001A in the below state.

assert((NULL == jnlpool.jnlpool_dummy_reg) || jnlpool.jnlpool_dummy_reg->open || (repl_csa->critical != criticalPtr));

And it was modified in V6.3-003 and V6.3-005 to add || conditions. Most likely to account for exceptions to
the assert as they were encountered.

I don't see the value in this assert so removing it.

Also corrected a pre-existing comment a few lines after the removed assert to reflect our renewed understanding
of the current failure possibility.
chathaway-codes pushed a commit that referenced this issue Nov 18, 2018
…secondary errors if primary error is out-of-memory

If already exiting, do not open any object/source directories (which could include relinkctl files)
as part of $ZROUTINES initialization. This avoids potentially nasty codepaths particulary if the
reason we are exiting is an out-of-memory.

We do not expect any user to run such extreme out-of-memory codepaths/tests so it is not considered
necessary to create a user-visible issue for this.

For example, below are two C-stacks that showed up in core dumps while running the
simpleapi/fatalerror2 subtest. In both cases, if we avoid the zro_init() call we can avoid
such cores.

Core1
------
Notice the local variables passed in #0 have "Cannot access memory" errors. Most likely there was no
space allocating the C-stack in this core.

(gdb) where
 #0  ydb_trans_log_name (envindx=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c5c>, trans=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c50>, buffer=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c48>, buffer_len=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c58>, ignore_errors=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c44>, is_ydb_env_match=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c38>) at sr_port/ydb_trans_log_name.c:41
 #1  util_out_send_oper (addr=0x7ffe1e3c7800 "%YDB-E-RELINKCTLERR, Error with relink control structure for $ZROUTINES directory ., %YDB-E-SYSCALL, Error received from system call mmap() -- called from module "..., len=287) at sr_unix/util_output.c:731
 #2  util_out_print_vaparm (message=0x0, flush=4, var=0x7ffe1e3c8050, faocnt=2147483647) at sr_unix/util_output.c:871
 #3  util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904
 #4  jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:261
 #5  gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36
 #6  rts_error_va (csa=0x0, argcnt=12, var=0x7ffe1e3c82b0) at sr_unix/rts_error.c:159
 #7  rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92
 #8  relinkctl_map (linkctl=0x7ffe1e3c8890) at sr_unix/relinkctl.c:679
 #9  relinkctl_open (linkctl=0x7ffe1e3c8890, object_dir_missing=0) at sr_unix/relinkctl.c:333
 #10 relinkctl_attach (obj_container_name=0x7ffe1e3cbb50, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188
 #11 zro_load (str=0x5611ed710ce8) at sr_unix/zro_load.c:159
 #12 zro_init () at sr_port/zro_init.c:51
 #13 zshow_svn (output=0x7ffe1e40f0b0, one_sv=0) at sr_port/zshow_svn.c:694
 #14 op_zshow (func=0x7ffe1e4171b0, type=1, lvn=0x0) at sr_port/op_zshow.c:166
 #15 jobexam_dump (dump_filename_arg=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0, fatal_file_name_buff=0x7ffe1e417c40 "simpleapi_0_2/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_65362_1.txt") at sr_port/jobexam_process.c:232
 #16 jobexam_process (dump_file_name=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0) at sr_port/jobexam_process.c:152
 #17 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66
 #18 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224
 #19 rts_error_va (csa=0x0, argcnt=5, var=0x7ffe1e41a6a0) at sr_unix/rts_error.c:159
 #20 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92
 #21 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114
 #22 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748
 #23 lvtreenode_newblock (sym=0x5611ed733b40, numElems=2097152) at sr_port/lv_newblock.c:82
 #24 lvtreenode_getslot (sym=0x5611ed733b40) at sr_port/lv_getslot.c:145
 #25 lvAvlTreeNodeInsert (lvt=0x5611ed736050, key=0x7ffe1e41aab0, parent=0x5611f87cb608) at sr_port/lv_tree.c:1698
 #26 op_putindx (argcnt=1, start=0x5611ed73b0a0) at sr_port/op_putindx.c:192
 #27 callg (fnptr=0x7fb75d4f4fff <op_putindx>, paramlist=0x7ffe1e41ae60) at sr_unix/callg.c:60
 #28 ydb_set_s (varname=0x7ffe1e41b5e0, subs_used=1, subsarray=0x7ffe1e41b5f0, value=0x7ffe1e41ade0) at sr_unix/ydb_set_s.c:108
 #29 gvnset () at fatalerror.c:56
 #30 ydb_tp_s (tpfn=0x5611ed225260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193
 #31 main () at fatalerror.c:32

Core2
-----
In this case there is a SIG-11 deep inside syslog(). Most likely due to an out-of-memory situation.

Program terminated with signal SIGSEGV, Segmentation fault.
 #0  vfprintf () from /usr/lib64/libc.so.6
 #1  fprintf () from /usr/lib64/libc.so.6
 #2  __vsyslog_chk () from /usr/lib64/libc.so.6
 #3  syslog () from /usr/lib64/libc.so.6
 #4  util_out_send_oper (addr=0x7ffdadd5ec10 "%YDB-E-JOBEXAMFAIL, YottaDB process 50787 executing $ZJOBEXAM function failed with the preceding error message -- generated from 0x", '0' <repeats 16 times>, ".", len=149) at sr_unix/util_output.c:761
 #5  util_out_print_vaparm (message=0x0, flush=4, var=0x7ffdadd5f460, faocnt=2147483647) at sr_unix/util_output.c:871
 #6  util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904
 #7  send_msg_va (csa=0x0, arg_count=0, var=0x7ffdadd5fa00) at sr_unix/send_msg.c:149
 #8  send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:79
 #9  jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:264
 #10 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36
 #11 rts_error_va (csa=0x0, argcnt=12, var=0x7ffdadd5fc60) at sr_unix/rts_error.c:159
 #12 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92
 #13 relinkctl_map (linkctl=0x7ffdadd60240) at sr_unix/relinkctl.c:679
 #14 relinkctl_open (linkctl=0x7ffdadd60240, object_dir_missing=0) at sr_unix/relinkctl.c:333
 #15 relinkctl_attach (obj_container_name=0x7ffdadd63500, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188
 #16 zro_load (str=0x55df19dd3ce8) at sr_unix/zro_load.c:159
 #17 zro_init () at sr_port/zro_init.c:51
 #18 zshow_svn (output=0x7ffdadda6a60, one_sv=0) at sr_port/zshow_svn.c:694
 #19 op_zshow (func=0x7ffdaddaeb60, type=1, lvn=0x0) at sr_port/op_zshow.c:166
 #20 jobexam_dump (dump_filename_arg=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660, fatal_file_name_buff=0x7ffdaddaf5f0 "simpleapi_0_40/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_50787_1.txt") at sr_port/jobexam_process.c:232
 #21 jobexam_process (dump_file_name=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660) at sr_port/jobexam_process.c:152
 #22 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66
 #23 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224
 #24 rts_error_va (csa=0x0, argcnt=5, var=0x7ffdaddb2050) at sr_unix/rts_error.c:159
 #25 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92
 #26 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114
 #27 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748
 #28 lvtreenode_newblock (sym=0x55df19df6b40, numElems=2097152) at sr_port/lv_newblock.c:82
 #29 lvtreenode_getslot (sym=0x55df19df6b40) at sr_port/lv_getslot.c:145
 #30 lvAvlTreeNodeInsert (lvt=0x55df19df9050, key=0x7ffdaddb2460, parent=0x55df24e8e5c8) at sr_port/lv_tree.c:1698
 #31 op_putindx (argcnt=1, start=0x55df19dfe0a0) at sr_port/op_putindx.c:192
 #32 callg (fnptr=0x7feae36c6fff <op_putindx>, paramlist=0x7ffdaddb2810) at sr_unix/callg.c:60
 #33 ydb_set_s (varname=0x7ffdaddb2f90, subs_used=1, subsarray=0x7ffdaddb2fa0, value=0x7ffdaddb2790) at sr_unix/ydb_set_s.c:108
 #34 gvnset () at fatalerror.c:56
 #35 ydb_tp_s (tpfn=0x55df18a5c260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193
 #36 main () at fatalerror.c:32
chathaway-codes pushed a commit that referenced this issue Nov 21, 2018
…CK being called during exit handling

When a C program that spawned off multiple threads that used the SimpleThreadAPI (e.g. ydb_tp_st() etc.)
was deadlocked (due to a code issue), pressing Ctrl-C (SIGINT) did nothing so pressing Ctrl-\ (SIGQUIT)
to terminate the C program caused a MAXRTSERRDEPTH fatal error and resulted in a core dump.

Below is the actual output.

^C^\%YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with coreQuit (core dumped)

The corresponding C-stack follows.

(gdb) where
 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52090) at sr_unix/rts_error.c:144
 #3  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52270) at sr_unix/rts_error.c:146
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52450) at sr_unix/rts_error.c:146
 #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #8  rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52630) at sr_unix/rts_error.c:146
 #9  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #10 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52810) at sr_unix/rts_error.c:146
 #11 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #12 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df529f0) at sr_unix/rts_error.c:146
 #13 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #14 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52bd0) at sr_unix/rts_error.c:146
 #15 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #16 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52db0) at sr_unix/rts_error.c:146
 #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #18 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52f90) at sr_unix/rts_error.c:146
 #19 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101
 #20 send_msg_va (csa=0x0, arg_count=8, var=0x7fb28df53570) at sr_unix/send_msg.c:125
 #21 send_msg_csa (csa=0x0, arg_count=8) at sr_unix/send_msg.c:84
 #22 generic_signal_handler (sig=3, info=0x7fb28df53830, context=0x7fb28df53700) at sr_unix/generic_signal_handler.c:244
 #23 <signal handler called>
 #24 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb2880180a8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
 #25 __pthread_cond_wait_common (abstime=0x0, mutex=0x7fb288018040, cond=0x7fb288018080) at pthread_cond_wait.c:502
 #26 __pthread_cond_wait (cond=0x7fb288018080, mutex=0x7fb288018040) at pthread_cond_wait.c:655
 #27 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:80
 #28 start_thread (arg=0x7fb28df54700) at pthread_create.c:463
 #29 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The primary error was at #20 in send_msg_va() inside the PTHREAD_MUTEX_LOCK_IF_NEEDED macro.
The actual assert that failed inside the macro was the following.

sr_unix/gtm_multi_thread.h
---------------------------
     99                 /* We should never use pthread_* calls inside a signal/timer handler. Assert that */                    \
    100                 assert(!in_nondeferrable_signal_handler);                                                               \

We were in a signal handler handling a non-deferrable signal (Ctrl-\ aka SIGQUIT) and are about to do
a pthread_mutex_lock() library call which is a no-no.

If we are in an exit handler, it is possible for send_msg() to be needed (to log the signal that was received
etc.) but it is safer to not do any pthread activity since we cannot be sure if we are exiting while inside
a signal handler or not. Therefore the fix for this is to check if "process_exiting" global variable is TRUE
and if so, we skip all pthread* calls in the PTHREAD_MUTEX_LOCK_IF_NEEDED and PTHREAD_MUTEX_UNLOCK_IF_NEEDED
macros.
chathaway-codes pushed a commit that referenced this issue Nov 21, 2018
A test C program with one or more threads each of which use ydb_tp_st() to do TP transactions
deadlocked once in a while when the database has concurrent udpates that caused the TP
transaction to restart.

Below is the C-stack of the hung TP worker thread.

Thread 5 (Thread 0x7f5d1d7ad700 (LWP 94507)):
 #0  __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
 #1  __GI___pthread_mutex_lock (mutex=0x7f5d222252c0 <thread_mutex>) at ../nptl/pthread_mutex_lock.c:78
 #2  rts_error_va (csa=0x0, argcnt=1, var=0x7f5d1d7ac5a0) at sr_unix/rts_error.c:146
 #3  rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:101
 #4  ydb_tp_s_common (stapi=1, tptoken=15090, tpfn=0x55c68a782bc0 <gvnset>, tpfnparm=0x7f5d2000fed0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:223
 #5  ydb_tp_sst (tptoken=15090, tpfn=0x55c68a782bc0 <gvnset>, tpfnparm=0x7f5d2000fed0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_sst.c:32
 #6  ydb_stm_tpthreadq_process (curTPWorkQHead=0x7f5d1001ba40) at sr_unix/ydb_stm_tpthread.c:125
 #7  ydb_stm_tpthread (parm=0x0) at sr_unix/ydb_stm_tpthread.c:80
 #8  start_thread (arg=0x7f5d1d7ad700) at pthread_create.c:463
 #9  clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(gdb) f 4
 #4  ydb_tp_s_common (stapi=1, tptoken=15090, tpfn=0x55c68a782bc0 <gvnset>, tpfnparm=0x7f5d2000fed0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:223
223                     INVOKE_RESTART;

It notices a restartable situation and therefore wants to signal that but in order to do that it
goes through the function rts_error_va() (frame #2) which tries to do a PTHREAD_MUTEX_LOCK_IF_NEEDED
and that is hung because the pthread mutex lock is held by some other thread.

Turns out this thread was obtained by this same thread in a prior restart (in rts_error_va) but
it was never released. Since all SimpleAPI and SimpleThreadAPI calls to rts_error_va() go through
ydb_simpleapi_ch(), that is where we should be releasing the pthread lock. We do that already but
towards the end of the function.  And in case of a TP restart (ERR_TPRETRY error code), we return
before reaching that point.

The main fix is to sr_unix/ydb_simpleapi_ch.c to move the PTHREAD_MUTEX_UNLOCK_IF_NEEDED macro to
the beginning of ydb_simpleapi_ch(). The pthread unlock logic that was already there did not use
the macro (code duplication) and is now removed.

While at this, various other issues were noticed and fixed.

* sr_unix/ydb_stm_thread.c
  1) It did not initialize the posix_timer_thread_id to be the worker thread.
  2) It did not initialize the thread_mutex global variable (which is what is used by the
     PTHREAD_MUTEX_LOCK_IF_NEEDED and PTHREAD_MUTEX_UNLOCK_IF_NEEDED macros). This is now done
     using a new INITIALIZE_THREAD_MUTEX_IF_NEEDED macro.

* sr_unix/gtm_multi_thread.c : This is the only place where the thread_mutex global variable was
  previously initialized. This now invokes the new INITIALIZE_THREAD_MUTEX_IF_NEEDED macro.

* sr_port/gtm_malloc_src.h : was_holder was being initialized incorrectly before invoking the
  PTHREAD_MUTEX_UNLOCK_IF_NEEDED macro just before issuing a MEMORY error. The consequences of this
  issue are that the pthread mutex lock would not be released by gtm_malloc() in case of a MEMORY error
  in case the call is done by when multi-threading is turned on. Since multi-threading was turned on
  only for MUPIP JOURNAL ROLLBACK/RECOVER, this is a non-issue until now when multi-threading is used
  due to support of ydb_malloc_t() in the SimpleThreadAPI.

* sr_unix/gtm_multi_thread.h : Defines the new INITIALIZE_THREAD_MUTEX_IF_NEEDED macro.
chathaway-codes pushed a commit that referenced this issue Nov 27, 2018
…ops while a V4 block is dirty and SimpleThreadAPI is active

The simplethreadapi/lockst subtest failed as follows when the test framework randomly chose V4 format blocks.

%YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with core

And produced a core with the following C-stack

 #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:72
 #2  send_msg_va (csa=0x0, arg_count=9, var=0x7f25a0f7ff40) at sr_unix/send_msg.c:123
 #3  send_msg (arg_count=9) at sr_unix/send_msg.c:74
 #4  gtm_assert2 (condlen=17, condtext=0x7f25a62dc0d6 "!timer_in_handler", file_name_len=45, file_name=0x7f25a62dc0a8 "sr_unix/send_msg.c", line_no=125) at sr_port/gtm_assert2.c:34
 #5  send_msg_va (csa=0x0, arg_count=9, var=0x7f25a0f80570) at sr_unix/send_msg.c:125
 #6  send_msg (arg_count=9) at sr_unix/send_msg.c:74
 #7  gtm_assert2 (condlen=17, condtext=0x7f25a62dc0d6 "!timer_in_handler", file_name_len=45, file_name=0x7f25a62dc0a8 "sr_unix/send_msg.c", line_no=125) at sr_port/gtm_assert2.c:34
 .
 .
 #32 send_msg_va (csa=0x0, arg_count=9, var=0x7f25a0f83d20) at sr_unix/send_msg.c:125
 #33 send_msg (arg_count=9) at sr_unix/send_msg.c:74
 #34 gtm_assert2 (condlen=17, condtext=0x7f25a6305658 "!timer_in_handler", file_name_len=51, file_name=0x7f25a6305320 "sr_port/gtm_malloc_src.h", line_no=685) at sr_port/gtm_assert2.c:34
 #35 gtm_malloc (size=4096) at sr_port/gtm_malloc_src.h:685
 #36 wcs_wtstart (region=0x5558a88e1170, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:538
 #37 wcs_stale (tid=93839273365872, hd_len=8, region=0x5558a8979628) at sr_port/t_end_sysops.c:1387
 #38 timer_handler (why=14) at sr_unix/gt_timers.c:815
 #39 <signal handler called>
 #40 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5558a88e2ea8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88
 #41 __pthread_cond_wait_common (abstime=0x0, mutex=0x5558a88e2e40, cond=0x5558a88e2e80) at pthread_cond_wait.c:502
 #42 __pthread_cond_wait (cond=0x5558a88e2e80, mutex=0x5558a88e2e40) at pthread_cond_wait.c:655
 #43 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:92
 #44 start_thread (arg=0x7f25a0f85700) at pthread_create.c:463
 #45 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The primary issue is failure in the assert "assert(!timer_in_handler)" inside the PTHREAD_MUTEX_LOCK_IF_NEEDED
macro at frame # 35 (gtm_malloc). This is because we are about to make a pthread_mutex_lock() call which should
not be done while inside a signal handler (according to the man pages). To fix this issue, we skip flushing
that particular cache-record in wcs_wtstart() thereby avoiding the need to call gtm_malloc() (and in turn
the pthread_mutex_lock() call) while inside the timer handler.
nars1 added a commit that referenced this issue Jul 28, 2023
Background
----------
Below is pasted from  https://gitlab.com/YottaDB/DB/YDB/-/issues/994#note_1385299862

* While fuzz testing YDBOcto, I noticed this YDB issue.

* The following query caused a SIG-11 in a Release build of YottaDB and an assert failure in a Debug build
  of YottaDB.

  ```sql
  OCTO> SELECT * from names where id = 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111;
  ```

* A much simpler test case is provided below that does not rely on Octo and has the same issue.

  ```m
  $ cat gvsuboflow.m
  test	;
  	set $etrap="write $extract($zstatus,1,103),! set $ecode="""""
  	for max=1020:1:1050 do gvsuboflow(max)
  	quit

  gvsuboflow(max)	;
  	set x=$tr($justify(1,max)," ",1) w ^names(x)
  	quit
  ```

  **Release build output**
  ```c
  $ yottadb -run gvsuboflow
  .
  .
  150372986,gvsuboflow+1^x,%YDB-E-GVSUBOFLOW, Maximum combined length of subscripts exceeded,%YDB-I-GVIS,
  %YDB-F-KILLBYSIGSINFO1, YottaDB process 16656 has been killed by a signal 11 at address 0x00007FD1F5253819 (vaddr 0x000055A400313131)
  ```

  **Debug build output**
  ```c
  $ yottadb -run gvsuboflow
  .
  .
  150372986,gvsuboflow+1^x,%YDB-E-GVSUBOFLOW, Maximum combined length of subscripts exceeded,%YDB-I-GVIS,
  %YDB-F-ASSERT, Assert failed in sr_port/format2zwr.c line 47 for expression (ZWR_EXP_RATIO(src_len) <= max_len)
  ```

Issue
-----
* Running the above test case with a Release build of YottaDB, that also is built with ASAN, clearly showed
  where the memory overflow was happening. It was a `stack-buffer-overflow` in `gvsub2str.c`.

  ```
  =================================================================
  ==66729==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7fd9e9dee900 at pc 0x7fd9e8bd7175 bp 0x7ffff20985a0 sp 0x7ffff2098590
  WRITE of size 1 at 0x7fd9e9dee900 thread T0
  150372986,gvsuboflow+1^gvsuboflow,%YDB-E-GVSUBOFLOW, Maximum combined length of subscripts exceeded,%YD
      #0 0x7fd9e8bd7174 in gvsub2str sr_port/gvsub2str.c:77
      #1 0x7fd9e8b5cab2 in format_targ_key sr_port/format_targ_key.c:63
      #2 0x7fd9e8d14198 in mval2subsc sr_port/mval2subsc.c:309
      #3 0x7fd9e8d3ca14 in op_gvname_common sr_port/op_gvname.c:133
      #4 0x7fd9e8d3d937 in op_gvname_fast sr_port/op_gvname.c:81
  ```

Fix
---
* The overflow in `gvsub2str.c` is fixed by ensuring that we never go past the buffer bounds by the use
  of a new `ptr_top` local variable. This is done in both cases where `xlat_flag` is TRUE and FALSE.

* Additionally, before calling `format2zwr()` in `gvsub2str.c`, we check if the zwrite expansion of
  the input (`in_length`) would be greater than the output length (`targ_len`). If so, we reduce `in_length`
  by enough so its zwrite expansion would not exceed `targ_len`. We use a new `ZWR_EXP_RATIO_INVERT` macro
  to get this value (it is the converse of the ZWR_EXP_RATIO macro). This is necessary to avoid failing
  a pre-existing assert in `format2zwr.c` that verifies all callers have allocated enough space in the
  output for the given input. In this case in `gvsub2str.c`, we cannot increase `targ_len` and so we
  truncate `in_length` enough so they are accepted by `format2zwr()`. This means we would see a truncated
  output but that is okay considering this usually would be part of a GVSUBOFLOW error which would show
  a huge subscript. Seeing the first few subscripts should be enough in that case for the user to figure
  out the problem and correct it. No need to see the entire input subscripts.
nars1 added a commit that referenced this issue Aug 15, 2023
…INT/SIGQUIT handler more than once (fixes Ctrl-C to terminate Flask)

Background
----------
* We started seeing test failures in the `r140/ydbpython32` subtest (in the YDBTest project) on some
  in-house systems with the following diff.

  ```diff
  --- ydbpython32/ydbpython32.diff ---
  13d12
  < # %YDB-SUCCESS: Test successful
  ```

* Analysis revealed the following connection between the version of python on that system and whether
  the test passed on that system.

  ```
  Python 3.6.15    PASS
  Python 3.6.8     PASS
  Python 3.8.10    PASS
  Python 3.10.12   FAIL
  Python 3.11.2    FAIL
  Python 3.11.4    FAIL
  ```

* The above established that any system that had Python version later than `3.8.10` failed.

Issue
-----
* 773e989 and a few later commits made some fixes so a Ctrl-C terminates an interactive Simple API
  Flask application.

* It did this by invoking the `drive_non_ydb_signal_handler_if_any()` function which in turn invoked
  the user-defined signal handler for a `SIGINT` (the signal that gets sent on a Ctrl-C).

* But this function was being invoked `twice`.

* First with the following stack trace.

  ```c
  (gdb) where
  #0  drive_non_ydb_signal_handler_if_any (caller=0x7f13c4a256d3 "generic_signal_handler2", sig=2, info=0x7f13c4bbb878 <stapi_signal_handler_oscontext+3320>, context=0x7f13c4bbb8f8 <stapi_signal_handler_oscontext+3448>, drive_exit=0) at sr_unix/drive_non_ydb_signal_handler_if_any.c:32
  #1  generic_signal_handler (sig=2, info=0x7f13c4bbb878 <stapi_signal_handler_oscontext+3320>, context=0x7f13c4bbb8f8 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:297
  #2  ydb_os_signal_handler (sig=2, info=0x7fff4ec19df0, context=0x7fff4ec19cc0) at sr_unix/ydb_os_signal_handler.c:85
  #3  <signal handler called>
  #4  __GI___poll (fds=0x7f13c5fe5a70, nfds=1, timeout=500) at ../sysdeps/unix/sysv/linux/poll.c:29
  #5  ?? ()
  #6  ?? ()
  #7  _PyEval_EvalFrameDefault ()
  .
  .
  ```

* And another (later invocation) with the following stack trace.

  ```c
  (gdb) where
  #0  drive_non_ydb_signal_handler_if_any (caller=0x7f13c4a4e58e "deferred_exit_handler", sig=2, info=0x7f13c4bbb878 <stapi_signal_handler_oscontext+3320>, context=0x7f13c4bbb8f8 <stapi_signal_handler_oscontext+3448>, drive_exit=1) at sr_unix/drive_non_ydb_signal_handler_if_any.c:32
  #1  signal_exit_handler (exit_handler_name=0x7f13c4a1e92e "deferred_exit_handler", sig=2, info=0x7f13c4bbb878 <stapi_signal_handler_oscontext+3320>, context=0x7f13c4bbb8f8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:81
  #2  deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #3  deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #4  gtm_exit_handler () at sr_unix/gtm_exit_handler.c:216
  #5  __run_exit_handlers (status=0, listp=0x7f13c6276838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
  #6  __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
  .
  .
  ```

* This meant the user defined signal handler for `SIGINT` was invoked twice before the process terminated.
  While this was okay with the older python version, with the newer python version, this caused a
  SIGSEGV (observed in the debugger when the second invocation happened below at line 86).

  **sr_unix/drive_non_ydb_signal_handler_if_any.c**
  ```c
     82         /* Invoke signal handler based on whether SA_SIGINFO was specified as it means calling the handler with 3 or 1 arguments. */
     83         if (NULL != sighandler3)
     84                 (*sighandler3)(sig, info, context);     /* Note - likely to NOT return */
     85         else
  -> 86                 (*sighandler1)(sig);                    /* Note - likely to NOT return */
  ```

* In general it is not safe to invoke the same signal handler again for the same signal event.

Fix
---
* The fix is simple and is to avoid duplicate invocation of the user-defined signal handler for SIGINT
  (and in addition for SIGTERM and SIGQUIT too).

* This is done by resetting the original signal handler (stored at `ydb_init()` time) to `SIG_IGN`
  thereby a second invocation of `drive_non_ydb_signal_handler_if_any()` skips invoking the user-defined
  signal handler.
nars1 added a commit that referenced this issue Sep 11, 2023
… detect signal/timer handling

Background
----------
* We had one rare test failure during in-house testing. The `ideminter_rolrec/mupipstop_rollback_or_recover`
  subtest failed with the following symptom.

  ```sh
  $ cat ROLLBACK1_3.logx
  mupip journal -ROLLBACK -back -verify -verbose "*"  -noonline -resync=369813 -lost=ROLLBACK1_3.lost
  Sat Sep  9 04:17:18 PM EDT 2023
  .
  .
  %YDB-I-MUJNLSTAT, Forward processing started at Sat Sep  9 16:19:23 2023
  %YDB-I-MUINFOUINT8, mur_process_seqno_table returns min_broken_seqno : 18446744073709551615 [0xFFFFFFFFFFFFFFFF]
  %YDB-I-MUINFOUINT8, mur_process_seqno_table returns losttn_seqno : 369813 [0x000000000005A495]
  %YDB-I-MUINFOSTR, Module : mur_forward:at the start at Sat Sep  9 16:19:23 2023
  .
  .
  %YDB-I-MUINFOSTR,     Journal file : ideminter_rolrec_0/mupipstop_rollback_or_recover/g.mjl_2023252161233
  %YDB-I-MUINFOUINT4,     Record Offset : 65744 [0x000100D0]
  %YDB-F-FORCEDHALT, Image HALTed by MUPIP STOP
  %YDB-F-ASSERT, Assert failed in sr_unix/db_ipcs_reset.c line 110 for expression (((TREF(dio_buff)).aligned != (char *)(csd)) || (!timer_in_handler && !multi_thread_in_use))
  Sat Sep  9 04:20:35 PM EDT 2023
  The time the mupip command took:  197
  ```

* The core file corresponding to the above assert failure had the following stack trace.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140217990231872, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fff160fdc00) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  db_ipcs_reset (reg=0x563c77a1c0b0) at sr_unix/db_ipcs_reset.c:110
  #9  mur_close_files () at sr_port/mur_close_files.c:841
  #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:116
  #11 signal_exit_handler (exit_handler_name=0x7f870b624acc "deferred_exit_handler", sig=15, info=0x7f870b7856a8 <stapi_signal_handler_oscontext+3320>, context=0x7f870b785728 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #14 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862
  #15 wcs_stale (tid=94817705118848, hd_len=8, region=0x563c77924b08) at sr_port/t_end_sysops.c:1445
  #16 timer_handler (why=0, info=0x7f870b787088 <stapi_signal_handler_oscontext+9944>, context=0x7f870b787108 <stapi_signal_handler_oscontext+10072>, is_os_signal_handler=0) at sr_unix/gt_timers.c:913
  #17 check_for_deferred_timers () at sr_unix/gt_timers.c:1312
  #18 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #19 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862
  #20 wcs_timer_start (reg=0x563c77a1cc80, io_ok=1) at sr_port/t_end_sysops.c:1344
  #21 op_tcommit () at sr_port/op_tcommit.c:535
  #22 mur_output_record (rctl=0x563c77a28a40) at sr_port/mur_output_record.c:323
  #23 mur_forward_play_cur_jrec (rctl=0x563c77a28a40) at sr_port/mur_forward_play_cur_jrec.c:362
  #24 mur_forward_multi_proc (rctl=0x563c77a28a40) at sr_port/mur_forward.c:400
  #25 gtm_multi_proc (fnptr=0x7f870ae20f00 <mur_forward_multi_proc>, ntasks=1, max_procs=1, ret_array=0x563c7cb21a40, parm_array=0x563c77a27c40, parmElemSize=512, extra_shm_size=2640, init_fnptr=0x7f870ae2b9f0 <mur_forward_multi_proc_init>, finish_fnptr=0x7f870ae2bc10 <mur_forward_multi_proc_finish>) at sr_unix/gtm_multi_proc.c:122
  #26 mur_forward (min_broken_time=4294967295, min_broken_seqno=18446744073709551615, losttn_seqno=369813) at sr_port/mur_forward.c:158
  #27 mupip_recover () at sr_port/mupip_recover.c:588
  #28 mupip_main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip_main.c:122
  #29 dlopen_libyottadb (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0, main_func=0x563c761b1004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #30 main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip.c:22

  (gdb) p gtm_threadgbl_true->dio_buff.aligned
  $5 = 0x563c78429000 "GDSDYNUNX04"
  (gdb) p csd
  $6 = (sgmnt_data_ptr_t) 0x563c78429000
  (gdb) p timer_in_handler
  $1 = 1
  (gdb) p multi_thread_in_use
  $2 = 0

  (gdb) p forced_exit
  $3 = 2
  (gdb) p exit_handler_active
  $4 = 1
  (gdb) p in_os_signal_handler
  $1 = 0
  ```

Issue
-----
* The assert failure was in the db_ipcs_reset() -> DB_LSEEKREAD -> DBG_CHECK_DIO_ALIGNMENT.

* The `DBG_CHECK_DIO_ALIGNMENT` macro had the following comment.

  ```c
     53         /* If we are using the global variable "dio_buff.aligned", then we better not be executing in timer     \
     54          * code or in threaded code (as we have only ONE buffer to use). Assert that.                           \
     55          */                                                                                                     \
     56         assert(((TREF(dio_buff)).aligned != (char *)(buff)) || (!timer_in_handler && !multi_thread_in_use));    \
  ```

* In the failure case, even though we are executing in timer code we are actually in exit handler code
  (as can be seen by the `forced_exit` and `exit_handler_active` variables in the gdb analysis above).
  In this case, the exit handler code will not return out of the timer code and so it is okay for the
  assert to not be TRUE.

* The global variable being checked in the assert is `timer_in_handler`. This is where the issue is.
  That global variable being TRUE just means the `timer_handler()` function is in the current call stack.
  It does not mean that we are handling a SIGALRM/timer signal and interrupting the mainline code.
  The assert is intended to protect against signal handler interrupting the mainline code. Therefore,
  the correct global variable to check in the assert is `in_os_signal_handler`.

Fix
---
* The fix is simple and is to use `in_os_signal_handler` instead of `timer_in_handler` in the assert.
nars1 added a commit that referenced this issue Sep 25, 2023
…mplete deferred state setup before invoking xfer_set_handlers()

* After merging GT.M V7.0-001, the following tests failed in rare cases.
  - -t dual_fail_extend -replic -st dual_fail2_mustop_sigquit
  - -t v60000 -replic -st gtm4525b

* The failure symptom was the following.

  ```c
  (gdb) x/s gtm_threadgbl_true->util_outbuff
  0x17d3ed8:  "%YDB-F-ASSERT, Assert failed in sr_port/deferred_signal_handler.c line 38 for expression (GET_DEFERRED_EXIT_CHECK_NEEDED || (1 != forced_exit))"
  ```

* And the C-stack was the following.

  ```c
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fffa2b5d390) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  deferred_signal_handler () at sr_port/deferred_signal_handler.c:38
  #7  set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:48
  #8  xfer_set_handlers (event_type=11, param_val=1730866112, popped_entry=0) at sr_port/deferred_events.c:191
  #9  generic_signal_handler (sig=15, info=0x7f7167e24218 <stapi_signal_handler_oscontext+3320>, context=0x7f7167e24298 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:305
  #10 ydb_os_signal_handler (sig=15, info=0x7fffa2b5d9f0, context=0x7fffa2b5d8c0) at sr_unix/ydb_os_signal_handler.c:85
  #11 <signal handler called>
  #12 __GI___clock_nanosleep (clock_id=1, flags=1, req=0x7fffa2b5e058, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78
  #13 wait_for_repl_inst_unfreeze_nocsa_jpl (jpl=0x17ec240) at sr_port/anticipatory_freeze.h:517
  #14 wait_for_repl_inst_unfreeze (csa=0x18f7040) at sr_port/anticipatory_freeze.h:547
  #15 jnl_write_attempt (jpc=0x18f7a40, threshold=29324848) at sr_port/jnl_write_attempt.c:348
  #16 jnl_flush (reg=0x189afe8) at sr_port/jnl_flush.c:57
  #17 tp_tend () at sr_port/tp_tend.c:795
  #18 op_tcommit () at sr_port/op_tcommit.c:497

  (gdb) f 6
  #6  0x00007f71672ae771 in deferred_signal_handler () at sr_port/deferred_signal_handler.c:38
  38              assert(GET_DEFERRED_EXIT_CHECK_NEEDED || (1 != forced_exit));

  ```

* The `SET_FORCED_EXIT_STATE` macro call (in frame 9 above) is where the issue is.

  **sr_port/have_crit.h**
  ```c
      172 #define SET_FORCED_EXIT_STATE(SIG)                                                                                              \
      173 {                                                                                                                               \
      174         char                    *rname;                                                                                         \
      175                                                                                                                                 \
      176         GBLREF VSIG_ATOMIC_T    forced_exit;                                                                                    \
      177         GBLREF int              forced_exit_sig;                                                                                \
      178         GBLREF boolean_t        (*xfer_set_handlers_fnptr)(int4, void (*callback)(int4), int4 param, boolean_t popped_entry);   \
      179         GBLREF void             (*deferred_signal_set_fnptr)(int4 dummy_val);                                                   \
      180                                                                                                                                 \
      181         /* Below code is not thread safe as it modifies global variables "forced_exit"                                          \
      182          * and "forced_exit_sig".                                                                                               \
      183          */                                                                                                                     \
      184         assert(!INSIDE_THREADED_CODE(rname));                                                                                   \
      185         assert((0 == forced_exit) || (1 == forced_exit));                                                                       \
  --> 186         forced_exit = 1;                                                                                                        \
      187         forced_exit_sig = SIG;          /* Record the signal forcing us to exit */                                              \
      188         if (in_os_signal_handler)                                                                                               \
      189         {       /* If we are inside an OS signal handler and therefore had to defer exit                                        \
      190                  * handling, treat this as an outofband event as this is checked by lots of                                     \
      191                  * potentially long-running commands in the runtime (e.g. HANG etc.) and we                                     \
      192                  * want all of those to automatically trigger process exit handling.                                            \
      193                  * The below invocation takes care of the signal as a deferred outofband event                                  \
      194                  * that gets handled at the earliest safe point.                                                                \
      195                  */                                                                                                             \
      196                 if (NULL != xfer_set_handlers_fnptr)                                                                            \
  --> 197                         (*xfer_set_handlers_fnptr)(deferred_signal, deferred_signal_set_fnptr, 0, FALSE);                       \
      198                 /* else: it is "gtmsecshr" in which case outofband does not apply */                                            \
      199         }                                                                                                                       \
      200         /* Whenever "forced_exit" gets set to 1, set the corresponding deferred event too */                                    \
  --> 201         SET_DEFERRED_EXIT_CHECK_NEEDED;                                                                                         \
      202         SET_FORCED_THREAD_EXIT;         /* Signal any running threads to stop */                                                \
      203         SET_FORCED_MULTI_PROC_EXIT;     /* Signal any parallel processes to stop */                                             \
      204 }
  ```

* Line 186 sets `forced_exit` and Line 201 sets the corresponding deferred event. But Line 197
  ends up invoking `deferred_signal_handler()` which has an assert that expects Line 186 and 201 to
  have happened at the same time.

* This is fixed by moving lines 188-199 above to execute AFTER lines 200-203. That way the state setup
  of the forced exit is finished first and then the outofband set up happens by the
  `xfer_set_handlers_fnptr` call.

* Now that Lines 186 and 201 are executed BEFORE line 197 in this commit, the assert failure seen
  in the test failure should be automatically fixed.
nars1 added a commit that referenced this issue Sep 25, 2023
* The `reorg/on_ntp_njnl_reorg` subtest failed once in 10 runs or so with the following assert.

  ```diff
  > reorg_5_6/on_ntp_njnl_reorg/reorg1.out
  > %YDB-F-ASSERT, Assert failed in sr_port/mu_split.c line 689 for expression (*top_off < max_fill)
  ```

* Below is the core analysis using gdb.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140573380031552) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140573380031552) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140573380031552, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd10fd0290) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  locate_block_split_point (blk_stat=0x61d00029eb98, level=1, cur_blk_size=16, max_fill=4, last_rec_size=0x7ffd10fd0950, last_key=0x7ffd10fd0e10 "srqponm", last_keysz=0x7ffd10fd0920, top_off=0x7ffd10fd0940) at sr_port/mu_split.c:689
  #9  mu_split (cur_level=0, i_max_fill=4096, d_max_fill=2908, blks_created=0x7ffd10fd15a0, lvls_increased=0x7ffd10fd15b0) at sr_port/mu_split.c:314
  #10 mu_reorg (gl_ptr=0x62d00021fec0, exclude_glist_ptr=0x7ffd10fd2820, resume=0x7ffd10fd26e0, index_fill_factor=100, data_fill_factor=71, reorg_op=0) at sr_port/mu_reorg.c:356
  #11 mupip_reorg () at sr_port/mupip_reorg.c:334
  #12 mupip_main (argc=5, argv=0x7ffd10fe51e8, envp=0x7ffd10fe5218) at sr_unix/mupip_main.c:121
  #13 dlopen_libyottadb (argc=5, argv=0x7ffd10fe51e8, envp=0x7ffd10fe5218, main_func=0x560122181020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #14 main (argc=5, argv=0x7ffd10fe51e8, envp=0x7ffd10fe5218) at sr_unix/mupip.c:22

  (gdb) p *top_off
  $5 = 16

  (gdb) p max_fill
  $6 = 4

  (gdb) up
  #9  mu_split (cur_level=0, i_max_fill=4096, d_max_fill=2908, blks_created=0x7ffd10fd15a0, lvls_increased=0x7ffd10fd15b0) at sr_port/mu_split.c:314
  314                             status = locate_block_split_point(old_blk1_hist_ptr, level, old_blk1_sz, max_fill,

  (gdb) p old_blk1_sz
  $30 = 16

  (gdb) p delta
  $31 = 56

  (gdb) p blk_size
  $32 = 4096

  (gdb) p reserve_bytes
  $33 = 4096

  (gdb) p max_fill_sav
  $34 = 2908

  (gdb) p bstar_rec_sz
  $35 = 12

  (gdb) p cs_data->reserved_bytes
  $36 = 0
  ```

* Below is the source code corresponding to frame 9.

  **sr_port/mu_split.c**
  ```c
    299    if ((old_blk1_sz + delta) > (blk_size - reserve_bytes))
    300    {
    301            split_required = TRUE;
    302            if (level == gv_target->hist.depth)
    303            {
    304                    create_root = TRUE;
    305                    if ((MAX_BT_DEPTH - 1) <= level)
    306                            return cdb_sc_maxlvl;                           /* maximum level reached */
    307            }
    308            if (max_fill + bstar_rec_sz > old_blk1_sz)
    309            {       /* need more space than what was in the old block, so new block will be "too big" */
    310                    if (((SIZEOF(blk_hdr) + bstar_rec_sz) == old_blk1_sz) && !mu_reorg_upgrd_dwngrd_in_prog)
    311                            return cdb_sc_oprnotneeded;                     /* Improve code to avoid this */
    312                    max_fill = old_blk1_sz - bstar_rec_sz;
    313            }
    314            status = locate_block_split_point(old_blk1_hist_ptr, level, old_blk1_sz, max_fill,
    315                    &old_blk1_last_rec_size, new_blk1_last_key, &new_blk1_last_keysz, &new_leftblk_top_off);
  ```

* As the core analysis indicates, `max_fill` is 4 which is the reason the assert failed. And that happened
  at line 312.

* And this is because `old_blk1_sz` was a small value of `16`.

* But then I was wondering how come we got into the `if` at line 299 as we would get into it only if the
  block has lot of content and needs a split.

* Turns out the issue is that `reserve_bytes` is `4096` which is the same as `blk_size`. And so the
  right hand side of the `>` check in line 299 was 0 which is why the `if` succeeded even though
  `old_blk1_sz + delta` was only `16 + 56` i.e. 72 bytes, which is a lot less than the value that would
  usually require a block split.

* I then started looking at why `reserve_bytes` ended up being such a huge value when the file header
  field it mirrors (i.e. `cs_data->reserved_bytes`) is 0 as indicated in the gdb analysis above.

* That is when I realized this happened in the following line.

  **sr_port/mu_split.c**
  ```c
    233         reserve_bytes = i_max_fill;
  ```

* This is a bug that was introduced in GT.M V7.0-001 changes (52a92df) to `sr_port/mu_split.c`. And I see
  that this incorrect set of `reserve_bytes = i_max_fill` is removed in GT.M V7.1-000 (f9ca5ad).

* f9ca5ad has a lot more changes to support `mupip reorg upgrade` so I cannot cherry-pick that commit.

* Therefore, I am fixing this issue by removing line 233 in this commit.

* In addition, 895c2d3a had fixed some `clang-tidy` warnings in `sr_port/mu_split.c` related to the
  `reserve_bytes` variable. There was a block of code that set `reserve_bytes` BEFORE line 233. And
  that was flagged as a `[clang-analyzer-deadcode.DeadStores]` warning. And so the previous block of
  code that set `reserve_bytes` was removed in 895c2d3a. And along with it, logic related to
  `available_bytes` was made `#ifdef DEBUG`. Both these changes are now removed in this commit as
  that previous block had correctly set `reserve_bytes = cs_data->reserved_bytes` (and is the line
  that remains even in GT.M V7.1-000 f9ca5ad). That said, the update of `available_bytes` inside
  the `#ifdef DEBUG` block continues to be used only by a later assert so that is still kept inside
  a `DEBUG_ONLY()` macro in the new code to avoid the following `clang-tidy` warning.

  ```
  mu_split.c:warning: Value stored to 'available_bytes' is never read [clang-analyzer-deadcode.DeadStores]
  ```
nars1 added a commit that referenced this issue Sep 25, 2023
…n is_free_blks_ctr_ok() (regression in ea9950a)

* The resil_4/resil subtest failed in one rare in-house test run with the following symptom.

  ```diff
  > resil_4_6/resil/reorg4.out
  > %YDB-F-ASSERT, Assert failed in sr_port/bm_getfree.c line 401 for expression (FALSE)
  ```

* Below is relevant information from the core file.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (threadid=<optimized out>, signo=3, no_tid=<optimized out>) at ./nptl/pthread_kill.c:44
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fafb799c5a0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  is_free_blks_ctr_ok () at sr_port/bm_getfree.c:401
  #7  gdsfilext (blocks=100, filesize=12559321033, trans_in_prog=1) at sr_unix/gdsfilext.c:335
  #8  bm_getfree (hint_arg=196239399, blk_used=0x7fafb80a4410, cw_work=3, cs=0x7fafb6d841c0 <cw_set>, cw_depth_ptr=0x7fafb80a4540) at sr_port/bm_getfree.c:185
  #9  t_end (hist1=0x62d0001128c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:520
  #10 mu_reorg (gl_ptr=0x62d000112040, exclude_glist_ptr=0x7ffc96944400, resume=0x7ffc96933f60, index_fill_factor=78, data_fill_factor=44, reorg_op=0) at sr_port/mu_reorg.c:367
  #11 mupip_reorg () at sr_port/mupip_reorg.c:334
  #12 mupip_main (argc=5, argv=0x7ffc96944958, envp=0x7ffc96944988) at sr_unix/mupip_main.c:121
  #13 dlopen_libyottadb (argc=5, argv=0x7ffc96944958, envp=0x7ffc96944988, main_func=0x55aebf0d8420 <str> "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #14 main (argc=5, argv=0x7ffc96944958, envp=0x7ffc96944988) at sr_unix/mupip.c:22

  (gdb) f 6
  #6  0x00007fafb5ce61b2 in is_free_blks_ctr_ok () at sr_port/bm_getfree.c:401
  401                             assert(FALSE);  /* In pro, we will simply skip counting this local bitmap. */

  (gdb) list
  379    for (free_blocks = 0, free_bml = 0; free_bml < local_maps; free_bml++)
  380    {
  381            #ifdef DEBUG
  382            if ((0 != ydb_skip_bml_num)
  383                    && (BLKS_PER_LMAP <= (BLKS_PER_LMAP * free_bml))
  384                    && ((BLKS_PER_LMAP * free_bml) < ydb_skip_bml_num))
  385            {
  386                    free_bml = (ydb_skip_bml_num / BLKS_PER_LMAP) - 1;
  387                            /* - 1 to compensate the "free_bml++" done in "for" loop line */
  388                    free_blocks += (ydb_skip_bml_num - BLKS_PER_LMAP) / BLKS_PER_LMAP * (BLKS_PER_LMAP - 1);
  389                    continue;
  390            }
  391            #endif
  392            bml = bmm_find_free((uint4)free_bml, (sm_uc_ptr_t)MM_ADDR(cs_data), local_maps);
  393            if (bml < free_bml)
  394                    break;
  395            free_bml = bml;
  396            bml *= BLKS_PER_LMAP;
  397            if (!(bmp = t_qread(bml, (sm_int_ptr_t)&cycle, &cr))
  398                            || (BM_SIZE(BLKS_PER_LMAP) != ((blk_hdr_ptr_t)bmp)->bsiz)
  399                            || (LCL_MAP_LEVL != ((blk_hdr_ptr_t)bmp)->levl))
  400            {
  401                    assert(FALSE);  /* In pro, we will simply skip counting this local bitmap. */
  402                    continue;
  403            }

  (gdb) p/x bml
  $4 = 0x200

  (gdb) p/x ydb_skip_bml_num
  $3 = 0x2ec97f600

  (gdb) p free_bml
  $5 = 1
  ```

* The value of `bml` is 512 at line 401 and `ydb_skip_bml_num` is a non-zero value. This means we are
  in the HOLE section of the database file and should never do a `t_qread()` call on such a block. But
  we did it at line 397 and is why we ended up with an assert failure.

* The value of `free_bml` is 1 at line 401.

* If `free_bml` was 1 at line 382, the `if` block would have kicked in and recognized this is a hole
  and counted the free blocks for the HOLE and then move on to the non-HOLE section of the database file
  using the `continue` at line 389 in a different iteration of the for loop.

* But this did not happen. Therefore, `free_bml` was 0 at line 382 but became 1 at line 401.

* This means that `free_bml` was 0 at line 382 and got set to 1 at line 395.

* That is, the `bmm_find_free()` call at line 392 was passed in `free_bml=0` as the first parameter
  and returned `bml=1`. That is, it found all blocks in bitmap block 0 as busy and so returned bitmap
  block 1 as that which has free space.

* In this case, we should redo the `if` check in lines 381-391 right after setting `free_bml` in line 395.
  That would take care of skipping the `t_qread()` call for bitmap blocks in the HOLE section.
  This is exactly what is taken care of in this commit.
nars1 added a commit that referenced this issue Sep 25, 2023
…ctl() (sr_unix/gtmsource_process.c)

Background
----------
* The merge_5/tp_stress subtest failed on an in-house AARCH64 system in one rare run with the
  following symptom.

  ```diff
  > ##HOST##:##TEST_OUTPUT##/SRC_17_37_44.log
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtmsource_process.c line 400 for expression ((GTMSOURCE_SENDING_JNLRECS != gtmsource_state) || ((0 == poll_time) || (GTMSOURCE_IDLE_POLL_WAIT == poll_time)) GTMTLS_ONLY(DEBUG_ONLY(|| renegotiation_pending)))
  ```

* Below is the code surrounding the failed assert.

  **sr_unix/gtmsource_process.c**
  ```c
      392         /* Check if receiver sent us any control message. Typically, the traffic from receiver to source is very
      393          * low compared to traffic in the other direction. More often than not, there will be nothing on the pipe
      394          * to receive. Ideally, we should let TCP notify us when there is data on the pipe (async I/O on Unix and
      395          * VMS). But, we are not there yet. Since we do a select() before a recv(), we won't block if there is
      396          * nothing in the pipe. So, it shouldn't be an expensive operation even if done before every send. Also,
      397          * in doing so, we react to an XOFF sooner than later.
      398          */
      399         /* Make sure we don't sleep for an extended period of time if there is something to be sent across */
      400         assert((GTMSOURCE_SENDING_JNLRECS != gtmsource_state)
      401                         || ((0 == poll_time) || (GTMSOURCE_IDLE_POLL_WAIT == poll_time))
      402                         GTMTLS_ONLY(DEBUG_ONLY(|| renegotiation_pending)));
  ```

* Below are relevant details from the gdb anaylsis of the resulting core file.

  ```c
  (gdb) where
  #0  __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  gtmsource_recv_ctl () at sr_unix/gtmsource_process.c:400
  #7  gtmsource_process () at sr_unix/gtmsource_process.c:1620
  #8  gtmsource () at sr_unix/gtmsource.c:618
  #9  mupip_main (argc=13, argv=0xffffc0f44a08, envp=0xffffc0f44a78) at sr_unix/mupip_main.c:121
  #10 dlopen_libyottadb (argc=13, argv=0xffffc0f44a08, envp=0xffffc0f44a78, main_func=0xaaaacb921650 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #11 0x0000aaaacb920f34 in main (argc=13, argv=0xffffc0f44a08, envp=0xffffc0f44a78) at sr_unix/mupip.c:22

  (gdb) p gtmsource_state
  $1 = GTMSOURCE_SENDING_JNLRECS

  (gdb) p poll_time
  $2 = 999 == REPL_POLL_WAIT

  (gdb) p renegotiation_pending
  $3 = 0

  sr_port/repl_comm.h
  88:#define GTMSOURCE_IDLE_POLL_WAIT     10                      /* 10ms sleep in case nothing sent to the other side */
  89:#define REPL_POLL_WAIT                       (MILLISECS_IN_SEC - 1)  /* Maximum time (in ms) for select()/poll() to timeout */

  (gdb) p repl_tls.enabled
  $5 = 1

  (gdb) p repl_tls.renegotiate_state
  $6 = REPLTLS_WAITING_FOR_RENEG_TIMEOUT

  (gdb) p next_renegotiate_hrtbt
  $7 = 1

  (gdb) p heartbeat_stalled
  $8 = 0

  (gdb) p hrtbt_cnt
  $10 = 12

  (gdb) p renegotiate_factor
  $11 = 12

  (gdb) p next_renegotiate_hrtbt
  $12 = 1

  (gdb) p gtmsource_options.renegotiate_interval
  $13 = 180
  ```

Issue
-----
* As indicated in line 399 in the below code, the assert at line 400 is to make sure that we don't wait for
  an extended period of time in the REPL_RECV_LOOP at line 433. Hence the check for `poll_time` being 0
  or 10 milliseconds. Or that a TLS renegotiation is pending.

  **sr_unix/gtmsource_process.c**
  ```c
      392         /* Check if receiver sent us any control message. Typically, the traffic from receiver to source is very
      393          * low compared to traffic in the other direction. More often than not, there will be nothing on the pipe
      394          * to receive. Ideally, we should let TCP notify us when there is data on the pipe (async I/O on Unix and
      395          * VMS). But, we are not there yet. Since we do a select() before a recv(), we won't block if there is
      396          * nothing in the pipe. So, it shouldn't be an expensive operation even if done before every send. Also,
      397          * in doing so, we react to an XOFF sooner than later.
      398          */
  --> 399         /* Make sure we don't sleep for an extended period of time if there is something to be sent across */
  --> 400         assert((GTMSOURCE_SENDING_JNLRECS != gtmsource_state)
      401                         || ((0 == poll_time) || (GTMSOURCE_IDLE_POLL_WAIT == poll_time))
      402                         GTMTLS_ONLY(DEBUG_ONLY(|| renegotiation_pending)));
      403         if (GTMSOURCE_CHANGING_MODE == gtmsource_state)
      404                 return;
      405         if (GTMSOURCE_WAITING_FOR_CONNECTION == gtmsource_state)
      406                 return;
      407 #       ifdef GTM_TLS
      408         if (repl_tls.enabled && (REPLTLS_WAITING_FOR_RENEG_TIMEOUT == repl_tls.renegotiate_state) && next_renegotiate_hrtbt)
      409         {       /* Time to renegotiate the TLS/SSL parameters. */
      410                 heartbeat_stalled = TRUE;       /* Defer heartbeats until renegotiation is done. */
      411                 DEBUG_ONLY(renegotiation_pending = TRUE);
      412                 /* Send REPL_RENEG_ACK_ME message to the receiver. */
      413                 renegotiate_msg.type = REPL_RENEG_ACK_ME;
      414                 renegotiate_msg.len = MIN_REPL_MSGLEN;
      415                 gtmsource_repl_send((repl_msg_ptr_t)&renegotiate_msg, "REPL_RENEG_ACK_ME",
      416                                                 MAX_SEQNO, INVALID_SUPPL_STRM);
      417                 if (GTMSOURCE_CHANGING_MODE == gtmsource_state)
      418                         return;
      419                 if (GTMSOURCE_WAITING_FOR_CONNECTION == gtmsource_state)
      420                         return;
      421                 /* We now have to wait for REPL_RENEG_ACK from the receiver. Until then we defer sending journal
      422                  * records to the other side. This way, we don't end up having outbound data in the TCP/IP pipe
      423                  * during the time of renegotiation. TLS/SSL protocol doesn't handle application data when it is
      424                  * in the middle of renegotiation. Similarly, the receiver on receipt of the REPL_RENEG_ACK_ME
      425                  * message will defer sending any more messages to us until the renegotiation is completed.
      426                  */
      427                 repl_tls.renegotiate_state = REPLTLS_WAITING_FOR_RENEG_ACK;
      428                 repl_log(gtmsource_log_fp, TRUE, TRUE, "Waiting for REPL_RENEG_ACK\n");
      429                 poll_time = REPL_POLL_WAIT; /* because we are waiting for a REPL_RENEG_ACK */
      430         }
      431 #       endif
      432         recv_msgp = &recv_msg;
  --> 433         REPL_RECV_LOOP(gtmsource_sock_fd, recv_msgp, MIN_REPL_MSGLEN, poll_time)
  ```

* But in the failure case, as the gdb analysis indicates, all variables in the `if` check at line 408 above
  evaluated to TRUE. That means `renegotiation_pending` would be set to `TRUE` at line 411 and so the
  assert at line 400 would have been TRUE if only we had delayed it a bit.

Fix
---
* The fix is simple and is to move lines 399-402 to AFTER line 431. That is, it just fixes the placement
  of this longstanding assert to address this timing scenario where a TLS renegotiation is pending.

Notes
-----
* The reason this started failing only after GT.M V7.0-001 code changes were merged is because in that
  version, lines 564-573 were added below which caused `next_renegotiate_hrtbt` to be set to TRUE. This
  variable was never set to TRUE in the prior version and so the `if` check at line 408 above never
  evaluated to TRUE.

  **sr_unix/gtmsource_process.c**
  ```c
    551                         case REPL_HEARTBEAT:
    552                                 if (msg_is_cross_endian)
    553                                 {
    554                                         heartbeat_msg = (repl_heartbeat_msg_ptr_t)recv_msgp;
    555                                         tmp_seqno = *(seq_num *)&heartbeat_msg->ack_seqno[0];
    556                                         tmp_seqno = GTM_BYTESWAP_64(tmp_seqno);
    557                                         *(seq_num *)&heartbeat_msg->ack_seqno[0] = tmp_seqno;
    558                                         tmp_time4 = *(gtm_time4_t *)&heartbeat_msg->ack_time[0];
    559                                         tmp_time4 = GTM_BYTESWAP_32(tmp_time4);
    560                                         *(gtm_time4_t *)&heartbeat_msg->ack_time[0] = tmp_time4;
    561                                 }
    562                                 if ((!gtmsource_is_heartbeat_stalled) && (SHUTDOWN != gtmsource_local->shutdown))
    563                                         gtmsource_process_heartbeat((repl_heartbeat_msg_ptr_t)recv_msgp);
    564 #                               ifdef GTM_TLS
    565                                 hrtbt_cnt++;
    566                                 if ((0 < renegotiate_factor) && (0 == (hrtbt_cnt % renegotiate_factor))
    567                                         && (SHUTDOWN != gtmsource_local->shutdown))
    568                                 {
    569                                         next_renegotiate_hrtbt = TRUE;
    570                                         repl_tls.renegotiate_state = REPLTLS_WAITING_FOR_RENEG_TIMEOUT;
    571                                         poll_time = REPL_POLL_WAIT;
    572                                 }
    573 #                               endif
    574                                 break;
  ```

* This failure was fairly reproducible on the AARCH64 systems (not on the x86_64 systems) with the
  following options passed to `com/gtmtest.csh`. The test failed at least once in 10 runs and likely
  more the slower the system is.

  ```
  -t merge -replic -reorg -st tp_stress -env gtm_test_tls=TRUE -env gtm_test_tls_renegotiate=1 -num_runs 10
  ```
nars1 added a commit that referenced this issue Sep 28, 2023
…terrupts cause %YDB-E-STACKCRIT

Background
----------
* This is an issue identified by @shabiel while trying to reproduce some other issue. The next
  bullet is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/1029#description.

* To reproduce this, start an M process in direct mode while it is at the `YDB>` prompt. And in
  another terminal send it a lot of repeated job interrupts (i.e. `mupip intrpt`), and then go
  back to the M process and type anything and press enter. Eventually, the process will crash
  with this:

  ```c
  YDB>%YDB-E-STACKCRIT, Stack space critical
  %YDB-F-ASSERT, Assert failed in sr_port/mdb_condition_handler.c line 1184
          for expression (!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL))
  ```

* One needs to send repeated jobinterrupts to the same process. One easy way to do this is in tcsh
  using the following command from a different terminal. The below issues 10,000 interrupts to
  pid 52545.

  ```
  repeat 10000 $ydb_dist/mupip intrpt 52545
  ```

* One would see the assert show up in the original terminal.

Issue
-----
Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/1029#note_1581268407

* The debugger shows it is a `STACKCRIT` error that in turn triggers an assert failure.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140478350915392) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140478350915392) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140478350915392, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffe608dc2d0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  mdb_condition_handler (arg=150373738) at sr_port/mdb_condition_handler.c:1184
  #9  rts_error_va (csa=0x0, argcnt=1, var=0x7ffe608dc7a0) at sr_unix/rts_error.c:198
  #10 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #11 jobintrpt_ztime_process (ztime=0) at sr_port/jobintrpt_ztime_process.c:84
  #12 trans_code () at sr_port/trans_code.c:193
  #13 stkok2 () at sr_x86_64/mum_tstart.s:37

  (gdb) f 11
  #11 jobintrpt_ztime_process (ztime=0) at sr_port/jobintrpt_ztime_process.c:84
  84              PUSH_MV_STENT(MVST_ZINTR);      /* MVST_ZTIMEOUT is identical to MVST_ZINTR with a flag to differentiate in debugging */

  (gdb) f 8
  #8  mdb_condition_handler (arg=150373738) at sr_port/mdb_condition_handler.c:1184
  1184                            assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL));

  (gdb) list
  1182                    if (!(SFT_ZINTR & proc_act_type) && !(SFT_ZTIMEOUT & proc_act_type))    /* ztimeout vector precompiled */
  1183                    {
  1184                            assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL));
  1185                            trans_code_cleanup();
  ```

* I verified that with a Release/PRO build, there is no issue.

  ```c
  YDB>%YDB-E-STACKCRIT, Stack space critical
  %YDB-E-ERRWZINTR, Error while processing $ZINTERRUPT
  %YDB-E-ZINTDIRECT, Attempt to enter direct mode from $ZINTERRUPT
                  At M source location +1^GTM$DMOD
  ```

* The issue is in line 1184 which needs to take a STACKCRIT error into account in the assert.

Fix
---
* The following change should fix the failure in my understanding and is implemented in this commit.

  ```diff
  < assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL));
  > assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL) || ((int)ERR_STACKCRIT == SIGNAL));
  ```

* And it did fix the failure when I manually tested with the above change.
nars1 added a commit that referenced this issue Oct 24, 2023
… by multiple threads

Background
----------
* The `simplethreadapi/ydb550` subtest in the YDBTest project failed in a very rare test run with
  the following symptom.

  ```diff
  7a8,10
  > %YDB-F-KILLBYSIGSINFO1, YottaDB process 30596 has been killed by a signal 11 at address 0x00007F75FAE7A6A6 (vaddr 0x00000000000010C8)
  > %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object
  > Quit (core dumped)
  ```

* Below is the C-stack of the SIG-11.

  ```c
  Thread 1 (Thread 0x7f46d05ff6c0 (LWP 29944)):
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=<optimized out>, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:91
  #5  generic_signal_handler (sig=11, info=0x7f46d1749e18 <stapi_signal_handler_oscontext+3320>, context=0x7f46d1749e98 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:494
  #6  ydb_os_signal_handler (sig=11, info=0x7f46d05fe530, context=0x7f46d05fe400) at sr_unix/ydb_os_signal_handler.c:85
  #7  <signal handler called>
  #8  ydb_tp_st (tptoken=0, errstr=0x0, tpfn=0x562be5c2f2a3 <callback1>, tpfnparm=0x7f46d05feea0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_st.c:65
  #9  start_thread (args=0x7ffd66c04a80) at simplethreadapi/inref/ydb550.c:116
  #10 start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
  #11 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

  (gdb) f 8
  #8  ydb_tp_st (tptoken=0, errstr=0x0, tpfn=0x562be5c2f2a3 <callback1>, tpfnparm=0x7f46d05feea0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_st.c:65
  65                      assert(LYDB_RTN_NONE == TREF(libyottadb_active_rtn));

  (gdb) p lcl_gtm_threadgbl
  $1 = (void *) 0x0

  (gdb) p gtm_threadgbl_true
  $2 = (gtm_threadgbl_true_t *) 0x7f46b4001b80
  ```

Issue
-----
* The SIG-11 is because `TREF(libyottadb_active_rtn)` tries to dereference the local copy of the global
  variable `gtm_threadgbl_true` which is stored in `lcl_gtm_threadgbl` but that is NULL.

* The reason it is NULL is because of a longstanding timing issue in `ydb_tp_st.c` as well as potentially
  other functions in the Simple Thread API.

* In the failing test, multiple threads are created in `YDBTest/simplethreadapi/inref/ydb550.c` and each
  thread does a call to `ydb_tp_st()` as the first call (it does not do a `ydb_init()`).

* The `SETUP_THREADGBL_ACCESS` at line 43 takes a copy of `gtm_threadgbl_true` into `lcl_gtm_threadgbl`.
  And then invokes the `LIBYOTTADB_RUNTIME_CHECK` macro at line 44.

  **sr_unix/ydb_tp_st.c**
  ```c
     41         DCL_THREADGBL_ACCESS;
     42
     43         SETUP_THREADGBL_ACCESS;
     44         LIBYOTTADB_RUNTIME_CHECK((int), errstr);
  ```

* But since `ydb_init()` has not been called yet, `lcl_gtm_threadgbl` would be NULL after line 43.

* Now inside the `LIBYOTTADB_RUNTIME_CHECK` macro, it is possible for 2 threads T1 and T2 to reach line 108
  below at the same time with `lcl_gtm_threadgbl` set to NULL in the `ydb_tp_st()` calls of both threads.

  **sr_unix/libyottadb_int.h**
  ```c
     96 #define LIBYOTTADB_RUNTIME_CHECK(RETTYPE, ERRSTR)                                                               \
     97 MBSTART {                                                                                                       \
      .
      .
    107         /* No threadgbl usage in this macro until the following block completes */                              \
    108         if (!ydb_init_complete)                                                                                 \
    109         {       /* Have to initialize things before we can establish an error handler */                        \
    110                 assert(!USING_ALTERNATE_SIGHANDLING);   /* Go uses ydb_main_lang_init() instead */              \
    111                 if (0 != (status = ydb_init()))         /* Note - sets fgncal_stack */                          \
    112                 {                                                                                               \
    113                         SET_STAPI_ERRSTR_MULTI_THREAD_SAFE(-status, (ydb_buffer_t *)ERRSTR);                    \
    114                         return RETTYPE -status;                                                                 \
    115                 }                                                                                               \
    116                 /* Since we called "ydb_init" above, "gtm_threadgbl" would have been set to a non-null VALUE    \
    117                  * and so any call to SETUP_THREADGBL_ACCESS done by the function that called this macro        \
    118                  * needs to be redone to set "lcl_gtm_threadgbl" to point to this new "gtm_threadgbl".          \
    119                  */                                                                                             \
    120                 SETUP_THREADGBL_ACCESS;                                                                         \
    121         }                                                                                                       \
    122 } MBEND
  ```

* Let us say one thread T1 finds `ydb_init_complete` as 0 and goes into the `if` block at line 108
  and does the `ydb_init()` call and later does the `SETUP_THREADGBL_ACCESS` call at line 120.
  This means that T1 would have updated `lcl_gtm_threadgbl` to a non-NULL value at line 120.

* Let us say the other thread T2 finds `ydb_init_complete` as non-zero at line 108 (because in the
  mean time thread T1 had done the `ydb_init()` call at line 111).  This means it would skip the
  `SETUP_THREADGBL_ACCESS` call at line 120 as well. And so it would continue to have a NULL value
  of `lcl_gtm_threadgbl`.

  When the macro returns from T2, this NULL value would cause the SIG-11 in a `TREF` access that is
  there in later portions of `ydb_tp_st.c`.

* Fortunately, all usages of `TREF` that I see in the normal code path are in debug-only code in
  `ydb_tp_st.c` (i.e. in asserts) and so this does not look like a Release build issue.

Fix
---
* The fix is to move the `SETUP_THREADGBL_ACCESS` call at line 120 from inside the `if` at line 108
  above to AFTER the `if` at line 108. That way both threads T1 and T2 will call `SETUP_THREADGBL_ACCESS`.

* While doing this, I noticed that various callers of the `LIBYOTTADB_RUNTIME_CHECK` macro currently do
  a `SETUP_THREADGBL_ACCESS` call already and those calls can be removed now that it is anyways done
  inside the `LIBYOTTADB_RUNTIME_CHECK` macro. This meant changes to various callers all of which were
  files implementing the Simple Thread API functions (i.e. ydb_*_t.c OR ydb_*_st.c OR
  `ydb_sig_dispatch()` which was invoked by the YDBGo wrapper also a Simple Thread API application).

* Additionally, a similar move of the `SETUP_THREADGBL_ACCESS` call was done inside the macro
  `LIBYOTTADB_RUNTIME_CHECK_NORETVAL` which is very similar to the `LIBYOTTADB_RUNTIME_CHECK` macro.
nars1 added a commit that referenced this issue Nov 10, 2023
…SRVF error from INGEST^%YDBJNLF

Background
----------
Below is pasted from https://gitlab.com/YottaDB/DBMS/YDBOcto/-/issues/915#note_1401714173

* Recording another rare failure symptom. The error is a `%YDB-E-GTMSECSHRSRVF`.

* This is the first time it has happened in the pipeline.

* Below are details from the pipeline failure.

  https://gitlab.com/YottaDB/DBMS/YDBOcto/-/jobs/4328647867

  ```sh
  $ cd build/bats-test.RPiMQ7

  $ cat bats_test.out
   --> Running test [/builds/YottaDB/DBMS/YDBOcto/build/bats_tests/test_ydbjnlf_in_octo.bats] : subtest [TYIO01 : Test %YDBJNLF Octo functionality] : in /builds/YottaDB/DBMS/YDBOcto/build/bats-test.RPiMQ7 : build type Debug

  $ cat stdout.txt
  .
  .
  DROP TABLE
  CREATE TABLE
  # Extract journals
  Error occurred: 150379506,INGEST+22^%YDBJNLF,%YDBJNLF-F-FILEFMTERR MUPIP JOURNAL EXTRACT stderr line #2 is: YDB-MUPIP[48666]: %YDB-E-GTMSECSHRSRVF, Client - 48666 : Attempt to service request failed (retry = 0), %YDB-I-TEXT, sendto to gtmsecshr failed, %SYSTEM-E-ENO2, No such file or directory -- generated from 0x00007F7DC8A0747F.
  ```

* The suspicion is that running tests concurrently is the cause of this failure.

* That said, we have seen this failure in-house even before when Octo tests were run concurrently
  and/or YDBTest tests and YDBOcto tests were running on the same system.

* It has always happened in a test that does a `mupip journal -extract`.

* In one occasion, the TC036 subtest reliably failed. In that case, doing a `kill -9` of the root-owned
  `gtmsecshr` process made the test reliably pass. So this suggests some issue with communicating to
  that particular gtmsecshr.

* The reason why `mupip journal -extract` shows up the problem is because it requires `gtmsecshr`
  because it opens the database file as READ_ONLY and so cannot flush the ipcs (semid/shmid) to the
  database file header and has to use gtmsecshr for that.

* In some cases, gtmsecshr socket might be deleted (I suspect) and so one gets a failure trying to
  communicate to gtmsecshr.

Issue
-----
* The above suspicion that the gtmsecshr socket got deleted turned out to be incorrect.

* In the pipeline, the `TYIO01` subtest is the first subtest that requires the use of `gtmsecshr` (this
  was verified by noting down the `ps -ef` listing at various stages of the YDBOcto pipeline). Therefore
  this was not a case of `gtmsecshr` already running and having its socket getting deleted by someone else
  just before the `TYIO01` subtest needed it for the `mupip journal -extract` that it does inside the
  `INGEST^%YDBJNLF` invocation.

* This instead turned out to be a different issue.

* This subtest failed almost every day once in the YDBOcto nightly pipeline. But this subtest never failed
  when run manually in an interactive environment.

* Additionally, the `%YDB-E-GTMSECSHRSRVF` message (with the `sendto to gtmsecshr failed` additional detail)
  seen in the failure is actually sent to the syslog (see below code). Note the `send_msg_csa()` call which
  sends it to syslog.

  **sr_unix/secshr_client.c**
  ```c
     294        if (0 < loop_count)
     295                /* No message unless attempted server start at least once */
  -> 296                send_msg_csa(CSA_ARG(NULL) VARLSTCNT(11) ERR_GTMSECSHRSRVF, 4,
     297                                RTS_ERROR_TEXT("Client"), process_id,
     298                                loop_count - 1, ERR_TEXT, 2,
     299                                RTS_ERROR_TEXT("sendto to gtmsecshr failed"), save_errno);
     300        START_SERVER;
  ```

* And `sr_unix/ydbjnlf.mpt` reads `stderr` for messages from `mupip journal -extract` and any message
  that it finds which is not of the form `YDB-I-` or `YDB-W-` it assumes is an error and stops processing
  at line 98 below.

  **sr_unix/ydbjnlf.mpt**
  ```m
     89     set mupipcmd="ydb_extract_nocol=1 $ydb_dist/mupip journal -extract=-stdout -forward -show=all -fences=none -noverify -detail -full "_jnlfile
     90     open "mupip":(shell="/bin/sh":command=mupipcmd:stderr="mupip_err":readonly:recordsize=1048576)::"pipe"
     91    ; Read the first three lines of stderr to confirm journal file format
     92    use "mupip_err"
     93    set prevline=""
     94    for linenum=1:1 do  quit:line?@("1""%""1"""_prefix_"""1""-I-MUJNLSTAT, Forward processing started"".E")
     95    . read line
     96    . set:$zeof $ecode=",U253,"
     97    . set prevline=line
  -> 98    . set:line'?@("1""%""1"""_prefix_"""1""-""1(1""W"",1""I"")1""-"".E") $ecode=",U250,"
     99    ; Purge old records with this label once MUPIP reports it has started forward processing
    100    ; Starting forward processing also confirms that the journal file format is correct
  ```

* Given the above 3 bullets, I came up with a theory as to the cause of the issue.

  Processes running inside a docker container (like the pipeline jobs) send syslog messages to `stderr`.

  In case the system is loaded (made more likely by YottaDB/DBMS/YDBOcto#915 where the Octo bats tests
  now run concurrently in the pipeline), it is possible for line 300 to have forked off the `gtmsecshr`
  server (using the `START_SERVER` macro) in the first iteration (when `loop_count` is 0).

  Since inside the macro we sleep only 0.5 seconds before returning, it is possible the server has still
  not created the socket file (a file named `ydb_secshr2E11FD23=` under the directory `$ydb_tmp`) when
  we come to line 294 in the second iteration (when `loop_count` is 1).

  Therefore, it is possible for us to send the `%YDB-E-GTMSECSHRSRVF` message to the syslog but given
  that this runs inside a docker container, the syslog message goes to stderr and that gets seen by
  line 98 of `sr_unix/ydbjnlf.mpt` above.

* And that causes `YDBJNLF` to terminate with an error because it saw a `%YDB-E-` message which is not
  a `%YDB-I-` or `%YDB-W-`.

* That explains why this test failed only in the pipeline and that too not so reliably. It depended on
  the system load.

Primary fix
-----------
* I realized that the syslog message `%YDB-E-GTMSECSHRVF` is not an error.  It is just a warning type
  message indicating that one try did not work and that we are going to try again (actually
  a few more times/iterations). Therefore, a `%YDB-E-GTMSECSHRSRVF` is not right in that case.

* It is better recorded as `%YDB-W-GTMSECSHRSRVF` to indicate the warning.

* But then once `MAX_COMM_ATTEMPTS` attempts have been done and the server is still not up, we send a
  `%YDB-E-GTMSECSHRVF` error (in the below code).

  **sr_unix/secshr_client.c**
  ```c
    444         if (MAX_COMM_ATTEMPTS < loop_count)
    445         {
    446                 gtm_putmsg_csa(CSA_ARG(NULL) VARLSTCNT(10) ERR_GTMSECSHRSRVF, 4,
    447                                 RTS_ERROR_TEXT("Client"), process_id, loop_count - 1,
    448                                 ERR_TEXT, 2, RTS_ERROR_TEXT("Unable to communicate with gtmsecshr"));
  ```

* Therefore, there is no reason that I see to send a syslog message for each iteration that failed when
  we are sending a syslog message after the final iteration too failed.

* Therefore, I am removing lines 294-299 as part of this commit as they just mislead the user (for example
  in case the server got started fine by the 3rd iteration which is BEFORE the final iteration, we would
  see a GTMSECSHRSRVF message in the syslog that would confuse the user as to something is wrong when there
  is nothing to worry about except that the system was loaded at some point in time).

  **sr_unix/secshr_client.c**
  ```c
    294        if (0 < loop_count)
    295                /* No message unless attempted server start at least once */
    296                send_msg_csa(CSA_ARG(NULL) VARLSTCNT(11) ERR_GTMSECSHRSRVF, 4,
    297                                RTS_ERROR_TEXT("Client"), process_id,
    298                                loop_count - 1, ERR_TEXT, 2,
    299                                RTS_ERROR_TEXT("sendto to gtmsecshr failed"), save_errno);
    300        START_SERVER;
  ```

Secondary Fix
-------------
* With the `Primary Fix` above, there are no more changes needed to `sr_unix/ydbjnlf.mpt` as it will not
  see the `%YDB-E-GTMSECSHRSRVF` syslog messages from intermediate iterations anymore in stderr.

* But the above issue exposes a general issue in `%YDBJNLF` in that it parses the `stderr` output and can
  be run in processes inside a docker container, and the `mupip journal extract` it runs can potentially
  send other syslog messages (not sure exactly but it is a possibility). Therefore, it must be ready to
  accept such messages and not abruptly terminate processing on seeing something unexpected.

* All syslog messages would start with a `YDB-MUPIP` line since `mupip journal -extract` is the command
  that is invoked by `%YDBJNLF`. Therefore, `sr_unix/ydbjnlf.mpt` is modified to check for lines starting
  with `YDB-MUPIP` and if so skip them as these are syslog messages (while running inside docker) and not
  errors.

* Note that this fix is not necessary at this point since the only syslog message we have seen so far is
  already removed by the `Primary Fix` but this `Secondary Fix` might prove useful in the future when
  other syslog messages end up being possible in the `mupip journal -extract` output.
nars1 added a commit that referenced this issue Nov 15, 2023
…ct block allocations in previous bitmap

Background
----------
* The `reorg_5/on_ntp_njnl_reorg` subtest failed with the following symptom.

  ```diff
  3c3,5
  < PASS from on_ntp_njnl_reorg
  ---
  > FAIL from on_ntp_njnl_reorg  # Subtest stopped as it has too many core files (Threshold = 10 ; Actual = 13)
  > Killed
  ```

* There were a lot of assert failures and core files. The primary failure was an assert failures
  in a `mupip reorg -fill=39 -index=95 -truncate` process.

  Line 1552 below is the primary assert failure (in `bml_status_check.c`).

  **reorg_5_7/online_reorg_31388.outx.4**
  ```
      1 # Fri 10 Nov 2023 06:16:09 PM EST : cnt = 4 ; ff = 39 ; inff = 95
      2 # Fri 10 Nov 2023 06:16:09 PM EST : nice +19 mupip reorg -fill=39 -index=95 -truncate
      .
   1544 Global: zyxwvu95 (region DEFAULT)
   1545 Blocks processed    : 16
   1546 Blocks coalesced    : 3
   1547 Blocks split        : 9
   1548 Blocks swapped      : 16
   1549 Blocks freed        : 0
   1550 Blocks reused       : 9
   1551 Blocks extended     : 0
   1552 %YDB-F-ASSERT, Assert failed in sr_port/bml_status_check.c line 72 for expression ((gds_t_acquired != cs->mode) || (BLK_BUSY != bml_status))
   1553 %YDB-F-ASSERT, Assert failed in sr_port/sec_shr_map_build.c line 62 for expression (bitnum > prev_bitnum)
   1554 %YDB-F-ASSERT, Assert failed in sr_port/wcs_recover.c line 173 for expression ((!dollar_tlevel && !cr_array_index) || (dollar_tlevel && (!si->cr_array_index || (NULL != si->kip_csa))))
   1555 %YDB-E-NOTALLDBRNDWN, Not all regions were successfully rundown
  ```

Issue
-----
* The assert failure in `bml_status_check.c` indicates that we found a block that was allocated in this
  transaction (in `bm_getfree.c`) but the local bitmap for that allocated block says the block is already
  marked `BUSY`. This is an out-of-design situation as only blocks marked `FREE` or `RECYCLED` in the
  local bitmap should be allocated.

* The C-stack and relevant variables from the core files using gdb are captured below.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fff32e6efc0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  bml_status_check (cs=0x7f9e639cc4e0 <cw_set>) at sr_port/bml_status_check.c:72
  #7  t_end (hist1=0x2150840, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1637
  #8  mu_swap_root (gl_ptr=0x2151540, root_swap_statistic_ptr=0x7fff32e727ac, upg_mv_block=0) at sr_unix/mu_swap_root.c:233
  #9  mupip_reorg () at sr_port/mupip_reorg.c:399
  #10 mupip_main (argc=5, argv=0x7fff32e84d48, envp=0x7fff32e84d78) at sr_unix/mupip_main.c:117
  #11 dlopen_libyottadb (argc=5, argv=0x7fff32e84d48, envp=0x7fff32e84d78, main_func=0x4014a4 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #12 main (argc=5, argv=0x7fff32e84d48, envp=0x7fff32e84d78) at sr_unix/mupip.c:22

  (gdb) p/x ydb_skip_bml_num
  $4 = 0x320dd1200

  (gdb) p/x cw_set[0].blk
  $33 = 0x320dd1bff
  (gdb) p/x cw_set[1].blk
  $35 = 0x1
  (gdb) p/x cw_set[2].blk
  $34 = 0x320dd1c00

  (gdb) p cw_set[0].mode
  $37 = gds_t_acquired
  (gdb) p cw_set[1].mode
  $39 = gds_t_write
  (gdb) p cw_set[2].mode
  $38 = gds_t_writemap
  ```

* The debug-only `ydb_test_4g_db_blks` env var is enabled in this test run as `ydb_skip_bml_num` is set
  to a non-zero value (seen above).

* This means that after the local bitmap `0`, the next local bitmap that is used would be `0x320dd1200`.
  All blocks from `0x200` till `0x320dd11ff` will not be used by the database logic (i.e. there will be
  a huge hole in the database file) for block allocation.

* The `cw_set[0].blk` indicates the allocated block number is `0x320dd1bff`. But `cw_set[2].blk` indicates
  this allocation happened in the local bitmap block `0x320dd1c00`.

* Notice that the allocated block number actually is 1 less than the bitmap block number.

* This means that the allocated block was actually in the `previous` local bitmap. This is an out-of-design
  situation.

* The allocation happened in `bm_getfree.c` but the point of assert failure is much later.

Changes
-------
* There is logic in `bm_getfree.c` that finds a free bit in the local bitmap and stores it in a variable
  `free_bit`. In the failure case above, `free_bit` must have ended up with a value of `-1` (which also
  happens to be the value of the `NO_FREE_SPACE` macro) to result in the failure.

* It is not clear to me how this happened but I suspect this was possible only because of interactions
  with the debug-only `ydb_skip_bml_num` scheme.

* That is, my suspicion is that it is some issue in the `ydb_skip_bml_num` scheme.

* Towards better understanding how `free_bit` ended up with the `-1` value, this commit adds an assert
  that `free_bit` is never negative.

* In the test failure case above, we would have seen this new assert fail at a much earlier point thereby
  giving us a better core file to analyze.

* This commit enables us to better analyze such failures if/when they happen in the future (the cause is
  still unknown).
nars1 added a commit that referenced this issue Nov 15, 2023
…ert failure)

Background
----------
* Below is a first-time failure, when running the `r126/ydb464` subtest (from the YDBTest project), that
  I noticed while trying to reproduce some other failure.

  ```diff
  --- ydb464/ydb464.diff ---
  19a20,73
  > r126_0_31/ydb464/simpleapi2/child98118.log
  > %YDB-F-ASSERT, Assert failed in sr_port/insert_region.c line 110 for expression ((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit))
  ```

* The C-stack and relevant variables from the core file are pasted below.

  ```c
  (gdb) where
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffee07f7480) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110
  #7  mlk_unlock (p=0x1591940) at sr_port/mlk_unlock.c:70
  #8  tp_unwind (newlevel=0, invocation_type=ROLLBACK_INVOCATION, tprestart_rc=0x0) at sr_port/tp_unwind.c:294
  #9  op_trollback (rb_levels=0) at sr_port/op_trollback.c:200
  #10 secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:569
  #11 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:230
  #12 signal_exit_handler (exit_handler_name=0x7ff4913b071e "deferred_exit_handler", sig=2, info=0x7ff491795458 <stapi_signal_handler_oscontext+3224>, context=0x7ff4917954d8 <stapi_signal_handler_oscontext+3352>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #15 rel_crit (reg=0x14d0170) at sr_unix/rel_crit.c:81
  #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120
  #17 op_lock2_common (timeout=0, laflag=64 '@') at sr_port/op_lock2.c:242
  #18 op_incrlock_common (timeout=0) at sr_port/op_incrlock.c:49
  #19 ydb_lock_incr_s (timeout_nsec=0, varname=0x7ffee07f8c30, subs_used=0, subsarray=0x0) at sr_unix/ydb_lock_incr_s.c:91
  #20 runProc (settings=0x7ffee07fab80, curDepth=1) at simpleapi/inref/randomWalk.c:489
  #21 tpHelper (tpfnparm=0x7ffee07fa100) at simpleapi/inref/randomWalk.c:691
  #22 ydb_tp_s_common (lydbrtn=LYDB_RTN_TP, tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:256
  #23 ydb_tp_s (tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:38
  #24 runProc (settings=0x7ffee07fab80, curDepth=0) at simpleapi/inref/randomWalk.c:666
  #25 runProc_driver (settings=0x7ffee07fab80) at simpleapi/inref/randomWalk.c:145
  #26 main () at simpleapi/inref/randomWalk.c:93

  (gdb) f 6
  #6  insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110
  110                                     assert((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit));

  (gdb) p process_exiting
  $3 = 1

  (gdb) p t_tries
  $4 = 3

  (gdb) p dollar_tlevel
  $5 = 1

  (gdb) p csa->now_crit
  $6 = 0

  (gdb) up
  #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120
  120                             TPNOTACID_CHECK(LOCKGCINTP);
  ```

Issue
-----
* The assert that failed in `insert_region()` (frame 6 in above stack trace) indicates that we were in the
  final retry (i.e. `t_tries` is equal to `3` or `CDB_STAGNATE`) but we did not hold crit on the current
  region where we are trying to do an `mlk_unlock()` operation.

* The assert is valid and did expose an issue.

* In frame 16, in `mlk_lock()`, we did a `rel_crit()` call in the `TPNOTACID_CHECK` macro while in the
  final retry.

  **sr_port/mlk_lock.c**
  ```c
    120                         TPNOTACID_CHECK(LOCKGCINTP);
  ```

* Below is the code inside the macro.

  **sr_port/tp.h**
  ```c
     979 #define TPNOTACID_CHECK(CALLER_STR)                                                                                             \
     980 {                                                                                                                               \
     981         GBLREF  boolean_t       mupip_jnl_recover;                                                                              \
     982         mval            zpos;                                                                                                   \
     983                                                                                                                                 \
     984         if (IS_TP_AND_FINAL_RETRY)                                                                                              \
     985         {                                                                                                                       \
  -> 986                 TP_REL_CRIT_ALL_REG;                                                                                            \
     987                 assert(!mupip_jnl_recover);                                                                                     \
     988                 TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK;                                                                         \
  ```

* Line 986 is where the issue is. We do a `rel_crit()` call there but `t_tries` is still not decremented.
  The decrement of `t_tries` happens 2 lines later at line 988.

* Before doing the `rel_crit()` call, we need to decrement `t_tries`. This way, in case `rel_crit()`
  decides to invoke exit handling due to handling a deferred SIGINT signal (sent in the `ydb464` subtest),
  the assert in `insert_region()` would not be confused by seeing this out-of-design state and will not
  attempt to invoke `t_retry()` etc. which is a no-no as we should not transfer control to M code as
  part of a TP restart while the process is about to terminate on receipt of a SIGINT signal.

Fix
---
* Notice that in `sr_port/t_commit_cleanup.c`, the `t_tries` decrement happens BEFORE the `rel_crit()`
  call.

  **sr_port/t_commit_cleanup.c**
  ```c
    288       if (CDB_STAGNATE <= t_tries)
    289               TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK; /* t_tries untouched for rollback and recover */
      .
      .
    303               if (!csa->hold_onto_crit && csa->now_crit)
    304                       rel_crit(tr->reg);      /* Undo Step (CMT01) */
  ```

* In a similar fashion, in the `TPNOTACID_CHECK` macro in `sr_port/tp.h`, the `TP_REL_CRIT_ALL_REG` call
  should happen AFTER the `TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK` call. And that is the fix.

* While doing this fix, I noticed a similar ordering issue in `sr_port/gvcst_init.c` and so fixed that too.

Notes
-----
* While this failure happened with a Debug build of YottaDB, I suspect there is an issue in the Release
  build of YottaDB too. But not sure exactly what the user-visible implications are. Even if so, it is
  likely to be not encountered in practice and so no user-visible issue is created for this.
nars1 added a commit that referenced this issue Nov 15, 2023
…_port/deferred_events.c

Background
----------
* The `v61000/intrpt_wcs_wtstart` subtest (in the YDBTest project) failed a few rare occasions
  during internal testing with the following symptom.

  ```diff
  12a13,299
  > v61000_0_22/intrpt_wcs_wtstart/mumps-wb.out
  > %YDB-F-ASSERT, Assert failed in sr_port/deferred_events.c line 114 for expression (no_event == outofband || (event_type == outofband))
  ```

Issue
-----
* The stack trace and relevant details from the gdb core analysis are pasted below.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffcc56fd8c0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114
  #7  jobinterrupt_event (sig=10, info=0x7fb372b8a518 <stapi_signal_handler_oscontext+5528>, context=0x7fb372b8a598 <stapi_signal_handler_oscontext+5656>) at sr_port/jobinterrupt_event.c:61
  #8  <signal handler called>
  #9  clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  #10 m_usleep (useconds=10000) at sr_unix/sleep.c:37
  #11 wcs_sleep (sleepfactor=6310) at sr_port/wcs_sleep.c:28
  #12 wcs_flu (options=519) at sr_unix/wcs_flu.c:571
  #13 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:632
  #14 gv_rundown () at sr_port/gv_rundown.c:122
  #15 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233
  #16 signal_exit_handler (exit_handler_name=0x7fb372a19ecf "generic_signal_handler", sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=0) at sr_unix/signal_exit_handler.c:78
  #17 generic_signal_handler (sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:502
  #18 ydb_os_signal_handler (sig=15, info=0x7ffcc56ffd30, context=0x7ffcc56ffc00) at sr_unix/ydb_os_signal_handler.c:88
  #19 <signal handler called>
  #20 clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6
  #21 m_usleep (useconds=999000) at sr_unix/sleep.c:37
  #22 wcs_wtstart (region=0xc30970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:216
  #23 wcs_timer_start (reg=0xc30970, io_ok=1) at sr_port/t_end_sysops.c:1346
  #24 t_end (hist1=0xcfe798, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1848
  #25 gvcst_put2 (val=0xc928b8, parms=0x7ffcc5709a80) at sr_port/gvcst_put.c:2796
  #26 gvcst_put (val=0xc928b8) at sr_port/gvcst_put.c:302
  #27 op_gvput (var=0xc928b8) at sr_port/op_gvput.c:79

  (gdb) f 6
  #6  xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114
  114                     assert(no_event == outofband || (event_type == outofband));

  (gdb) p (enum outofbands)no_event
  $2 = no_event

  (gdb) p (enum outofbands)outofband
  $1 = deferred_signal

  (gdb) p (enum outofbands)event_type
  $3 = jobinterrupt
  ```

* The test sends a SIGTERM (i.e. SIG-15) signal. This caused `outofband` variable to be set to
  `deferred_signal` in frame 17 above (`generic_signal_handler.c` inside the `SET_FORCED_EXIT_STATE` macro).

* And then the process was sleeping (due to a white-box test case in the test).

* At that point, it was holding crit and another process was waiting for this and so was about to send
  a `MUTEXLCKALERT` message. At this point, since the test framework had set the `gtm_procstuckexec` env
  var to `com/gtmprocstuck_get_stack_trace.csh`, that was invoked and it in turn invoked `^%YDBPROCSTUCKEXEC`
  which in turn sent a `SIGUSR1` signal (i.e. a `mupip intrpt`) to this very same process that was sleeping
  while holding crit.

* And at this point, the process got the assert failure because the `outofband` variable indicated that
  a `SIG-15` signal needs to be handled whereas the `event_type` variable indicated that the current
  out of band event is a `jobinterrupt` event.

Fix
---
* This seems like a valid scenario and I suspect the assert is invalid.

* I noticed that this very same assert has been removed in a later GT.M release V7.1-001.

  ```diff
  $ cd YDB
  $ git show tags/V7.1-001 sr_port/deferred_events.c | head -35 | tail -8
  @@ -127,7 +127,6 @@ boolean_t xfer_set_handlers(int4  event_type, int4 param_val, boolean_t popped_e
          }
          if (!already_ev_handling)
          {
  -               assert(no_event == outofband || (event_type == outofband));
                  assert(!dollar_zininterrupt || (jobinterrupt != event_type));
                  if (entry != (TREF(save_xfer_root_ptr))->ev_que.fl)
                  {       /* no event in play so pend this one by jiggeriing the xfer_table */
  ```

* I assume GT.M noticed a similar issue but not while releasing V7.0-001 (which is what YottaDB master
  currently has merged) but when releasing a much later V7.1-001 version and fixed it then.

* Therefore, I am removing the assert that failed.

* This should let the `v61000/intrpt_wcs_wtstart` test run fine until GT.M V7.1-001 gets merged into
  the YottaDB master branch.
nars1 added a commit that referenced this issue Nov 17, 2023
… damage

Background
----------
Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/1041#description

* The `reorg_5/on_ntp_njnl_reorg` subtest (from the YDBTest project) failed with the following symptom
  in a rare in-house test run.

  ```
  %YDB-F-ASSERT, Assert failed in sr_port/bml_status_check.c line 72 for expression ((gds_t_acquired != cs->mode) || (BLK_BUSY != bml_status))
  ```

* The assert failure indicates that we found a block that was allocated in this transaction (in
  `bm_getfree.c`) but the local bitmap for that allocated block says the block is already marked
  `BUSY`. This is an out-of-design situation as only blocks marked `FREE` or `RECYCLED` in the local
  bitmap should be allocated.

* The cause of this failure was not ascertained when it first happened and so !1409 (d366c66) added
  an assert to try catch the failure early in `bm_getfree.c` (pre-commit) rather than much later in
  `bml_status_check.c` (in-commit).

* But even after !1409 got merged, we encountered another rare test failure with the same symptom. The
  failure still happened in `bml_status_check.c` and not in the expected `bm_getfree.c`.

* Therefore, the new failure was further analyzed and this time around the cause was finally identified. It
  turned out to be a `mupip reorg -truncate` issue that can result in assert failures in Debug builds
  and database damage in Release builds.

* But for this subtle timing issue to happen, a database file extension needs to happen concurrently
  among many things.

* Below is an example mupip integ report from the damaged database after the `mupip reorg -truncate`
  in a manual test that I did.

  ```
  Integ of region DEFAULT

  Block:Offset Level
  %YDB-E-DBINCLVL,         Nature: #DANGER***
               3FF:0      2  Block at incorrect level
                             Directory Path:  1:10, 2:23
                             Path:  202:104, 3E6:17E, 3FF:0
  Keys from ^y(493.1) to the end are suspect.
  %YDB-E-DBBDBALLOC,         Nature: #DANGER***
               3FF:0      B  Block doubly allocated
                             Directory Path:  1:10, 2:36
                             Path:  3FF:0
  Keys from ^z to the end are suspect.
  ```

Issue
-----
* Below is the stack trace from the latest failure.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=<optimized out>, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffef29217e0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  bml_status_check (cs=0x7fe6935dfa40 <cw_set>) at sr_port/bml_status_check.c:72
  #9  t_end (hist1=0x62d0000a2840, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1637
  #10 mu_swap_root (gl_ptr=0x62d0000a3fc0, root_swap_statistic_ptr=0x7ffef29256a0, upg_mv_block=0) at sr_unix/mu_swap_root.c:233
  #11 mupip_reorg () at sr_port/mupip_reorg.c:399
  #12 mupip_main (argc=5, argv=0x7ffef2938148, envp=0x7ffef2938178) at sr_unix/mupip_main.c:117
  #13 dlopen_libyottadb (argc=5, argv=0x7ffef2938148, envp=0x7ffef2938178, main_func=0x55c02b55a020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #14 main (argc=5, argv=0x7ffef2938148, envp=0x7ffef2938178) at sr_unix/mupip.c:22
  ```

* Notice `mu_swap_root()` in the stack. It was seen as a caller in all failures so far where the symptom
  was an assert failure in `bml_status_check.c`.

* This specific function (that is called by `mupip reorg -truncate`) turned out to be where the issue is
  and not the general purpose function `bm_getfree.c` (like was suspected in d366c66).

* This function does block allocation, just like `bm_getfree.c`. But there is a subtle difference between
  the two which is what led me to the issue.

  **sr_port/bm_getfree.c**
  ```c
    306            free_bit = bm_find_blk((int4)offset, (sm_uc_ptr_t)bmp + SIZEOF(blk_hdr), map_size, blk_used);
      .
    311    if (NO_FREE_SPACE != free_bit)
    312            break;
  ```

  **sr_unix/mu_swap_root.c**
  ```c
    341         master_bit = bmm_find_free((hint_blk_num / BLKS_PER_LMAP), csa->bmm, num_local_maps);
    342         if ((NO_FREE_SPACE == master_bit))
      .
    360     free_bit = bm_find_blk(hint_bit, bmlhist.buffaddr + SIZEOF(blk_hdr), maxbitsthismap, &free_blk_recycled);
    361     free_blk_id = bmlhist.blk_num + free_bit;
  ```

* `bm_getfree.c` invokes `bm_find_blk()` to find a free block in the local bitmap and stores the result in
  `free_bit`. It then checks whether that indicates no free space (`NO_FREE_SPACE`) and if so it moves
  on to another local bitmap.

* `mu_swap_root.c` does a call to `bm_find_blk()` as well to find a free block but does no `NO_FREE_SPACE`
  check.

* This is where the issue is.

* If `free_bit` turns out to be `NO_FREE_SPACE`, we will end up setting `free_blk_id` (line 361 above)
  as 1 block BEFORE the current local bitmap block. This would mean we would consider a block in the
  previous bitmap as the FREE block and use it for the swap operation. If that block happens to be
  already used in some other global variable tree, then we would end up with integrity errors once the
  incorrect swap operation completes.

* In order to end up in this situation, what is needed is that we find the local bitmap has free space
  in the master bitmap (line 341 above) but when we look inside the local bitmap block we find that
  it has no free space.

* I initially tried to see if I can concurrently have a process do a transaction that allocates a block
  and ends up marking this local bitmap as full in the master map (in `bm_update()`). But in that case,
  I noticed that the `mu_swap_root()` function ended up restarting the transaction because it noticed
  the local bitmap block contents changed concurrently (restart code `cdb_sc_bmlmod`).

* After some trial and error, finally landed on the needed last link in this puzzle. And that is a
  database file extension which concurrently happens.

* It is possible the last local bitmap in a database file could have its status marked as `Full` even
  though it is only a partial bitmap (because the total block count stops midway in the local bitmap).

* When a database file extension happens on this partial last bitmap, it would mark this bitmap from
  `Full` status to having `Free space` (line 561 below).

  **sr_unix/gdsfilext.c**
  ```c
    556         cs_addrs->ti->free_blocks += blocks;
    557         cs_addrs->total_blks = cs_addrs->ti->total_blks = new_total;
    558         blocks = old_total;
    559         if (blocks / bplmap * bplmap != blocks)
    560         {
    561                 bit_set(blocks / bplmap, MM_ADDR(cs_data)); /* Mark old last local map as having space */
  ```

  **sr_unix/mu_swap_root.c**
  ```c
    339         total_blks = csa->ti->total_blks;
      .
    341         master_bit = bmm_find_free((hint_blk_num / BLKS_PER_LMAP), csa->bmm, num_local_maps);
    342         if ((NO_FREE_SPACE == master_bit))
      .
    360     free_bit = bm_find_blk(hint_bit, bmlhist.buffaddr + SIZEOF(blk_hdr), maxbitsthismap, &free_blk_recycled);
    361     free_blk_id = bmlhist.blk_num + free_bit;
  ```

* Given that, the timing of the `mupip reorg -truncate` (in `mu_swap_root.c`) and `mupip extend`
  (in `gdsfilext.c`) processes has to such that line 339 above in `mu_swap_root.c` takes a note of
  `total_blks` BEFORE line 556 is reached by `gdsfilext.c` but line 341 in `mu_swap_root.c` should
  happen AFTER line 561 is done by `gdsfilext.c`.

  When this occurs, `mu_swap_root.c` will incorrectly proceed at line 360 with a `free_bit` value of `-1`
  which would then cause a block from the previous bitmap to be allocated and the commit validation
  logic would not restart the transaction in this situation ending up in database damage.

Fix
---
* The fix is simple and is to check the return value of `bm_find_blk()` (stored in `free_bit` variable)
  for `NO_FREE_SPACE` and if so restart the transaction.
nars1 added a commit that referenced this issue Nov 17, 2023
…ess.c to fix test failure

Background
----------
* In internal testing, we noticed a few rare test failures where the source server failed an assert.

  ```diff
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtmsource_process.c line 1653 for expression (!renegotiation_pending)
  ```

Issue
-----
* Pasted below are relevant details from the core file analysis.

  ```c
  (gdb) where
  #0  pthread_kill () from /usr/lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fff37d618e0) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  gtmsource_process () at sr_unix/gtmsource_process.c:1653
  #7  gtmsource () at sr_unix/gtmsource.c:618
  #8  mupip_main (argc=11, argv=0x7fff37d68508, envp=0x7fff37d68568) at sr_unix/mupip_main.c:117
  #9  dlopen_libyottadb (argc=11, argv=0x7fff37d68508, envp=0x7fff37d68568, main_func=0x5270c0 <.str> "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #10 main (argc=11, argv=0x7fff37d68508, envp=0x7fff37d68568) at sr_unix/mupip.c:22
  (gdb) f 6
  #6  gtmsource_process () at sr_unix/gtmsource_process.c:1653
  1653                            GTMTLS_ONLY(assert(!renegotiation_pending));
  (gdb) p repl_tls
  $1 = {id = "INSTANCE1", '\000' <repeats 22 times>, plaintext_fallback = 0, enabled = 1, notls_retry = 0, renegotiate_state = REPLTLS_WAITING_FOR_RENEG_TIMEOUT, sock = 0x60700000a3c0}
  (gdb) down
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  99              return rts_error_va(csa, argcnt, var);
  (gdb) up
  #6  gtmsource_process () at sr_unix/gtmsource_process.c:1653
  1653                            GTMTLS_ONLY(assert(!renegotiation_pending));
  (gdb) p hrtbt_cnt
  $2 = 8
  (gdb) p renegotiate_factor
  $3 = 4
  (gdb) p gtmsource_options.renegotiate_interval
  $4 = 60
  (gdb) p gtmsource_local->connect_parms[GTMSOURCE_CONN_HEARTBEAT_PERIOD]
  $5 = 15
  (gdb) p gtmsource_state
  $6 = GTMSOURCE_SENDING_JNLRECS
  (gdb) p poll_time
  $7 = 999
  (gdb) p repl_tls.renegotiate_state
  $1 = REPLTLS_WAITING_FOR_RENEG_TIMEOUT
  (gdb) p next_renegotiate_hrtbt
  $1 = 1
  ```

* This turned out to be an interesting failure. The underlying issue is that a TLS renegotiation was
  pending for more than a minute (due to the receiver not replying to the REPL_RENEG_ACK_ME message
  from the source server for more than a minute) so the next TLS renegotiation became pending since
  the renegotiate_interval was 1 minute. And that scenario was not handled correctly resulting in the
  assert failure.

Fix
---
* This turns out to be an issue that is fixed in GT.M V7.0-002.

  http://tinco.pair.com/bhaskar/gtm/doc/articles/GTM_V7.0-002_Release_Notes.html#GTM-DE197637428

  _The Source Server defers a TLS renegotiation when a prior TLS renegotiation is pending. Previously,
  the Source Server logged the REPLWARN message, closed the replication connection, and attempted to
  re-establish the replication connection with the Receiver Server.(GTM-F135415)_

* I noticed the below code changes in GT.M V7.0-002 implements the above fix.

  ```diff
  $ cd YDB
  $ git show -U0 tags/V7.0-002 sr_unix/gtmsource_process.c
  .
  .
  @@ -568,0 +569 @@ void gtmsource_recv_ctl(void)
  +               /* Check whether it is time for renegotiation */
  @@ -572,3 +573,16 @@ void gtmsource_recv_ctl(void)
  -                      next_renegotiate_hrtbt = TRUE;
  -                      repl_tls.renegotiate_state = REPLTLS_WAITING_FOR_RENEG_TIMEOUT;
  -                      poll_time = REPL_POLL_WAIT;
  +                      switch(repl_tls.renegotiate_state)
  +                      {
  +                              case REPLTLS_RENEG_STATE_NONE:
  +                              case REPLTLS_WAITING_FOR_RENEG_TIMEOUT:
  +                                      next_renegotiate_hrtbt = TRUE;
  +                                      repl_tls.renegotiate_state = REPLTLS_WAITING_FOR_RENEG_TIMEOUT;
  +                                      poll_time = REPL_POLL_WAIT;
  +                                      break;
  +                              case REPLTLS_WAITING_FOR_RENEG_ACK:
  +                              /* On slower systems, heartbeat responses may arrive late.
  +                                 In such a case, defer renegotiation */
  +                                      break;
  +                              default:
  +                                      assert(FALSE);
  +                                      break;
  +                      }
  @@ -582 +596 @@ void gtmsource_recv_ctl(void)
  -               poll_time = REPL_POLL_WAIT; /* because we are back to sending data */
  +               poll_time = REPL_POLL_NOWAIT; /* because we are back to sending data */
  @@ -593 +607 @@ void gtmsource_recv_ctl(void)
  -                                               MAX_SEQNO, INVALID_SUPPL_STRM);
  +                                       MAX_SEQNO, INVALID_SUPPL_STRM);
  ```

* The `case REPLTLS_WAITING_FOR_RENEG_ACK:` block above is what implements the real fix.

* Therefore, this commit cherry-picks the above GT.M change and merges it into the YottaDB code base.

* I expect the test failure to no longer happen after this change.
nars1 added a commit that referenced this issue Dec 8, 2023
… restart related)

Background
----------
* While trying to come up with the `v70001/gtm9131` subtest (in the YDBTest project), I encountered
  an assert failure. Running that subtest with the below changes reproduces the assert.

  ```diff
  $ git diff -U1 v70001/inref/gtm9131.m
  diff --git a/v70001/inref/gtm9131.m b/v70001/inref/gtm9131.m
  index 14094d4c..92641556 100644
  --- a/v70001/inref/gtm9131.m
  +++ b/v70001/inref/gtm9131.m
  @@ -23,3 +23,4 @@ gtm9131       ;
          set jobid=1
  -       zsystem:$trestart=0 "$gtm_dist/mumps -run job^gtm9131 "_njobs_" "_jobid
  +       set $zcmdline=njobs_" "_jobid
  +       do job
          tcommit                                ; Commit TP transaction where we expect a TP restart due to concurrent statsdb extension
  ```

* Below is what I saw in the subtest output (with the above change).

  ```
  $ cat gtm9131.log
  .
  .
  # Execute [mumps -run gtm9131] which will create a TPRESTART message due to a statsdb database file extension restart
  %YDB-F-ASSERT, Assert failed in sr_port/tp_restart.c line 288 for expression (IS_STATSDB_REG(restart_reg) ? !memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN) : memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN))
  ```

Issue
-----
* The core file in that case had the following stack trace and interesting variables.

  ```
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140359692803712) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140359692803712) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140359692803712, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fff9c3d9330) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  tp_restart (newlevel=1, handle_errors_internally=1) at sr_port/tp_restart.c:288
  #9  mdb_condition_handler (arg=150376098) at sr_port/mdb_condition_handler.c:345
  #10 rts_error_va (csa=0x0, argcnt=1, var=0x7fff9c3dc110) at sr_unix/rts_error.c:198
  #11 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99
  #12 op_tcommit () at sr_port/op_tcommit.c:514

  (gdb) f 8
  #8  0x00007fa805089766 in tp_restart (newlevel=1, handle_errors_internally=1) at sr_port/tp_restart.c:288
  288        assert(IS_STATSDB_REG(restart_reg)      /* global ^%YGS if, and only if, statsDB */

  (gdb) list
  286          if (NULL != restart_reg)
  287          {
  288                  assert(IS_STATSDB_REG(restart_reg)      /* global ^%YGS if, and only if, statsDB */
  289                         ? !memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN)
  290                         : memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN));
  291                  reg_mstr.len = restart_reg->dyn.addr->fname_len;
  292                  reg_mstr.addr = (char *)restart_reg->dyn.addr->fname;

  (gdb) x/s restart_reg->rname
  0x6220000023e2: "default"

  (gdb) x/s gv_currkey->base
  0x62d000005046: "%jobwait"
  ```

* The assert in lines 288-290 checks that if the region where the restart happened is a statsdb region,
  then the current `$reference` (i.e. `gv_currkey`) should point to `^%YGS`, the global name that maps
  to the statsdb region. And vice versa.

* But in this case, the global name is `^%jobwait` which is not `^%YGS`.

* This is a case where the statsdb region had a tp restart due to a statsdb file extension scenario
  and the restart status code was `cdb_sc_helpedout` (see first occurrence in sr_port/tp_tend.c).
  In this case, the restart did not occur as part of a global reference, but as part of TCOMMIT and so
  we are not guaranteed that the most recent global reference (`gv_currkey`) would point to a statsdb
  global name. Therefore, this assert is not necessarily correct.

* Since restarts are possible in various scenarios other than as part of the current global reference,
  this assert could fail in other scenarios too.

* This assert was introduced in GT.M V6.3-008.

Fix
---
* Not sure the assert serves much purpose so am removing this inaccurate assert in this commit.
nars1 added a commit that referenced this issue Dec 11, 2023
…re=1 and DBNAMEMISMATCH

Background
----------
* While testing with `ydb_statshare=1`, I noticed the following failure in the `v54003/dbnammismatch1`
  subtest when YottaDB was built with ASAN.

  ```sh
  $ cat dbnammismatch1.log
  .
  .
  mupip rundown -file backup.dat
  %YDB-I-DBNAMEMISMATCH, Database file backup.dat points to shared memory (id = 447152162) which in turn points to an inaccessible database file v54003_0/dbnammismatch1/mumps.dat
  =================================================================
  ==74407==ERROR: AddressSanitizer: heap-use-after-free on address 0x625000006b44 at pc 0x7fe941b32e8d bp 0x7fffea717770 sp 0x7fffea717768
  READ of size 4 at 0x625000006b44 thread T0
      #0 mu_rndwn_file sr_unix/mu_rndwn_file.c:1484:3
      #1 mupip_rundown sr_unix/mupip_rundown.c:215:8
      #2 mupip_main sr_unix/mupip_main.c:117:4
      #3 dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11
      #4 main sr_unix/mupip.c:22:9

  0x625000006b44 is located 6724 bytes inside of 8192-byte region [0x625000005100,0x625000007100)
  freed by thread T0 here:
      #1 system_free sr_port/gtm_malloc_src.h:1477:2
      #2 gtm_free_main sr_port/gtm_malloc_src.h:850:3
      #3 gtm_free sr_port/gtm_malloc_src.h:1493:2
      #4 mu_rndwn_file sr_unix/mu_rndwn_file.c:1457:4
      #5 mupip_rundown sr_unix/mupip_rundown.c:215:8
      #6 mupip_main sr_unix/mupip_main.c:117:4
      #7 dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11
      #8 main sr_unix/mupip.c:22:9

  previously allocated by thread T0 here:
      #1 system_malloc sr_port/gtm_malloc_src.h:1462:9
      #2 gtm_malloc_main sr_port/gtm_malloc_src.h:669:10
      #3 gtm_malloc sr_port/gtm_malloc_src.h:1488:9
      #4 mu_rndwn_file sr_unix/mu_rndwn_file.c:457:27
      #5 mupip_rundown sr_unix/mupip_rundown.c:215:8
      #6 mupip_main sr_unix/mupip_main.c:117:4
      #7 dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11
      #8 main sr_unix/mupip.c:22:9
      #9 __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  ```

Issue
-----
* The issue is in `sr_unix/mu_rndwn_file.c` where we `free(tsd)` and later make a call to the
  `UNLINK_STATSDB_AT_BASEDB_RUNDOWN` macro which is passed the parameter `csd` which is the same as
  the copy of `tsd`.

Fix
---
* The fix is simple and is to move the `free()` call to AFTER the macro call.
nars1 added a commit that referenced this issue Dec 12, 2023
…ded assert (in dadf82f)

Background
----------
* While trying out something else, I entered an invalid command inside the `DSE>` prompt and got
  the following error.

  ```
  DSE> cd
  =================================================================
  ==50375==ERROR: AddressSanitizer: global-buffer-overflow on address 0x7f52da6b8b30 at pc 0x7f52d80dc6b6 bp 0x7ffc654076b0 sp 0x7ffc654076a8
  READ of size 1 at 0x7f52da6b8b30 thread T0
      #0 parse_cmd sr_unix/cli_parse.c:825:2
      #1 dse_process sr_unix/dse_main.c:176:20
      #2 dse_main sr_unix/dse_main.c:159:8
      #3 dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11
      #4 main sr_unix/dse.c:22:9
      #5 __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
      #6 __libc_start_main csu/../csu/libc-start.c:392:3
      #7 _start (/usr/library/V999_R139/dbg/dse+0x1e344) (BuildId: 07e14e3ee26867c2d614846a8c70000a0d1000f9)
  ```

Issue
-----
* The issue is in 2 newly added asserts in dadf82f. They did not take into account that `cmd_ind`
  could be `-1` in case of an invalid command.

Fix
---
* The fix is simple and is to add an `||` condition to handle the case when `cmd_ind` is `-1`.
nars1 added a commit that referenced this issue Jan 19, 2024
…ABANDONED (fixes GTM-9400 for real)

Background
----------
The below is pasted from https://gitlab.com/YottaDB/DB/YDBTest/-/issues/550#note_1733171439

* While trying to test YDBTest#550, I noticed that the KILLABANDONED error happens even with V7.0-001
  whereas the GTM-9400 release note in GT.M V7.0-001 indicates this as being fixed in V7.0-001.

* Below is the test case (using `tcsh`, not `sh`) that stops after a few seconds with a `KILLABANDONED`
  error with V7.0-000 as well as with V7.0-001.

  ```sh
  cat > kill.csh << CAT_EOF
  while (1)
          k15 reorg
          if (-e STOP) then
                  break
          endif
  end
  CAT_EOF

  rm -f STOP
  unsetenv ydb_gbldir
  setenv gtmgbldir mumps.gld
  rm -f mumps.gld mumps.dat
  $gtm_dist/mumps -run GDE exit
  $gtm_dist/mupip create
  $gtm_dist/mumps -run %XCMD 'for i=1:1:100000 set ^x(i)=$j(i,200)'
  source kill.csh &
  while (1)
          foreach fillfactor (50 10 90)
                  $gtm_dist/mupip reorg -fill=$fillfactor -region DEFAULT
                  $gtm_dist/mupip integ -reg "*"
                  if ($status) then
                          touch STOP
                          break
                  endif
          end
          if (-e STOP) then
                  break
          endif
  end
  ```

Issue
-----
* I added an assert in `secshr_db_clnup.c` where it invoked the `INCR_ABANDONED_KILLS` macro and ran
  the above test to see the cause of the above `KILLABANDONED` error.

* With that change, I got an assert failure (in line 446 below) and the core file showed the below
  stack trace.

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140185231378240) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140185231378240) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140185231378240, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffcb98a5b50) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:446
  #9  mupip_exit_handler () at sr_unix/mupip_exit_handler.c:124
  #10 signal_exit_handler (exit_handler_name=0x7f7f6ac987be "deferred_exit_handler", sig=15, info=0x7f7f6ae3ddf8 <stapi_signal_handler_oscontext+3320>, context=0x7f7f6ae3de78 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78
  #11 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #12 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74
  #13 t_end (hist1=0x5617ac86b8c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1813
  #14 mu_reorg (gl_ptr=0x5617ac865bc0, exclude_glist_ptr=0x7ffcb98aa450, resume=0x7ffcb98aa344, index_fill_factor=90, data_fill_factor=90, reorg_op=0) at sr_port/mu_reorg.c:572
  #15 mupip_reorg () at sr_port/mupip_reorg.c:334
  #16 mupip_main (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948) at sr_unix/mupip_main.c:117
  #17 dlopen_libyottadb (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948, main_func=0x5617ab37c004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #18 main (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948) at sr_unix/mupip.c:22

  (gdb) f 8
  #8  secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:446
  446                assert(!mu_reorg_process);

  (gdb) list
  441             } else if (!dollar_tlevel)
  442             {
  443                     if ((NULL != kip_csa) && (csa == kip_csa))
  444                     {
  445                             /* Assert that MUPIP REORG never leaves the database with an abandoned kill */
  446                             assert(!mu_reorg_process);
  447                             assert(0 < kip_csa->hdr->kill_in_prog);
  448                             DECR_KIP(csd, csa, kip_csa);
  449                             INCR_ABANDONED_KILLS(csd, csa);
  450                     }

  (gdb) f 13
  #13 t_end (hist1=0x5617ac86b8c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1813
  1813            REVERT; /* no need for t_ch to be invoked if any errors occur after this point */
  ```

* And below is the macro sequence how frame 13 `REVERT` ends up in frame 12 `deferred_signal_handler` call.

  ```
  REVERT -> ENABLE_INTERRUPTS -> DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED -> deferred_signal_handler
  ```

* GT.M V7.0-001 fixed GTM-9400 by adding a `DEFERRED_EXIT_REORG_CHECK` macro at logical points in
  the mupip reorg code flow where we are guaranteed the kill-in-progress condition has been cleared
  (i.e. DECR_KIP has been called).

  ```diff
  $ git show -U1 tags/V7.0-001 sr_port/mu_reorg.c | grep -B2 -A1 DEFERRED_EXIT_REORG_CHECK
  @@ -449,2 +449,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume,
                                                          DECR_KIP(cs_data, cs_addrs, kip_csa);
  +                                                       DEFERRED_EXIT_REORG_CHECK;
                                                          if (detailed_log)
  --
  @@ -579,2 +580,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume,
                                                  DECR_KIP(cs_data, cs_addrs, kip_csa);
  +                                               DEFERRED_EXIT_REORG_CHECK;
                                                  if (detailed_log)
  @@ -677,2 +679,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume,
                                  DECR_KIP(cs_data, cs_addrs, kip_csa);
  +                               DEFERRED_EXIT_REORG_CHECK;
                                  if (detailed_log)

  $ git show -U1 tags/V7.0-001 sr_unix/mu_swap_root.c | grep -B2 -A1 DEFERRED_EXIT_REORG_CHECK
  @@ -271,2 +248,3 @@ void        mu_swap_root(glist *gl_ptr, int *root_swap_statistic_ptr)
          }
  +       DEFERRED_EXIT_REORG_CHECK;      /* a single directory tree has to be quick, so check at end, rather than each DECR_KIP  */
          return;
  ```

* But what it did not realize is that even before those logical points are reached, it is possible
  for `t_end.c` to invoke the `REVERT` macro which in turn would invoke `deferred_signal_handler` like
  is seen in the above stack trace.

* Not sure how this did not get caught during the GT.M testing.

Fix
---
* In any case, the fix is simple and is to enhance `sr_port/deferred_signal_handler.c` to not invoke
  `deferred_exit_handler()` but instead `return` in case we are a `mupip reorg` process (indicated by
  the boolean_t typed `mu_reorg_process` global variable being TRUE) and we are in the middle of a
  kill-in-progress (indicated by `cs_data->kill_in_prog` being TRUE).

* This way, we delay the deferred signal handling of the `MUPIP STOP` (aka `SIG-15`/SIGTERM) a little
  more until the logical point in `sr_port/mu_reorg.c` or `sr_unix/mu_swap_root.c` is reached where
  the `DEFERRED_EXIT_REORG_CHECK` macro is invoked.
nars1 added a commit that referenced this issue Jan 24, 2024
Background
----------
* The `v60001/concbkup` subtest in the YDBTest project tests that a `BKUPRUNNING` message is issued
  by a concurrent mupip backup.

* This test starts 1000 `mupip backup` commands in the background and 1000 `mupip backup` commands in
  the foreground.

* The foreground backup is started with the `gtm_white_box_test_case_number` env var set to
  `WBTEST_CONCBKUP_RUNNING` (i.e. `83`). This causes it to sleep for 1 second thereby increasing
  the likelihood of the background backup finding the foreground backup running and therefore issuing
  the `BKUPRUNNING` message.

* But once in a while, this test failed with the following diff.

  ```diff
  concbkup took 00:00:25:00 from 2024.01.22.16.51.06.EST to 2024.01.22.17.16.06.EST : 31428 FAIL
  --- concbkup/concbkup.diff ---
  6d5
  < %YDB-E-BKUPRUNNING, Process ##PID## is currently backing up region DEFAULT. Cannot start another backup.
  ```

Issue
-----
* After an elaborate analysis, I found that the background backup process (which does not run with the
  white-box env var set) spends the time waiting for one of the below 2 locks whenever the test fails.
  - db crit
  - shmpool hdr lock

* Below were the corresponding stack traces.
  - **db crit**
    ```c
    #2  __pthread_mutex_clocklock_common (mutex=0x7f9059a4d008, clockid=0, abstime=0x7ffe0befbe70) at ./nptl/pthread_mutex_timedlock.c:253
    #3  gtm_mutex_lock (reg=0x555882b94170, mutex_spin_parms=0x7f9059fdf054, crash_count=0, mutex_lock_type=MUTEX_LOCK_WRITE, state=WS_10) at sr_unix/mutex.c:777
    #4  grab_crit (reg=0x555882b94170, state=WS_10) at sr_unix/grab_crit.c:121
    #5  mupip_backup () at sr_port/mupip_backup.c:960
    ```
  - **shmpool hdr lock**
    ```c
    #1  m_usleep (useconds=1000) at sr_unix/sleep.c:37
    #2  wcs_sleep (sleepfactor=1) at sr_port/wcs_sleep.c:28
    #3  shmpool_lock_hdr (reg=0x564318e17170) at sr_port/shmpool.c:348
    #4  mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:124
    #5  mupip_backup () at sr_port/mupip_backup.c:1567
    ```

* That is when I realized that the placement of the `LONG_SLEEP` macro invocation (to sleep 1 second in
  the `WBTEST_CONCBKUP_RUNNING` white-box case) is incorrect.

* It is placed in a section of the code where we hold the `db crit` and the `shmpool hdr` lock. This greatly
  increases the chance of the background backup process needing the lock and in turn sleeping instead of
  issuing the BKUPRUNNING error like the test wants.

Fix
---
* The white-box code is now moved from the current location (in `sr_port/mupip_backup.c`) to a few lines
  later where we hold neither of the 2 locks. This should let the sleep continue to happen in the foreground
  backup process and at the same time increases the chances of the background backup process not needing
  either of the 2 locks and therefore issuing the BKUPRUNNING error.

Notes
-----
* Without the fix, running 6 parallel set of 10 tests of the `v60001/concbkup` subtest on all the in-house
  systems resulted in around a dozen failures.

* With the fix, there were NO failures. And the test finished a lot faster too implying the BKUPRUNNING
  error was encountered a lot earlier and needed only a small fraction of the maximum of 1000
  foreground/background backup process invocations.
nars1 added a commit that referenced this issue Feb 19, 2024
…GTM-9400 in V7.0-001)

Background
----------
* The `v70001/gtm9400` subtest failed in one rare in-house run with the following symptom.

  ```diff
  > ##TEST_PATH##/v70001_0_4/gtm9400/reorg_1_90.out
  > %YDB-F-ASSERT, Assert failed in sr_port/deferred_signal_handler.c line 40 for expression (INTRPT_OK_TO_INTERRUPT == intrpt_ok_state)
  ```

* Below are the relevant details from the core file created by the `mupip reorg` process.

  ```c
  (gdb) where
  #0  pthread_kill () from /lib64/libpthread.so.0
  #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #3  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #4  rts_error_va (csa=0x0, argcnt=7, var=0x7ffe32ae6b80) at sr_unix/rts_error.c:198
  #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #6  deferred_signal_handler () at sr_port/deferred_signal_handler.c:40
  #7  t_end (hist1=0x7ffe32ae9d10, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:655
  #8  gvcst_bmp_mark_free (ks=0x7ffe32aeaa10) at sr_port/gvcst_bmp_mark_free.c:214
  #9  mu_reorg (gl_ptr=0x14562c0, exclude_glist_ptr=0x7ffe32aeb368, resume=0x7ffe32aeb3bc, index_fill_factor=90, data_fill_factor=90, reorg_op=0) at sr_port/mu_reorg.c:452
  #10 mupip_reorg () at sr_port/mupip_reorg.c:334
  #11 mupip_main (argc=5, argv=0x7ffe32afd848, envp=0x7ffe32afd878) at sr_unix/mupip_main.c:117
  #12 dlopen_libyottadb (argc=5, argv=0x7ffe32afd848, envp=0x7ffe32afd878, main_func=0x4015f4 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #13 main (argc=5, argv=0x7ffe32afd848, envp=0x7ffe32afd878) at sr_unix/mupip.c:22

  (gdb) f 6
  #6  deferred_signal_handler () at sr_port/deferred_signal_handler.c:40
  40              assert(INTRPT_OK_TO_INTERRUPT == intrpt_ok_state);      /* DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED ensures this */
  (gdb) p forced_exit
  $2 = 1
  (gdb) p mu_reorg_process
  $3 = 1
  (gdb) p cs_data->kill_in_prog
  $4 = 1
  (gdb) p intrpt_ok_state
  $5 = INTRPT_IN_KILL_CLEANUP
  ```

Issue
-----
* The reorg process that assert failed was sent a `mupip stop` by the `gtm9400` subtest.

* From the failure symptoms noted above, what happened is that the `mupip stop` got delivered AFTER
  line 321 below (when `intrpt_ok_state` was `INTRPT_OK_TO_INTERRUPT`) but before line 322.

  **sr_port/have_crit.h**
  ```c
      314 /* Restore deferrable interrupts back to the state it was at time of corresponding DEFER_INTERRUPTS call */
      315 #define ENABLE_INTERRUPTS(OLDSTATE, NEWSTATE)                                                   \
      316 {                                                                                               \
      317         if (!multi_thread_in_use)                                                               \
      318         {                                                                                       \
      319                 assert(OLDSTATE == intrpt_ok_state);                                            \
      320                 intrpt_ok_state = NEWSTATE;                                                     \
      321                 if (INTRPT_OK_TO_INTERRUPT == intrpt_ok_state)                                  \
      322                         DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED;                                 \
      323                                 /* check if signals were deferred in deferred zone */           \
      324         }                                                                                       \
      325 }
  ```

* And as part of processing the `mupip stop`, the below code (newly introduced as part of incorporating
  GTM-9400 in GT.M V7.0-001 in commit 5ae98ef) set `intrpt_ok_state` to `INTRPT_IN_KILL_CLEANUP`.

  **sr_unix/generic_signal_handler.c**
  ```c
      320                                 /* If nothing pending AND we have crit or in wcs_wtstart() or already in exit processing, wait to
      321                                  * invoke shutdown. wcs_wtstart() manipulates the active queue that a concurrent process in crit
      322                                  * in bt_put() might be waiting for. interrupting it can cause deadlocks (see C9C11-002178).
      323                                  */
      324                                 if (mu_reorg_process && OK_TO_INTERRUPT && cs_data && cs_data->kill_in_prog)
      325                                         DEFER_INTERRUPTS(INTRPT_IN_KILL_CLEANUP, prev_intrpt_state);    /* avoid ABANDONEDKILL */
  ```

* Therefore, when the `DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED` macro (in line 322 of the `ENABLE_INTERRUPTS`
  macro was invoked, `intrpt_ok_state` had a value that was not `INTRPT_OK_TO_INTERRUPT` and failed
  the assert at line 40 below.

  **sr_port/deferred_signal_handler.c**
  ```c
       40         assert(INTRPT_OK_TO_INTERRUPT == intrpt_ok_state);      /* DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED ensures this */
  ```

Fix
---
* Commit 57006c9 fixed `GTM-9400` for real (see commit message there for detail).

* All that is needed is for the failing assert to account for this new possibility. And that is done
  by adding a `||` condition in this commit.
nars1 added a commit that referenced this issue Feb 29, 2024
…ontrol.c due to memcmp() usage

Background
----------
* The `v70001/gtm8843_31` subtest failed with the following symptom when run with a YottaDB build that
  had ASAN enabled.

  ```
  command: write /block("attack-decay-sustain-release")
  =================================================================
  ==8725==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffc4433d3c6 at pc 0x55bc80d298a6 bp 0x7ffc4433cde0 sp 0x7ffc4433c590
  READ of size 28 at 0x7ffc4433d3c6 thread T0
      #0 0x55bc80d298a5 in MemcmpInterceptorCommon(void*, int (*)(void const*, void const*, unsigned long), void const*, void const*, unsigned long) (dbg/yottadb+0x388a5)
      #1 0x55bc80d29d99 in memcmp (dbg/yottadb+0x38d99)
      #2 0x7f382b92c262 in iosocket_block_iocontrol sr_port/iosocket_iocontrol.c:502:12
      #3 0x7f382b9295d5 in iosocket_iocontrol sr_port/iosocket_iocontrol.c:268:3
      #4 0x7f382bbcb01c in op_iocontrol sr_port/op_iocontrol.c:54:3
  ```

Issue
-----
* The issue is very similar to that described in the commit message of 03ef86c.

* The new usage of `MEMCMP` happens in `sr_port/iosocket_iocontrol.c` line 502 (frame 2 in above listing)
  and is newly introduced as part of merging GT.M V7.0-001 code changes.

Fix
---
* The fix is very similar to that in 03ef86c and is to ensure `option_buf` is a null-terminated string
  and use `strcmp()` instead. According to coding convention, `STRCMP()` is used instead of `strcmp()`.
nars1 added a commit that referenced this issue Mar 26, 2024
* This commit addresses merge conflicts involving deleted files.

* The list of deleted files was identified using the following commands as part of the prior
  commit (i.e. after the `git cherry-pick` command was run but before the `git commit` command
  was run).

  ```sh
  $ git status | grep 'deleted'
        deleted by us:   README
        deleted by us:   sr_i386/cmerrors_ctl.c
        deleted by us:   sr_i386/cmierrors_ctl.c
        deleted by us:   sr_i386/gdeerrors_ctl.c
        deleted by them: sr_port/outofband_action.c
        deleted by them: sr_port/start_fetches.c
        deleted by us:   sr_unix/gtm_logicals.h
        deleted by us:   sr_unix/gtm_tls_impl.c
        deleted by us:   sr_x86_64/merrors_ansi.h
        deleted by us:   sr_x86_64/merrors_ctl.c
  ```

* All files that show up as `deleted by us:` are deleted in this commit since those were deleted
  even before in the YDB git repository. That is a straightforward change. But each of these files
  needed to be reviewed to see if the GT.M changes to each file needs to be incorporated somewhere
  else in the YDB git repository. They are described below.

  - README : This file only has cosmetic changes in every GT.M release. No need to incorporate this
    into the README.md file (corresponding file in the YDB repository).

  - The following are all generated files and will be regenerated on the YDB side in a later commit
    so no need to incorporate any changes. The `sr_i386` files won't be regenerated as YottaDB does
    not support that architecture. Only the `sr_x86_64` files will be regenerated.
    - sr_i386/cmerrors_ctl.c
    - sr_i386/cmierrors_ctl.c
    - sr_i386/gdeerrors_ctl.c
    - sr_x86_64/merrors_ansi.h
    - sr_x86_64/merrors_ctl.c

  - sr_unix/gtm_tls_impl.c : This file had been moved to YDBEncrypt/gtm_tls_impl.c (a different git
    repository) so I had to use the following commands to incorporate the changes.

    The following command (run in the YDB git repo) creates a patch file containing the GT.M V7.0-002 changes.

    ```sh
    cd YDB
    git show tags/V7.0-002 -- sr_unix/gtm_tls_impl.c > /tmp/patch
    ```

    The following command (run in the YDBEncrypt git repo) applies the patch to the YDBEncrypt repo file.

    ```sh
    cd YDBEncrypt
    patch -i /tmp/patch gtm_tls_impl.c
    ```

    The patch command produced the following output.

    ```
    Hunk #1 succeeded at 1 with fuzz 1.
    Hunk #2 FAILED at 849.
    1 out of 2 hunks FAILED -- saving rejects to file gtm_tls_impl.c.rej
    ```

    The rejected Hunk 2 corresponded to the following V7.0-002 diff.

    ```diff
    @@ -849,14 +849,12 @@ int gtm_tls_store_passwd(gtm_tls_ctx_t *tls_ctx, const char *tlsid, const char *
     	if (!(GTMTLS_OP_INTERACTIVE_MODE & tls_ctx->flags))
     		return 0;	/* Not running in an interactive mode. */
     	assert(NULL != tlsid);
    -	env_name_len = strlen(tlsid);
    -	assert(PASSPHRASE_ENVNAME_MAX > (sizeof(GTMTLS_PASSWD_ENV_PREFIX) + env_name_len)); /* Includes null */
    +	env_name_len = strnlen(tlsid, (PASSPHRASE_ENVNAME_MAX - sizeof(GTMTLS_PASSWD_ENV_PREFIX))); /* Includes null */
     	env_name_idx = (sizeof(GTMTLS_PASSWD_ENV_PREFIX) - 1);
     	memcpy(env_name, GTMTLS_PASSWD_ENV_PREFIX, env_name_idx);
    -	assert((sizeof(env_name) - env_name_idx) > (env_name_len + 1));
    -	assert(PASSPHRASE_ENVNAME_MAX > env_name_len + env_name_idx);
     	memcpy(&env_name[env_name_idx], tlsid, env_name_len);
     	env_name_idx += env_name_len;
    +	assert(PASSPHRASE_ENVNAME_MAX > env_name_idx);
     	env_name[env_name_idx] = '\0';
     	obs_len = strlen(obs_passwd);
     	pwent_node = gtm_tls_find_pwent(env_name);
    ```

    In YDBEncrypt, the above logic had been reworked a while ago to not require `env_name[]` array at all.
    Therefore, the `assert(PASSPHRASE_ENVNAME_MAX > ...)` assert that was removed in V7.0-001 was correspondingly
    removed from YDB code in 2 places.
    - gtm_tls_find_pwent()
    - gtm_tls_store_passwd()

    The rest of the V7.0-002 GT.M changes did not make sense so were not retrofitted into the YDBEncrypt side.

    And the following files were then removed from YDBEncrypt
    - gtm_tls_impl.c.rej
    - gtm_tls_impl.c.orig

    And the following file was removed from YDB now that it served its purpose
    - sr_unix/gtm_tls_impl.c

  - The GT.M V7.0-002 changes to `sr_unix/gtm_logicals.h` were to change the `gtm_max_storalloc` env var
    name to a `gtm_malloc_limit` env var name. And the macro `GTM_MAX_STORALLOC` was renamed to
    `GTM_MALLOC_LIMIT`.

    The changes to `sr_unix/gtm_logicals.h` were retrofitted to the equivalent file in the YottaDB code base
    `sr_port/ydb_logicals_tab.h` and the copy of this file in the YDBEncrypt repo. The change to the YDB
    version is made in this commit. The YDBEncrypt change is made in that repo.

    The corresponding YottaDB change was to make the following changes
    - YDBENVINDX_MAX_STORALLOC -> YDBENVINDX_MALLOC_LIMIT
    - ydb_max_storalloc -> ydb_malloc_limit
    - gtm_max_storalloc -> gtm_malloc_limit

    The changes were needed not just in `sr_port/ydb_logicals_tab.h` but also in `sr_port/gtm_env_init.c`.

    And the following file was removed from YDB now that it served its purpose
    - sr_unix/gtm_logicals.h

* All files that show up as `deleted by them:` are deleted in this commit since those were deleted
  in the GT.M repository. That is a straightforward change as we just need to delete them from the YottaDB side
  too.  Below is the list of those files.
  - sr_port/outofband_action.c
  - sr_port/start_fetches.c
nars1 added a commit that referenced this issue Mar 26, 2024
…item.c (regression in GT.M V7.0-002)

Background
----------
* After merging GT.M V7.0-002, I noticed various tests failed with the following symptom when
  run with a YottaDB that was built with ASAN.

  ```diff
  > ==48680==ERROR: AddressSanitizer: global-buffer-overflow on address 0x7ffa2d551598 at pc 0x7ffa2c15671f bp 0x7ffcddfd11c0 sp 0x7ffcddfd11b0
  > READ of size 4 at 0x7ffa2d551598 thread T0
  >     #0 0x7ffa2c15671e in expritem sr_port/expritem.c:639
  >     #1 0x7ffa2ceb8ca1 in expratom sr_port/expratom.c:29
  >     #2 0x7ffa2ceb11e3 in eval_expr sr_port/eval_expr.c:68
  >     #3 0x7ffa2ceb7b66 in expr sr_port/expr.c:31
  >     #4 0x7ffa2cfe878b in m_write sr_port/m_write.c:73
  >     #5 0x7ffa2d37d59c in cmd sr_port/cmd.c:312
  >     #6 0x7ffa2cfc243f in linetail sr_port/linetail.c:35
  >     #7 0x7ffa2c4ea23c in op_commarg sr_port/op_commarg.c:91
  ```

Issue
-----
* GT.M V7.0-002 added an `if` block to `sr_port/expritem.c` that checked if the opcode was `$zchar`
  or `$char`. And did some special handling for `NUMOFLOW`.

* But this logic assumed `index` pointed to a valid intrinsic function.

* It is possible the user specified an invalid intrinsic function like is done in various tests in
  the test system.

  For example `write $zysufff(...)` is done in `YDBTest/r132/u_inref/ydb391.csh` where `$zysufff`
  is an invalid function.

* In such cases, `index` would have the value `-1`. And so using this as an index into an array
  is not valid.

* While this might work without issues in practice, ASAN is picky and issues an error.

Fix
---
* The fix is simple and is to check for valid `index` (i.e. `0 <= index`) and only if so do the
  `if` check that GT.M V7.0-002 introduced.
nars1 added a commit that referenced this issue Mar 26, 2024
…ofband_clear.c)

Background
----------
* After GT.M V7.0-002 changes were merged, the `r130/ydb560` subtest started failing with the
  following symptom.

  ```
  %YDB-F-ASSERT, Assert failed in sr_port/outofband_clear.c line 43 for expression (TRUE == status)
  ```

* A simple way to reproduce this issue is to run the following and in a parallel terminal send
  a `kill -4` to the `mumps` process (that is stuck in the `hang` command).

  ```sh
  $ cat test.m
   set x=1
   hang 100

  $ mumps -run test
  ```

* Before V7.0-002 merge, one would see just 1 core file (due to the `kill -4`). But after the
  merge, one would see 3 core files. And the 2nd core file had the following stack trace.

  ```
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140112165532736, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7ffd024690b0) at sr_unix/rts_error.c:198
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #8  outofband_clear () at sr_port/outofband_clear.c:43
  #9  outofband_action (lnfetch_or_start=0) at sr_port/outofband_action.c:58
  #10 async_action (lnfetch_or_start=false) at sr_port/deferred_events.c:394
  #11 lvzwr_var (lv=0x60f0000005f0, n=0) at sr_port/lvzwr_var.c:184
  #12 lvzwr_fini (out=0x7ffd02471dc0, t=1) at sr_port/lvzwr_fini.c:84
  #13 op_lvpatwrite (count=0, arg1=140724641668224) at sr_port/op_lvpatwrite.c:85
  #14 zshow_zwrite (output=0x7ffd02471dc0) at sr_port/zshow_zwrite.c:40
  #15 op_zshow (func=0x7ffd0247a0e0, type=1, lvn=0x0) at sr_port/op_zshow.c:166
  #16 jobexam_dump (dump_filename_arg=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fatal_file_name_buff=0x7ffd0247ae20 "/extra4/testarea1/nars/V998/tst_V998_R201_dbg_28_240320_111309/r130_0/ydb560/YDB_FATAL_ERROR.ZSHOW_DMP_89246_1.txt", fmt=0x0, dev_in_use=0x7ffd0247a240) at sr_port/jobexam_process.c:238
  #17 jobexam_process (dump_file_name=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fmt=0x0) at sr_port/jobexam_process.c:147
  #18 create_fatal_error_zshow_dmp (signal=4) at sr_port/create_fatal_error_zshow_dmp.c:66
  #19 signal_exit_handler (exit_handler_name=0x7f6e64c43140 "deferred_exit_handler", sig=4, info=0x7f6e6519f938 <stapi_signal_handler_oscontext+3320>, context=0x7f6e6519f9b8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:59
  #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95
  #22 set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:48
  #23 async_action (lnfetch_or_start=true) at sr_port/deferred_events.c:380
  #24 l1 () at sr_x86_64/op_startintrrpt.s:40

  (gdb) f 8
  #8  outofband_clear () at sr_port/outofband_clear.c:43
  43                      assert(TRUE == status);

  (gdb) list
  41              {
  42                      status = xfer_reset_if_setter(outofband);
  43                      assert(TRUE == status);
  44              }
  45      }

  (gdb) p outofband
  $1 = 11

  (gdb) p (enum outofbands)outofband
  $2 = deferred_signal
  ```

Issue
-----
* The issue was that `xfer_reset_if_setter()` had been reworked in GT.M V7.0-002. And that caused the
  handling of the `deferred_signal` type of outofband (which is a YottaDB-only value, unknown to the
  GT.M code base) not be handled correctly.

* The reason why `xfer_reset_if_setter()` returned FALSE in line 42 above is that the `event_state`
  for `deferred_signal` event_type at line 249 below was `pending`. Not `active` and so the call to
  line 250 got skipped. That would have done the real reset that was needed.

  **sr_port/deferred_events.c**
  ```c
    212 boolean_t xfer_reset_if_setter(int4 event_type)
      .
    249     if (res = (active == TAREF1(save_xfer_root, event_type).event_state))   /* WARNING: assignment */
    250             res = (real_xfer_reset(event_type));
  ```

Fix
---
* The fix was to set the event_state for `deferred_signal` outofband to `active` in `deferred_signal_set()`
  just like it is done for `jobinterrupt` outofband in `jobinterrupt_set()`.

* After this change though, an assert in line 370 below (in the `async_action()` function) failed.

  **sr_port/deferred_events.c**
  ```c
    350 void async_action(bool lnfetch_or_start)
      .
    358         if (jobinterrupt == outofband)
    359         {
      .
    367                 TAREF1(save_xfer_root, jobinterrupt).event_state = pending;     /* jobinterrupt gets a pass from the assert below */
    368         } else if (!lnfetch_or_start)
    369         {       /* something other than a new line caugth this, so  */
    370                 assert(pending >= TAREF1(save_xfer_root, outofband).event_state);
    371                 TAREF1(save_xfer_root, outofband).event_state = pending;        /* make it pending in case it was not there yet */
    372         }
  ```

  I noticed that `jobinterrupt` gets special handling in line 367. So decided to have special handling
  for `deferred_signal` as well. But the special handling is different here in that we do not modify
  the `event_state` (like is done for `jobinterrupt` in line 367 above) for the `deferred_signal` case.
  Just that we skip lines 370-371.

* With the changes in the above 2 bullets, the simple test case shown above started working fine in that
  it only generated 1 core file (not 3 core files).
nars1 added a commit that referenced this issue Apr 8, 2024
* This commit addresses merge conflicts involving deleted files.

* The list of deleted files was identified using the following commands as part of the prior
  commit (i.e. after the `git cherry-pick` command was run but before the `git commit` command
  was run).

  ```sh
  $ git status | grep 'deleted'
        deleted by us:   README
        deleted by us:   sr_unix/Makefile.mk
        deleted by us:   sr_unix/gtm_logicals.h
        deleted by us:   sr_unix/gtm_tls_impl.c
  ```

* All files that show up as `deleted by us:` are deleted in this commit since those were deleted
  even before in the YDB git repository. That is a straightforward change. But each of these files
  needed to be reviewed to see if the GT.M changes to each file needs to be incorporated somewhere
  else in the YDB git repository. They are described below.

  - README : This file only has cosmetic changes in every GT.M release. No need to incorporate this
    into the README.md file (corresponding file in the YDB repository).

  - sr_unix/Makefile.mk : The GT.M changes to this file have been incorporated into `YDBEncrypt/Makefile`
    which is where the YottaDB version of this file lies. There was a failure in `Hunk #2` (out of a
    total of 5 Hunks). This failure was manually resolved by picking the GT.M change and then
    incorporating the pre-existing YottaDB change of `gtm_tls_interface.h` -> `ydb_tls_interface.h`.

  - sr_unix/gtm_logicals.h : GT.M added a `GTM_DB_CREATE_VER` macro which in turn held the value
    `$gtm_db_create_ver`. This was incorporated into YottaDB side by adding a new env var line in
    `sr_port/ydb_logicals_tab.h` with the index name `YDBENVINDX_DB_CREATE_VER`. The env var names
    in the YottaDB and GT.M side for this are respectively `ydb_db_create_ver` and `gtm_db_create_ver`.

    In `sr_unix/gtm_env_init_sp.c`, replaced logic that was introduced in the GT.M side which used the
    `GTM_DB_CREATE_VER` macro to instead use `ydb_trans_log_name()` with `YDBENVINDX_DB_CREATE_VER` and
    allow for the values `6` or `V6` to imply a `V6` format database. I did think about allowing a value
    of `1` or `R1` for the `ydb_db_create_ver` env var since YottaDB releases have r1.x or r2.x numbering
    but decided the extra effort is not worth it as it might confuse the user more than help.

    While at this, noticed that the line for `ydb_dollartest` was not in sorted order like the rest of
    the lines. So fixed it by moving it to the sorted position.

  - sr_unix/gtm_tls_impl.c : The GT.M changes to this file have been incorporated into
    `YDBEncrypt/gtm_tls_impl.c` which is where the YottaDB version of this file lies. There were 12 Hunks
    out of which there was a failure in `Hunk #5`, `Hunk #6` and `Hunk #12`.
    - `Hunk #5` was related to `Initialize OpenSSL library`. The GT.M side had added a lot of logic inside
      a `#if OPENSSL_VERSION_NUMBER >= 0x10100000L` define. But the YottaDB side which had added TLS 3
      support in a prior commit had a `#if OPENSSL_VERSION_MAJOR < 3` logic in a smaller code block. It
      was not clear which was a better approach but to keep conflicts to a minimum, I picked the GT.M
      approach and discarded the YottaDB side of the conflict.
    - `Hunk #6` was related to GT.M moving a call to `gc_load_gtmshr_symbols()` to a few lines earlier.
      This call was replaced by `gc_load_yottadb_symbols()` call in a prior YottaDB commit unrelated to
      the TLS 3 support changes so that YottaDB change was retained but the GT.M move of the call was
      also picked. GT.M also added a `SSL_CTX_new(TLS_method())` call and that was picked as well.
    - `Hunk #12` was related to GT.M side adding a `#if OPENSSL_VERSION_NUMBER >= 0x10101000L` check in
      a `switch/case` block around a `case TLS1_3_VERSION:` in the `gtm_tls_get_conn_info()` function. But
      this switch/case block was removed in a prior YottaDB commit and replaced with a `SSL_get_version()`
      call and so the GT.M side of this conflict was discarded.
nars1 added a commit that referenced this issue Apr 17, 2024
…er call in ss_initiate)

Background
----------
* When run with a Debug build of YottaDB, a `mupip integ` failed as follows.

  ```c
  %YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 504 for expression ((INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) || (INTRPT_IN_DB_CSH_GETN == intrpt_ok_state) || (INTRPT_IN_GDS_RUNDOWN == intrpt_ok_state))
  ```

Issue
-----
* The issue is that GT.M V7.0-005 added a `START_JNL_FILE_CLOSE_TIMER_IF_NEEDED` call to
  the `SET_SNAPSHOTS_IN_PROG` macro as can be seen in the below diff.

  ```diff
  $ git show -U0 tags/V7.0-005 sr_port/gdsfhead.h
  @@ -4187 +4187 @@ MBSTART {
  -#define SET_SNAPSHOTS_IN_PROG(X)	((X)->snapshot_in_prog = TRUE)
  +#define SET_SNAPSHOTS_IN_PROG(X)	MBSTART { (X)->snapshot_in_prog = TRUE; START_JNL_FILE_CLOSE_TIMER_IF_NEEDED; } MBEND
  ```

* This caused the assert failure in the following code (which is present only in YottaDB, not in GT.M).

  **sr_unix/gt_timers.c**
  ```c
    501         } else if (jnl_file_close_timer_fptr == handler)
    502         {       /* Account for known instances of the above function being called from within a deferred zone. */
    503                 assert((INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) || (INTRPT_IN_DB_CSH_GETN == intrpt_ok_state)
    504                         || (INTRPT_IN_GDS_RUNDOWN == intrpt_ok_state));
    505                 safe_to_add = TRUE;
  ```

* Below are details from the gdb analysis of the core file

  ```c
  (gdb) where
  #0  __pthread_kill_implementation (no_tid=0, signo=3, threadid=140357926831936) at ./nptl/pthread_kill.c:44
  #1  __pthread_kill_internal (signo=3, threadid=140357926831936) at ./nptl/pthread_kill.c:78
  #2  __GI___pthread_kill (threadid=140357926831936, signo=3) at ./nptl/pthread_kill.c:89
  #3  gtm_dump_core () at sr_unix/gtm_dump_core.c:74
  #4  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163
  #5  ch_cond_core () at sr_unix/ch_cond_core.c:80
  #6  rts_error_va (csa=0x0, argcnt=7, var=0x7fff19e828f0) at sr_unix/rts_error.c:199
  #7  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:100
  #8  start_timer (tid=140357912091280, time_to_expir=60000000000, handler=0x7fa79f7dda90 <jnl_file_close_timer>, hdata_len=0, hdata=0x0) at sr_unix/gt_timers.c:503
  #9  ss_initiate (reg=0x55b37821b170, util_ss_ptr=0x55b37821a9c0, ss_ctx=0x55b378219110, preserve_snapshot=0, calling_utility=0x7fa7a03c8506 "MUPIP INTEG") at sr_unix/ss_initiate.c:666
  #10 mu_int_reg (reg=0x55b37821b170, return_value=0x7fff19e8560c, return_after_open=0) at sr_port/mu_int_reg.c:192
  #11 mupip_integ () at sr_port/mupip_integ.c:438
  #12 mupip_main (argc=4, argv=0x7fff19e89608, envp=0x7fff19e89630) at sr_unix/mupip_main.c:130
  #13 dlopen_libyottadb (argc=4, argv=0x7fff19e89608, envp=0x7fff19e89630, main_func=0x55b37723f004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #14 main (argc=4, argv=0x7fff19e89608, envp=0x7fff19e89630) at sr_unix/mupip.c:21

  (gdb) f 8
  #8  start_timer (tid=140357912091280, time_to_expir=60000000000, handler=0x7fa79f7dda90 <jnl_file_close_timer>, hdata_len=0, hdata=0x0) at sr_unix/gt_timers.c:503
  503                     assert((INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) || (INTRPT_IN_DB_CSH_GETN == intrpt_ok_state)

  (gdb) p intrpt_ok_state
  $1 = INTRPT_IN_SS_INITIATE
  ```

Fix
---
* Now that we know this is expected, the above value is also added as an accepted value in the assert.
nars1 added a commit that referenced this issue Jul 10, 2024
Background
----------
* In internal testing, we noticed a rare failure in the `v51000/mu_bkup_stop` subtest
  where a `mupip backup` process that was sent a `SIGTERM` (by the test) ended up
  creating a core file due to ASAN assert failing on a double free.

* Below are relevant details from the core file.

  ```c
  Core was generated by `mupip backup -online -dbg * ./49181_online1'.
  Program terminated with signal SIGSEGV, Segmentation fault.

  (gdb) where
  #0  ydb_os_signal_handler (sig=11, info=0x7fd09968c3f0, context=0x7fd09968c2c0) at sr_unix/ydb_os_signal_handler.c:57
  #1  <signal handler called>
  #2  ydb_os_signal_handler (sig=6, info=0x7fd09968caf0, context=0x7fd09968c9c0) at sr_unix/ydb_os_signal_handler.c:57
  #3  <signal handler called>
  #4  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
  #5  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
  #6  __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
  #7  __GI_abort () at ./stdlib/abort.c:79
  #8  __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:143
  #9  __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58
  #10 __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x7ffda6de6ebe, __in_chrg=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:190
  #11 __asan::ReportDoubleFree (addr=140533757257728, free_stack=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:224
  #12 __asan::Allocator::ReportInvalidFree (this=<optimized out>, stack=0x7ffda6de79f0, chunk_state=<optimized out>, ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_allocator.cpp:757
  #13 __interceptor_free (ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:53
  #14 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1485
  #15 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  #16 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  #18 mupip_backup_call_on_signal () at sr_port/mupip_backup.c:208
  #19 signal_exit_handler (exit_handler_name=0x7fd097f1dda0 "deferred_exit_handler", sig=15, info=0x7fd098480fd8 <stapi_signal_handler_oscontext+3320>, context=0x7fd098481058 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:67
  #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120
  #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95
  #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486
  #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  #26 mupip_backup () at sr_port/mupip_backup.c:1585
  #27 mupip_main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip_main.c:130
  #28 dlopen_libyottadb (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50, main_func=0x55af49fd9020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #29 main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip.c:21

  (gdb) f 25
  #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  103                                     free(ptr->backup_hdr);

  (gdb) f 17
  #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103
  103                                     free(ptr->backup_hdr);

  (gdb) down
  #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501
  1501            gtm_free_main(addr, TAIL_CALL_LEVEL);

  (gdb) down
  #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854
  854                     system_free(addr);

  (gdb) down
  #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486
  1486            ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);

  (gdb) list
  1481    {
  1482            intrpt_state_t  prev_intrpt_state;
  1483
  1484            DEFER_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);
  1485            free(addr);
  1486            ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state);
  1487            return;
  1488    }
  ```

Issue
-----
* We did a `free(ptr->backup_hdr)` at line 103. And that in turn ended up using the system `free()`
  function because the test framework had randomly set the `gtmdbglvl` env var to a value of
  `0x80000000`.

* So at line 1485 above, the system free finished but at line 1486 we noticed the SIGTERM that was
  deferred and so decided to handle it. But the `ptr->backup_hdr` variable was still set to a
  non-NULL value so as part of the deferred exit handler, we tried to free this again resulting
  in the double free.

Fix
---
* The fix is to note `ptr->backup_hdr` in a local variable and clear the former and then attempting
  the `free()` on the local variable. This way if we decide to do deferred exit handling after the
  `free()` occurred, we will notice a NULL value of `ptr->backup_hdr` and so avoid the double free.

Notes
-----
* This is considered a too rare a race condition to be encountered in practice and so it is expected
  to not be noticed by a user. Therefore no YDB issue is created for this.
nars1 added a commit that referenced this issue Aug 29, 2024
Background
----------
* This is an issue in c087690 that was found by YDBTest@a8eaadc4 (YDBTest!2075).

* When YottaDB was built with ASAN, running the `r202/shebang-ydb1084` subtest gave the
  following error.

  ```c
  =================================================================
  ==29909==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6020000001b6 at pc 0x559522eb8f9d bp 0x7ffd2cb5aa60 sp 0x7ffd2cb5a1e8
  READ of size 7 at 0x6020000001b6 thread T0
      #0 0x559522eb8f9c in printf_common(void*, char const*, __va_list_tag*) asan_interceptors.cpp.o
      #1 0x7f9d29cc83d3 in gtm_snprintf sr_unix/gtm_stdio.c:102:2
      #2 0x7f9d2ac7629a in ydb_shebang sr_port/ydb_shebang.c:186:3
      #3 0x7f9d29e41973 in jobchild_init sr_unix/jobchild_init.c:138:16
      #4 0x7f9d29cc5db5 in gtm_startup sr_unix/gtm_startup.c:287:3
      #5 0x7f9d29e04dd8 in init_gtm sr_unix/init_gtm.c:203:2
      #6 0x7f9d29b62e82 in gtm_main sr_unix/gtm_main.c:194:2
      #7 0x559522f55eee in dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11
      #8 0x559522f54f1d in main sr_unix/gtm.c:20:9
      #9 0x7f9d2efa1d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
      #10 0x7f9d2efa1e3f in __libc_start_main csu/../csu/libc-start.c:392:3
      #11 0x559522e97344 in _start (/usr/library/V998_R201/dbg/yottadb+0x1e344) (BuildId: a420a9112acf24ed4fac1e0bcc38f1947ddde8e8)

  0x6020000001b6 is located 0 bytes to the right of 6-byte region [0x6020000001b0,0x6020000001b6)
  allocated by thread T0 here:
      #0 0x559522f1a18e in __interceptor_malloc (/usr/library/V998_R201/dbg/yottadb+0xa118e) (BuildId: a420a9112acf24ed4fac1e0bcc38f1947ddde8e8)
      #1 0x7f9d2a42cd8f in system_malloc sr_port/gtm_malloc_src.h:1470:9
      #2 0x7f9d2a4274f9 in gtm_malloc_main sr_port/gtm_malloc_src.h:673:10
      #3 0x7f9d2a436de6 in gtm_malloc sr_port/gtm_malloc_src.h:1496:9
      #4 0x7f9d2b4d28f2 in zro_load sr_unix/zro_load.c:388:42
      #5 0x7f9d2ac7d169 in zro_init sr_port/zro_init.c:95:2
      #6 0x7f9d2b4d301f in zro_search sr_unix/zro_search.c:79:3
      #7 0x7f9d2ac75b53 in ydb_shebang sr_port/ydb_shebang.c:152:2
      #8 0x7f9d29e41973 in jobchild_init sr_unix/jobchild_init.c:138:16
      #9 0x7f9d29cc5db5 in gtm_startup sr_unix/gtm_startup.c:287:3
      #10 0x7f9d29e04dd8 in init_gtm sr_unix/init_gtm.c:203:2
      #11 0x7f9d29b62e82 in gtm_main sr_unix/gtm_main.c:194:2
      #12 0x559522f55eee in dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11
      #13 0x559522f54f1d in main sr_unix/gtm.c:20:9
      #14 0x7f9d2efa1d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16
  ```

Issue
-----
* `TREF(dollar_zroutines)` is an mstr. That is, it has a `.addr` and `.len` field. It is not a
  null-terminated string. But it was used as one in the following line where the last `%s` in the
  SNPRINTF command corresponds to that.

  **sr_port/ydb_shebang.c**
  ```
  186     SNPRINTF(new_zro, space_needed, "%s(%s) %s", buf, rtn_path, (TREF(dollar_zroutines)).addr);
  ```

* This caused `snprintf` to treat it as a null terminated string and access the null byte as part
  of its processing. And that caused ASAN to fail with the `heap-buffer-overflow` error.

Fix
---
* The fix is simple and is to not treat it as a null terminated string in the `snprintf` command.
  And that is easily achieved by using a `%.*s` instead of a `%s` (it now requires passing the
  length in addition to the char pointer).
nars1 added a commit that referenced this issue Sep 18, 2024
…or; Fix pre-existing error in sr_unix/ttt.txt

Background
----------
* As part of implementing 3c1955f, I noticed that new opcodes in `sr_unix/ttt.txt` which had
  more than 2 underscores (e.g. `OC_SORTS_AFTER_RETBOOL`) issued a `TTTGEN-F` message when
  invoking `mumps -run tttgen` (to generate `ttt.c`) as part of the pre-commit script but still
  exited the `mumps -run tttgen` invocation with a 0 exit status. This caused the pre-commit
  script to proceed with a broken copy of `ttt.c` and later cause a runtime assert failure
  in `emit_code.c` which was hard to trace back to the `TTTGEN-F` message.

Issue
-----
* While 3c1955f worked around this issue by renaming the opcode `OC_SORTS_AFTER_RETBOOL` with
  3 underscores to `OC_SORTSAFTER_RETBOOL` with 2 underscores, it would be nice to fix the
  pre-commit script to not create a broken `ttt.c` but instead issue an error.

* I took a look at the `mumps -run tttgen` output to see all the `TTTGEN-F` messages. And
  found 2 of them.

  ```sh
  $ $gtm_dist/mumps -run tttgen ../sr_unix/ttt.txt ../sr_port/opcode_def.h ../sr_port/vxi.h
       .
     798  ; Note if OC_RTERROR call changes, linetail.c and eval_expr.c will also need
     799  ; to change due to it dereferencing the backpoints to get to the opcode.
  %TTTGEN-F vax opcode not defined at line 798 character 9999
       .
       .
    1158  OC_SORTS_AFTER_RETMVAL: calls   #0,xfer.xf_gvzwithdraw
  %TTTGEN-F invalid opcode name at line 1158 character 23
  ```

Fix
---
* The `vax opcode not defined` error is due to a comment line showing up on its own instead of
  being inside line 801 that starts the `OC_RTERROR:` opcode definition.

  **sr_unix/ttt.txt**
  ```
    793 OC_RHDADDR1:    pushl   #-1
    794                 pushab  val.1
    795                 calls   #2,xfer.xf_rhdaddr
    796                 movl    r0,val.0
    797 ;
    798 ; Note if OC_RTERROR call changes, linetail.c and eval_expr.c will also need
    799 ; to change due to it dereferencing the backpoints to get to the opcode.
    800 ;
    801 OC_RTERROR:     pushl   val.2
    802                 pushl   val.1
    803                 jsb     xfer.xf_rterror
  ```

  The `tttscan.m` parser is very simplistic and does not recognize a comment line as its own
  and skip it. Instead of figuring out how to make it do so, I instead moved lines 797-800
  as comments in lines 801 and 802. And that made the `%TTTGEN-F vax opcode not defined' message
  go away.

* The `%TTTGEN-F invalid opcode name` error is due to another limitation of the simplistic
  `tttscan.m` parser in that it only allows 2 underscores. While I worked around that by
  renaming the opcode to only have 2 underscores, the bigger issue is to ensure the script
  exits with a non-zero exit status in case an opcode with more than 2 underscores is seen.

  Towards that `sr_port/tttscan.m` has been modified to do a `zhalt 1` which would terminate
  the process with a non-zero exit status right when an error is detected instead of just
  printing an error message and continuing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants