The method of printing SVNs in zshow_svn is sub-optimal #9

estess · 2017-07-14T16:15:34Z

Final Release Note

Description

Adding a new SVN requires updates in a number of places. One of those places is updating zshow_svn.c which is not well organized. The SVNs and their related fields and/or mini-routines to dump themselves should be in a table to reduce the complexity. Adding some comments in the other update places of the other places that need to be updated would be advisable also.

Draft Release Note

While this change not user visible and is likely not on a performance path (usually used in an error path or during debugging), making it more maintainable is always preferable.

…h an external filter followed by a deactivate and shutdown Release Note ------------- The replication source server terminates normally in case it was started with an external filter and later transitioned from an active to a passive source server before being asked to shut down. Previously, the source server would terminate abnormally with a SIG-11. (YDB#30) Test ----- * New r110/srcsrv_extfilter_sig11 subtest fails reliably without the fixes and passes reliably with the fix. * E_ALL run many times to ensure no regressions. README ------- When an active replication source server is deactivated, it changes mode from ACTIVE to PASSIVE mode. This also causes it to close external filters (if it had them open) by a call to repl_stop_filter() (in gtmsource.c). 517 if (gtmsource_filter & EXTERNAL_FILTER) 518 repl_stop_filter(); While it is in passive mode (and waiting to be activated again or asked to shutdown), if it encounters a shutdown signal, it goes to gtmsource_end() which in turn ends up invoking replstop_filter(). But this function cannot be invoked more than once. That is because it sends a message to the other side of the filter (the receiver side) to stop and then closes the source side of the filter and frees up and nullifies the associated buffers (including the global variable "extract_buff"). The second invocation of this function ends up with a SIG-11 when trying to send a message to the other side because "extract_buff" is NULL. Below is the stack trace. (gdb) where at /Distrib/GT.M/V63002/sr_unix/generic_signal_handler.c:374 at /Distrib/GT.M/V63002/sr_port/repl_filter.c:660 (gdb) where #0 0x000000000082200a in rel_lock (reg=0x3044958) at /Distrib/GT.M/V63002/sr_unix/rel_lock.c:85 #1 0x0000000000474602 in gtmsource_end1 (auto_shutdown=1) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:80 #2 0x000000000049d514 in gtmsource_stop (exit=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:323 YottaDB#3 0x000000000049d55b in gtmsource_sigstop () at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:333 YottaDB#4 0x00000000006d5377 in generic_signal_handler (sig=11, info=0x7ffd3122d6b0, context=0x7ffd3122d580) at /Distrib/GT.M/V63002/sr_unix/generic_signal_handler.c:374 YottaDB#5 <signal handler called> YottaDB#6 0x000000000057b776 in repl_filter_send (tr_num=0, tr=0x0, tr_len=0, first_send=1) at /Distrib/GT.M/V63002/sr_port/repl_filter.c:660 YottaDB#7 0x0000000000581342 in repl_stop_filter () at /Distrib/GT.M/V63002/sr_port/repl_filter.c:1128 YottaDB#8 0x000000000047504f in gtmsource_end1 (auto_shutdown=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:136 YottaDB#9 0x000000000047509e in gtmsource_end () at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:148 YottaDB#10 0x000000000046d968 in gtmsource () at /Distrib/GT.M/V63002/sr_unix/gtmsource.c:520 YottaDB#11 0x000000000044a022 in main (argc=11, argv=0x7ffd3122e7d8) at /Distrib/GT.M/V63002/sr_unix/mupip.c:123 The fix is to turn off the EXTERNAL_FILTER bit in "gtmsource_filter" right after a call to repl_stop_filter(). Since repl_stop_filter() is called from various places in the source server a macro STOP_EXTERNAL_FILTER_IF_NEEDED was introduced to take care of this. Although the receiver server does not suffer from this exact issue, it also invokes repl_stop_filter() in various places and might have a similar issue. So all those callers too were fixed to use this new macro. But they manipulate the global variable "gtmrecv_filter" (instead of "gtmsource_filter").

…h an external filter followed by a deactivate and shutdown Release Note ------------- The replication source server terminates normally in case it was started with an external filter and later transitioned from an active to a passive source server before being asked to shut down. Previously, the source server would terminate abnormally with a SIG-11. (YDB#30) Test ----- * New r110/srcsrv_extfilter_sig11 subtest fails reliably without the fixes and passes reliably with the fix. * E_ALL run many times to ensure no regressions. README ------- When an active replication source server is deactivated, it changes mode from ACTIVE to PASSIVE mode. This also causes it to close external filters (if it had them open) by a call to repl_stop_filter() (in gtmsource.c). 517 if (gtmsource_filter & EXTERNAL_FILTER) 518 repl_stop_filter(); While it is in passive mode (and waiting to be activated again or asked to shutdown), if it encounters a shutdown signal, it goes to gtmsource_end() which in turn ends up invoking replstop_filter(). But this function cannot be invoked more than once. That is because it sends a message to the other side of the filter (the receiver side) to stop and then closes the source side of the filter and frees up and nullifies the associated buffers (including the global variable "extract_buff"). The second invocation of this function ends up with a SIG-11 when trying to send a message to the other side because "extract_buff" is NULL. Below is the stack trace. (gdb) where #0 0x000000000082200a in rel_lock (reg=0x3044958) at /Distrib/GT.M/V63002/sr_unix/rel_lock.c:85 #1 0x0000000000474602 in gtmsource_end1 (auto_shutdown=1) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:80 #2 0x000000000049d514 in gtmsource_stop (exit=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:323 YottaDB#3 0x000000000049d55b in gtmsource_sigstop () at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:333 YottaDB#4 0x00000000006d5377 in generic_signal_handler (sig=11, info=0x7ffd3122d6b0, context=0x7ffd3122d580) at /Distrib/GT.M/V63002/sr_unix/generic_signal_handler.c:374 YottaDB#5 <signal handler called> YottaDB#6 0x000000000057b776 in repl_filter_send (tr_num=0, tr=0x0, tr_len=0, first_send=1) at /Distrib/GT.M/V63002/sr_port/repl_filter.c:660 YottaDB#7 0x0000000000581342 in repl_stop_filter () at /Distrib/GT.M/V63002/sr_port/repl_filter.c:1128 YottaDB#8 0x000000000047504f in gtmsource_end1 (auto_shutdown=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:136 YottaDB#9 0x000000000047509e in gtmsource_end () at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:148 YottaDB#10 0x000000000046d968 in gtmsource () at /Distrib/GT.M/V63002/sr_unix/gtmsource.c:520 YottaDB#11 0x000000000044a022 in main (argc=11, argv=0x7ffd3122e7d8) at /Distrib/GT.M/V63002/sr_unix/mupip.c:123 The fix is to turn off the EXTERNAL_FILTER bit in "gtmsource_filter" right after a call to repl_stop_filter(). Since repl_stop_filter() is called from various places in the source server a macro STOP_EXTERNAL_FILTER_IF_NEEDED was introduced to take care of this. Although the receiver server does not suffer from this exact issue, it also invokes repl_stop_filter() in various places and might have a similar issue. So all those callers too were fixed to use this new macro. But they manipulate the global variable "gtmrecv_filter" (instead of "gtmsource_filter").

…h an external filter followed by a deactivate and shutdown Release Note ------------- The replication source server terminates normally in case it was started with an external filter and later transitioned from an active to a passive source server before being asked to shut down. Previously, the source server would terminate abnormally with a SIG-11. (YDB#30) Test ----- * New r110/srcsrv_extfilter_sig11 subtest fails reliably without the fixes and passes reliably with the fix. * E_ALL run many times to ensure no regressions. README ------- When an active replication source server is deactivated, it changes mode from ACTIVE to PASSIVE mode. This also causes it to close external filters (if it had them open) by a call to repl_stop_filter() (in gtmsource.c). 517 if (gtmsource_filter & EXTERNAL_FILTER) 518 repl_stop_filter(); While it is in passive mode (and waiting to be activated again or asked to shutdown), if it encounters a shutdown signal, it goes to gtmsource_end() which in turn ends up invoking replstop_filter(). But this function cannot be invoked more than once. That is because it sends a message to the other side of the filter (the receiver side) to stop and then closes the source side of the filter and frees up and nullifies the associated buffers (including the global variable "extract_buff"). The second invocation of this function ends up with a SIG-11 when trying to send a message to the other side because "extract_buff" is NULL. Below is the stack trace. (gdb) where #0 0x000000000082200a in rel_lock (reg=0x3044958) at /Distrib/GT.M/V63002/sr_unix/rel_lock.c:85 #1 0x0000000000474602 in gtmsource_end1 (auto_shutdown=1) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:80 #2 0x000000000049d514 in gtmsource_stop (exit=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:323 #3 0x000000000049d55b in gtmsource_sigstop () at /Distrib/GT.M/V63002/sr_unix/gtmsource_shutdown.c:333 #4 0x00000000006d5377 in generic_signal_handler (sig=11, info=0x7ffd3122d6b0, context=0x7ffd3122d580) at /Distrib/GT.M/V63002/sr_unix/generic_signal_handler.c:374 #5 <signal handler called> #6 0x000000000057b776 in repl_filter_send (tr_num=0, tr=0x0, tr_len=0, first_send=1) at /Distrib/GT.M/V63002/sr_port/repl_filter.c:660 #7 0x0000000000581342 in repl_stop_filter () at /Distrib/GT.M/V63002/sr_port/repl_filter.c:1128 #8 0x000000000047504f in gtmsource_end1 (auto_shutdown=0) at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:136 #9 0x000000000047509e in gtmsource_end () at /Distrib/GT.M/V63002/sr_unix/gtmsource_end.c:148 #10 0x000000000046d968 in gtmsource () at /Distrib/GT.M/V63002/sr_unix/gtmsource.c:520 #11 0x000000000044a022 in main (argc=11, argv=0x7ffd3122e7d8) at /Distrib/GT.M/V63002/sr_unix/mupip.c:123 The fix is to turn off the EXTERNAL_FILTER bit in "gtmsource_filter" right after a call to repl_stop_filter(). Since repl_stop_filter() is called from various places in the source server a macro STOP_EXTERNAL_FILTER_IF_NEEDED was introduced to take care of this. Although the receiver server does not suffer from this exact issue, it also invokes repl_stop_filter() in various places and might have a similar issue. So all those callers too were fixed to use this new macro. But they manipulate the global variable "gtmrecv_filter" (instead of "gtmsource_filter").

…nperf) Files changes in this commit: - sr_i386/g_msf.si 1. Remove SFF_CI flag and resequence flags - sr_port/alias_funcs.c 1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI frame no longer being there. A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it has to do with the elimination of the GTM$CI frame. The frame chain used to look like this: CI-base-frame <- GTM$CI <- M routine Now it looks like: CI-base-frame <- M routine In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame. In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame. The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS the base frame. - sr_port/dollar_zlevel.c 1. Spruce up some comments to make more sense. - sr_port/f_text.c 1. Remove GTM$CI from the list of supressed values (no longer exists). - sr_port/fgncal.h 1. Remove declaration for fgncal_lookup.c (no longer used). - sr_port/fgncal_lookup.c 1. Deleted - no longer used - sr_port/gbldefs.c 1. Remove param_list global (no longer used). - sr_port/mdb_condition_handler.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. 4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0 instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1) and also the mumps -run base frame. - sr_port/mdef.h 1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of requiring it to be a compiler option. - sr_port/mprof_funcs.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. When running the stack, skip over both TRIGGER and CALLIN base frames. - sr_port/op_bindparm.c 1. Minor cosmetic changes and comment clarifications. - sr_port/op_clralsvals.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer need to unwind one more frame to the base frame as the flagged frame IS the base frame). - sr_port/op_halt.c 2. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/op_unwind.c 2. Add include of gtmio.h for debugging macro(s). - sr_port/op_zg1.c 2. If call-in mode, restart the frame we unwind to. This returns to the caller. - sr_port/op_zgoto.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. - sr_port/op_zhalt.c 1. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/parm_pool.c 1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in two forms: a. Form that gets its arguments from the varargs parameter list. b. Form that gets its arguments from a parmblk_struct. - sr_port/stack_frame.h 1. Remove SFF_CI flag and resequence flags to fill the gap. 2. Add SFT_CI type id. 3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames. - sr_port/tp_unwind.c 1. Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too. - sr_port/unw_mv_ent.c 1. Change debugging output to be more clear which path is being taken. - sr_port/zlput_rname.c 1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the M stack so it is more effectively utilized. - sr_port/zshow_stack.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Add marker for when call-in base frame shows up in stack. - sr_unix/ci_ret_code.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed. The ci_ret_code() routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp() which we are trying to avoid. That condition is now special cased in ZGOTO. 3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments added to better describe the dependencies - sr_unix/error_return.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Don't let CALL-IN mode do an EXIT here - force return to caller. - sr_unix/fgncalsp.h 1. Get rid of no longer needed pointer fields (now passed directly) - sr_unix/gtm_startup.c 1. Add comments on odd happenings. - sr_unix/gtm_trigger.c 1. Change an M stack running loop to work without GTM$CI frame. - sr_unix/gtm_unlink_all.c 1. Remove check for GTM$CI executable. - sr_unix/gtmci.c 1. Removed use of longjmp() to effect return from call-ins. 2. Change SFF_CI flag usage to SFT_CI type usage. 3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters appropriately for call-in to op_bindparm in generated M code. 4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call push_parm to set up called-routine parameters ourselves. - sr_unix/gtmci.h 1. Removed GTM$CI as an intermediate call-in frame. 2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame. - sr_unix/gtmci_isv.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Some comment cleanup. - sr_unix/gtmci_signals.c 1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere. - sr_unix/invocation_mode.h 1. Remove MUMPS_GTMCI_OFF macro as unused. 2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set). - sr_unix/jobchild_init.c 1. Don't create GTM$CI anymore (make_cimode() not needed and removed). 2. Create single base CI frame instead of base + GTM$CI frame. - sr_unix/make_cimode.c 1. Removed - no longer needed. - sr_unix/make_mode.c 1. Removed all capability to create GTM$CI as this "program" is no longer needed. 2. Removed all #ifdef of __ia64, __hpux, and _AIX. - sr_unix/ojchildparms.c 1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't matter what type of frame this is so long as there is one that can be used to return. - sr_unix/op_fnfgncal.c 1. Change SFF_CI flag usage to SFT_CI type usage. - sr_unix/relinkctl.c 1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame). - sr_unix/rtnhdr.h 1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a base frame doesn't screw things up and cause that NULL to be dereferenced. - sr_x86_64/ci_restart.s 1. Remove this parameter massaging routine as being unnecessary anymore. - sr_x86_64/g_msf.si 1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble). [9/2017 SEE] Files changes in this (squashed) commit: sr_port/op_fnview.c 1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER". Multiple comma separated tokens may be returned. sr_port/viewtab.h 1. Add "ENVIRONMENT" option. 2. Remove non-UNIX environment #ifdef code. Files changes in this (squashed) commit: - sr_linux/release_name.h - Change to R1.10 Files changes in this (squashed) commit: sr_port/alias.h: 1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not have to specify a compiler option. sr_port/alias_funcs.c: 1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables for when they get changed and there were the LVMON* view commands that help to debug aliases and are active only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG error since the view commands that followed it in the table were LVMON*. This update changes the name of the VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON* in the test system due to it requiring a compiler flag to be incorporated so no tests were affected. sr_port/gbldefs.c: sr_port/lv_val.h: sr_port/op_view.h sr_port/tp_unwind.c sr_port/viewtab.h 1. Rename lvmon* for lvamon*. sr_port/mdef.h 1. Comment out define for DEBUG_ALIAS ----------------------------------- This project significantly changes how call-ins works. To properly describe it, we need to first describe the current (original) operation of callins. The following write-up between the ------ lines below describes the current (r100 and previous versions) call-in operation: ----------------------------------- How call-ins work: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine is called GTM$CI. 9. - gtm_init_env(base_addr, transfer_addr) 10. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 11. - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the GTM$CI routine. 12. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by gtm_init_env()). - rvector is for GTM$CI routine. - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M). - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. 2. Executable frame (created by new_stack_frame() as called by gtm_init_env()). - rvector pointer is set to GTM$CI routine - Frame is flagged as as call-in gtm_ci[p]() - Perform call-in: 1. Executable frame set up in step YottaDB#11 of gtm_init() above has its execution field modified to point to the GTM$CI routine. 2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which will add a new stack frame and have the call-in parameters set up. 3. Drives dm_start() to invoke the callin. The actual call path goes like this: a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI). b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it. c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are calling into is invoked. d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where mdb_condition_handler was setup in dm_start(). e. dm_start() then returns to gtm_ci[p] for return value processing. The old parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms. 4. When op_extexfun() is driven, parms are shifted around for call to push_parm(). 5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms. 6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local vars in the called routine. Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the parameter conversion necessary at the time. The new support described next has made some changes that make this extra conversion step unnecessary. ----------------------------------- Following (between the *********** lines) is a description of how call-ins work with this new support in place: *********************************** How the new call-ins functionality works: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 9. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by jobchild_init()). - rvector is NULL as this frame is just for return. - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame. - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. gtm_ci[p]() - Perform call-in: 1. Create new executable frame (new_stack_frame()) for the called M routine to run in. 2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter. 3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by op_bindparm() which gets called at the top of the called target routine. See parameter flow below. 4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven. 5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp(). 6. dm_start() then returns to gtm_ci[p] for return value processing. The parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used for buffering parameters). 4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty) op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in the called routine then releases the parameter space for reuse. *********************************** ----------------------------------- Operational changes with this project in the execution of a call-in: - No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init YottaDB#8). - Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus the GTM$CI frame (step gtm_init YottaDB#9). - No longer setup GTM$CI as part of gtm_ci[p] (gtm_ci[p] YottaDB#1). Instead, we allocate the frame using new_stack_frame() directly (gtm_ci[p]() YottaDB#1). - After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in the M source code (gtm_ci[p]() YottaDB#2 and YottaDB#3). - The execution flow used to look like: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive GTM$CI "build" routine. * op_extexfun() * Drive target M routine * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start). * return to gtm_ci[p] * Return to C caller - The flow has similar steps but one less layer and returns without the use of a (longjmp) system call: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive target M routine * M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start. * return to gtm_ci[p] * return to C caller. ----------------------------------- Notes on the removal of GTM$CI. - A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller. - A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine" was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before calling the call-in routine. - Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but was created by the make_mode() routine. Here's the operation: a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall depending on whether has args or return value). The parm_blk also contains the parameters and other things needed to effect a call. b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to be GTM$CI. c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the input and output parameters for the call-in routine and drives it to create the stack. d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame. e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return. - So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the system call flavor of return is unnecessary. Wanted to avoid that too. - Two changes allowed us to be rid of the extra GTM$CI overhead: 1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return without being unwound. We make use of that so we can do the simple unwind of call-in levels. 2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them back off in the glue routine. This avoided one of the argument restructurings that were happening. - Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and it was moved to the base frame itself which is how triggers also does it. ----------------------------------- User visible changes in this project (for specification in release note): 1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings. 2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1. 3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text "(Call-In Level Entry)" appears in the stack list. 4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used. You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame. 5. Similar issue when gtm_trigger() is verifying a trigger is not in use anywhere on the stack. 6. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't find it. 7. $VIEW(ENVIRONMENT) description. 8. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously. ```

…nperf) Files changes in this commit: - sr_i386/g_msf.si 1. Remove SFF_CI flag and resequence flags - sr_port/alias_funcs.c 1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI frame no longer being there. A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it has to do with the elimination of the GTM$CI frame. The frame chain used to look like this: CI-base-frame <- GTM$CI <- M routine Now it looks like: CI-base-frame <- M routine In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame. In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame. The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS the base frame. - sr_port/dollar_zlevel.c 1. Spruce up some comments to make more sense. - sr_port/f_text.c 1. Remove GTM$CI from the list of supressed values (no longer exists). - sr_port/fgncal.h 1. Remove declaration for fgncal_lookup.c (no longer used). - sr_port/fgncal_lookup.c 1. Deleted - no longer used - sr_port/gbldefs.c 1. Remove param_list global (no longer used). - sr_port/mdb_condition_handler.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. 4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0 instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1) and also the mumps -run base frame. - sr_port/mdef.h 1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of requiring it to be a compiler option. - sr_port/mprof_funcs.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. When running the stack, skip over both TRIGGER and CALLIN base frames. - sr_port/op_bindparm.c 1. Minor cosmetic changes and comment clarifications. - sr_port/op_clralsvals.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer need to unwind one more frame to the base frame as the flagged frame IS the base frame). - sr_port/op_halt.c 2. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/op_unwind.c 2. Add include of gtmio.h for debugging macro(s). - sr_port/op_zg1.c 2. If call-in mode, restart the frame we unwind to. This returns to the caller. - sr_port/op_zgoto.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. - sr_port/op_zhalt.c 1. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/parm_pool.c 1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in two forms: a. Form that gets its arguments from the varargs parameter list. b. Form that gets its arguments from a parmblk_struct. - sr_port/stack_frame.h 1. Remove SFF_CI flag and resequence flags to fill the gap. 2. Add SFT_CI type id. 3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames. - sr_port/tp_unwind.c 1. Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too. - sr_port/unw_mv_ent.c 1. Change debugging output to be more clear which path is being taken. - sr_port/zlput_rname.c 1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the M stack so it is more effectively utilized. - sr_port/zshow_stack.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Add marker for when call-in base frame shows up in stack. - sr_unix/ci_ret_code.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed. The ci_ret_code() routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp() which we are trying to avoid. That condition is now special cased in ZGOTO. 3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments added to better describe the dependencies - sr_unix/error_return.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Don't let CALL-IN mode do an EXIT here - force return to caller. - sr_unix/fgncalsp.h 1. Get rid of no longer needed pointer fields (now passed directly) - sr_unix/gtm_startup.c 1. Add comments on odd happenings. - sr_unix/gtm_trigger.c 1. Change an M stack running loop to work without GTM$CI frame. - sr_unix/gtm_unlink_all.c 1. Remove check for GTM$CI executable. - sr_unix/gtmci.c 1. Removed use of longjmp() to effect return from call-ins. 2. Change SFF_CI flag usage to SFT_CI type usage. 3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters appropriately for call-in to op_bindparm in generated M code. 4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call push_parm to set up called-routine parameters ourselves. - sr_unix/gtmci.h 1. Removed GTM$CI as an intermediate call-in frame. 2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame. - sr_unix/gtmci_isv.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Some comment cleanup. - sr_unix/gtmci_signals.c 1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere. - sr_unix/invocation_mode.h 1. Remove MUMPS_GTMCI_OFF macro as unused. 2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set). - sr_unix/jobchild_init.c 1. Don't create GTM$CI anymore (make_cimode() not needed and removed). 2. Create single base CI frame instead of base + GTM$CI frame. - sr_unix/make_cimode.c 1. Removed - no longer needed. - sr_unix/make_mode.c 1. Removed all capability to create GTM$CI as this "program" is no longer needed. 2. Removed all #ifdef of __ia64, __hpux, and _AIX. - sr_unix/ojchildparms.c 1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't matter what type of frame this is so long as there is one that can be used to return. - sr_unix/op_fnfgncal.c 1. Change SFF_CI flag usage to SFT_CI type usage. - sr_unix/relinkctl.c 1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame). - sr_unix/rtnhdr.h 1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a base frame doesn't screw things up and cause that NULL to be dereferenced. - sr_x86_64/ci_restart.s 1. Remove this parameter massaging routine as being unnecessary anymore. - sr_x86_64/g_msf.si 1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble). [9/2017 SEE] Files changes in this (squashed) commit: sr_port/op_fnview.c 1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER". Multiple comma separated tokens may be returned. sr_port/viewtab.h 1. Add "ENVIRONMENT" option. 2. Remove non-UNIX environment #ifdef code. Files changes in this (squashed) commit: - sr_linux/release_name.h - Change to R1.10 Files changes in this (squashed) commit: sr_port/alias.h: 1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not have to specify a compiler option. sr_port/alias_funcs.c: 1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables for when they get changed and there were the LVMON* view commands that help to debug aliases and are active only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG error since the view commands that followed it in the table were LVMON*. This update changes the name of the VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON* in the test system due to it requiring a compiler flag to be incorporated so no tests were affected. sr_port/gbldefs.c: sr_port/lv_val.h: sr_port/op_view.h sr_port/tp_unwind.c sr_port/viewtab.h 1. Rename lvmon* for lvamon*. sr_port/mdef.h 1. Comment out define for DEBUG_ALIAS ----------------------------------- This project significantly changes how call-ins works. To properly describe it, we need to first describe the current (original) operation of callins. The following write-up between the ------ lines below describes the current (r100 and previous versions) call-in operation: ----------------------------------- How call-ins work: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine is called GTM$CI. 9. - gtm_init_env(base_addr, transfer_addr) 10. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 11. - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the GTM$CI routine. 12. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by gtm_init_env()). - rvector is for GTM$CI routine. - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M). - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. 2. Executable frame (created by new_stack_frame() as called by gtm_init_env()). - rvector pointer is set to GTM$CI routine - Frame is flagged as as call-in gtm_ci[p]() - Perform call-in: 1. Executable frame set up in step YottaDB#11 of gtm_init() above has its execution field modified to point to the GTM$CI routine. 2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which will add a new stack frame and have the call-in parameters set up. 3. Drives dm_start() to invoke the callin. The actual call path goes like this: a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI). b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it. c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are calling into is invoked. d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where mdb_condition_handler was setup in dm_start(). e. dm_start() then returns to gtm_ci[p] for return value processing. The old parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms. 4. When op_extexfun() is driven, parms are shifted around for call to push_parm(). 5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms. 6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local vars in the called routine. Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the parameter conversion necessary at the time. The new support described next has made some changes that make this extra conversion step unnecessary. ----------------------------------- Following (between the *********** lines) is a description of how call-ins work with this new support in place: *********************************** How the new call-ins functionality works: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 9. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by jobchild_init()). - rvector is NULL as this frame is just for return. - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame. - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. gtm_ci[p]() - Perform call-in: 1. Create new executable frame (new_stack_frame()) for the called M routine to run in. 2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter. 3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by op_bindparm() which gets called at the top of the called target routine. See parameter flow below. 4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven. 5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp(). 6. dm_start() then returns to gtm_ci[p] for return value processing. The parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used for buffering parameters). 4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty) op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in the called routine then releases the parameter space for reuse. *********************************** ----------------------------------- Operational changes with this project in the execution of a call-in (note steps are from original code): - No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init YottaDB#8). - Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus the GTM$CI frame (step gtm_init YottaDB#9). - No longer setup GTM$CI as part of gtm_ci[p](). Instead, we allocate the frame using new_stack_frame() called directly (gtm_ci[p]() YottaDB#1). - After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in the M source code (gtm_ci[p]() YottaDB#2 and YottaDB#3). - The execution flow used to look like: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive GTM$CI "build" routine. * op_extexfun() * Drive target M routine * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start). * return to gtm_ci[p] via longjmp() in ci_ret_code() which returns to dm_start() and returns from there. * Return to C caller - The flow has similar steps but one less layer and returns without the use of a (longjmp) system call: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive target M routine * M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start. * return to gtm_ci[p] * return to C caller. ----------------------------------- Notes on the removal of GTM$CI. - A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller. - A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine" was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before calling the call-in routine. - Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but was created by the make_mode() routine. Here's the operation: a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall depending on whether has args or return value). The parm_blk also contains the parameters and other things needed to effect a call. b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to be GTM$CI. c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the input and output parameters for the call-in routine and drives it to create the stack. d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame. e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return. - So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the system call flavor of return is unnecessary. Wanted to avoid that too. - Two changes allowed us to be rid of the extra GTM$CI overhead: 1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return without being unwound. We make use of that so we can do the simple unwind of call-in levels. 2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them back off in the glue routine. This avoided one of the argument restructurings that were happening. - Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and it was moved to the base frame itself which is how triggers also does it. ----------------------------------- User visible changes in this project (for specification in release note): 1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings. 2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1. 3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text "(Call-In Level Entry)" appears in the stack list. 4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used. You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame. 5. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't find it. 6. $VIEW(ENVIRONMENT) description. 7. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously. ```

``` Files changes in this commit: - sr_i386/g_msf.si 1. Remove SFF_CI flag and resequence flags - sr_port/alias_funcs.c 1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI frame no longer being there. A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it has to do with the elimination of the GTM$CI frame. The frame chain used to look like this: CI-base-frame <- GTM$CI <- M routine Now it looks like: CI-base-frame <- M routine In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame. In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame. The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS the base frame. - sr_port/dollar_zlevel.c 1. Spruce up some comments to make more sense. - sr_port/f_text.c 1. Remove GTM$CI from the list of supressed values (no longer exists). - sr_port/fgncal.h 1. Remove declaration for fgncal_lookup.c (no longer used). - sr_port/fgncal_lookup.c 1. Deleted - no longer used - sr_port/gbldefs.c 1. Remove param_list global (no longer used). - sr_port/mdb_condition_handler.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. 4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0 instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1) and also the mumps -run base frame. - sr_port/mdef.h 1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of requiring it to be a compiler option. - sr_port/mprof_funcs.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. When running the stack, skip over both TRIGGER and CALLIN base frames. - sr_port/op_bindparm.c 1. Minor cosmetic changes and comment clarifications. - sr_port/op_clralsvals.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer need to unwind one more frame to the base frame as the flagged frame IS the base frame). - sr_port/op_halt.c 2. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/op_unwind.c 2. Add include of gtmio.h for debugging macro(s). - sr_port/op_zg1.c 2. If call-in mode, restart the frame we unwind to. This returns to the caller. - sr_port/op_zgoto.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. - sr_port/op_zhalt.c 1. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/parm_pool.c 1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in two forms: a. Form that gets its arguments from the varargs parameter list. b. Form that gets its arguments from a parmblk_struct. - sr_port/stack_frame.h 1. Remove SFF_CI flag and resequence flags to fill the gap. 2. Add SFT_CI type id. 3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames. - sr_port/tp_unwind.c 1. Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too. - sr_port/unw_mv_ent.c 1. Change debugging output to be more clear which path is being taken. - sr_port/zlput_rname.c 1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the M stack so it is more effectively utilized. - sr_port/zshow_stack.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Add marker for when call-in base frame shows up in stack. - sr_unix/ci_ret_code.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed. The ci_ret_code() routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp() which we are trying to avoid. That condition is now special cased in ZGOTO. 3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments added to better describe the dependencies - sr_unix/error_return.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Don't let CALL-IN mode do an EXIT here - force return to caller. - sr_unix/fgncalsp.h 1. Get rid of no longer needed pointer fields (now passed directly) - sr_unix/gtm_startup.c 1. Add comments on odd happenings. - sr_unix/gtm_trigger.c 1. Change an M stack running loop to work without GTM$CI frame. - sr_unix/gtm_unlink_all.c 1. Remove check for GTM$CI executable. - sr_unix/gtmci.c 1. Removed use of longjmp() to effect return from call-ins. 2. Change SFF_CI flag usage to SFT_CI type usage. 3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters appropriately for call-in to op_bindparm in generated M code. 4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call push_parm to set up called-routine parameters ourselves. - sr_unix/gtmci.h 1. Removed GTM$CI as an intermediate call-in frame. 2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame. - sr_unix/gtmci_isv.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Some comment cleanup. - sr_unix/gtmci_signals.c 1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere. - sr_unix/invocation_mode.h 1. Remove MUMPS_GTMCI_OFF macro as unused. 2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set). - sr_unix/jobchild_init.c 1. Don't create GTM$CI anymore (make_cimode() not needed and removed). 2. Create single base CI frame instead of base + GTM$CI frame. - sr_unix/make_cimode.c 1. Removed - no longer needed. - sr_unix/make_mode.c 1. Removed all capability to create GTM$CI as this "program" is no longer needed. 2. Removed all #ifdef of __ia64, __hpux, and _AIX. - sr_unix/ojchildparms.c 1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't matter what type of frame this is so long as there is one that can be used to return. - sr_unix/op_fnfgncal.c 1. Change SFF_CI flag usage to SFT_CI type usage. - sr_unix/relinkctl.c 1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame). - sr_unix/rtnhdr.h 1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a base frame doesn't screw things up and cause that NULL to be dereferenced. - sr_x86_64/ci_restart.s 1. Remove this parameter massaging routine as being unnecessary anymore. - sr_x86_64/g_msf.si 1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble). [9/2017 SEE] Files changes in this (squashed) commit: sr_port/op_fnview.c 1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER". Multiple comma separated tokens may be returned. sr_port/viewtab.h 1. Add "ENVIRONMENT" option. 2. Remove non-UNIX environment #ifdef code. Files changes in this (squashed) commit: - sr_linux/release_name.h - Change to R1.10 Files changes in this (squashed) commit: sr_port/alias.h: 1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not have to specify a compiler option. sr_port/alias_funcs.c: 1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables for when they get changed and there were the LVMON* view commands that help to debug aliases and are active only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG error since the view commands that followed it in the table were LVMON*. This update changes the name of the VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON* in the test system due to it requiring a compiler flag to be incorporated so no tests were affected. sr_port/gbldefs.c: sr_port/lv_val.h: sr_port/op_view.h sr_port/tp_unwind.c sr_port/viewtab.h 1. Rename lvmon* for lvamon*. sr_port/mdef.h 1. Comment out define for DEBUG_ALIAS ----------------------------------- This project significantly changes how call-ins works. To properly describe it, we need to first describe the current (original) operation of callins. The following write-up between the ------ lines below describes the current (r100 and previous versions) call-in operation: ----------------------------------- How call-ins work: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine is called GTM$CI. 9. - gtm_init_env(base_addr, transfer_addr) 10. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 11. - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the GTM$CI routine. 12. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by gtm_init_env()). - rvector is for GTM$CI routine. - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M). - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. 2. Executable frame (created by new_stack_frame() as called by gtm_init_env()). - rvector pointer is set to GTM$CI routine - Frame is flagged as as call-in gtm_ci[p]() - Perform call-in: 1. Executable frame set up in step YottaDB#11 of gtm_init() above has its execution field modified to point to the GTM$CI routine. 2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which will add a new stack frame and have the call-in parameters set up. 3. Drives dm_start() to invoke the callin. The actual call path goes like this: a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI). b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it. c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are calling into is invoked. d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where mdb_condition_handler was setup in dm_start(). e. dm_start() then returns to gtm_ci[p] for return value processing. The old parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms. 4. When op_extexfun() is driven, parms are shifted around for call to push_parm(). 5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms. 6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local vars in the called routine. Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the parameter conversion necessary at the time. The new support described next has made some changes that make this extra conversion step unnecessary. ----------------------------------- Following (between the *********** lines) is a description of how call-ins work with this new support in place: *********************************** How the new call-ins functionality works: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 9. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by jobchild_init()). - rvector is NULL as this frame is just for return. - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame. - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. gtm_ci[p]() - Perform call-in: 1. Create new executable frame (new_stack_frame()) for the called M routine to run in. 2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter. 3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by op_bindparm() which gets called at the top of the called target routine. See parameter flow below. 4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven. 5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp(). 6. dm_start() then returns to gtm_ci[p] for return value processing. The parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used for buffering parameters). 4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty) op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in the called routine then releases the parameter space for reuse. *********************************** ----------------------------------- Operational changes with this project in the execution of a call-in (note steps are from original code): - No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init YottaDB#8). - Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus the GTM$CI frame (step gtm_init YottaDB#9). - No longer setup GTM$CI as part of gtm_ci[p](). Instead, we allocate the frame using new_stack_frame() called directly (gtm_ci[p]() YottaDB#1). - After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in the M source code (gtm_ci[p]() YottaDB#2 and YottaDB#3). - The execution flow used to look like: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive GTM$CI "build" routine. * op_extexfun() * Drive target M routine * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start). * return to gtm_ci[p] via longjmp() in ci_ret_code() which returns to dm_start() and returns from there. * Return to C caller - The flow has similar steps but one less layer and returns without the use of a (longjmp) system call: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive target M routine * M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start. * return to gtm_ci[p] * return to C caller. ----------------------------------- Notes on the removal of GTM$CI. - A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller. - A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine" was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before calling the call-in routine. - Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but was created by the make_mode() routine. Here's the operation: a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall depending on whether has args or return value). The parm_blk also contains the parameters and other things needed to effect a call. b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to be GTM$CI. c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the input and output parameters for the call-in routine and drives it to create the stack. d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame. e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return. - So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the system call flavor of return is unnecessary. Wanted to avoid that too. - Two changes allowed us to be rid of the extra GTM$CI overhead: 1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return without being unwound. We make use of that so we can do the simple unwind of call-in levels. 2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them back off in the glue routine. This avoided one of the argument restructurings that were happening. - Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and it was moved to the base frame itself which is how triggers also does it. ----------------------------------- User visible changes in this project (for specification in release note): 1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings. 2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1. 3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text "(Call-In Level Entry)" appears in the stack list. 4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used. You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame. 5. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't find it. 6. $VIEW(ENVIRONMENT) description. 7. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously. ```

``` Files changes in this commit: - sr_i386/g_msf.si 1. Remove SFF_CI flag and resequence flags - sr_port/alias_funcs.c 1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI frame no longer being there. A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it has to do with the elimination of the GTM$CI frame. The frame chain used to look like this: CI-base-frame <- GTM$CI <- M routine Now it looks like: CI-base-frame <- M routine In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame. In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame. The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS the base frame. - sr_port/dollar_zlevel.c 1. Spruce up some comments to make more sense. - sr_port/f_text.c 1. Remove GTM$CI from the list of supressed values (no longer exists). - sr_port/fgncal.h 1. Remove declaration for fgncal_lookup.c (no longer used). - sr_port/fgncal_lookup.c 1. Deleted - no longer used - sr_port/gbldefs.c 1. Remove param_list global (no longer used). - sr_port/mdb_condition_handler.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. 4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0 instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1) and also the mumps -run base frame. - sr_port/mdef.h 1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of requiring it to be a compiler option. - sr_port/mprof_funcs.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. When running the stack, skip over both TRIGGER and CALLIN base frames. - sr_port/op_bindparm.c 1. Minor cosmetic changes and comment clarifications. - sr_port/op_clralsvals.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer need to unwind one more frame to the base frame as the flagged frame IS the base frame). - sr_port/op_halt.c 2. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/op_unwind.c 2. Add include of gtmio.h for debugging macro(s). - sr_port/op_zg1.c 2. If call-in mode, restart the frame we unwind to. This returns to the caller. - sr_port/op_zgoto.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. - sr_port/op_zhalt.c 1. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/parm_pool.c 1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in two forms: a. Form that gets its arguments from the varargs parameter list. b. Form that gets its arguments from a parmblk_struct. - sr_port/stack_frame.h 1. Remove SFF_CI flag and resequence flags to fill the gap. 2. Add SFT_CI type id. 3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames. - sr_port/tp_unwind.c 1. Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too. - sr_port/unw_mv_ent.c 1. Change debugging output to be more clear which path is being taken. - sr_port/zlput_rname.c 1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the M stack so it is more effectively utilized. - sr_port/zshow_stack.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Add marker for when call-in base frame shows up in stack. - sr_unix/ci_ret_code.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed. The ci_ret_code() routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp() which we are trying to avoid. That condition is now special cased in ZGOTO. 3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments added to better describe the dependencies - sr_unix/error_return.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Don't let CALL-IN mode do an EXIT here - force return to caller. - sr_unix/fgncalsp.h 1. Get rid of no longer needed pointer fields (now passed directly) - sr_unix/gtm_startup.c 1. Add comments on odd happenings. - sr_unix/gtm_trigger.c 1. Change an M stack running loop to work without GTM$CI frame. - sr_unix/gtm_unlink_all.c 1. Remove check for GTM$CI executable. - sr_unix/gtmci.c 1. Removed use of longjmp() to effect return from call-ins. 2. Change SFF_CI flag usage to SFT_CI type usage. 3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters appropriately for call-in to op_bindparm in generated M code. 4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call push_parm to set up called-routine parameters ourselves. - sr_unix/gtmci.h 1. Removed GTM$CI as an intermediate call-in frame. 2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame. - sr_unix/gtmci_isv.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Some comment cleanup. - sr_unix/gtmci_signals.c 1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere. - sr_unix/invocation_mode.h 1. Remove MUMPS_GTMCI_OFF macro as unused. 2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set). - sr_unix/jobchild_init.c 1. Don't create GTM$CI anymore (make_cimode() not needed and removed). 2. Create single base CI frame instead of base + GTM$CI frame. - sr_unix/make_cimode.c 1. Removed - no longer needed. - sr_unix/make_mode.c 1. Removed all capability to create GTM$CI as this "program" is no longer needed. 2. Removed all #ifdef of __ia64, __hpux, and _AIX. - sr_unix/ojchildparms.c 1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't matter what type of frame this is so long as there is one that can be used to return. - sr_unix/op_fnfgncal.c 1. Change SFF_CI flag usage to SFT_CI type usage. - sr_unix/relinkctl.c 1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame). - sr_unix/rtnhdr.h 1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a base frame doesn't screw things up and cause that NULL to be dereferenced. - sr_x86_64/ci_restart.s 1. Remove this parameter massaging routine as being unnecessary anymore. - sr_x86_64/g_msf.si 1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble). sr_port/op_fnview.c 1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER". Multiple comma separated tokens may be returned. sr_port/viewtab.h 1. Add "ENVIRONMENT" option. 2. Remove non-UNIX environment #ifdef code. - sr_linux/release_name.h - Change to R1.10 sr_port/alias.h: 1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not have to specify a compiler option. sr_port/alias_funcs.c: 1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables for when they get changed and there were the LVMON* view commands that help to debug aliases and are active only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG error since the view commands that followed it in the table were LVMON*. This update changes the name of the VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON* in the test system due to it requiring a compiler flag to be incorporated so no tests were affected. sr_port/gbldefs.c: sr_port/lv_val.h: sr_port/op_view.h sr_port/tp_unwind.c sr_port/viewtab.h 1. Rename lvmon* for lvamon*. sr_port/mdef.h 1. Comment out define for DEBUG_ALIAS ----------------------------------- This project significantly changes how call-ins works. To properly describe it, we need to first describe the current (original) operation of callins. The following write-up between the ------ lines below describes the current (r100 and previous versions) call-in operation: ----------------------------------- How call-ins work: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine is called GTM$CI. 9. - gtm_init_env(base_addr, transfer_addr) 10. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 11. - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the GTM$CI routine. 12. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by gtm_init_env()). - rvector is for GTM$CI routine. - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M). - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. 2. Executable frame (created by new_stack_frame() as called by gtm_init_env()). - rvector pointer is set to GTM$CI routine - Frame is flagged as as call-in gtm_ci[p]() - Perform call-in: 1. Executable frame set up in step YottaDB#11 of gtm_init() above has its execution field modified to point to the GTM$CI routine. 2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which will add a new stack frame and have the call-in parameters set up. 3. Drives dm_start() to invoke the callin. The actual call path goes like this: a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI). b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it. c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are calling into is invoked. d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where mdb_condition_handler was setup in dm_start(). e. dm_start() then returns to gtm_ci[p] for return value processing. The old parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms. 4. When op_extexfun() is driven, parms are shifted around for call to push_parm(). 5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms. 6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local vars in the called routine. Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the parameter conversion necessary at the time. The new support described next has made some changes that make this extra conversion step unnecessary. ----------------------------------- Following (between the *********** lines) is a description of how call-ins work with this new support in place: *********************************** How the new call-ins functionality works: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 9. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by jobchild_init()). - rvector is NULL as this frame is just for return. - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame. - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. gtm_ci[p]() - Perform call-in: 1. Create new executable frame (new_stack_frame()) for the called M routine to run in. 2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter. 3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by op_bindparm() which gets called at the top of the called target routine. See parameter flow below. 4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven. 5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp(). 6. dm_start() then returns to gtm_ci[p] for return value processing. The parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used for buffering parameters). 4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty) op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in the called routine then releases the parameter space for reuse. *********************************** ----------------------------------- Operational changes with this project in the execution of a call-in (note steps are from original code): - No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init YottaDB#8). - Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus the GTM$CI frame (step gtm_init YottaDB#9). - No longer setup GTM$CI as part of gtm_ci[p](). Instead, we allocate the frame using new_stack_frame() called directly (gtm_ci[p]() YottaDB#1). - After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in the M source code (gtm_ci[p]() YottaDB#2 and YottaDB#3). - The execution flow used to look like: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive GTM$CI "build" routine. * op_extexfun() * Drive target M routine * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start). * return to gtm_ci[p] via longjmp() in ci_ret_code() which returns to dm_start() and returns from there. * Return to C caller - The flow has similar steps but one less layer and returns without the use of a (longjmp) system call: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive target M routine * M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start. * return to gtm_ci[p] * return to C caller. ----------------------------------- Notes on the removal of GTM$CI. - A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller. - A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine" was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before calling the call-in routine. - Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but was created by the make_mode() routine. Here's the operation: a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall depending on whether has args or return value). The parm_blk also contains the parameters and other things needed to effect a call. b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to be GTM$CI. c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the input and output parameters for the call-in routine and drives it to create the stack. d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame. e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return. - So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the system call flavor of return is unnecessary. Wanted to avoid that too. - Two changes allowed us to be rid of the extra GTM$CI overhead: 1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return without being unwound. We make use of that so we can do the simple unwind of call-in levels. 2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them back off in the glue routine. This avoided one of the argument restructurings that were happening. - Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and it was moved to the base frame itself which is how triggers also does it. ----------------------------------- User visible changes in this project (for specification in release note): 1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings. 2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1. 3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text "(Call-In Level Entry)" appears in the stack list. 4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used. You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame. 5. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't find it. 6. $VIEW(ENVIRONMENT) description. 7. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously. ```

``` Files changes in this commit: - sr_i386/g_msf.si 1. Remove SFF_CI flag and resequence flags - sr_port/alias_funcs.c 1. Change SFF_CI flag usage to SFT_CI type usage & restructure checks due to GTM$CI frame no longer being there. A review question asked why this change removed an "fp = fp->old-frame_pointer" statement. The answer is that it has to do with the elimination of the GTM$CI frame. The frame chain used to look like this: CI-base-frame <- GTM$CI <- M routine Now it looks like: CI-base-frame <- M routine In the previous scheme the GTM$CI frame had a "flag value" SFF_CI that indicated the M call was a call-in frame. In the new scheme the the base frame has a "type value" with SFT_CI set that indicates this is a call-in frame. The removed statement was because once we have noted the frame with the type set for a call-in frame, we had to back one more frame to get to its base frame. We no longer have to do that as the frame with the type flag IS the base frame. - sr_port/dollar_zlevel.c 1. Spruce up some comments to make more sense. - sr_port/f_text.c 1. Remove GTM$CI from the list of supressed values (no longer exists). - sr_port/fgncal.h 1. Remove declaration for fgncal_lookup.c (no longer used). - sr_port/fgncal_lookup.c 1. Deleted - no longer used - sr_port/gbldefs.c 1. Remove param_list global (no longer used). - sr_port/mdb_condition_handler.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. 4. Fix loop looking for $ETRAP defined when $ZTRAP set to explicit NULL to stop at frame 0 instead of frame 1. This allows it to find call-in base frame (now frame 0 instead of 1) and also the mumps -run base frame. - sr_port/mdef.h 1. Move DBGALS macro here from alias.h so can define DEBUG_ALIAS right here instead of requiring it to be a compiler option. - sr_port/mprof_funcs.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. When running the stack, skip over both TRIGGER and CALLIN base frames. - sr_port/op_bindparm.c 1. Minor cosmetic changes and comment clarifications. - sr_port/op_clralsvals.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Modify the loop to no longer assume a GTM$CI frame when call-ins are in use (no longer need to unwind one more frame to the base frame as the flagged frame IS the base frame). - sr_port/op_halt.c 2. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/op_unwind.c 2. Add include of gtmio.h for debugging macro(s). - sr_port/op_zg1.c 2. If call-in mode, restart the frame we unwind to. This returns to the caller. - sr_port/op_zgoto.c 1. Cleanups - remove #ifdef for UNIX (leaving code in place since all is UNIX now). 2. Cleanups - remove VMS conditional code. 3. Change SFF_CI flag usage to SFT_CI type usage. - sr_port/op_zhalt.c 1. If call-in mode, do NOT just exit - make sure to return to the call-in caller. - sr_port/parm_pool.c 1. Move the push_parm() routine out to sr_port/push_parm_src.h so it can be include in two forms: a. Form that gets its arguments from the varargs parameter list. b. Form that gets its arguments from a parmblk_struct. - sr_port/stack_frame.h 1. Remove SFF_CI flag and resequence flags to fill the gap. 2. Add SFT_CI type id. 3. Modify SKIP_BASE_FRAME() macro to skip both TRIGGER and CALL-IN base frames. - sr_port/tp_unwind.c 1. Remove #ifdef DEBUG wrapper on debugging includes so they work with a pro build too. - sr_port/unw_mv_ent.c 1. Change debugging output to be more clear which path is being taken. - sr_port/zlput_rname.c 1. Since SKIP_BASE_FRAME macro now handles call-in base frames and because there's one fewer frames than there used to be when GTM$CI was used, reorganize the loop(s) traversing the M stack so it is more effectively utilized. - sr_port/zshow_stack.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Add marker for when call-in base frame shows up in stack. - sr_unix/ci_ret_code.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Remove the ci_ret_code() and ci_ret_code_exit() routines as not needed. The ci_ret_code() routine did a longjmp() to "return" from the call-in - now replaced by a normal unwind procedure with no system calls. The ci_ret_code_exit() routine was set into the call-in base-frame and driven when ZGOTO 0 was done. It is removed because it too did a longjmp() which we are trying to avoid. That condition is now special cased in ZGOTO. 3. The ci_ret_code_quit() routine (called to unwind the CALL-IN base frame) had comments added to better describe the dependencies - sr_unix/error_return.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Don't let CALL-IN mode do an EXIT here - force return to caller. - sr_unix/fgncalsp.h 1. Get rid of no longer needed pointer fields (now passed directly) - sr_unix/gtm_startup.c 1. Add comments on odd happenings. - sr_unix/gtm_trigger.c 1. Change an M stack running loop to work without GTM$CI frame. - sr_unix/gtm_unlink_all.c 1. Remove check for GTM$CI executable. - sr_unix/gtmci.c 1. Removed use of longjmp() to effect return from call-ins. 2. Change SFF_CI flag usage to SFT_CI type usage. 3. Changed parameter handling to eliminate need for GTM$CI to reformat parameters appropriately for call-in to op_bindparm in generated M code. 4. Eliminated need to drive op_extcall/op_extexfun to start M routine. We now call push_parm to set up called-routine parameters ourselves. - sr_unix/gtmci.h 1. Removed GTM$CI as an intermediate call-in frame. 2. Modified SET_CI_ENV() macro to setup actual base frame instead of intermediate frame. - sr_unix/gtmci_isv.c 1. Change SFF_CI flag usage to SFT_CI type usage. 2. Some comment cleanup. - sr_unix/gtmci_signals.c 1. Remove reference to invocation flag MUMPS_GTMCI which was never being set anywhere. - sr_unix/invocation_mode.h 1. Remove MUMPS_GTMCI_OFF macro as unused. 2. Remove MUMPS_GTMCI mode and resequence modes to fill gap (MUMPS_GTMCI never set). - sr_unix/jobchild_init.c 1. Don't create GTM$CI anymore (make_cimode() not needed and removed). 2. Create single base CI frame instead of base + GTM$CI frame. - sr_unix/make_cimode.c 1. Removed - no longer needed. - sr_unix/make_mode.c 1. Removed all capability to create GTM$CI as this "program" is no longer needed. 2. Removed all #ifdef of __ia64, __hpux, and _AIX. - sr_unix/ojchildparms.c 1. Change the "dummy" base frame generated from GTM$CI frame to GTM$DMOD frame. It doesn't matter what type of frame this is so long as there is one that can be used to return. - sr_unix/op_fnfgncal.c 1. Change SFF_CI flag usage to SFT_CI type usage. - sr_unix/relinkctl.c 1. Change M stack frame loop to compensate for the lack of the GTM$CI frame and which frame now has the CI marker (was flag field in GTM$CI frame, now type field in CI base frame). - sr_unix/rtnhdr.h 1. Modify CLEANUP_COPIED_RECURSIVE_RTN() macro such that a null routine header address in a base frame doesn't screw things up and cause that NULL to be dereferenced. - sr_x86_64/ci_restart.s 1. Remove this parameter massaging routine as being unnecessary anymore. - sr_x86_64/g_msf.si 1. Remove SFF_CI (no longer used) and SFF_ETRAP_ERR (still defined but not used in assemble). sr_port/op_fnview.c 1. Add "ENVIRONMENT" returns possible environment tokens "MUMPS", "MUPIP", "CALLIN", or "TRIGGER". Multiple comma separated tokens may be returned. sr_port/viewtab.h 1. Add "ENVIRONMENT" option. 2. Remove non-UNIX environment #ifdef code. - sr_linux/release_name.h - Change to R1.10 sr_port/alias.h: 1. Move DBGALS related macros to mdef.h so we can put the define for DEBUG_ALIAS right there and not have to specify a compiler option. sr_port/alias_funcs.c: 1. The LVMON* view was overloaded. There was the LVMON command that monitors given/specified local variables for when they get changed and there were the LVMON* view commands that help to debug aliases and are active only when DEBUG_ALIAS is defined. When this define was enabled, the LVMON command was giving a VIEWAMBIG error since the view commands that followed it in the table were LVMON*. This update changes the name of the VIEW commands, the related routines and fields to be lvamon* instead of lvmon*. There are no tests of LVMON* in the test system due to it requiring a compiler flag to be incorporated so no tests were affected. sr_port/gbldefs.c: sr_port/lv_val.h: sr_port/op_view.h sr_port/tp_unwind.c sr_port/viewtab.h 1. Rename lvmon* for lvamon*. sr_port/mdef.h 1. Comment out define for DEBUG_ALIAS ----------------------------------- This project significantly changes how call-ins works. To properly describe it, we need to first describe the current (original) operation of callins. The following write-up between the ------ lines below describes the current (r100 and previous versions) call-in operation: ----------------------------------- How call-ins work: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - make_cimode() with returned addr of created routine set into "base_addr". This dynamically created routine is called GTM$CI. 9. - gtm_init_env(base_addr, transfer_addr) 10. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 11. - new_stack_frame(base_addr, PTEXT_ADR(base_addr), transfer_addr) - creates an executable frame for the GTM$CI routine. 12. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by gtm_init_env()). - rvector is for GTM$CI routine. - Execution address is ci_ret_code_exit() (again, only used for ZGOTO 0 aka process exit call from M). - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. 2. Executable frame (created by new_stack_frame() as called by gtm_init_env()). - rvector pointer is set to GTM$CI routine - Frame is flagged as as call-in gtm_ci[p]() - Perform call-in: 1. Executable frame set up in step #11 of gtm_init() above has its execution field modified to point to the GTM$CI routine. 2. Creates parameter block including address of call-in routine and address of either op_extcall() or op_extexfun() which will add a new stack frame and have the call-in parameters set up. 3. Drives dm_start() to invoke the callin. The actual call path goes like this: a. dm_start() - establishes mdb_condition_handler() and sets up to drive M code. Invokes top stack frame (GTM$CI). b. ci_restart() - this is the first thing done in GTM$CI. This takes the parmblk_struct passed in the global var param_list and restructures the inputs for a call to op_extcall() or op_extexfun() and jumps to it. c. op_extcall() or op_extexfun() allocate a new stack frame and "return" to it such that the actual routine we are calling into is invoked. d. When the M code returns, it returns to GTM$CI and runs ci_ret_code() which drives a longjmp back to where mdb_condition_handler was setup in dm_start(). e. dm_start() then returns to gtm_ci[p] for return value processing. The old parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. When ci_restart is driven, parms are converted from parm_blk to register/stack parms. 4. When op_extexfun() is driven, parms are shifted around for call to push_parm(). 5. When push_parm() is driven from op_extexfun(), register/stack parms are converted to parm_blk parms. 6. When called-in routine is driven, its parms are picked up from parm_blk by op_bindparm() and bound to local vars in the called routine. Note, the entire purpose of the GTM$CI routine is to create the stack frame the call-in will run in and to set up its parameters in a fashion that op_bindparm() can read them. Since the op_bindparm of the day expected a varargs list (albeit in a different fashion from the varargs list passed into gtm_ci[p]()), at that time, the GTM$CI was written to provide the parameter conversion necessary at the time. The new support described next has made some changes that make this extra conversion step unnecessary. ----------------------------------- Following (between the *********** lines) is a description of how call-ins work with this new support in place: *********************************** How the new call-ins functionality works: gtm_init() - Initializes YDB runtime 1. - image_type set to GTM_IMAGE 2. - invocation mode set to MUMPS_CALLIN 3. - init_gtm() 4. - gtm_startup() 5. - allocate M stack 6. - create (dummy) base frame with return addr of gtm_ret_code() 7. - jobchild_init() 8. - base_frame(base_addr transfer_addr) - creates "another" base frame with an initial execution addr of gtm_ret_code() but with rvector set to addr of GTM$CI. 9. - Runs SET_CI_ENV() which changes the executable frame created so it has the SF_CI call-in flag set, and and changes the base frame's execution address to ci_ret_code_exit(). This base frame is only used when there's a ZGOTO 0. Stack after gtm_init(): 0. BaseFrame (created by gtm_startup()) - probably just to have *something* on the stack in case later initialization fails. - This baseframe has a zeroed rvector with an execution address of gtm_ret_code(). There is no previous pointer nor is there an unwind frame pointer hiding behind the frame like most base frames have - This is the true bottom of the stack in terms of stack frames though an mv_stent is on the stack before this. - In practise, we won't see this frame as it isn't used past initialization. 1. BaseFrame (created by base_frame() as called by jobchild_init()). - rvector is NULL as this frame is just for return. - Execution address is gtm_levl_ret_code() - an entry point in dm_start() used to return without unwinding the frame. - The address field on the stack immediately prior to this base frame pointer is an actual unwind address to base frame 0. gtm_ci[p]() - Perform call-in: 1. Create new executable frame (new_stack_frame()) for the called M routine to run in. 2. Creates parameter block which contains a newly constructed lv_val for each incoming parameter. 3. Cal push_parm_ci() which takes the parameter block built by gtm_ci[p]() and moves the parms to the the parameter area used by op_bindparm() which gets called at the top of the called target routine. See parameter flow below. 4. Drives dm_start() to invoke the callin. Since the M stack frame for the called routine is on top, it gets immediately driven. 5. When the M code returns, it returns to gtm_levl_ret_code which does a simple return to dm_start() without using longjmp(). 6. dm_start() then returns to gtm_ci[p] for return value processing. The parameter flow is: 1. Parms are passed in as part of the call to gtm_ci[p](). 2. gtm_ci[p]() reformats the parameters into a "parm_blk". 3. gtm_ci[p]() drives push_parm_ci() to setup the parms for processing by op_bindparm() (moves them into a specific area used for buffering parameters). 4. When called-in routine is driven (op_bindparm() gets driven at all entryrefs with parameter lists defined - even if empty) op_bindparm pulls the parms from the parm pool parameter space created by push_parm_ci() and binds them to local vars in the called routine then releases the parameter space for reuse. *********************************** ----------------------------------- Operational changes with this project in the execution of a call-in (note steps are from original code): - No longer call make_cimode() to generate the internal GTM$CI routine (step gtm_init #8). - Call base_frame() instead of gtm_init_env() to create ONLY the call-in base frame instead of the base frame plus the GTM$CI frame (step gtm_init #9). - No longer setup GTM$CI as part of gtm_ci[p](). Instead, we allocate the frame using new_stack_frame() called directly (gtm_ci[p]() #1). - After the call-in execution frame is created by gtm_ci[p](), a variant of the push_parm routine is called that takes the parameters from the parm block that gtm_ci[p] created from the incoming arguments and converts them so they can be processed by op_bindparm() which is a routine called by generated code anytime a label with arguments occurs in the M source code (gtm_ci[p]() #2 and #3). - The execution flow used to look like: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive GTM$CI "build" routine. * op_extexfun() * Drive target M routine * Target routine returns to GTM$CI which then drives ci_ret_code (longjmp back to dm_start). * return to gtm_ci[p] via longjmp() in ci_ret_code() which returns to dm_start() and returns from there. * Return to C caller - The flow has similar steps but one less layer and returns without the use of a (longjmp) system call: * C [possibly main] routine * call gtm_ci[p]() * drive dm_start() to kick off transition to M mode execution * drive target M routine * M returns to gtm_levl_ret_code which returns from an M routine without unwinding it which unwinds to dm_start. * return to gtm_ci[p] * return to C caller. ----------------------------------- Notes on the removal of GTM$CI. - A call-in previously did not return "normally" by a return statement that unwound the stack. Instead it invoked a system call (longjmp()) to unwind the stack and return to dm_start() which then returned "normally" to the caller. - A much faster way to return would be to just - return but GTM$CI complicated that. The purpose of the GTM$CI "routine" was to set up arguments and call an assembler "glue code" routine to put the parms where they needed to be before calling the call-in routine. - Specifically, GTM$CI was a "constructed" routine (much like GTM$DMOD) in that it was not built from M source but was created by the make_mode() routine. Here's the operation: a. gtm_ci[p]() builds a parm_blk that contains the name of the glue routine to call (op_extexfun or op_extcall depending on whether has args or return value). The parm_blk also contains the parameters and other things needed to effect a call. b. gtm_ci[p]() drives dm_start() which enters M mode and drives the top routine on the M stack which happens to be GTM$CI. c. GTM$CI's first function is to drive ci_restart() which is an assembler routine sort of like a specialized version of callg() that takes the routine to call (e.g. op_extexfun), the routine/label to call and all the input and output parameters for the call-in routine and drives it to create the stack. d. When op_extexfun() completes, it drives the top routine on the M stack which is now the call-in frame. e. When the callin frame returns to GTM$CI, it drives ci_ret_code() which does a longjmp() to return. - So GTM$CI's primary purpose was to allocate the stack frame, setup the arguments and drive the glue code that made the actual call. This means the arguments were reformatted at least twice. Wanted to avoid that. Also, the system call flavor of return is unnecessary. Wanted to avoid that too. - Two changes allowed us to be rid of the extra GTM$CI overhead: 1. With the routine gtm_levl_ret_code added to GT.M for triggers, it became possible for a stackframe to return without being unwound. We make use of that so we can do the simple unwind of call-in levels. 2. By changing the push_parm() routine (in parm pool) so we it can pull the arguments for the call-in routine directly out of the parm block created by gtm_ci[p] instead of having to push them on the stack and pull them back off in the glue routine. This avoided one of the argument restructurings that were happening. - Because GTM$CI went away, the SFF_CI flag was changed to a type flag instead (since that is what it actually is) and it was moved to the base frame itself which is how triggers also does it. ----------------------------------- User visible changes in this project (for specification in release note): 1. There is no GTM$CI level anymore (no longer needed). So this name no longer shows up in stack listings. 2. Because there is no GTM$CI level anymore, the $STACK and $ZLEVEL SVNs show one less than they used to in a call-in environment. These SVNs now mirror the levels one would get by using mumps -run. The first level executing routine is $ZLEVEL=1 and $STACK=0 instead of the previous $ZLEVEL=2 and $STACK=1. 3. ZSHOW "S" shows the entire stack. It used to stop at the first call-in frame and not report any further back. A stack marker shows where call-in base frames are located in the stack. Where a call-in frame is detected, the text "(Call-In Level Entry)" appears in the stack list. 4. When replacing a routine that is active on the stack, we ran the stack backwards to verify the routine was not being used. You can still replace an active routine but it is a special case. Unfortunately the loop was again stopping at the first call-in frame and not looking further back. If a routine being replaced was on the stack further back than that call-in base-frame, ugly stuff was likely to occur when we unwound back to that earlier frame. 5. Similar issue when M-Profiling is looking up an entry ref on the stack. Need to figure out what happens when it doesn't find it. 6. $VIEW(ENVIRONMENT) description. 7. [Z]HALT in a call-in do not halt but return to the caller as they should but did not previously. ```

…andler in threaded code ``` The v63000/gtm8394 subtest failed an assert with the following stack trace. #0 0x00007f3f038734c7 in kill () from /usr/lib64/libc.so.6 #1 0x00000000006c2413 in gtm_dump_core () at R110/sr_unix/gtm_dump_core.c:69 #2 0x00000000006d5dd0 in gtm_fork_n_core () at R110/sr_unix/gtm_fork_n_core.c:211 YottaDB#3 0x0000000000695b5b in ch_cond_core () at R110/sr_unix/ch_cond_core.c:59 YottaDB#4 0x000000000087e6ba in rts_error_va (csa=0x0, argcnt=7, var=0x7f3ef864e178) at R110/sr_unix/rts_error.c:153 YottaDB#5 0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=7) at R110/sr_unix/rts_error.c:85 YottaDB#6 0x0000000000916610 in hashtab_rehash_ch (arg=150373340) at R110/sr_port/hashtab_rehash_ch.c:33 YottaDB#7 0x000000000087ec12 in rts_error_va (csa=0x0, argcnt=5, var=0x7f3ef864e438) at R110/sr_unix/rts_error.c:153 YottaDB#8 0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=5) at R110/sr_unix/rts_error.c:85 YottaDB#9 0x00000000008fa778 in raise_gtmmemory_error () at R110/sr_port/gtm_malloc_src.h:1074 YottaDB#10 0x00000000008f5ee2 in gtm_malloc (size=835672) at R110/sr_port/gtm_malloc_src.h:724 YottaDB#11 0x0000000000978722 in init_hashtab_intl_int8 (table=0x7f3ef864e780, minsize=24594, old_table=0x10e8718 <murgbl+88>) at R110/sr_port/hashtab_implementation.h:392 YottaDB#12 0x000000000097971e in expand_hashtab_int8 (table=0x10e8718 <murgbl+88>, minsize=24594) at R110/sr_port/hashtab_implementation.h:436 YottaDB#13 0x000000000097a063 in add_hashtab_intl_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0, changing_table_size=0) at R110/sr_port/hashtab_implementation.h:499 YottaDB#14 0x000000000097a005 in add_hashtab_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0) at R110/sr_port/hashtab_implementation.h:483 YottaDB#15 0x000000000052a9cc in mur_back_processing_one_region (mur_back_options=0x7f3ef864ee40) at R110/sr_port/mur_back_process.c:1064 YottaDB#16 0x0000000000523e09 in mur_back_phase1 (rctl=0x2e8fc20) at R110/sr_port/mur_back_process.c:535 YottaDB#17 0x00000000006e75b8 in gtm_multi_thread_helper (tparm=0x7ffe5753ef30) at R110/sr_unix/gtm_multi_thread.c:228 YottaDB#18 0x00007f3f03629e25 in start_thread () from /usr/lib64/libpthread.so.0 YottaDB#19 0x00007f3f0393634d in clone () from /usr/lib64/libc.so.6 This is a test where a memory-error is forced (using limit vmemorysize). And various rollbacks are run. One of them runs with multiple threads and one thread gets a memory error during hashtable expansion. Normally a memory error causes the thread to exit and in turn that signals other threads to exit which is handled fine. But in this case, the condition handler hashtab_rehash_ch() did an UNWIND because it decided an out-of-memory situation implies we will abort the expansion and continue with the previous hashtable (this was a good-to-expand call, not a need-to-expand call). And the UNWIND macro had an assert that we better not be inside multi-threaded code. But that is exactly where we were in this failure. The reason why the UNWIND has that logic is because in pro it would return control to the erroring thread and let it continue processing but we would not have released the pthread-mutex-lock that we obtained in rts_error_va() for this thread. That means all other threads will not be able to get this lock for various actions they do until the erroring thread tries to obtain the lock again (at which point we would check that we already hold the lock and not try to get the lock again) and later when we release it, other threads will be able to get the thread lock. The fix is to make sure we release the thread-level lock in the UNWIND macro (and assert that we do hold the lock in dbg). The pro implication of this issue is that a MUPIP JOURNAL command that encounters a memory error in some cases could in the worst case transform a multi-threaded recovery to a non-threaded recovery command thereby slowing it down. No other user-visible implications are expected out of this. ```

…andler in threaded code ``` The v63000/gtm8394 subtest failed an assert with the following stack trace. #0 0x00007f3f038734c7 in kill () from /usr/lib64/libc.so.6 #1 0x00000000006c2413 in gtm_dump_core () at R110/sr_unix/gtm_dump_core.c:69 #2 0x00000000006d5dd0 in gtm_fork_n_core () at R110/sr_unix/gtm_fork_n_core.c:211 #3 0x0000000000695b5b in ch_cond_core () at R110/sr_unix/ch_cond_core.c:59 #4 0x000000000087e6ba in rts_error_va (csa=0x0, argcnt=7, var=0x7f3ef864e178) at R110/sr_unix/rts_error.c:153 #5 0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=7) at R110/sr_unix/rts_error.c:85 #6 0x0000000000916610 in hashtab_rehash_ch (arg=150373340) at R110/sr_port/hashtab_rehash_ch.c:33 #7 0x000000000087ec12 in rts_error_va (csa=0x0, argcnt=5, var=0x7f3ef864e438) at R110/sr_unix/rts_error.c:153 #8 0x000000000087dca4 in rts_error_csa (csa=0x0, argcnt=5) at R110/sr_unix/rts_error.c:85 #9 0x00000000008fa778 in raise_gtmmemory_error () at R110/sr_port/gtm_malloc_src.h:1074 #10 0x00000000008f5ee2 in gtm_malloc (size=835672) at R110/sr_port/gtm_malloc_src.h:724 #11 0x0000000000978722 in init_hashtab_intl_int8 (table=0x7f3ef864e780, minsize=24594, old_table=0x10e8718 <murgbl+88>) at R110/sr_port/hashtab_implementation.h:392 #12 0x000000000097971e in expand_hashtab_int8 (table=0x10e8718 <murgbl+88>, minsize=24594) at R110/sr_port/hashtab_implementation.h:436 #13 0x000000000097a063 in add_hashtab_intl_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0, changing_table_size=0) at R110/sr_port/hashtab_implementation.h:499 #14 0x000000000097a005 in add_hashtab_int8 (table=0x10e8718 <murgbl+88>, key=0x7f3f04b32190, value=0x7f3f04b32190, tabentptr=0x7f3ef864eaa0) at R110/sr_port/hashtab_implementation.h:483 #15 0x000000000052a9cc in mur_back_processing_one_region (mur_back_options=0x7f3ef864ee40) at R110/sr_port/mur_back_process.c:1064 #16 0x0000000000523e09 in mur_back_phase1 (rctl=0x2e8fc20) at R110/sr_port/mur_back_process.c:535 #17 0x00000000006e75b8 in gtm_multi_thread_helper (tparm=0x7ffe5753ef30) at R110/sr_unix/gtm_multi_thread.c:228 #18 0x00007f3f03629e25 in start_thread () from /usr/lib64/libpthread.so.0 #19 0x00007f3f0393634d in clone () from /usr/lib64/libc.so.6 This is a test where a memory-error is forced (using limit vmemorysize). And various rollbacks are run. One of them runs with multiple threads and one thread gets a memory error during hashtable expansion. Normally a memory error causes the thread to exit and in turn that signals other threads to exit which is handled fine. But in this case, the condition handler hashtab_rehash_ch() did an UNWIND because it decided an out-of-memory situation implies we will abort the expansion and continue with the previous hashtable (this was a good-to-expand call, not a need-to-expand call). And the UNWIND macro had an assert that we better not be inside multi-threaded code. But that is exactly where we were in this failure. The reason why the UNWIND has that logic is because in pro it would return control to the erroring thread and let it continue processing but we would not have released the pthread-mutex-lock that we obtained in rts_error_va() for this thread. That means all other threads will not be able to get this lock for various actions they do until the erroring thread tries to obtain the lock again (at which point we would check that we already hold the lock and not try to get the lock again) and later when we release it, other threads will be able to get the thread lock. The fix is to make sure we release the thread-level lock in the UNWIND macro (and assert that we do hold the lock in dbg). The pro implication of this issue is that a MUPIP JOURNAL command that encounters a memory error in some cases could in the worst case transform a multi-threaded recovery to a non-threaded recovery command thereby slowing it down. No other user-visible implications are expected out of this. ```

…me errors as they could cause SIG-11 YottaDB#90 A few issues related to compile-time errors were discovered. 1) The below M program correctly issues a PATNOTFOUND error when compiling. But if one tries to run the compiled object code (which should be okay since the execution does not reach the portions of the M code where the compiler error was found), a SIG-11 is observed. This happens only with GT.M V63002 (and in turn YottaDB r1.10) but not with V63001A (and in turn YottaDB r1.00). ``` > cat test.m main ; do good quit bad ; if 1?1B quit good ; write "hello",! quit ``` Related to the above, the below M program test1.m produces a GTMASSERT when run with a debug build. Unlike the previous test case (test.m), the production build did not have problems with test1.m. ``` > cat test1.m if 1?1B > $gtm_dist/mumps -run test1 %GTM-F-GTMASSERT, GT.M V6.3-002 Linux x86_64 - Assert failed /Distrib/GT.M/V63002/sr_port/chktchain.c line 28 ``` The primary issue in both the above tests was in bx_boollit() which noticed a pattern match operator usage with both operands being literals and hence invoked do_patfixed() which encountered a PATNOTFOUND error. That caused ins_errtriple() to be invoked which in turn removed all triples corresponding to the current M line (dqdelchain() call) and returned back to bx_boollit() which did not realize this and went ahead with manipulating the triple chains (dqrins() call etc.) and returned to its caller bool_expr() which also did triple chain manipulation (dqdel() call etc.) all the while operating on triples that were no longer part of the execution chain (due to the prior delqchain() call). This caused a corruption in the doubly-linked triple list in "t_orig" which resulted in incorrect object code being generated that later ended up as the SIG-11 when one tried running this M program. In GT.M V63002, boolean expression evaluation and literal optimization got a significant rework. As part of that change, the macros RETURN_IF_RTS_ERROR and RETURN_EXPR_IF_RTS_ERROR were introduced to check for compile-time errors and if so return from functions right away instead of manipulating triple chains. These safety checks needed to be added in a few more places. That fixed the primary issue. 2) In addition, it was noticed that the following M program fails an assert when run with the debug build. ``` > cat test2.m xecute "if ""a""?1B" > mumps -run test2 %GTM-F-ASSERT, Assert failed in /Distrib/GT.M/V63002/sr_port/zlcompile.c line 81 for expression ((FALSE == run_time) && (TRUE == TREF(compile_time))) ``` Below is the corresponding C-stack. ``` #0 0x00007ff2e6988767 in kill () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007ff2e6014a5c in gtm_dump_core () at /Distrib/GT.M/V63002/sr_unix/gtm_dump_core.c:69 #2 0x00007ff2e5f1de97 in gtm_fork_n_core () at /Distrib/GT.M/V63002/sr_unix/gtm_fork_n_core.c:211 YottaDB#3 0x00007ff2e6007f2b in ch_cond_core () at /Distrib/GT.M/V63002/sr_unix/ch_cond_core.c:59 YottaDB#4 0x00007ff2e5f443a2 in rts_error_va (csa=0x0, argcnt=7, var=0x7ffffc4b0a90) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:153 YottaDB#5 0x00007ff2e5f439b8 in rts_error_csa (csa=0x0, argcnt=7) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:85 YottaDB#6 0x00007ff2e636d64c in zlcompile (len=48 '0', addr=0x7ffffc4b0e30 "/extra1/testarea1/nars/test/temp/tmp/tmp/test2.m") at /Distrib/GT.M/V63002/sr_port/zlcompile.c:81 YottaDB#7 0x00007ff2e60e6f1c in op_zlink (v=0x7ffffc4b14a0, quals=0x7ffffc4b0cf0) at /Distrib/GT.M/V63002/sr_unix/op_zlink.c:443 YottaDB#8 0x00007ff2e5f6a2d7 in job_addr (rtn=0x7ffffc4b1590, label=0x7ffffc4b15a0, offset=0, hdr=0x7ffffc4b1518, labaddr=0x7ffffc4b1510, need_rtnobj_shm_free=0x7ffffc4b14e4) at /Distrib/GT.M/V63002/sr_port/job_addr.c:41 YottaDB#9 0x00007ff2e5f40b48 in jobchild_init () at /Distrib/GT.M/V63002/sr_unix/jobchild_init.c:146 YottaDB#10 0x00007ff2e5f3835d in gtm_startup (svec=0x7ffffc4b1d30) at /Distrib/GT.M/V63002/sr_unix/gtm_startup.c:252 YottaDB#11 0x00007ff2e5f3b2f6 in init_gtm () at /Distrib/GT.M/V63002/sr_unix/init_gtm.c:201 YottaDB#12 0x00007ff2e5f072ea in gtm_main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm_main.c:162 YottaDB#13 0x0000000000400cbe in main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm.c:131 ``` In this case, run_time was TRUE and caused the assert failure. Turns out this was due to m_xecute() function (invoked by zlcompile()) temporarily setting run_time to FALSE but when a PATNOTFOUND error was encountered, the condition handler compiler_ch() was invoked which did an UNWIND back to zlcompile() incorrectly persisting the global variable changes done by the interim function call of m_xecute(). The fix for this was to reset the run_time and TREF(xecute_literal_parse) global variables just like is being done in mdb_condition_handler().

…me errors as they could cause SIG-11 #90 A few issues related to compile-time errors were discovered. 1) The below M program correctly issues a PATNOTFOUND error when compiling. But if one tries to run the compiled object code (which should be okay since the execution does not reach the portions of the M code where the compiler error was found), a SIG-11 is observed. This happens only with GT.M V63002 (and in turn YottaDB r1.10) but not with V63001A (and in turn YottaDB r1.00). ``` > cat test.m main ; do good quit bad ; if 1?1B quit good ; write "hello",! quit ``` Related to the above, the below M program test1.m produces a GTMASSERT when run with a debug build. Unlike the previous test case (test.m), the production build did not have problems with test1.m. ``` > cat test1.m if 1?1B > $gtm_dist/mumps -run test1 %GTM-F-GTMASSERT, GT.M V6.3-002 Linux x86_64 - Assert failed /Distrib/GT.M/V63002/sr_port/chktchain.c line 28 ``` The primary issue in both the above tests was in bx_boollit() which noticed a pattern match operator usage with both operands being literals and hence invoked do_patfixed() which encountered a PATNOTFOUND error. That caused ins_errtriple() to be invoked which in turn removed all triples corresponding to the current M line (dqdelchain() call) and returned back to bx_boollit() which did not realize this and went ahead with manipulating the triple chains (dqrins() call etc.) and returned to its caller bool_expr() which also did triple chain manipulation (dqdel() call etc.) all the while operating on triples that were no longer part of the execution chain (due to the prior delqchain() call). This caused a corruption in the doubly-linked triple list in "t_orig" which resulted in incorrect object code being generated that later ended up as the SIG-11 when one tried running this M program. In GT.M V63002, boolean expression evaluation and literal optimization got a significant rework. As part of that change, the macros RETURN_IF_RTS_ERROR and RETURN_EXPR_IF_RTS_ERROR were introduced to check for compile-time errors and if so return from functions right away instead of manipulating triple chains. These safety checks needed to be added in a few more places. That fixed the primary issue. 2) In addition, it was noticed that the following M program fails an assert when run with the debug build. ``` > cat test2.m xecute "if ""a""?1B" > mumps -run test2 %GTM-F-ASSERT, Assert failed in /Distrib/GT.M/V63002/sr_port/zlcompile.c line 81 for expression ((FALSE == run_time) && (TRUE == TREF(compile_time))) ``` Below is the corresponding C-stack. ``` #0 0x00007ff2e6988767 in kill () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007ff2e6014a5c in gtm_dump_core () at /Distrib/GT.M/V63002/sr_unix/gtm_dump_core.c:69 #2 0x00007ff2e5f1de97 in gtm_fork_n_core () at /Distrib/GT.M/V63002/sr_unix/gtm_fork_n_core.c:211 #3 0x00007ff2e6007f2b in ch_cond_core () at /Distrib/GT.M/V63002/sr_unix/ch_cond_core.c:59 #4 0x00007ff2e5f443a2 in rts_error_va (csa=0x0, argcnt=7, var=0x7ffffc4b0a90) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:153 #5 0x00007ff2e5f439b8 in rts_error_csa (csa=0x0, argcnt=7) at /Distrib/GT.M/V63002/sr_unix/rts_error.c:85 #6 0x00007ff2e636d64c in zlcompile (len=48 '0', addr=0x7ffffc4b0e30 "/extra1/testarea1/nars/test/temp/tmp/tmp/test2.m") at /Distrib/GT.M/V63002/sr_port/zlcompile.c:81 #7 0x00007ff2e60e6f1c in op_zlink (v=0x7ffffc4b14a0, quals=0x7ffffc4b0cf0) at /Distrib/GT.M/V63002/sr_unix/op_zlink.c:443 #8 0x00007ff2e5f6a2d7 in job_addr (rtn=0x7ffffc4b1590, label=0x7ffffc4b15a0, offset=0, hdr=0x7ffffc4b1518, labaddr=0x7ffffc4b1510, need_rtnobj_shm_free=0x7ffffc4b14e4) at /Distrib/GT.M/V63002/sr_port/job_addr.c:41 #9 0x00007ff2e5f40b48 in jobchild_init () at /Distrib/GT.M/V63002/sr_unix/jobchild_init.c:146 #10 0x00007ff2e5f3835d in gtm_startup (svec=0x7ffffc4b1d30) at /Distrib/GT.M/V63002/sr_unix/gtm_startup.c:252 #11 0x00007ff2e5f3b2f6 in init_gtm () at /Distrib/GT.M/V63002/sr_unix/init_gtm.c:201 #12 0x00007ff2e5f072ea in gtm_main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm_main.c:162 #13 0x0000000000400cbe in main (argc=3, argv=0x7ffffc4b4048, envp=0x7ffffc4b4068) at /Distrib/GT.M/V63002/sr_unix/gtm.c:131 ``` In this case, run_time was TRUE and caused the assert failure. Turns out this was due to m_xecute() function (invoked by zlcompile()) temporarily setting run_time to FALSE but when a PATNOTFOUND error was encountered, the condition handler compiler_ch() was invoked which did an UNWIND back to zlcompile() incorrectly persisting the global variable changes done by the interim function call of m_xecute(). The fix for this was to reset the run_time and TREF(xecute_literal_parse) global variables just like is being done in mdb_condition_handler().

…back process This is an issue identified based on a rare ideminter_rolrec/interrupted_rollback_or_recover subtest failure. The test ran 4 rollbacks and killed all but the last one using kill -9 before they finished. The last rollback failed an assert. %YDB-F-ASSERT, Assert failed in sr_unix/mutex.c line 1045 for expression (((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno) Below is the C-stack (gdb) where #0 kill () at ../sysdeps/unix/syscall-template.S:84 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:69 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148 #3 ch_cond_core () at sr_unix/ch_cond_core.c:64 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7fff2eebdeb0) at sr_unix/rts_error.c:159 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:92 #6 mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045 #7 gtm_mutex_lock (reg=0x106f960, mutex_spin_parms=0x7f7690440080, crash_count=0, mutex_lock_type=MUTEX_LOCK_WRITE) at sr_unix/mutex.c:703 #8 grab_lock (reg=0x106f960, is_blocking_wait=1, onln_rlbk_action=1) at sr_unix/grab_lock.c:83 #9 mur_open_files () at sr_port/mur_open_files.c:492 #10 mupip_recover () at sr_port/mupip_recover.c:195 #11 mupip_main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip_main.c:124 #12 dlopen_libyottadb (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688, main_func=0x401424 "mupip_main") at sr_unix/dlopen_libyottadb.c:148 #13 main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip.c:19 (gdb) f 6 #6 0x00007f76931b75bd in mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045 1045 assert(((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno); (gdb) p lastJplCmt->jnl_seqno $1 = 323033 (gdb) p jpl->jnl_seqno $2 = 293120 We were expecting the two seqnos to be 1 apart but they are way apart. This is because had killed a prior rollback (the first rollback) just before it had finished the rollback. Below is its log. > cat ROLLBACK2_1.logx . . %YDB-I-RLBKJNSEQ, Journal seqno of the instance after rollback is 293120 [0x0000000000047900] . %YDB-I-FILERENAME, File ideminter_rolrec_0_7/interrupted_rollback_or_recover/g.mjl_2018284231355 is renamed to ideminter_rolrec_0_7/interrupted_rollback_or_recover/rolled_bak_g.mjl_2018284231355 Killed The fact that it printed the RLBKJNSEQ and FILERENAME messages implies it was in mur_close_files() when it was killed. There is code in mur_close_files() where we reset various fields in "jpl" to reset the state of the journal pool based on the post-rollback instance seqno. This code also needs to clear a few 2-phase-jnl-commit related fields so "mutex_salvage" when it comes in later (in this test, it came in as part of the later rollback) after the kill -9 does not fail the above assert. This failure can be easily reproduced by running an online rollback with the -resync qualifier to take back the state of the instance to a prior seqno, setting a break point in "rel_lock()" and quitting from the debugger once that break point is hit (this simulates kill -9 of online rollback). Reissuing the same online rollback out of the debugger should show the assert failure. Reissuing the online rollback with a production build did not show any issues so the suspicion is that this is a dbg-only issue hence no tracking is done as a separate issue at gitlab.

When ydb_chset env var is set to "M", compiling the following line set c=$PIECE("Hello "_$ZCH(190)_" world!",$ZCH(191),1,2) Failed an assert %YDB-F-ASSERT, Assert failed in sr_unix/gtm_utf8.c line 273 for expression (gtm_utf8_mode) with the following C-stack #6 utf8_badchar_real () at sr_unix/gtm_utf8.c:273 #7 utf8_badchar_dec () at sr_unix/gtm_utf8.c:249 #8 valid_utf_string () at sr_unix/gtm_utf8.c:410 #9 op_fnzpiece () at sr_port/op_fnzpiece.c:53 #10 f_piece () at sr_unix/f_piece.c:171 #11 expritem () at sr_port/expritem.c:619 #12 expratom () at sr_port/expratom.c:29 #13 eval_expr () at sr_port/eval_expr.c:63 #14 expr () at sr_port/expr.c:29 #15 m_write () at sr_port/m_write.c:71 #16 cmd () at sr_port/cmd.c:302 #17 linetail () at sr_port/linetail.c:35 #18 line () at sr_port/line.c:230 #19 compiler_startup () at sr_port/compiler_startup.c:144 #20 compile_source_file () at sr_unix/source_file.c:132 #21 gtm_compile () at sr_unix/gtm_compile.c:120 The assert that failed is correct. The issue is that we called the utf8_badchar_real() function in non-UTF8 mode (i.e. when "gtm_utf8_mode" is 0). The issue is in op_fnzpiece() where we invoke the valid_utf_string() function only if we are in UTF-8 mode (indicated by "gtm_utf8_mode == 1"). The assert (likely introduced as part of GTM-7762 in GT.M V6.3-000) is now fixed to take care of this.

…mer taking more than the allowed 50 seconds to pop This fixes occasional test failures we see where a core file gets generated because the timer pop gets delayed by around 75 seconds non a loaded system. Below is an example core trace. (gdb) where #0 kill () at ../sysdeps/unix/syscall-template.S:84 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:69 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148 #3 timer_handler (why=14) at sr_unix/gt_timers.c:724 #4 <signal handler called> #5 __fsync_nocancel () at ../sysdeps/unix/syscall-template.S:84 #6 wcs_timer_start (reg=0x11e3970, io_ok=1) at sr_port/t_end_sysops.c:1281 #7 t_end (hist1=0x12a20c0, hist2=0x0, ctn=18446744071629176832) at sr_port/t_end.c:1818 #8 gvcst_put2 (val=0x122d008, parms=0x7ffcdda9aef0) at sr_port/gvcst_put.c:2641 #9 gvcst_put (val=0x122d008) at sr_port/gvcst_put.c:299 #10 op_gvput (var=0x122d008) at sr_port/op_gvput.c:74

…ure) In one run of the v53002/C9E04002596 subtest, the following assert failed. %YDB-F-ASSERT, Assert failed in sr_port/mutex_deadlock_check.c line 102 for expression (!jnlpool || !jnlpool->jnlpool_dummy_reg || jnlpool->jnlpool_dummy_reg->open || (repl_csa->critical != criticalPtr) || (NULL == cs_addrs)) And below was the C-stack at the time of the assert failure. (gdb) where #0 kill () at ../sysdeps/unix/syscall-template.S:84 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:69 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148 #3 ch_cond_core () at sr_unix/ch_cond_core.c:64 #4 rts_error_va () at sr_unix/rts_error.c:159 #5 rts_error_csa () at sr_unix/rts_error.c:92 #6 mutex_deadlock_check () at sr_port/mutex_deadlock_check.c:101 #7 mutex_long_sleep () at sr_unix/mutex.c:511 #8 gtm_mutex_lock () at sr_unix/mutex.c:856 #9 grab_lock () at sr_unix/grab_lock.c:83 #10 repl_inst_ftok_counter_halted () at sr_unix/repl_inst_ftok_counter_halted.c:45 #11 jnlpool_init () at sr_unix/jnlpool_init.c:764 #12 gvcst_init () at sr_port/gvcst_init.c:917 #13 gv_init_reg () at sr_port/gv_init_reg.c:56 #14 gv_bind_name () at sr_port/gv_bind_name.c:75 #15 op_gvname_common () at sr_port/op_gvname.c:117 #16 op_gvname_fast () at sr_port/op_gvname.c:81 The assert at line 102 expects that if ever we are in mutex_deadlock_check() for the jnlpool, we better have not opened any database file (i.e. non-NULL cs_addrs). Whereas in this call sequence clearly, we have a case where while opening the database, we need to open the journal pool (because ydb_custom_errors env var is set) and while opening the journal pool we notice a 32K semaphore counter overflow (artificially created by the test system in this debug-run using ydb_db_counter_sem_incr env var) which results in getting a lock on the jnlpool before making changes to the instance file and while trying to get the lock we notice it is being held by some other process which is taking a long time so we go to mutex_deadlock_check() eventually for the jnlpool and fail the assert while in the middle of a database open (so cs_addrs is non-NULL). The assert is clearly wrong for this case. The assert was only introduced in V6.3-001A in the below state. assert((NULL == jnlpool.jnlpool_dummy_reg) || jnlpool.jnlpool_dummy_reg->open || (repl_csa->critical != criticalPtr)); And it was modified in V6.3-003 and V6.3-005 to add || conditions. Most likely to account for exceptions to the assert as they were encountered. I don't see the value in this assert so removing it. Also corrected a pre-existing comment a few lines after the removed assert to reflect our renewed understanding of the current failure possibility.

… failures showing up after tls handshake in TLS 1.3 In the tls/errors subtest, the "TEST CASE 5" section tests that expired certificates issue appropriate errors. This test fails with an assert failure in the source server. %YDB-F-ASSERT, Assert failed in sr_port/repl_comm.c line 352 for expression (FALSE) Below is the C-stack of the source server at the time of the failure. #4 rts_error_va () at sr_unix/rts_error.c:194 #5 rts_error_csa () at sr_unix/rts_error.c:103 #6 repl_recv () at sr_port/repl_comm.c:352 #7 gtmsource_recv_ctl () at sr_unix/gtmsource_process.c:438 #8 gtmsource_process () at sr_unix/gtmsource_process.c:1488 #9 gtmsource () at sr_unix/gtmsource.c:528 The source server fails a SSL_read() call after the initial handshake due to an expired certificate. In TLS 1.2 and before, the initial handshake used to detect the expired certificate. That used to happen in the function repl_do_tls_handshake() and if a failure is detected there, the function gtmsource_exchange_tls_info() used to fall back to plaintext mode of replication thereby keeping the source server alive. But with TLS 1.3, if the failing assert in repl_comm.c line 352 is removed, a TLSIOERROR error would be issued and the source server would still terminate with an error instead of falling back to plaintext mode. This is because the TLS error was detected after the initial handshake and the replication code was not written to handle plaintext fallback after the initial handshake. With TLS 1.3, the initial handshake has been cut down a lot and so I guess the certificate expiry case cannot be made to issue an error as part of the initial handshake by using any controls in the SSL* functions. Therefore, handling plaintext fallback in case of TLS errors after the initial handshake effectively becomes a necessity with TLS 1.3. Towards this, repl_recv() and repl_send() have been enhanced to return with a status of ERR_TLSIOERROR when repl_errno is set to EREPL_RECV and EREPL_SEND respectively. And all places that check for EREPL_RECV and/or EREPL_SEND have been enhanced to handle the new ERR_TLSIOERROR status code by invoking new macros GTMSOURCE_HANDLE_TLSIOERROR or GTMRECV_HANDLE_TLSIOERROR depending on whether the caller is the source or receiver server respectively. These macros take care of checking if plaintext fallback was specified at the source/receiver server startup and if so do the plaintext fallback without terminating the server (but issuing a warning-type YDB-W-TLSIOERROR message). If no fallback was specified, a error-type YDB-E-TLSIOERROR message is issued and the server terminates.

…secondary errors if primary error is out-of-memory If already exiting, do not open any object/source directories (which could include relinkctl files) as part of $ZROUTINES initialization. This avoids potentially nasty codepaths particulary if the reason we are exiting is an out-of-memory. We do not expect any user to run such extreme out-of-memory codepaths/tests so it is not considered necessary to create a user-visible issue for this. For example, below are two C-stacks that showed up in core dumps while running the simpleapi/fatalerror2 subtest. In both cases, if we avoid the zro_init() call we can avoid such cores. Core1 ------ Notice the local variables passed in #0 have "Cannot access memory" errors. Most likely there was no space allocating the C-stack in this core. (gdb) where #0 ydb_trans_log_name (envindx=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c5c>, trans=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c50>, buffer=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c48>, buffer_len=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c58>, ignore_errors=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c44>, is_ydb_env_match=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c38>) at sr_port/ydb_trans_log_name.c:41 #1 util_out_send_oper (addr=0x7ffe1e3c7800 "%YDB-E-RELINKCTLERR, Error with relink control structure for $ZROUTINES directory ., %YDB-E-SYSCALL, Error received from system call mmap() -- called from module "..., len=287) at sr_unix/util_output.c:731 #2 util_out_print_vaparm (message=0x0, flush=4, var=0x7ffe1e3c8050, faocnt=2147483647) at sr_unix/util_output.c:871 #3 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904 #4 jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:261 #5 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36 #6 rts_error_va (csa=0x0, argcnt=12, var=0x7ffe1e3c82b0) at sr_unix/rts_error.c:159 #7 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92 #8 relinkctl_map (linkctl=0x7ffe1e3c8890) at sr_unix/relinkctl.c:679 #9 relinkctl_open (linkctl=0x7ffe1e3c8890, object_dir_missing=0) at sr_unix/relinkctl.c:333 #10 relinkctl_attach (obj_container_name=0x7ffe1e3cbb50, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188 #11 zro_load (str=0x5611ed710ce8) at sr_unix/zro_load.c:159 #12 zro_init () at sr_port/zro_init.c:51 #13 zshow_svn (output=0x7ffe1e40f0b0, one_sv=0) at sr_port/zshow_svn.c:694 #14 op_zshow (func=0x7ffe1e4171b0, type=1, lvn=0x0) at sr_port/op_zshow.c:166 #15 jobexam_dump (dump_filename_arg=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0, fatal_file_name_buff=0x7ffe1e417c40 "simpleapi_0_2/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_65362_1.txt") at sr_port/jobexam_process.c:232 #16 jobexam_process (dump_file_name=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0) at sr_port/jobexam_process.c:152 #17 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66 #18 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224 #19 rts_error_va (csa=0x0, argcnt=5, var=0x7ffe1e41a6a0) at sr_unix/rts_error.c:159 #20 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92 #21 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114 #22 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748 #23 lvtreenode_newblock (sym=0x5611ed733b40, numElems=2097152) at sr_port/lv_newblock.c:82 #24 lvtreenode_getslot (sym=0x5611ed733b40) at sr_port/lv_getslot.c:145 #25 lvAvlTreeNodeInsert (lvt=0x5611ed736050, key=0x7ffe1e41aab0, parent=0x5611f87cb608) at sr_port/lv_tree.c:1698 #26 op_putindx (argcnt=1, start=0x5611ed73b0a0) at sr_port/op_putindx.c:192 #27 callg (fnptr=0x7fb75d4f4fff <op_putindx>, paramlist=0x7ffe1e41ae60) at sr_unix/callg.c:60 #28 ydb_set_s (varname=0x7ffe1e41b5e0, subs_used=1, subsarray=0x7ffe1e41b5f0, value=0x7ffe1e41ade0) at sr_unix/ydb_set_s.c:108 #29 gvnset () at fatalerror.c:56 #30 ydb_tp_s (tpfn=0x5611ed225260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193 #31 main () at fatalerror.c:32 Core2 ----- In this case there is a SIG-11 deep inside syslog(). Most likely due to an out-of-memory situation. Program terminated with signal SIGSEGV, Segmentation fault. #0 vfprintf () from /usr/lib64/libc.so.6 #1 fprintf () from /usr/lib64/libc.so.6 #2 __vsyslog_chk () from /usr/lib64/libc.so.6 #3 syslog () from /usr/lib64/libc.so.6 #4 util_out_send_oper (addr=0x7ffdadd5ec10 "%YDB-E-JOBEXAMFAIL, YottaDB process 50787 executing $ZJOBEXAM function failed with the preceding error message -- generated from 0x", '0' <repeats 16 times>, ".", len=149) at sr_unix/util_output.c:761 #5 util_out_print_vaparm (message=0x0, flush=4, var=0x7ffdadd5f460, faocnt=2147483647) at sr_unix/util_output.c:871 #6 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904 #7 send_msg_va (csa=0x0, arg_count=0, var=0x7ffdadd5fa00) at sr_unix/send_msg.c:149 #8 send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:79 #9 jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:264 #10 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36 #11 rts_error_va (csa=0x0, argcnt=12, var=0x7ffdadd5fc60) at sr_unix/rts_error.c:159 #12 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92 #13 relinkctl_map (linkctl=0x7ffdadd60240) at sr_unix/relinkctl.c:679 #14 relinkctl_open (linkctl=0x7ffdadd60240, object_dir_missing=0) at sr_unix/relinkctl.c:333 #15 relinkctl_attach (obj_container_name=0x7ffdadd63500, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188 #16 zro_load (str=0x55df19dd3ce8) at sr_unix/zro_load.c:159 #17 zro_init () at sr_port/zro_init.c:51 #18 zshow_svn (output=0x7ffdadda6a60, one_sv=0) at sr_port/zshow_svn.c:694 #19 op_zshow (func=0x7ffdaddaeb60, type=1, lvn=0x0) at sr_port/op_zshow.c:166 #20 jobexam_dump (dump_filename_arg=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660, fatal_file_name_buff=0x7ffdaddaf5f0 "simpleapi_0_40/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_50787_1.txt") at sr_port/jobexam_process.c:232 #21 jobexam_process (dump_file_name=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660) at sr_port/jobexam_process.c:152 #22 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66 #23 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224 #24 rts_error_va (csa=0x0, argcnt=5, var=0x7ffdaddb2050) at sr_unix/rts_error.c:159 #25 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92 #26 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114 #27 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748 #28 lvtreenode_newblock (sym=0x55df19df6b40, numElems=2097152) at sr_port/lv_newblock.c:82 #29 lvtreenode_getslot (sym=0x55df19df6b40) at sr_port/lv_getslot.c:145 #30 lvAvlTreeNodeInsert (lvt=0x55df19df9050, key=0x7ffdaddb2460, parent=0x55df24e8e5c8) at sr_port/lv_tree.c:1698 #31 op_putindx (argcnt=1, start=0x55df19dfe0a0) at sr_port/op_putindx.c:192 #32 callg (fnptr=0x7feae36c6fff <op_putindx>, paramlist=0x7ffdaddb2810) at sr_unix/callg.c:60 #33 ydb_set_s (varname=0x7ffdaddb2f90, subs_used=1, subsarray=0x7ffdaddb2fa0, value=0x7ffdaddb2790) at sr_unix/ydb_set_s.c:108 #34 gvnset () at fatalerror.c:56 #35 ydb_tp_s (tpfn=0x55df18a5c260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193 #36 main () at fatalerror.c:32

…CK being called during exit handling When a C program that spawned off multiple threads that used the SimpleThreadAPI (e.g. ydb_tp_st() etc.) was deadlocked (due to a code issue), pressing Ctrl-C (SIGINT) did nothing so pressing Ctrl-\ (SIGQUIT) to terminate the C program caused a MAXRTSERRDEPTH fatal error and resulted in a core dump. Below is the actual output. ^C^\%YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with coreQuit (core dumped) The corresponding C-stack follows. (gdb) where #0 __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:72 #2 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52090) at sr_unix/rts_error.c:144 #3 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52270) at sr_unix/rts_error.c:146 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52450) at sr_unix/rts_error.c:146 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #8 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52630) at sr_unix/rts_error.c:146 #9 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #10 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52810) at sr_unix/rts_error.c:146 #11 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #12 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df529f0) at sr_unix/rts_error.c:146 #13 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #14 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52bd0) at sr_unix/rts_error.c:146 #15 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #16 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52db0) at sr_unix/rts_error.c:146 #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #18 rts_error_va (csa=0x0, argcnt=7, var=0x7fb28df52f90) at sr_unix/rts_error.c:146 #19 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #20 send_msg_va (csa=0x0, arg_count=8, var=0x7fb28df53570) at sr_unix/send_msg.c:125 #21 send_msg_csa (csa=0x0, arg_count=8) at sr_unix/send_msg.c:84 #22 generic_signal_handler (sig=3, info=0x7fb28df53830, context=0x7fb28df53700) at sr_unix/generic_signal_handler.c:244 #23 <signal handler called> #24 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x7fb2880180a8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #25 __pthread_cond_wait_common (abstime=0x0, mutex=0x7fb288018040, cond=0x7fb288018080) at pthread_cond_wait.c:502 #26 __pthread_cond_wait (cond=0x7fb288018080, mutex=0x7fb288018040) at pthread_cond_wait.c:655 #27 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:80 #28 start_thread (arg=0x7fb28df54700) at pthread_create.c:463 #29 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 The primary error was at #20 in send_msg_va() inside the PTHREAD_MUTEX_LOCK_IF_NEEDED macro. The actual assert that failed inside the macro was the following. sr_unix/gtm_multi_thread.h --------------------------- 99 /* We should never use pthread_* calls inside a signal/timer handler. Assert that */ \ 100 assert(!in_nondeferrable_signal_handler); \ We were in a signal handler handling a non-deferrable signal (Ctrl-\ aka SIGQUIT) and are about to do a pthread_mutex_lock() library call which is a no-no. If we are in an exit handler, it is possible for send_msg() to be needed (to log the signal that was received etc.) but it is safer to not do any pthread activity since we cannot be sure if we are exiting while inside a signal handler or not. Therefore the fix for this is to check if "process_exiting" global variable is TRUE and if so, we skip all pthread* calls in the PTHREAD_MUTEX_LOCK_IF_NEEDED and PTHREAD_MUTEX_UNLOCK_IF_NEEDED macros.

A test C program with one or more threads each of which use ydb_tp_st() to do TP transactions deadlocked once in a while when the database has concurrent udpates that caused the TP transaction to restart. Below is the C-stack of the hung TP worker thread. Thread 5 (Thread 0x7f5d1d7ad700 (LWP 94507)): #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 __GI___pthread_mutex_lock (mutex=0x7f5d222252c0 <thread_mutex>) at ../nptl/pthread_mutex_lock.c:78 #2 rts_error_va (csa=0x0, argcnt=1, var=0x7f5d1d7ac5a0) at sr_unix/rts_error.c:146 #3 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:101 #4 ydb_tp_s_common (stapi=1, tptoken=15090, tpfn=0x55c68a782bc0 <gvnset>, tpfnparm=0x7f5d2000fed0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:223 #5 ydb_tp_sst (tptoken=15090, tpfn=0x55c68a782bc0 <gvnset>, tpfnparm=0x7f5d2000fed0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_sst.c:32 #6 ydb_stm_tpthreadq_process (curTPWorkQHead=0x7f5d1001ba40) at sr_unix/ydb_stm_tpthread.c:125 #7 ydb_stm_tpthread (parm=0x0) at sr_unix/ydb_stm_tpthread.c:80 #8 start_thread (arg=0x7f5d1d7ad700) at pthread_create.c:463 #9 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 (gdb) f 4 #4 ydb_tp_s_common (stapi=1, tptoken=15090, tpfn=0x55c68a782bc0 <gvnset>, tpfnparm=0x7f5d2000fed0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:223 223 INVOKE_RESTART; It notices a restartable situation and therefore wants to signal that but in order to do that it goes through the function rts_error_va() (frame #2) which tries to do a PTHREAD_MUTEX_LOCK_IF_NEEDED and that is hung because the pthread mutex lock is held by some other thread. Turns out this thread was obtained by this same thread in a prior restart (in rts_error_va) but it was never released. Since all SimpleAPI and SimpleThreadAPI calls to rts_error_va() go through ydb_simpleapi_ch(), that is where we should be releasing the pthread lock. We do that already but towards the end of the function. And in case of a TP restart (ERR_TPRETRY error code), we return before reaching that point. The main fix is to sr_unix/ydb_simpleapi_ch.c to move the PTHREAD_MUTEX_UNLOCK_IF_NEEDED macro to the beginning of ydb_simpleapi_ch(). The pthread unlock logic that was already there did not use the macro (code duplication) and is now removed. While at this, various other issues were noticed and fixed. * sr_unix/ydb_stm_thread.c 1) It did not initialize the posix_timer_thread_id to be the worker thread. 2) It did not initialize the thread_mutex global variable (which is what is used by the PTHREAD_MUTEX_LOCK_IF_NEEDED and PTHREAD_MUTEX_UNLOCK_IF_NEEDED macros). This is now done using a new INITIALIZE_THREAD_MUTEX_IF_NEEDED macro. * sr_unix/gtm_multi_thread.c : This is the only place where the thread_mutex global variable was previously initialized. This now invokes the new INITIALIZE_THREAD_MUTEX_IF_NEEDED macro. * sr_port/gtm_malloc_src.h : was_holder was being initialized incorrectly before invoking the PTHREAD_MUTEX_UNLOCK_IF_NEEDED macro just before issuing a MEMORY error. The consequences of this issue are that the pthread mutex lock would not be released by gtm_malloc() in case of a MEMORY error in case the call is done by when multi-threading is turned on. Since multi-threading was turned on only for MUPIP JOURNAL ROLLBACK/RECOVER, this is a non-issue until now when multi-threading is used due to support of ydb_malloc_t() in the SimpleThreadAPI. * sr_unix/gtm_multi_thread.h : Defines the new INITIALIZE_THREAD_MUTEX_IF_NEEDED macro.

…xiting and exit handler is being invoked in a thread other than the MAIN worker thread We got a test failure in the simplethreadapi/tp subtest where a SimpleThreadAPI process was exiting and as part of the exit handler, we ended up checking for deferred timers and that failed the following assert in timer_handler(). assert(gtm_is_main_thread() || gtm_jvm_process); In this case, we were exiting as the below C-stack shows. (gdb) where #0 __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:62 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:72 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148 #3 ch_cond_core () at sr_unix/ch_cond_core.c:64 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7ffe7c683120) at sr_unix/rts_error.c:194 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #6 timer_handler (why=0) at sr_unix/gt_timers.c:724 #7 check_for_deferred_timers () at sr_unix/gt_timers.c:1178 #8 deferred_signal_handler () at sr_port/deferred_signal_handler.c:49 #9 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:191 #10 __run_exit_handlers (status=0, listp=0x7fb6d8d2f5f8 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true) at exit.c:82 #11 __GI_exit (status=<optimized out>) at exit.c:104 #12 __libc_start_main (main=0x400f76 <main>, argc=1, argv=0x7ffe7c683648, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe7c683638) at ../csu/libc-start.c:325 #13 _start () So the assert is enhanced to reflect this.

…o issue an error); Nix NOTSUPSTAPI message The simplethreadapi/tp subtest failed once with the following signature in the tp5_TPTIMEOUT.c section of the test. %YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 725 for expression (gtm_is_main_thread() || gtm_jvm_process || exit_handler_active && (DUMMY_SIG_NUM == why)) with the following C-stack (gdb) where #0 __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:62 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:72 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148 #3 ch_cond_core () at sr_unix/ch_cond_core.c:64 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7fe3a13888b0) at sr_unix/rts_error.c:194 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #6 timer_handler (why=0) at sr_unix/gt_timers.c:725 #7 check_for_deferred_timers () at sr_unix/gt_timers.c:1179 #8 deferred_signal_handler () at sr_port/deferred_signal_handler.c:49 #9 rts_error_va (csa=0x0, argcnt=4, var=0x7fe3a1388c70) at sr_unix/rts_error.c:194 #10 rts_error_csa (csa=0x0, argcnt=4) at sr_unix/rts_error.c:101 #11 ydb_hiber_start (sleep_nsec=1000000) at sr_unix/ydb_hiber_start.c:46 #12 gvnset (tptoken=1) at tp5_TPTIMEOUT.c:77 #13 ydb_stm_tpthreadq_process (curTPWorkQHead=0x13dac40) at sr_unix/ydb_stm_tpthread.c:197 #14 ydb_stm_tpthread (parm=0x0) at sr_unix/ydb_stm_tpthread.c:78 #15 start_thread (arg=0x7fe3a1389700) at pthread_create.c:333 #16 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 This was a process that had already made SimpleThreadAPI calls but is now making a SimpleAPI call (ydb_hiber_start()) and so a NOTSUPSTAPI error is about to be issued. But since this call is happening in the user-defined callback function inside a TP transaction, it is the TP worker thread (not the MAIN worker thread) that is executing the "ydb_hiber_start". This means the rts_error_csa invocation is running in the TP worker thread while the YottaDB engine is concurrently being modified by the MAIN worker thread. A no-no since the YottaDB engine is not multi-threaded. To fix this issue, the VERIFY_NON_THREADED_API macro is now fixed to do a "return YDB_ERR_INVAPIMODE". This means ydb_hiber_start() would return a lot sooner thereby not requiring an "rts_error" invocation. But while doing this change, noticed a few issues. The VERIFY_NON_THREADED_API is used from a few functions that do not return any value (sr_unix/ydb_free.c and sr_unix/ydb_timer_cancel.c) so a new macro VERIFY_NON_THREADED_API_NORETVAL is created which is very similar except it does a plain "return". sr_unix/ydb_malloc.c needed special handling since it was returning a "void *" and so a new VERIFY_NON_THREADED_API_RETNULL macro is created for that purpose. While at this, noticed that the VERIFY_NON_THREADED_API macro was not resetting TREF(libyottadb_active_rtn) in case of an INVAPIMODE return (since this macro is usually invoked after a LIBYOTTADB_INIT) so fixed it to do so. Note that the VERIFY_THREADED_API macro stayed the same in that it did not do this reset since it is usually invoked before the LIBYOTTADB_INIT macro. But two exceptions to this rule were found, sr_unix/ydb_cip_helper.c and sr_unix/ydb_tp_s_common.c. They are now fixed so the VERIFY_THREADED_API macro invocation happens before the LIBYOTTADB_INIT macro. Another issue that was noticed is that "ydb_ci" and "ydb_cip" were not doing a VERIFY_NON_THREADED_API check like other SimpleAPI calls do so a new sr_unix/ydb_ci.c and sr_unix/ydb_cip.c were created to do this before invoking ydb_ci_exec(). And the existing ydb_ci() and ydb_cip() function definitions in sr_unix/gtmci.c were removed. A new VERIFY_NON_THREADED_API_DO_NOT_SHUTOFF_ACTIVE_RTN macro was introduced for this purpose since we do not want to do a LIBYOTTADB_INIT in these functions (to avoid unnecessary SIMPLEAPINEST errors). With all these changes, the NOTSUPSTAPI error (currently issued in sr_unix/ydb_hiber_start_wait_any.c and sr_unix/ydb_hiber_start.c) was no longer necessary since an INVAPIMODE error would have been issued before this error codepath is reached in all callers. So this error message is now removed.

…ThreadAPI is active This issue was exposed by a failure in the dual_fail_extend/dual_fail2_mustop_sigquit subtest. This test terminates processes by sending them a SIGQUIT/SIG-3 or SIGTERM/SIG-15 signal. But since one of the threads (the MAIN worker thread) in this multi-threaded process was inside wcs_wtstart() in a non-interruptable code zone (DEFER_INTERRUPTS had been done), the exit handler invoked in another concurrently running thread decided to defer the exit until the ENABLE_INTERRUPTS happened in the worker thread. When the ENABLE_INTERRUPTS did happen, the worker thread invoked exit handling code while it was already inside a timer handler. And since this particular test was running with GDSV4 format blocks, wcs_wtstart() could not flush such blocks (since it required a call to gtm_malloc() which meant a pthread_mutex_lock() call while inside a timer handler which is a no-no) and so wcs_flu() was not able to flush any blocks as part of exit handling causing it to fail an assert. Below is the C-stack corresponding to the assert failure. (gdb) where #0 __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:72 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148 #3 ch_cond_core () at sr_unix/ch_cond_core.c:64 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7f59dccc02a0) at sr_unix/rts_error.c:194 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #6 wcs_flu (options=519) at sr_unix/wcs_flu.c:587 #7 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:608 #8 gv_rundown () at sr_port/gv_rundown.c:123 #9 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:204 #10 __run_exit_handlers (status=-3, listp=0x7f59e2319718 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108 #11 __GI_exit (status=<optimized out>) at exit.c:139 #12 gtm_image_exit (status=-3) at sr_unix/gtm_image_exit.c:27 #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:111 #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:45 #15 wcs_wtstart (region=0x55b9581d66d8, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:829 #16 wcs_stale (tid=94254535632600, hd_len=8, region=0x55b9581d62a8) at sr_port/t_end_sysops.c:1387 #17 timer_handler (why=14) at sr_unix/gt_timers.c:821 #18 <signal handler called> #19 __memmove_avx_unaligned_erms () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:277 #20 gtm_memcpy_validate_and_execute (target=0x7f59dccc25c0, src=0x7f59e32fd6c6, len=0) at sr_port/gtm_memcpy_validate_and_execute.c:42 #21 gvcst_put2 (val=0x7f59e30c7440 <increment_delta_mval>, parms=0x7f59dccc4be0) at sr_port/gvcst_put.c:626 #22 gvcst_put (val=0x7f59e30c7440 <increment_delta_mval>) at sr_port/gvcst_put.c:299 #23 gvcst_incr (increment=0x55b9581a05a0, result=0x7f59d8009410) at sr_port/gvcst_incr.c:56 #24 op_gvincr (increment=0x55b9581a05a0, result=0x7f59d8009410) at sr_port/op_gvincr.c:58 The fix for this issue is to not invoke exit handling while inside the timer handler if we know SimpleThreadAPI is active. In that case, finish the timer handler first and invoke exit handling a little later in mainline code where it is safe to invoke exit handling.

…eads to reach logical point before starting exit handler processing We had a test failure (in the dual_fail_extend/dual_fail2_mustop_sigquit subtest) where a SimpleThreadAPI process was sent a SIG-15 by the test and the signal got delivered to the MAIN worker thread but it went ahead with exit handler processing (including rolling back an active TP transaction) while a TP worker thread was concurrently running the TP callback function without realizing all of this going on. The TP worker thread effectively got an INVTPTRANS error since it was using a non-zero tptoken in a ydb_set_st() call when there was no active TP transaction (due to the exit handler doing an op_trollback()). The fix is to defer exit processing in generic_signal_handler.c if we find out that we are the MAIN worker thread. This way the MAIN worker thread will invoke the exit handler gtm_exit_handler() inside ydb_stm_thread() when it knows it is a logical/safe point to do so. In addition, deferred_signal_handler() is now fixed to skip invoking the exit handler in case we are the MAIN worker thread. This is because ydb_stm_thread() has an already established scheme (using "forced_simplethreadapi_exit" global variable) to determine the logical point and then invoke gtm_exit_handler(). Below is the C-stack of all threads at the time of the core for the record. (gdb) thread apply all bt Thread 3 (Thread 0x7fde4cb67700 (LWP 14698)): #0 fsync () from /usr/lib64/libc.so.6 #1 jnl_fsync (reg=0x55af6c90e7b8, fsync_addr=38517184) at sr_unix/jnl_fsync.c:134 #2 wcs_flu (options=519) at sr_unix/wcs_flu.c:413 #3 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:608 #4 gv_rundown () at sr_port/gv_rundown.c:123 #5 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:216 #6 __run_exit_handlers () from /usr/lib64/libc.so.6 #7 exit () from /usr/lib64/libc.so.6 #8 gtm_image_exit (status=-15) at sr_unix/gtm_image_exit.c:27 #9 generic_signal_handler (sig=15, info=0x7fde4cb66830, context=0x7fde4cb66700) at sr_unix/generic_signal_handler.c:380 #10 <signal handler called> #11 pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 #12 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:123 #13 start_thread () from /usr/lib64/libpthread.so.0 #14 clone () from /usr/lib64/libc.so.6 Thread 2 (Thread 0x7fde510c6dc0 (LWP 14695)): #0 do_futex_wait.constprop () from /usr/lib64/libpthread.so.0 #1 __new_sem_wait_slow.constprop.0 () from /usr/lib64/libpthread.so.0 #2 ydb_stm_args (callblk=0x55af6c96b550) at sr_unix/ydb_stm_args.c:183 #3 ydb_stm_args5 (tptoken=0, errstr=0x0, calltyp=16, p1=94211928230125, p2=140733677288928, p3=94211928265280, p4=1, p5=140733677288912) at sr_unix/ydb_stm_args.c:320 #4 ydb_tp_st (tptoken=0, errstr=0x0, tpfn=0x55af6c8408ed <tpfn_stage1>, tpfnparm=0x7fff1cd7bde0, transid=0x55af6c849240 <tptypebuff> "BATCH", namecount=1, varnames=0x7fff1cd7bdd0) at sr_unix/ydb_tp_st.c:33 #5 impjob (childnum=2) at simplethreadapi_imptp.c:1148 #6 main (argc=1, argv=0x7fff1cd7c198) at simplethreadapi_imptp.c:602 Thread 1 (Thread 0x7fde47fff700 (LWP 14705)): #0 pthread_kill () from /usr/lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:72 #2 ch_cond_core () at sr_unix/ch_cond_core.c:76 #3 rts_error_va (csa=0x0, argcnt=7, var=0x7fde47ffeaa0) at sr_unix/rts_error.c:194 #4 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:101 #5 ydb_stm_args (callblk=0x7fde40000b20) at sr_unix/ydb_stm_args.c:126 #6 ydb_stm_args4 (tptoken=7085, errstr=0x0, calltyp=12, p1=94211928265184, p2=2, p3=94211928261632, p4=94211928263568) at sr_unix/ydb_stm_args.c:298 #7 ydb_set_st (tptoken=7085, errstr=0x0, varname=0x55af6c8491e0 <ygbl_arandom>, subs_used=2, subsarray=0x55af6c848400 <subscr>, value=0x55af6c848b90 <ybuff_val>) at sr_unix/ydb_set_st.c:33 #8 tpfn_stage1 (tptoken=7085, errstr=0x0, parm_array=0x7fff1cd7bde0) at simplethreadapi_imptp.c:1384 #9 ydb_stm_tpthreadq_process (curTPWorkQHead=0x7fde48024c40, forced_simplethreadapi_exit_seen=0x7fde47ffeea8) at sr_unix/ydb_stm_tpthread.c:225 #10 ydb_stm_tpthread (parm=0x0) at sr_unix/ydb_stm_tpthread.c:84 #11 start_thread () from /usr/lib64/libpthread.so.0 #12 clone () from /usr/lib64/libc.so.6

…l buffers while instance freeze is ON DO_JNL_FLUSH_IF_POSSIBLE macro is invoked as a desire to flush if possible. If the act of flushing is going to hang due to a frozen instance, it is better to skip the jnl flush and avoid the hang. That is what is done as the fix in this commit. This addresses a hang seen in the v62000/gtm8086 subtest where the source server was stuck waiting for the instance to be unfrozen (while trying to flush the journal file using the DO_JNL_FLUSH_IF_POSSIBLE macro) while the test script (which does the unfreeze) was waiting for the source server to clear some backlog. Below is the C-stack of the stuck source server for the record. (gdb) where #0 clock_nanosleep () from /usr/lib64/libc.so.6 #1 m_usleep () at sr_unix/sleep.c:25 #2 wait_for_repl_inst_unfreeze_nocsa_jpl () at sr_port/anticipatory_freeze.h:490 #3 wait_for_repl_inst_unfreeze () at sr_port/anticipatory_freeze.h:513 #4 jnl_write_attempt () at sr_port/jnl_write_attempt.c:335 #5 jnl_flush () at sr_port/jnl_flush.c:57 #6 update_max_seqno_info () at sr_unix/gtmsource_readfiles.c:741 #7 first_read () at sr_unix/gtmsource_readfiles.c:881 #8 read_regions () at sr_unix/gtmsource_readfiles.c:1711 #9 read_and_merge () at sr_unix/gtmsource_readfiles.c:1544 #10 gtmsource_readfiles () at sr_unix/gtmsource_readfiles.c:1974 #11 gtmsource_get_jnlrecs () at sr_unix/gtmsource_process_ops.c:980 #12 gtmsource_process () at sr_unix/gtmsource_process.c:1544 #13 gtmsource () at sr_unix/gtmsource.c:528 #14 mupip_main () at sr_unix/mupip_main.c:124 #15 dlopen_libyottadb () at sr_unix/dlopen_libyottadb.c:148 #16 main () at sr_unix/mupip.c:19

…r ends in a newline Background ---------- * Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/964#note_1267101642 * The below SIG-11 in `trigger_update.c` happens in YottaDB and GT.M. In Release and Debug builds. But the stack trace is slightly different than the one in https://gitlab.com/YottaDB/DB/YDB/-/issues/964#note_1267095573. ```m YDB>set trigstr="+^a4 -commands=S -xecute=<<"_$c(10) YDB>write $ztrigger("item",trigstr) %YDB-F-KILLBYSIGSINFO1, YottaDB process 41235 has been killed by a signal 11 at address 0x00007FE0AB393D97 (vaddr 0x00007FE1A73A68F2) %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object ``` ```c #8 trigger_update_rec (trigger_rec=0x62d0000080a0, noprompt=1, trig_stats=0x7fe0ac428920, trigfile_device=0x0, record_num=0x0) at sr_unix/trigger_update.c:1576 #9 trigger_update_rec_helper (trigger_rec=0x62d0000080a0, noprompt=1, trig_stats=0x7fe0ac428920) at sr_unix/trigger_update.c:2217 #10 trigger_update (trigger_rec=0x62d0000080a0) at sr_unix/trigger_update.c:2270 #11 op_fnztrigger (func=0x7fe0a97cf490, arg1=0x62d0000080a0, arg2=0x7fe0a97cf470, dst=0x7fe0a742f5c8) at sr_port/op_fnztrigger.c:248 (gdb) f 8 #8 trigger_update_rec (trigger_rec=0x62d0000080a0, noprompt=1, trig_stats=0x7fe0ac428920, trigfile_device=0x0, record_num=0x0) at sr_unix/trigger_update.c:1576 1576 if ('\n' != ptr[len - 1]) (gdb) p len $1 = 0 ``` Issue ----- * When a multi-line trigger specification after the `-xecute=<<` ends in a `\n` character, but is otherwise empty, `len` variable in `trigger_update_rec()` (in `sr_unix/trigger_update.c`) is 0 but in order to set the `newline_found` variable, we check `ptr[len - 1]` which would result in an out-of-bounds access and a SIG-11. Fix --- * Inside the `if (NULL != trigjrecptr)` logic (that surrounds the `ptr[len - 1]` access logic) we now check if `len` is 0 and if so we set `newline_found` to TRUE (because `len` denotes the length of the M code after the first `\n` in the xecute string and so if this is 0 it means the last character we found in the xecute string was a `\n`) thereby avoiding the `ptr[len - 1]` access in that case.

…ecute string Background ---------- * Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/964#note_1267098363 * The below SIG-11 in `trigger_parse.c` happens in YottaDB and GT.M. In Release and Debug builds. ```m YDB>if $ztrigger("item","+","PAGFIbal(1) -commands=S -xecute=""") %YDB-F-KILLBYSIGSINFO1, YottaDB process 28864 has been killed by a signal 11 at address 0x00007FA4E345D3E0 (vaddr 0x00000000 00000000) %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object ``` ```c #7 trigger_parse (input=0x0, input_len=0, trigvn=0x7ffe6146bcc0 "", values=0x7ffe6146be70, value_len=0x7ffe6146bd90, max_len=0x7ffe6146bbdc, multi_line_xecute=0x7ffe6146bbd4) at sr_unix/trigger_parse.c:1368 #8 trigger_update_rec (trigger_rec=0x7fa4e38e62a8, noprompt=1, trig_stats=0x7ffe61480bf0, trigfile_device=0x0, record_num=0x0) at sr_unix/trigger_update.c:1417 #9 trigger_update_rec_helper.constprop.0 (trigger_rec=trigger_rec@entry=0x7fa4e38e62a8, trig_stats=trig_stats@entry=0x7ffe61480bf0, noprompt=1) at sr_unix/trigger_update.c:2217 #10 trigger_update (trigger_rec=0x7fa4e38e62a8) at sr_unix/trigger_update.c:2270 #11 op_fnztrigger (func=<optimized out>, arg1=0x7fa4e38e62a8, arg2=<optimized out>, dst=0x55e1696279e0) at sr_port/op_fnztrigger.c:248 (gdb) f 7 #7 trigger_parse (input=0x0, input_len=0, trigvn=0x7ffe6146bcc0 "", values=0x7ffe6146be70, value_len=0x7ffe6146bd90, max_len=0x7ffe6146bbdc, multi_line_xecute=0x7ffe6146bbd4) at sr_unix/trigger_parse.c:1368 1368 if ('^' != *ptr1++) (gdb) p ptr1 $1 = 0x1 <error: Cannot access memory at address 0x1> ``` Issue ----- * The function `trigger_parse()` is presented with an input buffer `input` and a length `input_len`. But while processing the input buffer, it does not ensure all references are within the buffer limits. This means that if `-xecute=` specification is incomplete (in the example above, the string starts with a double-quote but ends abruptly without an ending double-quote), we could enter this function with an `input_len` value of 0. In that case, we would try to access the very first byte resulting in a SIG-11. Fix --- * The fix to the above issue is to check if `input_len` is 0 and if so return right away with an error from the `trigger_parse()` function. * But while examining this function, I realized similar issues existed in various other error code paths. All of them scanned past the input buffer (for a global nam, subscripts etc.) without honoring the input buffer bounds. * So all these issues are now fixed in this commit by introducing a new variable `ptr1_top` which is set to the end of the valid input buffer and every access of the input buffer (mostly done through the variable `*ptr1`) is validated to be within bounds by comparing it against `ptr1_top` and if not an error is issued. This is done by adding `||` conditions involving `ptr1_top` to pre-existing error code paths.

Background ---------- Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/964#note_1267198021 * https://gitlab.com/YottaDB/DB/YDB/-/issues/964#note_1267114939 mentions that the ASAN `heap-buffer-overflow` error is not reproducible. I was able to later find a reproducible test case. Pasting that below. This failure requires a build of YottaDB with ASAN enabled. ```m $ cat asan.m if $ztrigger("item","+^a(1) -xecute=") if $ztrigger("item","+^a(1") ``` ```c $ yottadb -direct < asan.m ================================================================= ==17210==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x62d000008c28 at pc 0x0000004987c7 bp 0x7ffdc8a9d080 sp 0x7ffdc8a9c848 READ of size 33023 at 0x62d000008c28 thread T0 #0 __asan_memcpy (dbg/yottadb+0x4987c6) #1 gtm_memcpy_validate_and_execute sr_port/gtm_memcpy_validate_and_execute.c:44:9 #2 cli_str_setup sr_unix/cli_lex.c:202:2 #3 trigger_parse sr_unix/trigger_parse.c:1423:2 #4 trigger_update_rec sr_unix/trigger_update.c:1417:7 #5 trigger_update_rec_helper sr_unix/trigger_update.c:2217:19 #6 trigger_update sr_unix/trigger_update.c:2270:21 #7 op_fnztrigger sr_port/op_fnztrigger.c:248:31 0x62d000008c28 is located 0 bytes to the right of 34856-byte region [0x62d000000400,0x62d000008c28) allocated by thread T0 here: #0 malloc (dbg/yottadb+0x49936d) #1 findStorElem sr_port/gtm_malloc_src.h:598:3 #2 findStorElem sr_port/gtm_malloc_src.h:571:3 #3 findStorElem sr_port/gtm_malloc_src.h:571:3 #4 findStorElem sr_port/gtm_malloc_src.h:571:3 #5 findStorElem sr_port/gtm_malloc_src.h:571:3 #6 gtm_malloc_main sr_port/gtm_malloc_src.h:743:6 #7 gtm_malloc_main sr_port/gtm_malloc_src.h:823:19 #8 gtm_malloc sr_port/gtm_malloc_src.h:1486:9 #9 gtm_env_init_sp sr_unix/gtm_env_init_sp.c:175:28 #10 gtm_env_init sr_port/gtm_env_init.c:385:3 #11 common_startup_init sr_port/common_startup_init.c:121:2 #12 gtm_main sr_unix/gtm_main.c:103:2 #13 dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11 #14 main sr_unix/gtm.c:20:9 #15 __libc_start_main csu/../csu/libc-start.c:308:16 SUMMARY: AddressSanitizer: heap-buffer-overflow (dbg/yottadb+0x4987c6) in __asan_memcpy ``` Issue ----- * Notice that in `asan.m`, there are 2 calls to `ztrigger()`. The second call does not have a closing paren after the `^a(1` but the first call had the `)` at the exact same spot. * Because of this, line 1049 below in the `process_subscripts()` function incorrectly concludes that the `)` was seen even in the second call. This is because even though `len` is 0 at that point (indicating we are at the end of the input buffer), we check `*ptr` which is `)` from the previous `$ztrigger()` call. Therefore the error at line 1051 is not issued. **sr_unix/trigger_parse.c** ```c 694 STATICFNDEF boolean_t process_subscripts(char *subscr_str, uint4 *subscr_len, char **next_str, char *out_str, int4 *out_max) 695 { . 730 while ((0 < len) && (')' != *ptr)) 731 { . 1048 } 1049 if ((0 == len) && (')' != *ptr)) 1050 { 1051 util_out_print_gtmio("Missing \")\" after global subscript", FLUSH); 1052 return FALSE; 1053 } ``` * And because no error was issued, line 1409 below (in the caller) continues processing to line 1425 below where the first parameter ends up being a negative value which when treated as a `uint4` ends up being a huge positive quantity. **sr_unix/trigger_parse.c** ```c 1409 if (!process_subscripts(ptr1, &len, &ptr2, values[GVSUBS_SUB], &max_output_len)) 1410 { 1411 ERROR_MSG_RETURN("", input_len, input); 1412 } 1413 } else 1414 len = 0; 1415 if (0 > --max_output_len) 1416 { 1417 util_out_print_gtmio("Error : Trigger definition too long", FLUSH); 1418 return TRIG_FAILURE; 1419 } 1420 values[GVSUBS_SUB][len] = '\0'; 1421 value_len[GVSUBS_SUB] = (uint4)len; 1422 save_cmd_ary = cmd_ary; 1423 cmd_ary = &trigger_cmd_ary[0]; 1424 gtm_cli_interpret_string = FALSE; 1425 cli_str_setup(input_len - (uint4)(ptr2 - input), ptr2); ``` * And later inside `cli_str_setup()` (lines 190 and 201 below) we end up with an `alloclen` and `addrlen` that is close to `MAX_LINE` resulting in a `memcpy()` call that goes past the allocated buffer limits resulting in the `heap-buffer-overflow` ASAN error. **sr_unix/cli_lex.c** ```c 185 void cli_str_setup(uint4 addrlen, char *addr) 186 { /* callers trigger_parse and zl_cmd_qlf create command strings with knowledge of their length */ . 190 alloclen = ((MAX_LINE <= addrlen) ? MAX_LINE : addrlen) + 1; . 201 addrlen = MIN(addrlen, alloclen - 1); 202 memcpy(cli_lex_in_ptr->in_str, addr, addrlen); ``` Fix --- * The fix is simple and is in `process_subscripts()` to fix the incorrect check of whether a `)` was seen. * Since the `while` loop at line 730 below terminates if `len` is either 0 or if `*ptr` is `)`, all that is needed is for line 1049 to just check if `0 == len`. That implies we terminated the while loop without seeing a `)` and if so we should issue an error. No need to also check `')' != *ptr` like it currently does below. **sr_unix/trigger_parse.c** ```c 694 STATICFNDEF boolean_t process_subscripts(char *subscr_str, uint4 *subscr_len, char **next_str, char *out_str, int4 *out_max) 695 { . 730 while ((0 < len) && (')' != *ptr)) 731 { . 1048 } 1049 if ((0 == len) && (')' != *ptr)) ```

…ut in dee9d0c) Background ---------- * The `mem_stress_1/memleak` subtest failed in one rare test run on a slow in-house system with various core files. Below is an analysis of the first core file using gdb. ```c (gdb) where #0 pthread_kill () from /lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7ffed66c2ec0) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 lvzwr_out_targkey (one=0x7ffed66c30c0) at sr_port/lvzwr_out.c:96 #7 lvzwr_out (lvp=0x103fb48) at sr_port/lvzwr_out.c:286 #8 lvzwr_var (lv=0x103fb48, n=3) at sr_port/lvzwr_var.c:233 #9 lvzwr_var (lv=0x103faf0, n=2) at sr_port/lvzwr_var.c:312 #10 lvzwr_var (lv=0x103fa98, n=1) at sr_port/lvzwr_var.c:312 #11 lvzwr_var (lv=0x10c95d0, n=0) at sr_port/lvzwr_var.c:309 #12 lvzwr_fini (out=0x7ffed66ce590, t=1) at sr_port/lvzwr_fini.c:83 #13 op_lvpatwrite (count=0, arg1=140732495881408) at sr_port/op_lvpatwrite.c:85 #14 zshow_zwrite (output=0x7ffed66ce590) at sr_port/zshow_zwrite.c:40 #15 op_zshow (func=0x7ffed66d66c0, type=1, lvn=0x0) at sr_port/op_zshow.c:220 #16 jobexam_dump (dump_filename_arg=0x7f28bdf51b60, dump_file_spec=0x10323b8, fatal_file_name_buff=0x7ffed66d7210 "", zshowcodes=0x7f28bdf51b60, dev_in_use=0x7ffed66d67a0) at sr_port/jobexam_process.c:237 #17 jobexam_process (dump_file_name=0x7f28bdf51b60, zshowcodes=0x7f28bdf51b60, dump_file_spec=0x10323b8) at sr_port/jobexam_process.c:147 #18 op_fnzjobexam (prelimSpec=0x7f28bdf51b60, zshowcodes=0x7f28bdf51b60, finalSpec=0x10323b8) at sr_port/op_fnzjobexam.c:22 (gdb) f 6 #6 lvzwr_out_targkey (one=0x7ffed66c30c0) at sr_port/lvzwr_out.c:96 96 assert(MAX_STRLEN /* WARNING assignment below; check in op_putindx should assure this */ 97 >= (length += ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len)); (gdb) p gtm_threadgbl_true->util_outbuff $1 = "%YDB-F-ASSERT, Assert failed in sr_port/lvzwr_out.c line 97 for expression (MAX_STRLEN >= (length += ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len))", '\000' <repeats 5946 times> (gdb) p length $2 = 1048577 (gdb) p ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len $4 = 1048576 ``` * Based on this, I was able to come up with a simple test case that demonstrates the same issue. ```m YDB>set x(1,$justify(2,2**20))="" zwrite x %YDB-F-ASSERT, Assert failed in sr_port/lvzwr_out.c line 97 for expression (MAX_STRLEN >= (length += ((zwr_sub_lst *)lvzwrite_block->sub)->subsc_list[n].actual->str.len)) ``` * This failure happens only in a Debug build. A Release build runs fine and prints a long string corresponding to the contents of the subscripted local variable node `x(1,<2**20-long-string>)` in the zwrite format. Issue ----- * As part of dee9d0c, the following change happened where we started allowing sets of subscripted local variable nodes where each subscript is 1Mib long. * Below is relevant text from the commit message of dee9d0c. ``` Files that had merge conflicts but the V63003 change was discarded ------------------------------------------------------------------ Reason for discard is mentioned below against each module. * sr_port/op_fnquery.c & sr_port/op_putindx.c --> GTM-6115/GTM-8792 in GT.M V6.3-003 release notes describes that only $QUERY --> on lvns with subscripts exceeding 1Mb in total length will be prohibited, not --> other operations like SET but the change in this module does the exact opposite. ``` * This meant YottaDB allowed SETs of lvns where each subscript was 1MiB long. Whereas GT.M did not. Below is an example using GT.M V7.0-005. GT.M only allows a subscript that is 5 bytes shorter than 1MiB when there is just 2 subscripts in the lvn. It does not allow a subscript that is 4 bytes shorter than 1MiB. ```m GTM>set x($justify(1,2**20-4))="" %GTM-E-MAXSTRLEN, Maximum string length exceeded GTM>set x($justify(1,2**20-5))="" GTM> ``` But if one tries to use 3 subscripts, GT.M only allows a subscript that is 68 bytes short of 1MiB. ```m GTM>set x($justify(1,2,2**20-5))="" %GTM-E-MAXSTRLEN, Maximum string length exceeded GTM>set x($justify(1,2,2**20-67))="" %GTM-E-MAXSTRLEN, Maximum string length exceeded GTM>set x($justify(1,2,2**20-68))="" GTM> ``` So the maximum allowed subscript length is dependent on other subscripts in the lvn. * The assert that failed in `sr_port/lvzwr_out.c` is tied to this logic in GT.M and relies on the fact that a SET of such a lvn would have been disallowed in `sr_port/op_putindx.c`. * But YottaDB allows each subscript to be 1MiB long since dee9d0c. Independent of other subscripts in the lvn. * Therefore this assert should have been removed as part of dee9d0c but was missed out then. Fix --- * The assert is removed in this commit. Along with it, a debug-only variable `length` as well as some comments describing the reliance on the obsolete `sr_port/op_putindx.c` behavior also got removed.

…ggers too Background ---------- * Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/722#description _Database updates made by triggers are [not propagated by the replication stream](https://docs.yottadb.com/ProgrammersGuide/triggers.html#multisite-database-replication) because the design point was that database updates from triggers are derivable from the primary update. However, this means that times (and other data such as process ids) are not replicated and hence not easily recreated. The proposed enhancement adds an option when starting a Source Server to include trigger updates in the replication stream._ Core changes ------------ * `sr_unix/mupip_cmd.c` has a new `TRIGUPDATE` option in the `gtmsource_qual[]` array. * `-trigupdate` is allowed only if `-secondary` is also specified. This limits the possible commands that can specify `-trigupdate` to `mupip replic -source -start` or `mupip replic -source -activate` when they also specify `-secondary=...`. Towards this, `sr_unix/mupip_cmd_disallow.c` is modified to disallow `TRIGUPDATE` if `SECONDARY` is not also specified. * An active source server startup (`mupip replic -source -start -secondary=...`) or a source server activation (`mupip replic -source -activate -secondary=...`) now supports an optional `-trigupdate` option which when specified implies that database updates made inside trigger invocations are also included in the replication stream. * The fact that a `-trigupdate` option was specified is noted down in the boolean valued variable `gtmsource_options.trigupdate` (set to FALSE if `-trigupdate` was not specified and TRUE if it was). For a `mupip replic -source -start` command, the source server that eventually starts is forked off the process that specifies the source server startup command and so `gtmsource_options.trigupdate` is inherited implicitly and is therefore usable. But for a `mupip replic -source -activate` command, the process that specifies this command is different from the concurrently running source server and so `gtmsource_options.trigupdate` is only usable in the activate process and not transferred to the running source server. It is therefore necessary to copy over this user specified option from the activate command to the corresponding source server specific structure in the journal pool (`gtmsource_local` structure which is in shared memory). This copy happens in `sr_unix/jnlpool_init.c` (for a `mupip replic -source -start`) and in `sr_unix/gtmsource_mode_change.c` (for a `mupip replic -source -activate` command). And because of the above, the replication filter can safely use `jnlpool->gtmsource_local->trigupdate` when it needs to know whether `-trigupdate` was specified if the caller is a source server. * The replication filter functions `jnl_v44TOv44()` and `jnl_v44TOv24()` were modified in `sr_port/repl_filter.c` to check if `-trigupdate` was specified (checked using `gtmsource_local->trigupdate`). And if so, they replicate updates that happen inside triggers (those which have `JS_NOT_REPLICATED_MASK` bit set in the `nodeflags` member). The `nodeflags` member is modified to clear the `JS_NOT_REPLICATED_MASK` bit in such records. Additionally they do not replicate any LGTRIG (trigger definitions) or ZTRIG (ZTRIGGER command) or $ZTWORMHOLE jnl records. Both these filter functions had code that previously issued an error if no conversion occurred as it was an out-of-design scenario. That code has now been replaced with code that generates a NULL record since in this case the entire transaction consists of journal records that are not replicated. This NULL record logic was copied over from the filter function `jnl_v44TOv22()` which was otherwise unchanged because that function is only used in case the receiver side is pre-V6.2-000 in which case it does not support LGTRIG records (issues a `%YDB-E-REPLNOHASHTREC` error). `jrec_null.bitmask.filler` is also initialized in `INITIALIZE_V44_NULL_RECORD` now that this macro is used for `jnl_v44TOv44()` filter conversion too (not just for `jnl_v44TOv22()` conversion like previously). Misc changes ------------ * Since the `gtmsource_local_struct` structure has a new `trigupdate` member, `repl_inst_dump_gtmsourcelocal()` in `sr_unix/repl_inst_dump.c` was modified to dump this new field. * Initialization of default values for `CONNECT_PARMS` was duplicated in `sr_unix/gtmsource_get_opt.c` in the `if` and `else` blocks. That is now moved to before the `if` thereby avoiding the duplication. * A stale comment in `sr_port/jnl.h` was fixed (`align_str` was no longer in use like `ztworm_str` or `lgtrig_str` was). * Regenerated GTMDefinedTypesInit*.m for sr_x86_64,sr_aarch64,sr_armv6l due to `gtmsource_local_struct` and `gtmsource_options_t` structure layout/size changes. * Fixed a pre-existing issue with `gv_target` maintenance in `sr_port/gvcst_jrt_null.c`. Some background first. After enhancing the YDBTest test framework to test with `-trigupdate`, I noticed a rare test failure with the following symptom. This issue showed up only now due to more NULL records possible with the use of `-trigupdate`. ```sh $ cat ##REMOTE_PATH##/online_rollback_1_10/trestartrootverify/instance2/RCVR_22_01_55_4.log.updproc . . Thu Feb 16 22:02:29 2023 : ----> TPRETRY for sequence number 11252 [0x2bf4] Thu Feb 16 22:02:29 2023 : Jnl seq no : 11252 [0x2bf4];Rectype : 17 - TCOM %YDB-F-ASSERT, Assert failed in sr_port/gvcst_root_search.c line 210 for expression (cs_addrs == gv_target->gd_csa) %YDB-I-IPCNOTDEL, Thu Feb 16 22:02:29 2023 : Update process did not delete IPC resources for region AREG ``` And below was the debugger analysis. ```c (gdb) where #8 gvcst_redo_root_search () at sr_port/gvcst_root_search.c:210 #9 t_retry (failure=cdb_sc_gvtrootmod2) at sr_port/t_retry.c:558 #10 t_end (hist1=0x0, hist2=0x0, ctn=18446744071629176832) at sr_port/t_end.c:1896 #11 gvcst_jrt_null (salvaged=0) at sr_port/gvcst_jrt_null.c:72 #12 updproc_actions (gld_db_files=0x62d000001ac0) at sr_port/updproc.c:1046 #13 updproc () at sr_port/updproc.c:502 #14 mupip_main (argc=3, argv=0x7ffc86786618, envp=0x7ffc86786638) at sr_unix/mupip_main.c:122 #15 dlopen_libyottadb (argc=3, argv=0x7ffc86786618, envp=0x7ffc86786638, main_func=0x563cba3c1020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #16 main (argc=3, argv=0x7ffc86786618, envp=0x7ffc86786638) at sr_unix/mupip.c:22 (gdb) f 8 #8 gvcst_redo_root_search () at sr_port/gvcst_root_search.c:210 210 assert(cs_addrs == gv_target->gd_csa); (gdb) p cs_addrs $1 = (sgmnt_addrs *) 0x62d000026040 (gdb) p gv_target->gd_csa $2 = (sgmnt_addrs *) 0x62d00001e840 (gdb) p cs_addrs->sgm_info_ptr->gv_cur_region->rname $4 = "DREG", '\000' <repeats 27 times> (gdb) p $2->sgm_info_ptr->gv_cur_region->rname $5 = "DEFAULT", '\000' <repeats 24 times> (gdb) p gv_target->gvname $6 = {var_name = {char_len = 0, len = 31, addr = 0x62d000057f10 "jrandomvariableinimptpfillprogr"}, hash_code = 1897224776, marked = 0} ``` ```m YDB>write $view("region","^jrandomvariableinimptpfillprogr") DEFAULT ``` So the assert failure is because `gv_target` pointed to a global variable name that mapped to the `DEFAULT` region whereas the current region were writing the NULL record was for `DREG`. The cause of the assert was the call to `gvcst_redo_root_search()` in line 558. **sr_port/t_retry.c** ```c 545 case cdb_sc_gvtrootmod2: 546 if (!redo_root_search_done) 547 RESET_ALL_GVT_CLUES; 548 /* It is possible for a read-only transaction to release crit after detecting gvtrootmod2, during 549 * which time yet another root block could have moved. In that case, the MISMATCH_ROOT_CYCLES check 550 * would have already done the redo_root_search. 551 */ 552 assert(!redo_root_search_done || !update_trans); 553 if (WANT_REDO_ROOT_SEARCH) 554 { /* Note: An online rollback can occur DURING gvcst_redo_root_search, which can remove gbls 555 * from db, leading to gv_target->root being 0, even though failure code is not 556 * cdb_sc_onln_rlbk2 557 */ --> 558 gvcst_redo_root_search(); 559 } ``` The `cdb_sc_gvtrootmod2` failure code is because of a concurrent online rollback on the receiver side. But in order to handle it, we invoke the `gvcst_redo_root_search()` if the `WANT_REDO_ROOT_SEARCH` macro returned TRUE. The macro is defined as follows. **sr_port/t_retry.c** ```c 67 /* In mu_reorg if we are in gvcst_bmp_mark_free, we actually have a valid gv_target. Find its root before the next iteration 68 * in mu_reorg. 69 */ 70 #define WANT_REDO_ROOT_SEARCH \ --> 71 ( (NULL != gv_target) \ 72 && (DIR_ROOT != gv_target->root) \ 73 && !redo_root_search_done \ 74 && !TREF(in_gvcst_redo_root_search) \ 75 && !mu_reorg_upgrd_dwngrd_in_prog \ 76 && !mu_reorg_encrypt_in_prog \ 77 && (!TREF(in_gvcst_bmp_mark_free) || mu_reorg_process) \ 78 ) ``` Line 71 is the issue. `gv_target` should have been `NULL` in the `gvcst_jrt_null()` case since we are not dealing with any global name (the `NULL` journal record is to denote an empty transaction). Given the above analysis, the fix is to set `gv_target` to `NULL` in `sr_port/gvcst_jrt_null.c`. Additionally, `gv_currkey->base[0]` had to also be cleared to keep it in sync with gv_target in DEBUG code (just like is being already done in the `GVTR_SWITCH_REG_AND_HASHT_BIND_NAME` macro in `sr_unix/gv_trigger.h`). * Fixed `sr_unix/mupip_cmd_disallow.c` to disallow `ZEROBACKLOG` option if `SHUTDOWN` is not also specified. I noticed this issue while adding disallow code for `TRIGUPDATE`. The `ZEROBACKLOG` option was introduced in 60e7e2d (GT.M V6.3-000) which was many years ago but this option was allowed to be specified in various source server commands that had nothing to do with this option (for example, the command `mupip replic -source -deactivate -zerobacklog` was allowed but it did not make any sense). This misfeature is fixed in the current commit by generating a `%YDB-E-CLIERR` error for such meaningless commands. * Made `cstart` and `jstart` variables DEBUG_ONLY in `sr_port/repl_filter.c`. This removed all the `clang-tidy` warnings related to these variables in the following reference files. - ci/tidy_warnings_release_x86_64.ref - ci/tidy_warnings_release_aarch64.ref In addition, a `Value stored to 'prefix' is never read [clang-analyzer-deadcode.DeadStores]` warning also no longer shows up after all changes to `sr_port/repl_filter.c` in this commit. Not exactly sure where the change happened but not spending time on it since the warning has now disappeared. This meant removing one line of warning from the following reference files. - ci/tidy_warnings_release_x86_64.ref - ci/tidy_warnings_release_aarch64.ref - ci/tidy_warnings_debug_aarch64.ref - ci/tidy_warnings_debug_x86_64.ref * Enhanced `ci/create_tidy_warnings.sh` to capture detail in case of compilation errors. As a background, I had a `clang-tidy-amd64` pipeline job fail because my changes to `sr_port/repl_filter.c` had a compilation error. In this case, the `clang-tidy-14` call in `ci/create_tidy_warnings.sh` exited with a non-zero status but since we had redirected the stderr to `/dev/null` (i.e. `2>/dev/null`) we had no error text to look at to see why it exited abnormally. I had to run the same command locally to determine the cause of the error. This is fixed by redirecting stderr to `$output_dir/tidy_warnings.err`. And if the `clang-tidy` exited with a non-zero status, we examine this file for any lines containing `error` (case insensitive). We print those lines in the standard output. That way it is more likely to give helpful information as to which C file had compilation troubles. In my case, I saw the following 2 lines of extra output indicating `sr_port/repl_filter.c` had compilation errors. ``` 4617 warnings and 20 errors generated. Error while processing /builds/nars1/YDB/sr_port/repl_filter.c. ```

…ctly returning FALSE Background ---------- * The `stress_1/concurr_small` subtest failed in a rare test run on an AARCH64 system with the following symptom. ```diff --- concurr_small/concurr_small.diff --- 118a119,164 > host:stress_1/concurr_small/instance2/stress_oli.out > %YDB-F-ASSERT, Assert failed in sr_unix/ss_release.c line 184 for expression (FALSE) ``` * Below is the C-stack of the core file. ``` (gdb) where #0 __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 ss_release (ss_ctx=0xffffeb411fc8) at sr_unix/ss_release.c:184 #7 mupip_integ () at sr_port/mupip_integ.c:813 #8 mupip_main (argc=7, argv=0xffffeb416008, envp=0xffffeb416048) at sr_unix/mupip_main.c:122 #9 dlopen_libyottadb (argc=7, argv=0xffffeb416008, envp=0xffffeb416048, main_func=0xaaaad2db1650 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #10 main (argc=7, argv=0xffffeb416008, envp=0xffffeb416048) at sr_unix/mupip.c:22 ``` Issue ----- * The assert that failed was in line 184 below. **sr_unix/ss_release.c** ```c 173 if (dba_bg == cs_data->acc_meth) 174 { 175 /* Now that we have crit, wait for any pending phase2 updates to finish. Since phase2 updates happen 176 * outside of crit, we dont want them to keep writing to the snapshot file even after the snapshot 177 * is complete. This is needed as otherwise a GT.M process might see the value of csa->snapshot_in_prog 178 * as TRUE and before it can proceed any further(starvation, maybe), we went ahead and removed the 179 * snapshot file(below). Now, if the GT.M process resumes execution, it might end up writing 180 * the before image to a temporary file which is no longer available. 181 */ 182 if (cnl->wcs_phase2_commit_pidcnt && !wcs_phase2_commit_wait(csa, NULL)) 183 { 184 assert(FALSE); 185 gtm_putmsg_csa(CSA_ARG(csa) VARLSTCNT(7) ERR_COMMITWAITSTUCK, 5, process_id, 1, 186 cnl->wcs_phase2_commit_pidcnt, DB_LEN_STR(gv_cur_region)); 187 } 188 } ``` * The reason it failed was that `cnl->wcs_phase2_commit_pidcnt` was a non-zero value at line 182 above but the call to `wcs_phase2_commit_wait()` returned FALSE at line 182. * Below is the relevant code flow inside that function call. **sr_port/wcs_phase2_commit_wait.c** ```c 153 if (NULL == cr) 154 { 155 value = cnl->wcs_phase2_commit_pidcnt; 156 if (!value) 157 return TRUE; 158 if (lcnt == lcnt_isprcalv_next) 159 { /* Do "is_proc_alive" check. This section is very similar to the "NULL == cr" section 160 * at the end of this module in terms of book-keeping array maintenance. 161 */ 162 crarray_index = 0; 163 any_blocked = FALSE; 164 for (curcr = cr_lo; curcr < cr_top; curcr++) 165 { 166 blocking_pid = curcr->in_tend; 167 if (!blocking_pid || (blocking_pid == process_id)) 168 continue; 169 any_blocked = TRUE; . . 210 } 211 if (was_crit && (crarray_index || !any_blocked) && (curcr == cr_top)) 212 { /* We hold crit, found no alive pids in phase2 commit, and found either at least one 213 * dead pid or no blocking pids, the latter indicating that the wcs_phase2_commit_pidcnt 214 * is wrong. 215 * No need to wait any more. Return FALSE right away. Caller will invoke "wcs_recover" 216 * to fix the situation. 217 */ 218 return FALSE; 219 } ``` * What I suspect happened is that at line 155, we found `value` to be a non-zero value. This means a concurrent process is in phase2 of commit and some cache-record has a non-zero value of `cr->in_tend` pointing to that process id. But before we went to examine that particular cache-record in line 167, the concurrent process that was in the phase2 commit finished its phase2 commit and reset `cr->in_tend` back to 0. Therefore, we found no cache-record with a non-zero `cr->in_tend` in the `for` loop (lines 164 to 210). This means `any_blocked` was still set to FALSE when we came to line 211. And since we were holding crit when we entered `wcs_phase2_commit_wait()`, the variable `was_crit` was set to TRUE. Therefore, we were in a situation where `was_crit` was TRUE, `any_blocked` was FALSE and `curcr` was equal to `cr_top` because we scanned all the cache-records in the `for` loop of line 164. And because of this, we went inside the `if` at line 211 above and returned FALSE at line 218. And this caused the assert failure in the caller function. In this case, returning FALSE was incorrect. The comment at line 213 indicates that we found `no blocking pids` and that this indicates `that the wcs_phase2_commit_pidcnt is wrong`. But there is nothing wrong with the count. The count changed concurrently in between line 155 and 211 and we did not take that into account. Fix --- * The fix is simple and is to `re-check` the value of `cnl->wcs_phase2_commit_pidcnt` in line 211 after checking `any_blocked` and if it is still non-zero, only then return FALSE as it then clearly indicates something wrong with the `wcs_phase2_commit_pidcnt` counter. * So we would continue to the next iteration of the outer `for` loop (based on the `lcnt` variable) and that iteration would see `value` to be 0 at line 155 and return TRUE at line 157 thereby avoiding this incorrect return. Notes ----- * In Debug builds, one would see an assert failure like in this case. * In Release builds, this particular caller `ss_release()` does not do that but most other callers of `wcs_phase2_commit_wait()` would call `wcs_recover()` in case the process is already holding crit and the function returns FALSE. And that would effectively be a no-op since by that time the `wcs_phase2_commit_pidcnt` is already 0. One would see some syslog messages indicating the unnecessary cache recovery occurred but otherwise no user-visible symptoms should happen. Therefore, no issue number is assigned to this commit.

* This commit addresses merge conflicts involving deleted files. * The list of deleted files was identified using the following commands as part of the prior commit (i.e. after the `git cherry-pick` command was run but before the `git commit` command was run). ```sh $ git status | grep 'deleted' deleted: sr_unix/buildaux_dbcertify.csh deleted: sr_unix/buildaux_ftok.csh deleted: sr_unix/buildaux_semstat2.csh deleted: sr_unix/dbcertify_cmd.c deleted: sr_unix/dbcertify_dbfilop.c deleted: sr_unix/dbcertify_parse_and_dispatch.c deleted by us: README deleted by us: sr_i386/cmerrors_ctl.c deleted by us: sr_i386/cmierrors_ctl.c deleted by us: sr_i386/gdeerrors_ctl.c deleted by them: sr_port/dbcertify.c deleted by them: sr_port/dbcertify.h deleted by them: sr_port/dbcertify_base_ch.c deleted by them: sr_port/dbcertify_certify_phase.c deleted by them: sr_port/dbcertify_funcs.c deleted by them: sr_port/dbcertify_scan_phase.c deleted by them: sr_port/dsewrap.mpt deleted by us: sr_port/get_spec.c deleted by them: sr_port/v5cbsu.m deleted by us: sr_unix/Makefile.mk deleted by them: sr_unix/dbcertify_deferred_exit_handler.c deleted by them: sr_unix/dbcertify_signal_handler.c deleted by them: sr_unix/ftok.c deleted by us: sr_unix/gtcmstub.c deleted by us: sr_unix/gtm_startup_chk.c deleted by us: sr_unix/gtm_test_install.csh deleted by us: sr_unix/gtm_tls_impl.c deleted by us: sr_unix/gtmcrypt_dbk_ref.c deleted by us: sr_unix/gtmcrypt_dbk_ref.h deleted by us: sr_unix/gtmcrypt_pk_ref.c deleted by us: sr_unix/gtmcrypt_ref.c deleted by us: sr_unix/gtmcrypt_util.c deleted by us: sr_unix/gtmcrypt_util.h deleted by us: sr_unix/gtminstall.sh deleted by us: sr_unix/gtmprofile.gtc deleted by us: sr_unix/kitstart.csh deleted by them: sr_unix/semstat2.c deleted by us: sr_unix_cm/rc_mval2subsc.c deleted by us: sr_unix_gnp/libgnpserver.list deleted by us: sr_x86_64/merrors_ansi.h deleted by us: sr_x86_64/merrors_ctl.c ``` * Those files that show up as just `deleted:` in the output above need no changes in this commit. This is because they would have been automatically deleted in the prior commit itself as those files were already deleted in the YottaDB side and were deleted in GT.M V7.0-000 in the GT.M side. * All files that show up as `deleted by us:` are deleted in this commit since those were deleted even before in the YDB git repository. That is a straightforward change. But each of these files needed to be reviewed to see if the GT.M changes to each file needs to be incorporated somewhere else in the YDB git repository. They are described below. - README : This file only has cosmetic changes in every GT.M release. No need to incorporate this into the README.md file (corresponding file in the YDB repository). - The following are all generated files and will be regenerated on the YDB side in a later commit so no need to incorporate any changes. The `sr_i386` files won't be regenerated as YottaDB does not support that architecture. Only the `sr_x86_64` files will be regenerated. - sr_i386/cmerrors_ctl.c - sr_i386/cmierrors_ctl.c - sr_i386/gdeerrors_ctl.c - sr_x86_64/merrors_ansi.h - sr_x86_64/merrors_ctl.c - sr_port/get_spec.c : This file was changed in the GT.M side to replace a `rts_error_csa()` call that issued a ERR_INVSPECREC error with the new `RTS_ERROR_CSA_ABT()` call. But as part of 0b74bdc, we had moved this error to only be issued from `mu_int_blk.c` and we use the `mu_int_err()` function to issue the error there. So no need to incorporate the GT.M changes. - sr_unix/Makefile.mk : This file was changed to issue a `UTF-8 mode library installation may fail if gtm_icu_version is not set` warning in the GT.M side. That warning had already been added to `YDBEncrypt/Makefile` in the YottaDB side as part of the first commit. So no need to incorporate the GT.M changes. - sr_unix/gtcmstub.c : This file was changed in the GT.M side to replace a `rts_error()` call with `RTS_ERROR_ABT()`. No other real change. Since this file itself has been deleted on the YottaDB side, no need to incorporate this otherwise cosmetic GT.M change. - sr_unix/gtm_startup_chk.c : The GT.M changes were - In the `gtm_chk_dist()` function to change `rts_error_csa()` with `RTS_ERROR_CSA_ABT` and remove the `CSA_ARG` from `CSA_ARG(NULL)` usages (as the first parameter). Those changes are now incorporated into `sr_unix/ydb_chk_dist.c` which is the corresponding YottaDB file. - `gtm_dist_len` was moved from being a local variable to a global variable. This change was not incorporated because I think it has a bug hidden inside. The `gtm_chk_dist()` function does a `realpath()` call on the string stored in `gtm_dist[]` global buffer and stores the result in a local variable `real_gtm_dist_path[]`. And then it does a `STRNLEN(real_dist, GTM_PATH_MAX, gtm_dist_len);` call where `real_dist` is a pointer to `real_gtm_dist_path[]`. This call will reset the global variable `gtm_dist_len` to the real path length of the path stored in `gtm_dist[]`. In case that happens to be a soft link that points to a much longer actual path name, `gtm_dist_len` after this point will contain the length of the much longer actual path name whereas `gtm_dist[]` will continue to point to the shorter soft link path name. The two will become out of sync. This can cause problems later when they are used for example in `sr_unix/secshr_client.c` where one copies this buffer using the out-of-sync length into the `gtmsecshr_path` buffer (e.g. `memcpy(gtmsecshr_path, gtm_dist, gtm_dist_len);`). In this case, `gtmsecshr_pathname.len` is set to include `gtm_dist_len` and so would hold an inflated value. And can cause confusing error messages. All of this only if `$gtm_dist` is a soft link to another path. In any case, all of this complexity can be avoided by not updating the global variable `gtm_dist_len` in this file and is why the GT.M change is not incorporated into `sr_unix/ydb_chk_dist.c`. - sr_unix/gtm_test_install.csh : This file was removed in the YottaDB side a while ago. So no need to incorporate any GT.M changes. - sr_unix/gtm_tls_impl.c : This file had been moved to YDBEncrypt/gtm_tls_impl.c (a different git repository) so I had to use the following commands to incorporate the changes. The following command (run in the YDB git repo) creates a patch file containing the GT.M V7.0-000 changes. ```sh cd YDB git show tags/V7.0-000 -- sr_unix/gtm_tls_impl.c > /tmp/patch ``` The following command (run in the YDBEncrypt git repo) applies the patch to the YDBEncrypt repo file. ```sh cd YDBEncrypt patch -i /tmp/patch gtm_tls_impl.c ``` The patch command produced the following output. ``` Hunk #1 succeeded at 1 with fuzz 1. Hunk #2 succeeded at 711 (offset 37 lines). Hunk #3 FAILED at 827. Hunk #4 FAILED at 836. Hunk #5 succeeded at 1175 (offset 38 lines). Hunk #6 succeeded at 1386 (offset 41 lines). Hunk #7 FAILED at 1358. Hunk #8 FAILED at 1430. Hunk #9 succeeded at 1753 (offset 83 lines). 4 out of 9 hunks FAILED -- saving rejects to file gtm_tls_impl.c.rej ``` The 5 Hunks that succeeded needed no intervention. The 4 Hunks that FAILED needed more review/intervention. Let us take `Hunk #3` as an example. I looked at `gtm_tls_impl.c.rej` (created by the patch command) and found the first hunk there. This would be `Hunk #3` since that was the first FAILED Hunk in the above output. Below is the corresponding diff. ```diff @@ -827,8 +827,9 @@ STATICFNDEF gtmtls_passwd_list_t *gtm_tls_find_pwent(const char *input_env_name) int gtm_tls_store_passwd(gtm_tls_ctx_t *tls_ctx, const char *tlsid, const char *obs_passwd) { - char *env_name_ptr, env_name[PASSPHRASE_ENVNAME_MAX]; - int env_name_len, obs_len; + char env_name[PASSPHRASE_ENVNAME_MAX]; + size_t env_name_idx; + size_t env_name_len, obs_len; gtmtls_passwd_list_t *pwent_node; passwd_entry_t *pwent; ``` The reason the patch failed is because the variables `env_name_ptr`, `env_name` and `env_name_len` have been removed in YDBEncrypt/gtm_tls_impl.c. So I discarded that part of the diff. But found that the type of `obs_len` has been changed from `int` to `size_t`. Therefore manually incorporated just that change into YDBEncrypt/gtm_tls_impl.c. Similar analysis/incorporation was done for each FAILED Hunk in the above output. Once all FAILED Hunks were addressed, the following files were removed as they were no longer needed. - gtm_tls_impl.c.orig - gtm_tls_impl.c.rej At a high level, all GT.M changes were picked up from the FAILED Hunks (`Hunk 4`, `Hunk 7` and `Hunk 8`). - sr_unix/gtmcrypt_dbk_ref.c : Used the same procedure as described in the `sr_unix/gtm_tls_impl.c` bullet above, merged the GT.M changes from YDB to YDBEncrypt/gtmcrypt_dbk_ref.c. - sr_unix/gtmcrypt_dbk_ref.h : This one was also a YDB -> YDBEncrypt patch merge. This one succeeded without any FAILED Hunks. - sr_unix/gtmcrypt_pk_ref.c : This one was also a YDB -> YDBEncrypt patch merge. This one has 1 FAILED Hunk. Manually retrofitted GT.M change on the FAILED Hunk. - sr_unix/gtmcrypt_ref.c : This one was also a YDB -> YDBEncrypt patch merge. This one succeeded without any FAILED Hunks. - sr_unix/gtmcrypt_util.c : This one was also a YDB -> YDBEncrypt patch merge. This one has 1 FAILED Hunk. Manually retrofitted GT.M change on the FAILED Hunk. - sr_unix/gtmcrypt_util.h : This one was also a YDB -> YDBEncrypt patch merge. This one has 1 FAILED Hunk. Manually retrofitted GT.M change on the FAILED Hunk. - sr_unix/gtminstall.sh : This one was a YDB -> YDB patch merge. From sr_unix/gtminstall.sh to sr_unix/ydbinstall.sh. This one was complicated because in the GT.M side, they removed support for a specific ICU version when specifying `--utf8`. Previously one needed to specify `--utf8 default` or `--utf8 7.1` (to indicate ICU version 7.1 needs to be used etc.). Since the GT.M change to no longer require `default` made sense, that was picked to be retrofitted onto the YottaDB side. Since `--utf8 default` is used in a lot of places across various YDB repositories (see list below) and could be relied upon by some user, it was decided to continue supporting it but treat it as if `--utf8` was specified (i.e. without `default`). But if the user specified `--utf8 7.1`, that will issue an error even in ydbinstall.sh as it is no longer a supported usage. In addition, found 8fe2516 had made a change to set `tmpdir` variable but the immediately next line used `$tmp` instead of `$tmpdir` to set `ydb_routines` env var. This clearly will result in a broken `ydb_routines` env var since `$tmp` will be the empty string in that case. But this did not break `ydbinstall.sh --plugins-only` so far so I concluded that setting `ydb_routines` was unnecessary altogether. So I removed that line instead of fixing the `$tmp` occurrence to `$tmpdir`. Additionally, I found a lot of `--utf8 default` usages across various repositories. Fixed all of those as well. The 2 `YDB` occurrences below were fixed in this commit. The occurrences in other repos were fixed in commits in those repos. ```sh YDBDoc/Plugins/ydbaim.rst: sudo ./ydbinstall.sh --utf8 default --verbose --aim YDBDoc/Plugins/ydbzlib.rst: sudo ./ydbinstall.sh --utf8 default --verbose --zlib YDBDoc/AdminOpsGuide/installydb.rst: sudo --preserve-env=ydb_icu_version ./ydbinstall.sh --installdir /opt/yottadb/ --utf8 default --verbose YDBDoc/AdminOpsGuide/installydb.rst: sudo --preserve-env=ydb_icu_version ./ydbinstall.sh --utf8 default --verbose YDBDoc/AdminOpsGuide/installydb.rst:For example, :code:`ydbinstall --from-source https://gitlab.com/ydbuser/YDB.git --branch working --utf8 default --aim --install-directory /usr/local/lib/yottadb/devel_$(date +%Y%m%d)` will checkout, build, and install the :code:`working` branch of YottaDB from the YDB repository of GitLab user :code:`ydbuser` in a date-stamped directory, along with the `Application Independent Metadata plugin <https://gitlab.com/YottaDB/Util/YDBAIM/>`_. YDBDoc/Plugins/ydbposix.rst: sudo ./ydbinstall.sh --utf8 default --verbose --posix YDBDoc/MultiLangProgGuide/MultiLangProgGuide.rst: - Run it with your choice of directory where you want it installed (omit the :code:`--verbose` option for less output): :code:`sudo ./ydbinstall.sh --utf8 default --verbose`. If you do not specify an installation directory with :code:`--installdir`, the script installs YottaDB in :code:`/usr/local/lib/yottadb/r###` where :code:`r###` is the release, e.g., :code:`r130`. YDBDoc/AcculturationGuide/acculturation.rst:- Run it (omit the :code:`--verbose` option if you want less output): :code:`sudo ./ydbinstall.sh --utf8 default --verbose` (This command installs YottaDB under :code:`/usr/local/lib/`.) YDBTest/docker/build_and_install_yottadb.csh:./ydbinstall --installdir=$gtm_root/$verno/dbg --utf8 default --keep-obj --ucaseonly-utils --prompt-for-group YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $4 --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --linkexec --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $4 --linkexec --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --nolinkexec --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $4 --linkenv --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --linkenv --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $4 --nolinkenv --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --linkexec $5 --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $4 --linkexec $5 --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --linkenv $6 --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $4 --linkenv $6 --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --copyexec --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $4 --copyenv --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --copyexec $5 --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb880.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $4 --copyenv $6 --utf8 default --user $USER $7 --nopkg-config --overwrite-existing YDBTest/sudo/u_inref/ydb894.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --nopkg-config --overwrite-existing --utf8 default --user $USER $4 YDBTest/sudo/u_inref/ydb910.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --from-source --overwrite-existing --utf8 default --user $USER $4 > ydbinstall_fromsource.txt 2>&1 YDBTest/sudo/u_inref/ydb924.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --overwrite-existing --utf8 default --dry-run $4 YDBTest/sudo/u_inref/ydb924.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --overwrite-existing --utf8 default $4 YDBTest/sudo/u_inref/ydb924.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --overwrite-existing --utf8 default --dry-run $4 YDBTest/sudo/u_inref/ydb924.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --overwrite-existing --utf8 default $4 YDBTest/sudo/u_inref/ydb783.csh:$sudostr /Distrib/YottaDB/$gtm_verno/$tst_image/yottadb_r*/ydbinstall --utf8 default --installdir $PWD/YDBUTF8 $installoptions YDBTest/sudo/u_inref/ydb783.csh:$sudostr /Distrib/YottaDB/$gtm_verno/$tst_image/yottadb_r*/ydbinstall --utf8 default --keep-obj --installdir $PWD/YDBUTF8NOSO $installoptions YDBTest/sudo/u_inref/ydb306.csh:#Tests that ydbinstall will not give an error when run with --zlib and --utf8 default YDBTest/sudo/u_inref/ydb306.sh:/Distrib/YottaDB/$1/$2/yottadb_r*/ydbinstall --installdir $3 --zlib --overwrite-existing --utf8 default --user $USER $4 YDBOcto/README.md:sudo ./ydbinstall.sh --utf8 default --verbose --octo YDBOcto/doc/developer_doc.rst: ./ydbinstall.sh --overwrite-existing --utf8 default --aim --posix YDBPython/tools/vm_scripts/clean_vm_test.sh: sudo ./ydbinstall.sh --force-install --utf8 default --verbose YDBPython/tools/vm_scripts/clean_vm_test.sh: sudo ./ydbinstall.sh --utf8 default --verbose YDBOctoVistA/OctoVistA.md:./ydbinstall --force-install --ucaseonly-utils --utf8 default --installdir /opt/yottadb/r1.25_x86_64 YDB/Dockerfile: /tmp/ydb-release/ydbinstall --utf8 default --installdir /opt/yottadb/current && \ YDB/Dockerfile-rocky: /tmp/ydb-release/ydbinstall --utf8 default --installdir /opt/yottadb/current && \ ``` - sr_unix/gtmprofile.gtc : This file was deleted on the YottaDB side a while ago (by 52822ee). The file `ydb_env_set` takes care of this functionality on the YottaDB side. The changes in the GT.M side for V7.0-000 to this file were to enhance error handling while trying to install GT.M with UTF-8 support. Such issues have already been ironed out in `ydb_env_set` for the YottaDB side so no need to incorporate any of this GT.M change into the YottaDB side. - sr_unix/kitstart.csh : This file deals with building the binary kits and has long been deleted on the YottaDB side. So no need to incorporate the changes to this file on the GT.M side (which are to remove `dbcertify` from the binary kit and to take the `configure.gtc` interface change into account where it now expects one less answer due to icu version no longer being supplied to it). - sr_unix_cm/rc_mval2subsc.c : This file is almost a duplicate of `mval2subsc.c` and so was removed on the YottaDB side a while ago. The changes to this file in the GT.M side are to change `rts_error_csa` calls with `RTS_ERROR_CSA_ABT`. Those changes will also show up in `mval2subsc.c` and will be addressed there. So no need to incorporate any of the `rc_mval2subsc.c` changes on the YottaDB side. - sr_unix_gnp/libgnpserver.list : This file was removed on the YottaDB side (as part of bf577ff) as there was no need to maintain a list of files related to individual utilities. The GT.M side added a new function `gtcmtr_get_key` to this file. New files will automatically get linked in to `libyottadb.so` which is what all utilities `dlopen()` so there is no need to incorporate the GT.M change to the YottaDB side. * All files that show up as `deleted by them:` are deleted in this commit since those were deleted in the GT.M repository. That is a straightforward change as we just need to delete them from the YottaDB side too. Below is the list of those files. - sr_port/dbcertify.c - sr_port/dbcertify.h - sr_port/dbcertify_base_ch.c - sr_port/dbcertify_certify_phase.c - sr_port/dbcertify_funcs.c - sr_port/dbcertify_scan_phase.c - sr_port/dsewrap.mpt - sr_port/v5cbsu.m - sr_unix/dbcertify_deferred_exit_handler.c - sr_unix/dbcertify_signal_handler.c - sr_unix/ftok.c - sr_unix/semstat2.c

… it can cause hang with CLANG/ASAN Background ---------- * While running the YDBOcto tests with CLANG, I noticed various tests hang. All of them had a similar stack-trace. ```c (gdb) where #0 __sanitizer::FutexWait(__sanitizer::atomic_uint32_t*, unsigned int) () #1 __sanitizer::Semaphore::Wait() () #2 __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >::GetFromAllocator(__sanitizer::AllocatorStats*, unsigned long, unsigned int*, unsigned long) () #3 __sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::Refill(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >::PerClass*, __sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >*, unsigned long) () #4 __sanitizer::CombinedAllocator<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> >, __sanitizer::LargeMmapAllocatorPtrArrayDynamic>::Allocate(__sanitizer::SizeClassAllocator64LocalCache<__sanitizer::SizeClassAllocator64<__asan::AP64<__sanitizer::LocalAddressSpaceView> > >*, unsigned long, unsigned long) () #5 __asan::Allocator::Allocate(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*, __asan::AllocType, bool) () #6 __asan::asan_calloc(unsigned long, unsigned long, __sanitizer::BufferedStackTrace*) () #7 calloc () #8 __pthread_attr_extension (attr=0x7f29af3cee48) at ./nptl/pthread_attr_extension.c:28 #9 __GI___pthread_attr_setaffinity_np (attr=attr@entry=0x7f29af3cee48, cpusetsize=cpusetsize@entry=32, cpuset=cpuset@entry=0x603000001b40) at ./nptl/pthread_attr_setaffinity.c:45 #10 __pthread_getattr_np (thread_id=139817006390848, attr=0x7f29af3cee48) at ./nptl/pthread_getattr_np.c:194 #11 __sanitizer::GetThreadStackTopAndBottom(bool, unsigned long*, unsigned long*) () #12 __sanitizer::GetThreadStackAndTls(bool, unsigned long*, unsigned long*, unsigned long*, unsigned long*) () #13 __asan::PlatformUnpoisonStacks() () #14 __asan_handle_no_return () #15 generic_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:187 #16 ydb_os_signal_handler (sig=15, info=0x7f29af3cfbf0, context=0x7f29af3cfac0) at sr_unix/ydb_os_signal_handler.c:85 #17 <signal handler called> #18 sched_yield () at ../sysdeps/unix/syscall-template.S:120 #19 __sanitizer::StopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), void*) () #20 __lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*) () #21 __GI___dl_iterate_phdr (callback=0x55bd48373320 <__lsan::LockStuffAndStopTheWorldCallback(dl_phdr_info*, unsigned long, void*)>, data=0x7ffe13010eb8) at ./elf/dl-iteratephdr.c:74 #22 __lsan::LockStuffAndStopTheWorld(void (*)(__sanitizer::SuspendedThreadsList const&, void*), __lsan::CheckForLeaksParam*) () #23 __lsan::CheckForLeaks() () #24 __lsan::DoLeakCheck() () #25 __cxa_finalize (d=0x55bd483af128) at ./stdlib/cxa_finalize.c:83 #26 __do_global_dtors_aux () #27 ?? () #28 _dl_fini () at ./elf/dl-fini.c:142 ``` Issue ----- * The YottaDB SIG-15/SIGTERM signal handler got invoked for a SIG-15. But it noticed that all YottaDB exit handler code has already been run (`exit_handler_complete` global variable is TRUE). In that case, it invoked any non-YottaDB signal handler for SIG-15 and afterwards, it invoked `_exit()` to terminate the process (in line 187). **sr_unix/generic_signal_handler.c** ```c 182 if (exit_handler_complete) 183 { 184 if (!using_alternate_sighandling) /* Go does not send us signals so no need to forward */ 185 { 186 drive_non_ydb_signal_handler_if_any("generic_signal_handler1", sig, info, context, TRUE); 187 UNDERSCORE_EXIT(-sig); 188 } 189 return; /* Nothing we can do if exit handler has run */ 190 } ``` * And because of the `_exit()` all, the CLANG/ASAN library ended up doing a `calloc()` call which hung waiting for a futex. Most likely due to re-entrant invocations of C library functions that are not async-signal safe. * The cause of this is line 187 above in my opinion. * If YottaDB exit handler has already run (as part of SIGTERM handling) and we are getting the SIGTERM signal again, then I don't see any reason to do the `_exit()` call (using the `UNDERSCORE_EXIT` macro in line 187). * This code has been there for a long time but I don't think it is doing the right thing. Fix --- * Lines 184-188 are now removed in this commit. I think the right thing to do is to just return in case the YottaDB exit handler has already been invoked. * With this change, I verified that the CLANG/ASAN tests run fine in YDBOcto. So at least one Simple API use case runs fine with the fix in this commit. * Initially I thought of disabling lines 184-188 above only when ASAN is enabled. But then I realized it is a good change for all cases and so removed lines 184-188.

… detect signal/timer handling Background ---------- * We had one rare test failure during in-house testing. The `ideminter_rolrec/mupipstop_rollback_or_recover` subtest failed with the following symptom. ```sh $ cat ROLLBACK1_3.logx mupip journal -ROLLBACK -back -verify -verbose "*" -noonline -resync=369813 -lost=ROLLBACK1_3.lost Sat Sep 9 04:17:18 PM EDT 2023 . . %YDB-I-MUJNLSTAT, Forward processing started at Sat Sep 9 16:19:23 2023 %YDB-I-MUINFOUINT8, mur_process_seqno_table returns min_broken_seqno : 18446744073709551615 [0xFFFFFFFFFFFFFFFF] %YDB-I-MUINFOUINT8, mur_process_seqno_table returns losttn_seqno : 369813 [0x000000000005A495] %YDB-I-MUINFOSTR, Module : mur_forward:at the start at Sat Sep 9 16:19:23 2023 . . %YDB-I-MUINFOSTR, Journal file : ideminter_rolrec_0/mupipstop_rollback_or_recover/g.mjl_2023252161233 %YDB-I-MUINFOUINT4, Record Offset : 65744 [0x000100D0] %YDB-F-FORCEDHALT, Image HALTed by MUPIP STOP %YDB-F-ASSERT, Assert failed in sr_unix/db_ipcs_reset.c line 110 for expression (((TREF(dio_buff)).aligned != (char *)(csd)) || (!timer_in_handler && !multi_thread_in_use)) Sat Sep 9 04:20:35 PM EDT 2023 The time the mupip command took: 197 ``` * The core file corresponding to the above assert failure had the following stack trace. ```c (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=140217990231872) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140217990231872, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7fff160fdc00) at sr_unix/rts_error.c:198 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #8 db_ipcs_reset (reg=0x563c77a1c0b0) at sr_unix/db_ipcs_reset.c:110 #9 mur_close_files () at sr_port/mur_close_files.c:841 #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:116 #11 signal_exit_handler (exit_handler_name=0x7f870b624acc "deferred_exit_handler", sig=15, info=0x7f870b7856a8 <stapi_signal_handler_oscontext+3320>, context=0x7f870b785728 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78 #12 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120 #13 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74 #14 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862 #15 wcs_stale (tid=94817705118848, hd_len=8, region=0x563c77924b08) at sr_port/t_end_sysops.c:1445 #16 timer_handler (why=0, info=0x7f870b787088 <stapi_signal_handler_oscontext+9944>, context=0x7f870b787108 <stapi_signal_handler_oscontext+10072>, is_os_signal_handler=0) at sr_unix/gt_timers.c:913 #17 check_for_deferred_timers () at sr_unix/gt_timers.c:1312 #18 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78 #19 wcs_wtstart (region=0x563c77a1cc80, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:862 #20 wcs_timer_start (reg=0x563c77a1cc80, io_ok=1) at sr_port/t_end_sysops.c:1344 #21 op_tcommit () at sr_port/op_tcommit.c:535 #22 mur_output_record (rctl=0x563c77a28a40) at sr_port/mur_output_record.c:323 #23 mur_forward_play_cur_jrec (rctl=0x563c77a28a40) at sr_port/mur_forward_play_cur_jrec.c:362 #24 mur_forward_multi_proc (rctl=0x563c77a28a40) at sr_port/mur_forward.c:400 #25 gtm_multi_proc (fnptr=0x7f870ae20f00 <mur_forward_multi_proc>, ntasks=1, max_procs=1, ret_array=0x563c7cb21a40, parm_array=0x563c77a27c40, parmElemSize=512, extra_shm_size=2640, init_fnptr=0x7f870ae2b9f0 <mur_forward_multi_proc_init>, finish_fnptr=0x7f870ae2bc10 <mur_forward_multi_proc_finish>) at sr_unix/gtm_multi_proc.c:122 #26 mur_forward (min_broken_time=4294967295, min_broken_seqno=18446744073709551615, losttn_seqno=369813) at sr_port/mur_forward.c:158 #27 mupip_recover () at sr_port/mupip_recover.c:588 #28 mupip_main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip_main.c:122 #29 dlopen_libyottadb (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0, main_func=0x563c761b1004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #30 main (argc=10, argv=0x7fff1610a958, envp=0x7fff1610a9b0) at sr_unix/mupip.c:22 (gdb) p gtm_threadgbl_true->dio_buff.aligned $5 = 0x563c78429000 "GDSDYNUNX04" (gdb) p csd $6 = (sgmnt_data_ptr_t) 0x563c78429000 (gdb) p timer_in_handler $1 = 1 (gdb) p multi_thread_in_use $2 = 0 (gdb) p forced_exit $3 = 2 (gdb) p exit_handler_active $4 = 1 (gdb) p in_os_signal_handler $1 = 0 ``` Issue ----- * The assert failure was in the db_ipcs_reset() -> DB_LSEEKREAD -> DBG_CHECK_DIO_ALIGNMENT. * The `DBG_CHECK_DIO_ALIGNMENT` macro had the following comment. ```c 53 /* If we are using the global variable "dio_buff.aligned", then we better not be executing in timer \ 54 * code or in threaded code (as we have only ONE buffer to use). Assert that. \ 55 */ \ 56 assert(((TREF(dio_buff)).aligned != (char *)(buff)) || (!timer_in_handler && !multi_thread_in_use)); \ ``` * In the failure case, even though we are executing in timer code we are actually in exit handler code (as can be seen by the `forced_exit` and `exit_handler_active` variables in the gdb analysis above). In this case, the exit handler code will not return out of the timer code and so it is okay for the assert to not be TRUE. * The global variable being checked in the assert is `timer_in_handler`. This is where the issue is. That global variable being TRUE just means the `timer_handler()` function is in the current call stack. It does not mean that we are handling a SIGALRM/timer signal and interrupting the mainline code. The assert is intended to protect against signal handler interrupting the mainline code. Therefore, the correct global variable to check in the assert is `in_os_signal_handler`. Fix --- * The fix is simple and is to use `in_os_signal_handler` instead of `timer_in_handler` in the assert.

… to a silent merge conflict in sr_port/jobinterrupt_event.c * The `v52000/D9G12002636` subtest in the YDBTest project failed with the following symptom. ``` > %YDB-F-ASSERT, Assert failed in sr_port/jobinterrupt_event.c line 85 for expression ((SIGUSR1 == sig_num) || (SIGUSR2 == sig_num)) ``` * The below simple M program demonstrates the same issue. ```m $ cat test.m do ^sstep for do . hang 0.1 $ $ydb_dist/yottadb -run test . . ``` * If one tries to send a `mupip intrpt` to the `yottadb -run test` PID above, one would see the assert failure. * The reason for the assert failure is that in YottaDB, we can come to `jobinterrupt_set()` because of either a `SIGUSR1` or `SIGUSR2` signal. Whereas in GT.M side, we can come here only because of a `SIGUSR1`. Hence the YottaDB side had an assert that the signal number passed in better be one of the 2 signals. Whereas in GT.M side, there was no such assert. * In GT.M V7.0-001, as part of GTM-9333, outofband/asynchronous event processing was reworked to better handle various type of outofband events like Ctrl-C, MUPIP INTRPT etc. See http://tinco.pair.com/bhaskar/gtm/doc/articles/GTM_V7.0-001_Release_Notes.html#GTM-9333 for more details. As part of that work, it is possible for a jobinterrupt outofband event to be moved into a pending state and activated at a later point when no other competing outofband events (like $ETRAP or $ZSTEP) are active. The `TRY_EVENT_POP` macro (defined in `sr_port/deferred_events_queue.h` and invoked from routines like `op_unwind.c`) would trigger the jobinterrupt handling when it is a safe/logical point. In that case, `jobinterrupt_set()` would be invoked with the signal number that was saved when the original job interrupt signal was delivered. But no signal number was stored at that time and so the deferred invocation used a `sig_num` value of 0 as the following stack trace corresponding to the above assert failure indicates. ```c > #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 > #8 jobinterrupt_set (sig_num=0) at sr_port/jobinterrupt_event.c:85 > #9 xfer_set_handlers (event_type=3, param_val=0, popped_entry=1) at sr_port/deferred_events.c:120 > #10 op_unwind () at sr_port/op_unwind.c:158 > #11 stkok2 () at sr_x86_64/opp_ret.s:29 ``` The `param_val` parameter passed into frame 9 is 0 (from the TRY_EVENT_POP macro in op_unwind.c in frame 10). And that is where the issue is. * The issue is that the `param_val` was not properly stored originally. This is fixed by storing the value in the `jobinterrupt_set()` function in this commit. * Additionally, a few asserts were added in `save_xfer_queue_entry()` and `pop_real_xfer_queue_entry()` functions in `sr_port/deferred_events_queue.c` as they serve as good check points.

…mplete deferred state setup before invoking xfer_set_handlers() * After merging GT.M V7.0-001, the following tests failed in rare cases. - -t dual_fail_extend -replic -st dual_fail2_mustop_sigquit - -t v60000 -replic -st gtm4525b * The failure symptom was the following. ```c (gdb) x/s gtm_threadgbl_true->util_outbuff 0x17d3ed8: "%YDB-F-ASSERT, Assert failed in sr_port/deferred_signal_handler.c line 38 for expression (GET_DEFERRED_EXIT_CHECK_NEEDED || (1 != forced_exit))" ``` * And the C-stack was the following. ```c (gdb) where #0 __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7fffa2b5d390) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 deferred_signal_handler () at sr_port/deferred_signal_handler.c:38 #7 set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:48 #8 xfer_set_handlers (event_type=11, param_val=1730866112, popped_entry=0) at sr_port/deferred_events.c:191 #9 generic_signal_handler (sig=15, info=0x7f7167e24218 <stapi_signal_handler_oscontext+3320>, context=0x7f7167e24298 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:305 #10 ydb_os_signal_handler (sig=15, info=0x7fffa2b5d9f0, context=0x7fffa2b5d8c0) at sr_unix/ydb_os_signal_handler.c:85 #11 <signal handler called> #12 __GI___clock_nanosleep (clock_id=1, flags=1, req=0x7fffa2b5e058, rem=0x0) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:78 #13 wait_for_repl_inst_unfreeze_nocsa_jpl (jpl=0x17ec240) at sr_port/anticipatory_freeze.h:517 #14 wait_for_repl_inst_unfreeze (csa=0x18f7040) at sr_port/anticipatory_freeze.h:547 #15 jnl_write_attempt (jpc=0x18f7a40, threshold=29324848) at sr_port/jnl_write_attempt.c:348 #16 jnl_flush (reg=0x189afe8) at sr_port/jnl_flush.c:57 #17 tp_tend () at sr_port/tp_tend.c:795 #18 op_tcommit () at sr_port/op_tcommit.c:497 (gdb) f 6 #6 0x00007f71672ae771 in deferred_signal_handler () at sr_port/deferred_signal_handler.c:38 38 assert(GET_DEFERRED_EXIT_CHECK_NEEDED || (1 != forced_exit)); ``` * The `SET_FORCED_EXIT_STATE` macro call (in frame 9 above) is where the issue is. **sr_port/have_crit.h** ```c 172 #define SET_FORCED_EXIT_STATE(SIG) \ 173 { \ 174 char *rname; \ 175 \ 176 GBLREF VSIG_ATOMIC_T forced_exit; \ 177 GBLREF int forced_exit_sig; \ 178 GBLREF boolean_t (*xfer_set_handlers_fnptr)(int4, void (*callback)(int4), int4 param, boolean_t popped_entry); \ 179 GBLREF void (*deferred_signal_set_fnptr)(int4 dummy_val); \ 180 \ 181 /* Below code is not thread safe as it modifies global variables "forced_exit" \ 182 * and "forced_exit_sig". \ 183 */ \ 184 assert(!INSIDE_THREADED_CODE(rname)); \ 185 assert((0 == forced_exit) || (1 == forced_exit)); \ --> 186 forced_exit = 1; \ 187 forced_exit_sig = SIG; /* Record the signal forcing us to exit */ \ 188 if (in_os_signal_handler) \ 189 { /* If we are inside an OS signal handler and therefore had to defer exit \ 190 * handling, treat this as an outofband event as this is checked by lots of \ 191 * potentially long-running commands in the runtime (e.g. HANG etc.) and we \ 192 * want all of those to automatically trigger process exit handling. \ 193 * The below invocation takes care of the signal as a deferred outofband event \ 194 * that gets handled at the earliest safe point. \ 195 */ \ 196 if (NULL != xfer_set_handlers_fnptr) \ --> 197 (*xfer_set_handlers_fnptr)(deferred_signal, deferred_signal_set_fnptr, 0, FALSE); \ 198 /* else: it is "gtmsecshr" in which case outofband does not apply */ \ 199 } \ 200 /* Whenever "forced_exit" gets set to 1, set the corresponding deferred event too */ \ --> 201 SET_DEFERRED_EXIT_CHECK_NEEDED; \ 202 SET_FORCED_THREAD_EXIT; /* Signal any running threads to stop */ \ 203 SET_FORCED_MULTI_PROC_EXIT; /* Signal any parallel processes to stop */ \ 204 } ``` * Line 186 sets `forced_exit` and Line 201 sets the corresponding deferred event. But Line 197 ends up invoking `deferred_signal_handler()` which has an assert that expects Line 186 and 201 to have happened at the same time. * This is fixed by moving lines 188-199 above to execute AFTER lines 200-203. That way the state setup of the forced exit is finished first and then the outofband set up happens by the `xfer_set_handlers_fnptr` call. * Now that Lines 186 and 201 are executed BEFORE line 197 in this commit, the assert failure seen in the test failure should be automatically fixed.

* The `reorg/on_ntp_njnl_reorg` subtest failed once in 10 runs or so with the following assert. ```diff > reorg_5_6/on_ntp_njnl_reorg/reorg1.out > %YDB-F-ASSERT, Assert failed in sr_port/mu_split.c line 689 for expression (*top_off < max_fill) ``` * Below is the core analysis using gdb. ```c (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=140573380031552) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=140573380031552) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140573380031552, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7ffd10fd0290) at sr_unix/rts_error.c:198 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #8 locate_block_split_point (blk_stat=0x61d00029eb98, level=1, cur_blk_size=16, max_fill=4, last_rec_size=0x7ffd10fd0950, last_key=0x7ffd10fd0e10 "srqponm", last_keysz=0x7ffd10fd0920, top_off=0x7ffd10fd0940) at sr_port/mu_split.c:689 #9 mu_split (cur_level=0, i_max_fill=4096, d_max_fill=2908, blks_created=0x7ffd10fd15a0, lvls_increased=0x7ffd10fd15b0) at sr_port/mu_split.c:314 #10 mu_reorg (gl_ptr=0x62d00021fec0, exclude_glist_ptr=0x7ffd10fd2820, resume=0x7ffd10fd26e0, index_fill_factor=100, data_fill_factor=71, reorg_op=0) at sr_port/mu_reorg.c:356 #11 mupip_reorg () at sr_port/mupip_reorg.c:334 #12 mupip_main (argc=5, argv=0x7ffd10fe51e8, envp=0x7ffd10fe5218) at sr_unix/mupip_main.c:121 #13 dlopen_libyottadb (argc=5, argv=0x7ffd10fe51e8, envp=0x7ffd10fe5218, main_func=0x560122181020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #14 main (argc=5, argv=0x7ffd10fe51e8, envp=0x7ffd10fe5218) at sr_unix/mupip.c:22 (gdb) p *top_off $5 = 16 (gdb) p max_fill $6 = 4 (gdb) up #9 mu_split (cur_level=0, i_max_fill=4096, d_max_fill=2908, blks_created=0x7ffd10fd15a0, lvls_increased=0x7ffd10fd15b0) at sr_port/mu_split.c:314 314 status = locate_block_split_point(old_blk1_hist_ptr, level, old_blk1_sz, max_fill, (gdb) p old_blk1_sz $30 = 16 (gdb) p delta $31 = 56 (gdb) p blk_size $32 = 4096 (gdb) p reserve_bytes $33 = 4096 (gdb) p max_fill_sav $34 = 2908 (gdb) p bstar_rec_sz $35 = 12 (gdb) p cs_data->reserved_bytes $36 = 0 ``` * Below is the source code corresponding to frame 9. **sr_port/mu_split.c** ```c 299 if ((old_blk1_sz + delta) > (blk_size - reserve_bytes)) 300 { 301 split_required = TRUE; 302 if (level == gv_target->hist.depth) 303 { 304 create_root = TRUE; 305 if ((MAX_BT_DEPTH - 1) <= level) 306 return cdb_sc_maxlvl; /* maximum level reached */ 307 } 308 if (max_fill + bstar_rec_sz > old_blk1_sz) 309 { /* need more space than what was in the old block, so new block will be "too big" */ 310 if (((SIZEOF(blk_hdr) + bstar_rec_sz) == old_blk1_sz) && !mu_reorg_upgrd_dwngrd_in_prog) 311 return cdb_sc_oprnotneeded; /* Improve code to avoid this */ 312 max_fill = old_blk1_sz - bstar_rec_sz; 313 } 314 status = locate_block_split_point(old_blk1_hist_ptr, level, old_blk1_sz, max_fill, 315 &old_blk1_last_rec_size, new_blk1_last_key, &new_blk1_last_keysz, &new_leftblk_top_off); ``` * As the core analysis indicates, `max_fill` is 4 which is the reason the assert failed. And that happened at line 312. * And this is because `old_blk1_sz` was a small value of `16`. * But then I was wondering how come we got into the `if` at line 299 as we would get into it only if the block has lot of content and needs a split. * Turns out the issue is that `reserve_bytes` is `4096` which is the same as `blk_size`. And so the right hand side of the `>` check in line 299 was 0 which is why the `if` succeeded even though `old_blk1_sz + delta` was only `16 + 56` i.e. 72 bytes, which is a lot less than the value that would usually require a block split. * I then started looking at why `reserve_bytes` ended up being such a huge value when the file header field it mirrors (i.e. `cs_data->reserved_bytes`) is 0 as indicated in the gdb analysis above. * That is when I realized this happened in the following line. **sr_port/mu_split.c** ```c 233 reserve_bytes = i_max_fill; ``` * This is a bug that was introduced in GT.M V7.0-001 changes (52a92df) to `sr_port/mu_split.c`. And I see that this incorrect set of `reserve_bytes = i_max_fill` is removed in GT.M V7.1-000 (f9ca5ad). * f9ca5ad has a lot more changes to support `mupip reorg upgrade` so I cannot cherry-pick that commit. * Therefore, I am fixing this issue by removing line 233 in this commit. * In addition, 895c2d3a had fixed some `clang-tidy` warnings in `sr_port/mu_split.c` related to the `reserve_bytes` variable. There was a block of code that set `reserve_bytes` BEFORE line 233. And that was flagged as a `[clang-analyzer-deadcode.DeadStores]` warning. And so the previous block of code that set `reserve_bytes` was removed in 895c2d3a. And along with it, logic related to `available_bytes` was made `#ifdef DEBUG`. Both these changes are now removed in this commit as that previous block had correctly set `reserve_bytes = cs_data->reserved_bytes` (and is the line that remains even in GT.M V7.1-000 f9ca5ad). That said, the update of `available_bytes` inside the `#ifdef DEBUG` block continues to be used only by a later assert so that is still kept inside a `DEBUG_ONLY()` macro in the new code to avoid the following `clang-tidy` warning. ``` mu_split.c:warning: Value stored to 'available_bytes' is never read [clang-analyzer-deadcode.DeadStores] ```

…n is_free_blks_ctr_ok() (regression in ea9950a) * The resil_4/resil subtest failed in one rare in-house test run with the following symptom. ```diff > resil_4_6/resil/reorg4.out > %YDB-F-ASSERT, Assert failed in sr_port/bm_getfree.c line 401 for expression (FALSE) ``` * Below is relevant information from the core file. ```c (gdb) where #0 __pthread_kill_implementation (threadid=<optimized out>, signo=3, no_tid=<optimized out>) at ./nptl/pthread_kill.c:44 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7fafb799c5a0) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 is_free_blks_ctr_ok () at sr_port/bm_getfree.c:401 #7 gdsfilext (blocks=100, filesize=12559321033, trans_in_prog=1) at sr_unix/gdsfilext.c:335 #8 bm_getfree (hint_arg=196239399, blk_used=0x7fafb80a4410, cw_work=3, cs=0x7fafb6d841c0 <cw_set>, cw_depth_ptr=0x7fafb80a4540) at sr_port/bm_getfree.c:185 #9 t_end (hist1=0x62d0001128c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:520 #10 mu_reorg (gl_ptr=0x62d000112040, exclude_glist_ptr=0x7ffc96944400, resume=0x7ffc96933f60, index_fill_factor=78, data_fill_factor=44, reorg_op=0) at sr_port/mu_reorg.c:367 #11 mupip_reorg () at sr_port/mupip_reorg.c:334 #12 mupip_main (argc=5, argv=0x7ffc96944958, envp=0x7ffc96944988) at sr_unix/mupip_main.c:121 #13 dlopen_libyottadb (argc=5, argv=0x7ffc96944958, envp=0x7ffc96944988, main_func=0x55aebf0d8420 <str> "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #14 main (argc=5, argv=0x7ffc96944958, envp=0x7ffc96944988) at sr_unix/mupip.c:22 (gdb) f 6 #6 0x00007fafb5ce61b2 in is_free_blks_ctr_ok () at sr_port/bm_getfree.c:401 401 assert(FALSE); /* In pro, we will simply skip counting this local bitmap. */ (gdb) list 379 for (free_blocks = 0, free_bml = 0; free_bml < local_maps; free_bml++) 380 { 381 #ifdef DEBUG 382 if ((0 != ydb_skip_bml_num) 383 && (BLKS_PER_LMAP <= (BLKS_PER_LMAP * free_bml)) 384 && ((BLKS_PER_LMAP * free_bml) < ydb_skip_bml_num)) 385 { 386 free_bml = (ydb_skip_bml_num / BLKS_PER_LMAP) - 1; 387 /* - 1 to compensate the "free_bml++" done in "for" loop line */ 388 free_blocks += (ydb_skip_bml_num - BLKS_PER_LMAP) / BLKS_PER_LMAP * (BLKS_PER_LMAP - 1); 389 continue; 390 } 391 #endif 392 bml = bmm_find_free((uint4)free_bml, (sm_uc_ptr_t)MM_ADDR(cs_data), local_maps); 393 if (bml < free_bml) 394 break; 395 free_bml = bml; 396 bml *= BLKS_PER_LMAP; 397 if (!(bmp = t_qread(bml, (sm_int_ptr_t)&cycle, &cr)) 398 || (BM_SIZE(BLKS_PER_LMAP) != ((blk_hdr_ptr_t)bmp)->bsiz) 399 || (LCL_MAP_LEVL != ((blk_hdr_ptr_t)bmp)->levl)) 400 { 401 assert(FALSE); /* In pro, we will simply skip counting this local bitmap. */ 402 continue; 403 } (gdb) p/x bml $4 = 0x200 (gdb) p/x ydb_skip_bml_num $3 = 0x2ec97f600 (gdb) p free_bml $5 = 1 ``` * The value of `bml` is 512 at line 401 and `ydb_skip_bml_num` is a non-zero value. This means we are in the HOLE section of the database file and should never do a `t_qread()` call on such a block. But we did it at line 397 and is why we ended up with an assert failure. * The value of `free_bml` is 1 at line 401. * If `free_bml` was 1 at line 382, the `if` block would have kicked in and recognized this is a hole and counted the free blocks for the HOLE and then move on to the non-HOLE section of the database file using the `continue` at line 389 in a different iteration of the for loop. * But this did not happen. Therefore, `free_bml` was 0 at line 382 but became 1 at line 401. * This means that `free_bml` was 0 at line 382 and got set to 1 at line 395. * That is, the `bmm_find_free()` call at line 392 was passed in `free_bml=0` as the first parameter and returned `bml=1`. That is, it found all blocks in bitmap block 0 as busy and so returned bitmap block 1 as that which has free space. * In this case, we should redo the `if` check in lines 381-391 right after setting `free_bml` in line 395. That would take care of skipping the `t_qread()` call for bitmap blocks in the HOLE section. This is exactly what is taken care of in this commit.

…ctl() (sr_unix/gtmsource_process.c) Background ---------- * The merge_5/tp_stress subtest failed on an in-house AARCH64 system in one rare run with the following symptom. ```diff > ##HOST##:##TEST_OUTPUT##/SRC_17_37_44.log > %YDB-F-ASSERT, Assert failed in sr_unix/gtmsource_process.c line 400 for expression ((GTMSOURCE_SENDING_JNLRECS != gtmsource_state) || ((0 == poll_time) || (GTMSOURCE_IDLE_POLL_WAIT == poll_time)) GTMTLS_ONLY(DEBUG_ONLY(|| renegotiation_pending))) ``` * Below is the code surrounding the failed assert. **sr_unix/gtmsource_process.c** ```c 392 /* Check if receiver sent us any control message. Typically, the traffic from receiver to source is very 393 * low compared to traffic in the other direction. More often than not, there will be nothing on the pipe 394 * to receive. Ideally, we should let TCP notify us when there is data on the pipe (async I/O on Unix and 395 * VMS). But, we are not there yet. Since we do a select() before a recv(), we won't block if there is 396 * nothing in the pipe. So, it shouldn't be an expensive operation even if done before every send. Also, 397 * in doing so, we react to an XOFF sooner than later. 398 */ 399 /* Make sure we don't sleep for an extended period of time if there is something to be sent across */ 400 assert((GTMSOURCE_SENDING_JNLRECS != gtmsource_state) 401 || ((0 == poll_time) || (GTMSOURCE_IDLE_POLL_WAIT == poll_time)) 402 GTMTLS_ONLY(DEBUG_ONLY(|| renegotiation_pending))); ``` * Below are relevant details from the gdb anaylsis of the resulting core file. ```c (gdb) where #0 __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 gtmsource_recv_ctl () at sr_unix/gtmsource_process.c:400 #7 gtmsource_process () at sr_unix/gtmsource_process.c:1620 #8 gtmsource () at sr_unix/gtmsource.c:618 #9 mupip_main (argc=13, argv=0xffffc0f44a08, envp=0xffffc0f44a78) at sr_unix/mupip_main.c:121 #10 dlopen_libyottadb (argc=13, argv=0xffffc0f44a08, envp=0xffffc0f44a78, main_func=0xaaaacb921650 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #11 0x0000aaaacb920f34 in main (argc=13, argv=0xffffc0f44a08, envp=0xffffc0f44a78) at sr_unix/mupip.c:22 (gdb) p gtmsource_state $1 = GTMSOURCE_SENDING_JNLRECS (gdb) p poll_time $2 = 999 == REPL_POLL_WAIT (gdb) p renegotiation_pending $3 = 0 sr_port/repl_comm.h 88:#define GTMSOURCE_IDLE_POLL_WAIT 10 /* 10ms sleep in case nothing sent to the other side */ 89:#define REPL_POLL_WAIT (MILLISECS_IN_SEC - 1) /* Maximum time (in ms) for select()/poll() to timeout */ (gdb) p repl_tls.enabled $5 = 1 (gdb) p repl_tls.renegotiate_state $6 = REPLTLS_WAITING_FOR_RENEG_TIMEOUT (gdb) p next_renegotiate_hrtbt $7 = 1 (gdb) p heartbeat_stalled $8 = 0 (gdb) p hrtbt_cnt $10 = 12 (gdb) p renegotiate_factor $11 = 12 (gdb) p next_renegotiate_hrtbt $12 = 1 (gdb) p gtmsource_options.renegotiate_interval $13 = 180 ``` Issue ----- * As indicated in line 399 in the below code, the assert at line 400 is to make sure that we don't wait for an extended period of time in the REPL_RECV_LOOP at line 433. Hence the check for `poll_time` being 0 or 10 milliseconds. Or that a TLS renegotiation is pending. **sr_unix/gtmsource_process.c** ```c 392 /* Check if receiver sent us any control message. Typically, the traffic from receiver to source is very 393 * low compared to traffic in the other direction. More often than not, there will be nothing on the pipe 394 * to receive. Ideally, we should let TCP notify us when there is data on the pipe (async I/O on Unix and 395 * VMS). But, we are not there yet. Since we do a select() before a recv(), we won't block if there is 396 * nothing in the pipe. So, it shouldn't be an expensive operation even if done before every send. Also, 397 * in doing so, we react to an XOFF sooner than later. 398 */ --> 399 /* Make sure we don't sleep for an extended period of time if there is something to be sent across */ --> 400 assert((GTMSOURCE_SENDING_JNLRECS != gtmsource_state) 401 || ((0 == poll_time) || (GTMSOURCE_IDLE_POLL_WAIT == poll_time)) 402 GTMTLS_ONLY(DEBUG_ONLY(|| renegotiation_pending))); 403 if (GTMSOURCE_CHANGING_MODE == gtmsource_state) 404 return; 405 if (GTMSOURCE_WAITING_FOR_CONNECTION == gtmsource_state) 406 return; 407 # ifdef GTM_TLS 408 if (repl_tls.enabled && (REPLTLS_WAITING_FOR_RENEG_TIMEOUT == repl_tls.renegotiate_state) && next_renegotiate_hrtbt) 409 { /* Time to renegotiate the TLS/SSL parameters. */ 410 heartbeat_stalled = TRUE; /* Defer heartbeats until renegotiation is done. */ 411 DEBUG_ONLY(renegotiation_pending = TRUE); 412 /* Send REPL_RENEG_ACK_ME message to the receiver. */ 413 renegotiate_msg.type = REPL_RENEG_ACK_ME; 414 renegotiate_msg.len = MIN_REPL_MSGLEN; 415 gtmsource_repl_send((repl_msg_ptr_t)&renegotiate_msg, "REPL_RENEG_ACK_ME", 416 MAX_SEQNO, INVALID_SUPPL_STRM); 417 if (GTMSOURCE_CHANGING_MODE == gtmsource_state) 418 return; 419 if (GTMSOURCE_WAITING_FOR_CONNECTION == gtmsource_state) 420 return; 421 /* We now have to wait for REPL_RENEG_ACK from the receiver. Until then we defer sending journal 422 * records to the other side. This way, we don't end up having outbound data in the TCP/IP pipe 423 * during the time of renegotiation. TLS/SSL protocol doesn't handle application data when it is 424 * in the middle of renegotiation. Similarly, the receiver on receipt of the REPL_RENEG_ACK_ME 425 * message will defer sending any more messages to us until the renegotiation is completed. 426 */ 427 repl_tls.renegotiate_state = REPLTLS_WAITING_FOR_RENEG_ACK; 428 repl_log(gtmsource_log_fp, TRUE, TRUE, "Waiting for REPL_RENEG_ACK\n"); 429 poll_time = REPL_POLL_WAIT; /* because we are waiting for a REPL_RENEG_ACK */ 430 } 431 # endif 432 recv_msgp = &recv_msg; --> 433 REPL_RECV_LOOP(gtmsource_sock_fd, recv_msgp, MIN_REPL_MSGLEN, poll_time) ``` * But in the failure case, as the gdb analysis indicates, all variables in the `if` check at line 408 above evaluated to TRUE. That means `renegotiation_pending` would be set to `TRUE` at line 411 and so the assert at line 400 would have been TRUE if only we had delayed it a bit. Fix --- * The fix is simple and is to move lines 399-402 to AFTER line 431. That is, it just fixes the placement of this longstanding assert to address this timing scenario where a TLS renegotiation is pending. Notes ----- * The reason this started failing only after GT.M V7.0-001 code changes were merged is because in that version, lines 564-573 were added below which caused `next_renegotiate_hrtbt` to be set to TRUE. This variable was never set to TRUE in the prior version and so the `if` check at line 408 above never evaluated to TRUE. **sr_unix/gtmsource_process.c** ```c 551 case REPL_HEARTBEAT: 552 if (msg_is_cross_endian) 553 { 554 heartbeat_msg = (repl_heartbeat_msg_ptr_t)recv_msgp; 555 tmp_seqno = *(seq_num *)&heartbeat_msg->ack_seqno[0]; 556 tmp_seqno = GTM_BYTESWAP_64(tmp_seqno); 557 *(seq_num *)&heartbeat_msg->ack_seqno[0] = tmp_seqno; 558 tmp_time4 = *(gtm_time4_t *)&heartbeat_msg->ack_time[0]; 559 tmp_time4 = GTM_BYTESWAP_32(tmp_time4); 560 *(gtm_time4_t *)&heartbeat_msg->ack_time[0] = tmp_time4; 561 } 562 if ((!gtmsource_is_heartbeat_stalled) && (SHUTDOWN != gtmsource_local->shutdown)) 563 gtmsource_process_heartbeat((repl_heartbeat_msg_ptr_t)recv_msgp); 564 # ifdef GTM_TLS 565 hrtbt_cnt++; 566 if ((0 < renegotiate_factor) && (0 == (hrtbt_cnt % renegotiate_factor)) 567 && (SHUTDOWN != gtmsource_local->shutdown)) 568 { 569 next_renegotiate_hrtbt = TRUE; 570 repl_tls.renegotiate_state = REPLTLS_WAITING_FOR_RENEG_TIMEOUT; 571 poll_time = REPL_POLL_WAIT; 572 } 573 # endif 574 break; ``` * This failure was fairly reproducible on the AARCH64 systems (not on the x86_64 systems) with the following options passed to `com/gtmtest.csh`. The test failed at least once in 10 runs and likely more the slower the system is. ``` -t merge -replic -reorg -st tp_stress -env gtm_test_tls=TRUE -env gtm_test_tls_renegotiate=1 -num_runs 10 ```

@shabiel

…terrupts cause %YDB-E-STACKCRIT Background ---------- * This is an issue identified by @shabiel while trying to reproduce some other issue. The next bullet is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/1029#description. * To reproduce this, start an M process in direct mode while it is at the `YDB>` prompt. And in another terminal send it a lot of repeated job interrupts (i.e. `mupip intrpt`), and then go back to the M process and type anything and press enter. Eventually, the process will crash with this: ```c YDB>%YDB-E-STACKCRIT, Stack space critical %YDB-F-ASSERT, Assert failed in sr_port/mdb_condition_handler.c line 1184 for expression (!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL)) ``` * One needs to send repeated jobinterrupts to the same process. One easy way to do this is in tcsh using the following command from a different terminal. The below issues 10,000 interrupts to pid 52545. ``` repeat 10000 $ydb_dist/mupip intrpt 52545 ``` * One would see the assert show up in the original terminal. Issue ----- Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/1029#note_1581268407 * The debugger shows it is a `STACKCRIT` error that in turn triggers an assert failure. ```c (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=140478350915392) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=140478350915392) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140478350915392, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7ffe608dc2d0) at sr_unix/rts_error.c:198 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #8 mdb_condition_handler (arg=150373738) at sr_port/mdb_condition_handler.c:1184 #9 rts_error_va (csa=0x0, argcnt=1, var=0x7ffe608dc7a0) at sr_unix/rts_error.c:198 #10 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99 #11 jobintrpt_ztime_process (ztime=0) at sr_port/jobintrpt_ztime_process.c:84 #12 trans_code () at sr_port/trans_code.c:193 #13 stkok2 () at sr_x86_64/mum_tstart.s:37 (gdb) f 11 #11 jobintrpt_ztime_process (ztime=0) at sr_port/jobintrpt_ztime_process.c:84 84 PUSH_MV_STENT(MVST_ZINTR); /* MVST_ZTIMEOUT is identical to MVST_ZINTR with a flag to differentiate in debugging */ (gdb) f 8 #8 mdb_condition_handler (arg=150373738) at sr_port/mdb_condition_handler.c:1184 1184 assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL)); (gdb) list 1182 if (!(SFT_ZINTR & proc_act_type) && !(SFT_ZTIMEOUT & proc_act_type)) /* ztimeout vector precompiled */ 1183 { 1184 assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL)); 1185 trans_code_cleanup(); ``` * I verified that with a Release/PRO build, there is no issue. ```c YDB>%YDB-E-STACKCRIT, Stack space critical %YDB-E-ERRWZINTR, Error while processing $ZINTERRUPT %YDB-E-ZINTDIRECT, Attempt to enter direct mode from $ZINTERRUPT At M source location +1^GTM$DMOD ``` * The issue is in line 1184 which needs to take a STACKCRIT error into account in the assert. Fix --- * The following change should fix the failure in my understanding and is implemented in this commit. ```diff < assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL)); > assert(!dollar_zininterrupt || ((int)ERR_ZINTRECURSEIO == SIGNAL) || ((int)ERR_STACKCRIT == SIGNAL)); ``` * And it did fix the failure when I manually tested with the above change.

… by multiple threads Background ---------- * The `simplethreadapi/ydb550` subtest in the YDBTest project failed in a very rare test run with the following symptom. ```diff 7a8,10 > %YDB-F-KILLBYSIGSINFO1, YottaDB process 30596 has been killed by a signal 11 at address 0x00007F75FAE7A6A6 (vaddr 0x00000000000010C8) > %YDB-F-SIGMAPERR, Signal was caused by an address not mapped to an object > Quit (core dumped) ``` * Below is the C-stack of the SIG-11. ```c Thread 1 (Thread 0x7f46d05ff6c0 (LWP 29944)): #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=<optimized out>, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:91 #5 generic_signal_handler (sig=11, info=0x7f46d1749e18 <stapi_signal_handler_oscontext+3320>, context=0x7f46d1749e98 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:494 #6 ydb_os_signal_handler (sig=11, info=0x7f46d05fe530, context=0x7f46d05fe400) at sr_unix/ydb_os_signal_handler.c:85 #7 <signal handler called> #8 ydb_tp_st (tptoken=0, errstr=0x0, tpfn=0x562be5c2f2a3 <callback1>, tpfnparm=0x7f46d05feea0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_st.c:65 #9 start_thread (args=0x7ffd66c04a80) at simplethreadapi/inref/ydb550.c:116 #10 start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444 #11 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) f 8 #8 ydb_tp_st (tptoken=0, errstr=0x0, tpfn=0x562be5c2f2a3 <callback1>, tpfnparm=0x7f46d05feea0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_st.c:65 65 assert(LYDB_RTN_NONE == TREF(libyottadb_active_rtn)); (gdb) p lcl_gtm_threadgbl $1 = (void *) 0x0 (gdb) p gtm_threadgbl_true $2 = (gtm_threadgbl_true_t *) 0x7f46b4001b80 ``` Issue ----- * The SIG-11 is because `TREF(libyottadb_active_rtn)` tries to dereference the local copy of the global variable `gtm_threadgbl_true` which is stored in `lcl_gtm_threadgbl` but that is NULL. * The reason it is NULL is because of a longstanding timing issue in `ydb_tp_st.c` as well as potentially other functions in the Simple Thread API. * In the failing test, multiple threads are created in `YDBTest/simplethreadapi/inref/ydb550.c` and each thread does a call to `ydb_tp_st()` as the first call (it does not do a `ydb_init()`). * The `SETUP_THREADGBL_ACCESS` at line 43 takes a copy of `gtm_threadgbl_true` into `lcl_gtm_threadgbl`. And then invokes the `LIBYOTTADB_RUNTIME_CHECK` macro at line 44. **sr_unix/ydb_tp_st.c** ```c 41 DCL_THREADGBL_ACCESS; 42 43 SETUP_THREADGBL_ACCESS; 44 LIBYOTTADB_RUNTIME_CHECK((int), errstr); ``` * But since `ydb_init()` has not been called yet, `lcl_gtm_threadgbl` would be NULL after line 43. * Now inside the `LIBYOTTADB_RUNTIME_CHECK` macro, it is possible for 2 threads T1 and T2 to reach line 108 below at the same time with `lcl_gtm_threadgbl` set to NULL in the `ydb_tp_st()` calls of both threads. **sr_unix/libyottadb_int.h** ```c 96 #define LIBYOTTADB_RUNTIME_CHECK(RETTYPE, ERRSTR) \ 97 MBSTART { \ . . 107 /* No threadgbl usage in this macro until the following block completes */ \ 108 if (!ydb_init_complete) \ 109 { /* Have to initialize things before we can establish an error handler */ \ 110 assert(!USING_ALTERNATE_SIGHANDLING); /* Go uses ydb_main_lang_init() instead */ \ 111 if (0 != (status = ydb_init())) /* Note - sets fgncal_stack */ \ 112 { \ 113 SET_STAPI_ERRSTR_MULTI_THREAD_SAFE(-status, (ydb_buffer_t *)ERRSTR); \ 114 return RETTYPE -status; \ 115 } \ 116 /* Since we called "ydb_init" above, "gtm_threadgbl" would have been set to a non-null VALUE \ 117 * and so any call to SETUP_THREADGBL_ACCESS done by the function that called this macro \ 118 * needs to be redone to set "lcl_gtm_threadgbl" to point to this new "gtm_threadgbl". \ 119 */ \ 120 SETUP_THREADGBL_ACCESS; \ 121 } \ 122 } MBEND ``` * Let us say one thread T1 finds `ydb_init_complete` as 0 and goes into the `if` block at line 108 and does the `ydb_init()` call and later does the `SETUP_THREADGBL_ACCESS` call at line 120. This means that T1 would have updated `lcl_gtm_threadgbl` to a non-NULL value at line 120. * Let us say the other thread T2 finds `ydb_init_complete` as non-zero at line 108 (because in the mean time thread T1 had done the `ydb_init()` call at line 111). This means it would skip the `SETUP_THREADGBL_ACCESS` call at line 120 as well. And so it would continue to have a NULL value of `lcl_gtm_threadgbl`. When the macro returns from T2, this NULL value would cause the SIG-11 in a `TREF` access that is there in later portions of `ydb_tp_st.c`. * Fortunately, all usages of `TREF` that I see in the normal code path are in debug-only code in `ydb_tp_st.c` (i.e. in asserts) and so this does not look like a Release build issue. Fix --- * The fix is to move the `SETUP_THREADGBL_ACCESS` call at line 120 from inside the `if` at line 108 above to AFTER the `if` at line 108. That way both threads T1 and T2 will call `SETUP_THREADGBL_ACCESS`. * While doing this, I noticed that various callers of the `LIBYOTTADB_RUNTIME_CHECK` macro currently do a `SETUP_THREADGBL_ACCESS` call already and those calls can be removed now that it is anyways done inside the `LIBYOTTADB_RUNTIME_CHECK` macro. This meant changes to various callers all of which were files implementing the Simple Thread API functions (i.e. ydb_*_t.c OR ydb_*_st.c OR `ydb_sig_dispatch()` which was invoked by the YDBGo wrapper also a Simple Thread API application). * Additionally, a similar move of the `SETUP_THREADGBL_ACCESS` call was done inside the macro `LIBYOTTADB_RUNTIME_CHECK_NORETVAL` which is very similar to the `LIBYOTTADB_RUNTIME_CHECK` macro.

…ct block allocations in previous bitmap Background ---------- * The `reorg_5/on_ntp_njnl_reorg` subtest failed with the following symptom. ```diff 3c3,5 < PASS from on_ntp_njnl_reorg --- > FAIL from on_ntp_njnl_reorg # Subtest stopped as it has too many core files (Threshold = 10 ; Actual = 13) > Killed ``` * There were a lot of assert failures and core files. The primary failure was an assert failures in a `mupip reorg -fill=39 -index=95 -truncate` process. Line 1552 below is the primary assert failure (in `bml_status_check.c`). **reorg_5_7/online_reorg_31388.outx.4** ``` 1 # Fri 10 Nov 2023 06:16:09 PM EST : cnt = 4 ; ff = 39 ; inff = 95 2 # Fri 10 Nov 2023 06:16:09 PM EST : nice +19 mupip reorg -fill=39 -index=95 -truncate . 1544 Global: zyxwvu95 (region DEFAULT) 1545 Blocks processed : 16 1546 Blocks coalesced : 3 1547 Blocks split : 9 1548 Blocks swapped : 16 1549 Blocks freed : 0 1550 Blocks reused : 9 1551 Blocks extended : 0 1552 %YDB-F-ASSERT, Assert failed in sr_port/bml_status_check.c line 72 for expression ((gds_t_acquired != cs->mode) || (BLK_BUSY != bml_status)) 1553 %YDB-F-ASSERT, Assert failed in sr_port/sec_shr_map_build.c line 62 for expression (bitnum > prev_bitnum) 1554 %YDB-F-ASSERT, Assert failed in sr_port/wcs_recover.c line 173 for expression ((!dollar_tlevel && !cr_array_index) || (dollar_tlevel && (!si->cr_array_index || (NULL != si->kip_csa)))) 1555 %YDB-E-NOTALLDBRNDWN, Not all regions were successfully rundown ``` Issue ----- * The assert failure in `bml_status_check.c` indicates that we found a block that was allocated in this transaction (in `bm_getfree.c`) but the local bitmap for that allocated block says the block is already marked `BUSY`. This is an out-of-design situation as only blocks marked `FREE` or `RECYCLED` in the local bitmap should be allocated. * The C-stack and relevant variables from the core files using gdb are captured below. ```c (gdb) where #0 pthread_kill () from /lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7fff32e6efc0) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 bml_status_check (cs=0x7f9e639cc4e0 <cw_set>) at sr_port/bml_status_check.c:72 #7 t_end (hist1=0x2150840, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1637 #8 mu_swap_root (gl_ptr=0x2151540, root_swap_statistic_ptr=0x7fff32e727ac, upg_mv_block=0) at sr_unix/mu_swap_root.c:233 #9 mupip_reorg () at sr_port/mupip_reorg.c:399 #10 mupip_main (argc=5, argv=0x7fff32e84d48, envp=0x7fff32e84d78) at sr_unix/mupip_main.c:117 #11 dlopen_libyottadb (argc=5, argv=0x7fff32e84d48, envp=0x7fff32e84d78, main_func=0x4014a4 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #12 main (argc=5, argv=0x7fff32e84d48, envp=0x7fff32e84d78) at sr_unix/mupip.c:22 (gdb) p/x ydb_skip_bml_num $4 = 0x320dd1200 (gdb) p/x cw_set[0].blk $33 = 0x320dd1bff (gdb) p/x cw_set[1].blk $35 = 0x1 (gdb) p/x cw_set[2].blk $34 = 0x320dd1c00 (gdb) p cw_set[0].mode $37 = gds_t_acquired (gdb) p cw_set[1].mode $39 = gds_t_write (gdb) p cw_set[2].mode $38 = gds_t_writemap ``` * The debug-only `ydb_test_4g_db_blks` env var is enabled in this test run as `ydb_skip_bml_num` is set to a non-zero value (seen above). * This means that after the local bitmap `0`, the next local bitmap that is used would be `0x320dd1200`. All blocks from `0x200` till `0x320dd11ff` will not be used by the database logic (i.e. there will be a huge hole in the database file) for block allocation. * The `cw_set[0].blk` indicates the allocated block number is `0x320dd1bff`. But `cw_set[2].blk` indicates this allocation happened in the local bitmap block `0x320dd1c00`. * Notice that the allocated block number actually is 1 less than the bitmap block number. * This means that the allocated block was actually in the `previous` local bitmap. This is an out-of-design situation. * The allocation happened in `bm_getfree.c` but the point of assert failure is much later. Changes ------- * There is logic in `bm_getfree.c` that finds a free bit in the local bitmap and stores it in a variable `free_bit`. In the failure case above, `free_bit` must have ended up with a value of `-1` (which also happens to be the value of the `NO_FREE_SPACE` macro) to result in the failure. * It is not clear to me how this happened but I suspect this was possible only because of interactions with the debug-only `ydb_skip_bml_num` scheme. * That is, my suspicion is that it is some issue in the `ydb_skip_bml_num` scheme. * Towards better understanding how `free_bit` ended up with the `-1` value, this commit adds an assert that `free_bit` is never negative. * In the test failure case above, we would have seen this new assert fail at a much earlier point thereby giving us a better core file to analyze. * This commit enables us to better analyze such failures if/when they happen in the future (the cause is still unknown).

…ert failure) Background ---------- * Below is a first-time failure, when running the `r126/ydb464` subtest (from the YDBTest project), that I noticed while trying to reproduce some other failure. ```diff --- ydb464/ydb464.diff --- 19a20,73 > r126_0_31/ydb464/simpleapi2/child98118.log > %YDB-F-ASSERT, Assert failed in sr_port/insert_region.c line 110 for expression ((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit)) ``` * The C-stack and relevant variables from the core file are pasted below. ```c (gdb) where #0 pthread_kill () from /usr/lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7ffee07f7480) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110 #7 mlk_unlock (p=0x1591940) at sr_port/mlk_unlock.c:70 #8 tp_unwind (newlevel=0, invocation_type=ROLLBACK_INVOCATION, tprestart_rc=0x0) at sr_port/tp_unwind.c:294 #9 op_trollback (rb_levels=0) at sr_port/op_trollback.c:200 #10 secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:569 #11 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:230 #12 signal_exit_handler (exit_handler_name=0x7ff4913b071e "deferred_exit_handler", sig=2, info=0x7ff491795458 <stapi_signal_handler_oscontext+3224>, context=0x7ff4917954d8 <stapi_signal_handler_oscontext+3352>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78 #13 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120 #14 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74 #15 rel_crit (reg=0x14d0170) at sr_unix/rel_crit.c:81 #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120 #17 op_lock2_common (timeout=0, laflag=64 '@') at sr_port/op_lock2.c:242 #18 op_incrlock_common (timeout=0) at sr_port/op_incrlock.c:49 #19 ydb_lock_incr_s (timeout_nsec=0, varname=0x7ffee07f8c30, subs_used=0, subsarray=0x0) at sr_unix/ydb_lock_incr_s.c:91 #20 runProc (settings=0x7ffee07fab80, curDepth=1) at simpleapi/inref/randomWalk.c:489 #21 tpHelper (tpfnparm=0x7ffee07fa100) at simpleapi/inref/randomWalk.c:691 #22 ydb_tp_s_common (lydbrtn=LYDB_RTN_TP, tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s_common.c:256 #23 ydb_tp_s (tpfn=0x4037c2 <tpHelper>, tpfnparm=0x7ffee07fa100, transid=0x4041f9 "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:38 #24 runProc (settings=0x7ffee07fab80, curDepth=0) at simpleapi/inref/randomWalk.c:666 #25 runProc_driver (settings=0x7ffee07fab80) at simpleapi/inref/randomWalk.c:145 #26 main () at simpleapi/inref/randomWalk.c:93 (gdb) f 6 #6 insert_region (reg=0x14d0170, reg_list=0x7ff49179f158 <tp_reg_list>, reg_free_list=0x7ff49179f078 <tp_reg_free_list>, size=40) at sr_port/insert_region.c:110 110 assert((CDB_STAGNATE > t_tries) || (dollar_tlevel && csa->now_crit)); (gdb) p process_exiting $3 = 1 (gdb) p t_tries $4 = 3 (gdb) p dollar_tlevel $5 = 1 (gdb) p csa->now_crit $6 = 0 (gdb) up #16 mlk_lock (p=0x1591940, auxown=0, new=1) at sr_port/mlk_lock.c:120 120 TPNOTACID_CHECK(LOCKGCINTP); ``` Issue ----- * The assert that failed in `insert_region()` (frame 6 in above stack trace) indicates that we were in the final retry (i.e. `t_tries` is equal to `3` or `CDB_STAGNATE`) but we did not hold crit on the current region where we are trying to do an `mlk_unlock()` operation. * The assert is valid and did expose an issue. * In frame 16, in `mlk_lock()`, we did a `rel_crit()` call in the `TPNOTACID_CHECK` macro while in the final retry. **sr_port/mlk_lock.c** ```c 120 TPNOTACID_CHECK(LOCKGCINTP); ``` * Below is the code inside the macro. **sr_port/tp.h** ```c 979 #define TPNOTACID_CHECK(CALLER_STR) \ 980 { \ 981 GBLREF boolean_t mupip_jnl_recover; \ 982 mval zpos; \ 983 \ 984 if (IS_TP_AND_FINAL_RETRY) \ 985 { \ -> 986 TP_REL_CRIT_ALL_REG; \ 987 assert(!mupip_jnl_recover); \ 988 TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK; \ ``` * Line 986 is where the issue is. We do a `rel_crit()` call there but `t_tries` is still not decremented. The decrement of `t_tries` happens 2 lines later at line 988. * Before doing the `rel_crit()` call, we need to decrement `t_tries`. This way, in case `rel_crit()` decides to invoke exit handling due to handling a deferred SIGINT signal (sent in the `ydb464` subtest), the assert in `insert_region()` would not be confused by seeing this out-of-design state and will not attempt to invoke `t_retry()` etc. which is a no-no as we should not transfer control to M code as part of a TP restart while the process is about to terminate on receipt of a SIGINT signal. Fix --- * Notice that in `sr_port/t_commit_cleanup.c`, the `t_tries` decrement happens BEFORE the `rel_crit()` call. **sr_port/t_commit_cleanup.c** ```c 288 if (CDB_STAGNATE <= t_tries) 289 TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK; /* t_tries untouched for rollback and recover */ . . 303 if (!csa->hold_onto_crit && csa->now_crit) 304 rel_crit(tr->reg); /* Undo Step (CMT01) */ ``` * In a similar fashion, in the `TPNOTACID_CHECK` macro in `sr_port/tp.h`, the `TP_REL_CRIT_ALL_REG` call should happen AFTER the `TP_FINAL_RETRY_DECREMENT_T_TRIES_IF_OK` call. And that is the fix. * While doing this fix, I noticed a similar ordering issue in `sr_port/gvcst_init.c` and so fixed that too. Notes ----- * While this failure happened with a Debug build of YottaDB, I suspect there is an issue in the Release build of YottaDB too. But not sure exactly what the user-visible implications are. Even if so, it is likely to be not encountered in practice and so no user-visible issue is created for this.

…_port/deferred_events.c Background ---------- * The `v61000/intrpt_wcs_wtstart` subtest (in the YDBTest project) failed a few rare occasions during internal testing with the following symptom. ```diff 12a13,299 > v61000_0_22/intrpt_wcs_wtstart/mumps-wb.out > %YDB-F-ASSERT, Assert failed in sr_port/deferred_events.c line 114 for expression (no_event == outofband || (event_type == outofband)) ``` Issue ----- * The stack trace and relevant details from the gdb core analysis are pasted below. ```c (gdb) where #0 pthread_kill () from /lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7ffcc56fd8c0) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114 #7 jobinterrupt_event (sig=10, info=0x7fb372b8a518 <stapi_signal_handler_oscontext+5528>, context=0x7fb372b8a598 <stapi_signal_handler_oscontext+5656>) at sr_port/jobinterrupt_event.c:61 #8 <signal handler called> #9 clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 #10 m_usleep (useconds=10000) at sr_unix/sleep.c:37 #11 wcs_sleep (sleepfactor=6310) at sr_port/wcs_sleep.c:28 #12 wcs_flu (options=519) at sr_unix/wcs_flu.c:571 #13 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:632 #14 gv_rundown () at sr_port/gv_rundown.c:122 #15 gtm_exit_handler () at sr_unix/gtm_exit_handler.c:233 #16 signal_exit_handler (exit_handler_name=0x7fb372a19ecf "generic_signal_handler", sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=0) at sr_unix/signal_exit_handler.c:78 #17 generic_signal_handler (sig=15, info=0x7fb372b89c78 <stapi_signal_handler_oscontext+3320>, context=0x7fb372b89cf8 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:502 #18 ydb_os_signal_handler (sig=15, info=0x7ffcc56ffd30, context=0x7ffcc56ffc00) at sr_unix/ydb_os_signal_handler.c:88 #19 <signal handler called> #20 clock_nanosleep@GLIBC_2.2.5 () from /lib64/libc.so.6 #21 m_usleep (useconds=999000) at sr_unix/sleep.c:37 #22 wcs_wtstart (region=0xc30970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:216 #23 wcs_timer_start (reg=0xc30970, io_ok=1) at sr_port/t_end_sysops.c:1346 #24 t_end (hist1=0xcfe798, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1848 #25 gvcst_put2 (val=0xc928b8, parms=0x7ffcc5709a80) at sr_port/gvcst_put.c:2796 #26 gvcst_put (val=0xc928b8) at sr_port/gvcst_put.c:302 #27 op_gvput (var=0xc928b8) at sr_port/op_gvput.c:79 (gdb) f 6 #6 xfer_set_handlers (event_type=3, param_val=10, popped_entry=0) at sr_port/deferred_events.c:114 114 assert(no_event == outofband || (event_type == outofband)); (gdb) p (enum outofbands)no_event $2 = no_event (gdb) p (enum outofbands)outofband $1 = deferred_signal (gdb) p (enum outofbands)event_type $3 = jobinterrupt ``` * The test sends a SIGTERM (i.e. SIG-15) signal. This caused `outofband` variable to be set to `deferred_signal` in frame 17 above (`generic_signal_handler.c` inside the `SET_FORCED_EXIT_STATE` macro). * And then the process was sleeping (due to a white-box test case in the test). * At that point, it was holding crit and another process was waiting for this and so was about to send a `MUTEXLCKALERT` message. At this point, since the test framework had set the `gtm_procstuckexec` env var to `com/gtmprocstuck_get_stack_trace.csh`, that was invoked and it in turn invoked `^%YDBPROCSTUCKEXEC` which in turn sent a `SIGUSR1` signal (i.e. a `mupip intrpt`) to this very same process that was sleeping while holding crit. * And at this point, the process got the assert failure because the `outofband` variable indicated that a `SIG-15` signal needs to be handled whereas the `event_type` variable indicated that the current out of band event is a `jobinterrupt` event. Fix --- * This seems like a valid scenario and I suspect the assert is invalid. * I noticed that this very same assert has been removed in a later GT.M release V7.1-001. ```diff $ cd YDB $ git show tags/V7.1-001 sr_port/deferred_events.c | head -35 | tail -8 @@ -127,7 +127,6 @@ boolean_t xfer_set_handlers(int4 event_type, int4 param_val, boolean_t popped_e } if (!already_ev_handling) { - assert(no_event == outofband || (event_type == outofband)); assert(!dollar_zininterrupt || (jobinterrupt != event_type)); if (entry != (TREF(save_xfer_root_ptr))->ev_que.fl) { /* no event in play so pend this one by jiggeriing the xfer_table */ ``` * I assume GT.M noticed a similar issue but not while releasing V7.0-001 (which is what YottaDB master currently has merged) but when releasing a much later V7.1-001 version and fixed it then. * Therefore, I am removing the assert that failed. * This should let the `v61000/intrpt_wcs_wtstart` test run fine until GT.M V7.1-001 gets merged into the YottaDB master branch.

…Simple Thread API application Background ---------- * The `r126/ydb464` subtest failed in one rare run with the following failure symptom. ```diff > %YDB-F-ASSERT, Assert failed in sr_port/deferred_events_queue.c line 48 for expression (INTRPT_IN_EVENT_HANDLING == intrpt_ok_state) ``` * When this specific test was rerun around 10000 times, we saw around a dozen failures (with differing assert failures but all pointing to the same underlying issue) so this failure was reproducible but not easily. Issue ----- * Relevant details from the core file analysis is pasted below. ```c (gdb) thread apply all bt Thread 6 (Thread 0x7fa62a6c0640 (LWP 99885)): . #6 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #7 set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:37 #8 xfer_set_handlers (event_type=11, param_val=939582496, popped_entry=0) at sr_port/deferred_events.c:190 #9 generic_signal_handler (sig=2, info=0x7fa63a1b6fd8 <stapi_signal_handler_oscontext+3320>, context=0x7fa63a1b7058 <stapi_signal_handler_oscontext+3448>, is_os_signal_handler=1) at sr_unix/generic_signal_handler.c:305 #10 ydb_os_signal_handler (sig=2, info=0x7fa625096bf0, context=0x7fa625096ac0) at sr_unix/ydb_os_signal_handler.c:88 #11 <signal handler called> #12 __pthread_create_2_1 (newthread=<optimized out>, attr=<optimized out>, start_routine=<optimized out>, arg=<optimized out>) at ./nptl/pthread_create.c:835 #13 pthread_create () #14 runProc (tptoken=..., errstr=0x0, settings=..., curDepth=6) at simplethreadapi/inref/randomWalk.c:662 #15 threadHelper (args=0x7fa62a6ba880) at simplethreadapi/inref/randomWalk.c:723 #16 tpHelper (tptoken=..., errstr=0x7fa62a6ba850, tpfnparm=0x7fa62a6ba880) at simplethreadapi/inref/randomWalk.c:712 #17 ydb_tp_st (tptoken=..., errstr=0x7fa62a6ba850, tpfn=0x55fa406e2d20 <tpHelper>, tpfnparm=0x7fa62a6ba880, transid=0x55fa406f69ea "BATCH", namecount=0, varnames=0x0) at sr_unix/ydb_tp_st.c:100 #18 runProc (tptoken=..., errstr=0x0, settings=..., curDepth=5) at simplethreadapi/inref/randomWalk.c:642 #19 threadHelper (args=0x7fa62a6bb7e0) at simplethreadapi/inref/randomWalk.c:723 . #41 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 Thread 1 (Thread 0x7fa61ef2e640 (LWP 7158)): . #6 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #7 xfer_reset_handlers (event_type=11) at sr_port/deferred_events.c:235 #8 outofband_clear () at sr_port/outofband_clear.c:41 #9 outofband_action (lnfetch_or_start=0) at sr_port/outofband_action.c:55 #10 ydb_zwr2str_s (zwr=0x7fa61ef2d550, str=0x7fa61ef2d560) at sr_unix/ydb_zwr2str_s.c:55 #11 ydb_zwr2str_st (tptoken=..., errstr=0x7fa61ef2d530, zwr=0x7fa61ef2d550, str=0x7fa61ef2d560) at sr_unix/ydb_zwr2str_st.c:40 #12 runProc (tptoken=..., errstr=0x0, settings=..., curDepth=7) at simplethreadapi/inref/randomWalk.c:545 #13 threadHelper (args=0x7fa62a6b9940) at simplethreadapi/inref/randomWalk.c:723 #14 start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442 #15 clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81 (gdb) p dollar_tlevel $4 = 6 (gdb) p/x ydb_engine_threadsafe_mutex_holder[0] $14 = 0x7fa62a6c0640 (gdb) p/x ydb_engine_threadsafe_mutex_holder[1] $15 = 0x7fa62a6c0640 (gdb) p/x ydb_engine_threadsafe_mutex_holder[2] $16 = 0x7fa62a6c0640 (gdb) p/x ydb_engine_threadsafe_mutex_holder[3] $17 = 0x7fa62a6c0640 (gdb) p/x ydb_engine_threadsafe_mutex_holder[4] $18 = 0x7fa62a6c0640 (gdb) p/x ydb_engine_threadsafe_mutex_holder[5] $19 = 0x7fa62a6c0640 (gdb) p/x ydb_engine_threadsafe_mutex_holder[6] $20 = 0x7fa61ef2e640 ``` * This is a case when signals are sent (SIGINT aka SIG-2 in this case) to a Simple Thread API process and one thread (`Thread 1` below) is running under the YottaDB engine lock already but the signal gets delivered to another thread (`Thread 6` below) and that incorrectly starts executing the signal handler which in turn invokes `xfer_set_handlers()` etc.. And so at the same time, 2 threads are executing YottaDB engine/runtime code although only one holds the lock. This is a no-no since YottaDB runtime logic is not multi-thread safe. * From the above analysis, it is clear that the process was executing a TP transaction with `dollar_tlevel` equal to `6`. `Thread 6` had invoked `ydb_tp_st()` (in frame 17) which in turn invoked a callback function that created a new thread `Thread 1`. `Thread 6` held the YottaDB engine multi-thread lock for tlevels 0, 1, 2, 3, 4, 5. For tlevel 6, `Thread 1` held the YottaDB engine multi-thread lock. * But the `SIGINT` signal (sent by the test) got sent to `Thread 6`. Therefore, it should have realized, while in `generic_signal_handler()`, that `dollar_tlevel` is 6 and it does not own the tlevel=6 lock (`Thread 1` owns it) and therefore should have done a `return` at line 404 below. ```c 315 #define FORWARD_SIG_TO_MAIN_THREAD_IF_NEEDED(SIGHNDLRTYPE, SIG, IS_EXI_SIGNAL, INFO, CONTEXT) \ . 332 if (simpleThreadAPI_active) \ 333 { \ . 355 thisThreadId = pthread_self(); \ 356 assert(thisThreadId); \ 357 SET_YDB_ENGINE_MUTEX_HOLDER_THREAD_ID(mutexHolderThreadId, tLevel); \ . 374 thisThreadIsMutexHolder = pthread_equal(mutexHolderThreadId, thisThreadId); \ . 386 if (!thisThreadIsMutexHolder \ 387 || (!IS_EXI_SIGNAL && (tLevel && (!isSigThreadDirected || signalForwarded)))) \ 388 { /* Two possibilities. \ . -> 404 return; \ 405 } else \ ``` * But clearly that did not happen (from the core file). Therefore, `thisThreadIsMutexHolder` (set at line 374 above) should have been `TRUE`. * How that happened can be seen in line 286 below inside the macro (invoked from line 357 above). ```c 268 #define SET_YDB_ENGINE_MUTEX_HOLDER_THREAD_ID(HOLDER_THREAD_ID, TLEVEL) \ 269 { \ 270 GBLREF uint4 dollar_tlevel; \ 271 GBLREF pthread_t ydb_engine_threadsafe_mutex_holder[]; \ 272 \ 273 /* If not in TP, the YottaDB engine lock index is 0 (i.e. ydb_engine_threadsafe_mutex_holder[0] is \ 274 * current lock holder thread if it is non-zero). But if we are in TP, then lock index could be \ 275 * "dollar_tlevel" : e.g. if a "ydb_get_st" call occurs inside of the "ydb_tp_st" call OR \ 276 * "dollar_tlevel - 1" : if control is in the TP callback function inside "ydb_tp_st" but not a \ 277 * SimpleThreadAPI call like "ydb_get_st" etc. \ 278 */ \ 279 TLEVEL = dollar_tlevel; /* take a local copy of global variable as it could be concurrently changing */ \ 280 if (!TLEVEL) \ 281 HOLDER_THREAD_ID = ydb_engine_threadsafe_mutex_holder[0]; \ 282 else \ 283 { \ 284 HOLDER_THREAD_ID = ydb_engine_threadsafe_mutex_holder[TLEVEL]; \ 285 if (!HOLDER_THREAD_ID) \ 286 HOLDER_THREAD_ID = ydb_engine_threadsafe_mutex_holder[TLEVEL - 1]; \ 287 } \ 288 } ``` * Line 284 must have returned a value of 0 for `HOLDER_THREAD_ID` and so we went to line 286 and used the thread owner of tlevel=5 which was `Thread 6`. * In the core file, we see that tlevel=6 lock owned is `Thread 1`. But at the time line 284 got executed, `Thread 1` was not owning the lock. * That can be explained if `Thread 1` had not yet done the `ydb_zwr2str_st()` call when line 284 got executed. * The issue then is that when we found no one holding the tlevel=6 lock, we went to see who holds the tlevel=5 lock and returned that thread is as the current YottaDB engine multi-thread lock holder. * This is where the issue is. `Thread 1` even though it had not yet attempted to get the lock, owns the lock at this point since `Thread 6` has invoked the callback function and has no control of what calls the callback function can invoke (including creating new threads that in turn do Simple Thread API calls on their own like happened with `Thread 1`). * Treating `Thread 6` as owning the lock ended up with a situation where 2 threads think they each own the engine lock and run YottaDB code at the same time causing the assert failures. * This issue is long standing (started in 2afcbd2, which was committed 2019/03/25) but it manifests as assert failures only after the GT.M V7.0-001 code merge. That is because the deferred event queue handling got reworked in V7.0-001 making it possible for more logic to execute while in the signal handler thereby exposing this long standing issue. * Note that even then it has taken a few months of testing to show this one failure in a C program that invokes multiple threads. So it is really a rare issue. Fix --- * The fix is thankfully simple and is to remove lines 285-286 above. That is, check the lock holder for the tlevel which `dollar_tlevel` global currently points to. Do not go one before that if we find the top level not being held by any thread. * With this change, `Thread 6` will not incorrectly conclude it is the owner. This is because it will find that the owner of the YottaDB engine lock is no thread in this case and since that does not match its own thread id, it will `return` if it gets delivered the SIGINT (after noting down the fact that this signal handling was deferred) and the next thread that runs YottaDB runtime logic will notice this happened and handle the signal while it holds the engine lock. Notes ----- * Since this issue is very unlikely to be seen in practice (needs a Simple Thread API application that creates threads while inside a `ydb_tp_st()` call and also sends SIGINT signals), no YDB issue is created for this.

… damage Background ---------- Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/1041#description * The `reorg_5/on_ntp_njnl_reorg` subtest (from the YDBTest project) failed with the following symptom in a rare in-house test run. ``` %YDB-F-ASSERT, Assert failed in sr_port/bml_status_check.c line 72 for expression ((gds_t_acquired != cs->mode) || (BLK_BUSY != bml_status)) ``` * The assert failure indicates that we found a block that was allocated in this transaction (in `bm_getfree.c`) but the local bitmap for that allocated block says the block is already marked `BUSY`. This is an out-of-design situation as only blocks marked `FREE` or `RECYCLED` in the local bitmap should be allocated. * The cause of this failure was not ascertained when it first happened and so !1409 (d366c66) added an assert to try catch the failure early in `bm_getfree.c` (pre-commit) rather than much later in `bml_status_check.c` (in-commit). * But even after !1409 got merged, we encountered another rare test failure with the same symptom. The failure still happened in `bml_status_check.c` and not in the expected `bm_getfree.c`. * Therefore, the new failure was further analyzed and this time around the cause was finally identified. It turned out to be a `mupip reorg -truncate` issue that can result in assert failures in Debug builds and database damage in Release builds. * But for this subtle timing issue to happen, a database file extension needs to happen concurrently among many things. * Below is an example mupip integ report from the damaged database after the `mupip reorg -truncate` in a manual test that I did. ``` Integ of region DEFAULT Block:Offset Level %YDB-E-DBINCLVL, Nature: #DANGER*** 3FF:0 2 Block at incorrect level Directory Path: 1:10, 2:23 Path: 202:104, 3E6:17E, 3FF:0 Keys from ^y(493.1) to the end are suspect. %YDB-E-DBBDBALLOC, Nature: #DANGER*** 3FF:0 B Block doubly allocated Directory Path: 1:10, 2:36 Path: 3FF:0 Keys from ^z to the end are suspect. ``` Issue ----- * Below is the stack trace from the latest failure. ```c (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=<optimized out>, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7ffef29217e0) at sr_unix/rts_error.c:198 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #8 bml_status_check (cs=0x7fe6935dfa40 <cw_set>) at sr_port/bml_status_check.c:72 #9 t_end (hist1=0x62d0000a2840, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1637 #10 mu_swap_root (gl_ptr=0x62d0000a3fc0, root_swap_statistic_ptr=0x7ffef29256a0, upg_mv_block=0) at sr_unix/mu_swap_root.c:233 #11 mupip_reorg () at sr_port/mupip_reorg.c:399 #12 mupip_main (argc=5, argv=0x7ffef2938148, envp=0x7ffef2938178) at sr_unix/mupip_main.c:117 #13 dlopen_libyottadb (argc=5, argv=0x7ffef2938148, envp=0x7ffef2938178, main_func=0x55c02b55a020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #14 main (argc=5, argv=0x7ffef2938148, envp=0x7ffef2938178) at sr_unix/mupip.c:22 ``` * Notice `mu_swap_root()` in the stack. It was seen as a caller in all failures so far where the symptom was an assert failure in `bml_status_check.c`. * This specific function (that is called by `mupip reorg -truncate`) turned out to be where the issue is and not the general purpose function `bm_getfree.c` (like was suspected in d366c66). * This function does block allocation, just like `bm_getfree.c`. But there is a subtle difference between the two which is what led me to the issue. **sr_port/bm_getfree.c** ```c 306 free_bit = bm_find_blk((int4)offset, (sm_uc_ptr_t)bmp + SIZEOF(blk_hdr), map_size, blk_used); . 311 if (NO_FREE_SPACE != free_bit) 312 break; ``` **sr_unix/mu_swap_root.c** ```c 341 master_bit = bmm_find_free((hint_blk_num / BLKS_PER_LMAP), csa->bmm, num_local_maps); 342 if ((NO_FREE_SPACE == master_bit)) . 360 free_bit = bm_find_blk(hint_bit, bmlhist.buffaddr + SIZEOF(blk_hdr), maxbitsthismap, &free_blk_recycled); 361 free_blk_id = bmlhist.blk_num + free_bit; ``` * `bm_getfree.c` invokes `bm_find_blk()` to find a free block in the local bitmap and stores the result in `free_bit`. It then checks whether that indicates no free space (`NO_FREE_SPACE`) and if so it moves on to another local bitmap. * `mu_swap_root.c` does a call to `bm_find_blk()` as well to find a free block but does no `NO_FREE_SPACE` check. * This is where the issue is. * If `free_bit` turns out to be `NO_FREE_SPACE`, we will end up setting `free_blk_id` (line 361 above) as 1 block BEFORE the current local bitmap block. This would mean we would consider a block in the previous bitmap as the FREE block and use it for the swap operation. If that block happens to be already used in some other global variable tree, then we would end up with integrity errors once the incorrect swap operation completes. * In order to end up in this situation, what is needed is that we find the local bitmap has free space in the master bitmap (line 341 above) but when we look inside the local bitmap block we find that it has no free space. * I initially tried to see if I can concurrently have a process do a transaction that allocates a block and ends up marking this local bitmap as full in the master map (in `bm_update()`). But in that case, I noticed that the `mu_swap_root()` function ended up restarting the transaction because it noticed the local bitmap block contents changed concurrently (restart code `cdb_sc_bmlmod`). * After some trial and error, finally landed on the needed last link in this puzzle. And that is a database file extension which concurrently happens. * It is possible the last local bitmap in a database file could have its status marked as `Full` even though it is only a partial bitmap (because the total block count stops midway in the local bitmap). * When a database file extension happens on this partial last bitmap, it would mark this bitmap from `Full` status to having `Free space` (line 561 below). **sr_unix/gdsfilext.c** ```c 556 cs_addrs->ti->free_blocks += blocks; 557 cs_addrs->total_blks = cs_addrs->ti->total_blks = new_total; 558 blocks = old_total; 559 if (blocks / bplmap * bplmap != blocks) 560 { 561 bit_set(blocks / bplmap, MM_ADDR(cs_data)); /* Mark old last local map as having space */ ``` **sr_unix/mu_swap_root.c** ```c 339 total_blks = csa->ti->total_blks; . 341 master_bit = bmm_find_free((hint_blk_num / BLKS_PER_LMAP), csa->bmm, num_local_maps); 342 if ((NO_FREE_SPACE == master_bit)) . 360 free_bit = bm_find_blk(hint_bit, bmlhist.buffaddr + SIZEOF(blk_hdr), maxbitsthismap, &free_blk_recycled); 361 free_blk_id = bmlhist.blk_num + free_bit; ``` * Given that, the timing of the `mupip reorg -truncate` (in `mu_swap_root.c`) and `mupip extend` (in `gdsfilext.c`) processes has to such that line 339 above in `mu_swap_root.c` takes a note of `total_blks` BEFORE line 556 is reached by `gdsfilext.c` but line 341 in `mu_swap_root.c` should happen AFTER line 561 is done by `gdsfilext.c`. When this occurs, `mu_swap_root.c` will incorrectly proceed at line 360 with a `free_bit` value of `-1` which would then cause a block from the previous bitmap to be allocated and the commit validation logic would not restart the transaction in this situation ending up in database damage. Fix --- * The fix is simple and is to check the return value of `bm_find_blk()` (stored in `free_bit` variable) for `NO_FREE_SPACE` and if so restart the transaction.

…ess.c to fix test failure Background ---------- * In internal testing, we noticed a few rare test failures where the source server failed an assert. ```diff > %YDB-F-ASSERT, Assert failed in sr_unix/gtmsource_process.c line 1653 for expression (!renegotiation_pending) ``` Issue ----- * Pasted below are relevant details from the core file analysis. ```c (gdb) where #0 pthread_kill () from /usr/lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7fff37d618e0) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 gtmsource_process () at sr_unix/gtmsource_process.c:1653 #7 gtmsource () at sr_unix/gtmsource.c:618 #8 mupip_main (argc=11, argv=0x7fff37d68508, envp=0x7fff37d68568) at sr_unix/mupip_main.c:117 #9 dlopen_libyottadb (argc=11, argv=0x7fff37d68508, envp=0x7fff37d68568, main_func=0x5270c0 <.str> "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #10 main (argc=11, argv=0x7fff37d68508, envp=0x7fff37d68568) at sr_unix/mupip.c:22 (gdb) f 6 #6 gtmsource_process () at sr_unix/gtmsource_process.c:1653 1653 GTMTLS_ONLY(assert(!renegotiation_pending)); (gdb) p repl_tls $1 = {id = "INSTANCE1", '\000' <repeats 22 times>, plaintext_fallback = 0, enabled = 1, notls_retry = 0, renegotiate_state = REPLTLS_WAITING_FOR_RENEG_TIMEOUT, sock = 0x60700000a3c0} (gdb) down #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 99 return rts_error_va(csa, argcnt, var); (gdb) up #6 gtmsource_process () at sr_unix/gtmsource_process.c:1653 1653 GTMTLS_ONLY(assert(!renegotiation_pending)); (gdb) p hrtbt_cnt $2 = 8 (gdb) p renegotiate_factor $3 = 4 (gdb) p gtmsource_options.renegotiate_interval $4 = 60 (gdb) p gtmsource_local->connect_parms[GTMSOURCE_CONN_HEARTBEAT_PERIOD] $5 = 15 (gdb) p gtmsource_state $6 = GTMSOURCE_SENDING_JNLRECS (gdb) p poll_time $7 = 999 (gdb) p repl_tls.renegotiate_state $1 = REPLTLS_WAITING_FOR_RENEG_TIMEOUT (gdb) p next_renegotiate_hrtbt $1 = 1 ``` * This turned out to be an interesting failure. The underlying issue is that a TLS renegotiation was pending for more than a minute (due to the receiver not replying to the REPL_RENEG_ACK_ME message from the source server for more than a minute) so the next TLS renegotiation became pending since the renegotiate_interval was 1 minute. And that scenario was not handled correctly resulting in the assert failure. Fix --- * This turns out to be an issue that is fixed in GT.M V7.0-002. http://tinco.pair.com/bhaskar/gtm/doc/articles/GTM_V7.0-002_Release_Notes.html#GTM-DE197637428 _The Source Server defers a TLS renegotiation when a prior TLS renegotiation is pending. Previously, the Source Server logged the REPLWARN message, closed the replication connection, and attempted to re-establish the replication connection with the Receiver Server.(GTM-F135415)_ * I noticed the below code changes in GT.M V7.0-002 implements the above fix. ```diff $ cd YDB $ git show -U0 tags/V7.0-002 sr_unix/gtmsource_process.c . . @@ -568,0 +569 @@ void gtmsource_recv_ctl(void) + /* Check whether it is time for renegotiation */ @@ -572,3 +573,16 @@ void gtmsource_recv_ctl(void) - next_renegotiate_hrtbt = TRUE; - repl_tls.renegotiate_state = REPLTLS_WAITING_FOR_RENEG_TIMEOUT; - poll_time = REPL_POLL_WAIT; + switch(repl_tls.renegotiate_state) + { + case REPLTLS_RENEG_STATE_NONE: + case REPLTLS_WAITING_FOR_RENEG_TIMEOUT: + next_renegotiate_hrtbt = TRUE; + repl_tls.renegotiate_state = REPLTLS_WAITING_FOR_RENEG_TIMEOUT; + poll_time = REPL_POLL_WAIT; + break; + case REPLTLS_WAITING_FOR_RENEG_ACK: + /* On slower systems, heartbeat responses may arrive late. + In such a case, defer renegotiation */ + break; + default: + assert(FALSE); + break; + } @@ -582 +596 @@ void gtmsource_recv_ctl(void) - poll_time = REPL_POLL_WAIT; /* because we are back to sending data */ + poll_time = REPL_POLL_NOWAIT; /* because we are back to sending data */ @@ -593 +607 @@ void gtmsource_recv_ctl(void) - MAX_SEQNO, INVALID_SUPPL_STRM); + MAX_SEQNO, INVALID_SUPPL_STRM); ``` * The `case REPLTLS_WAITING_FOR_RENEG_ACK:` block above is what implements the real fix. * Therefore, this commit cherry-picks the above GT.M change and merges it into the YottaDB code base. * I expect the test failure to no longer happen after this change.

… restart related) Background ---------- * While trying to come up with the `v70001/gtm9131` subtest (in the YDBTest project), I encountered an assert failure. Running that subtest with the below changes reproduces the assert. ```diff $ git diff -U1 v70001/inref/gtm9131.m diff --git a/v70001/inref/gtm9131.m b/v70001/inref/gtm9131.m index 14094d4c..92641556 100644 --- a/v70001/inref/gtm9131.m +++ b/v70001/inref/gtm9131.m @@ -23,3 +23,4 @@ gtm9131 ; set jobid=1 - zsystem:$trestart=0 "$gtm_dist/mumps -run job^gtm9131 "_njobs_" "_jobid + set $zcmdline=njobs_" "_jobid + do job tcommit ; Commit TP transaction where we expect a TP restart due to concurrent statsdb extension ``` * Below is what I saw in the subtest output (with the above change). ``` $ cat gtm9131.log . . # Execute [mumps -run gtm9131] which will create a TPRESTART message due to a statsdb database file extension restart %YDB-F-ASSERT, Assert failed in sr_port/tp_restart.c line 288 for expression (IS_STATSDB_REG(restart_reg) ? !memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN) : memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN)) ``` Issue ----- * The core file in that case had the following stack trace and interesting variables. ``` (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=140359692803712) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=140359692803712) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140359692803712, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7fff9c3d9330) at sr_unix/rts_error.c:198 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #8 tp_restart (newlevel=1, handle_errors_internally=1) at sr_port/tp_restart.c:288 #9 mdb_condition_handler (arg=150376098) at sr_port/mdb_condition_handler.c:345 #10 rts_error_va (csa=0x0, argcnt=1, var=0x7fff9c3dc110) at sr_unix/rts_error.c:198 #11 rts_error_csa (csa=0x0, argcnt=1) at sr_unix/rts_error.c:99 #12 op_tcommit () at sr_port/op_tcommit.c:514 (gdb) f 8 #8 0x00007fa805089766 in tp_restart (newlevel=1, handle_errors_internally=1) at sr_port/tp_restart.c:288 288 assert(IS_STATSDB_REG(restart_reg) /* global ^%YGS if, and only if, statsDB */ (gdb) list 286 if (NULL != restart_reg) 287 { 288 assert(IS_STATSDB_REG(restart_reg) /* global ^%YGS if, and only if, statsDB */ 289 ? !memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN) 290 : memcmp(&gv_currkey->base, STATSDB_GBLNAME, STATSDB_GBLNAME_LEN)); 291 reg_mstr.len = restart_reg->dyn.addr->fname_len; 292 reg_mstr.addr = (char *)restart_reg->dyn.addr->fname; (gdb) x/s restart_reg->rname 0x6220000023e2: "default" (gdb) x/s gv_currkey->base 0x62d000005046: "%jobwait" ``` * The assert in lines 288-290 checks that if the region where the restart happened is a statsdb region, then the current `$reference` (i.e. `gv_currkey`) should point to `^%YGS`, the global name that maps to the statsdb region. And vice versa. * But in this case, the global name is `^%jobwait` which is not `^%YGS`. * This is a case where the statsdb region had a tp restart due to a statsdb file extension scenario and the restart status code was `cdb_sc_helpedout` (see first occurrence in sr_port/tp_tend.c). In this case, the restart did not occur as part of a global reference, but as part of TCOMMIT and so we are not guaranteed that the most recent global reference (`gv_currkey`) would point to a statsdb global name. Therefore, this assert is not necessarily correct. * Since restarts are possible in various scenarios other than as part of the current global reference, this assert could fail in other scenarios too. * This assert was introduced in GT.M V6.3-008. Fix --- * Not sure the assert serves much purpose so am removing this inaccurate assert in this commit.

…re=1 and DBNAMEMISMATCH Background ---------- * While testing with `ydb_statshare=1`, I noticed the following failure in the `v54003/dbnammismatch1` subtest when YottaDB was built with ASAN. ```sh $ cat dbnammismatch1.log . . mupip rundown -file backup.dat %YDB-I-DBNAMEMISMATCH, Database file backup.dat points to shared memory (id = 447152162) which in turn points to an inaccessible database file v54003_0/dbnammismatch1/mumps.dat ================================================================= ==74407==ERROR: AddressSanitizer: heap-use-after-free on address 0x625000006b44 at pc 0x7fe941b32e8d bp 0x7fffea717770 sp 0x7fffea717768 READ of size 4 at 0x625000006b44 thread T0 #0 mu_rndwn_file sr_unix/mu_rndwn_file.c:1484:3 #1 mupip_rundown sr_unix/mupip_rundown.c:215:8 #2 mupip_main sr_unix/mupip_main.c:117:4 #3 dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11 #4 main sr_unix/mupip.c:22:9 0x625000006b44 is located 6724 bytes inside of 8192-byte region [0x625000005100,0x625000007100) freed by thread T0 here: #1 system_free sr_port/gtm_malloc_src.h:1477:2 #2 gtm_free_main sr_port/gtm_malloc_src.h:850:3 #3 gtm_free sr_port/gtm_malloc_src.h:1493:2 #4 mu_rndwn_file sr_unix/mu_rndwn_file.c:1457:4 #5 mupip_rundown sr_unix/mupip_rundown.c:215:8 #6 mupip_main sr_unix/mupip_main.c:117:4 #7 dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11 #8 main sr_unix/mupip.c:22:9 previously allocated by thread T0 here: #1 system_malloc sr_port/gtm_malloc_src.h:1462:9 #2 gtm_malloc_main sr_port/gtm_malloc_src.h:669:10 #3 gtm_malloc sr_port/gtm_malloc_src.h:1488:9 #4 mu_rndwn_file sr_unix/mu_rndwn_file.c:457:27 #5 mupip_rundown sr_unix/mupip_rundown.c:215:8 #6 mupip_main sr_unix/mupip_main.c:117:4 #7 dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11 #8 main sr_unix/mupip.c:22:9 #9 __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 ``` Issue ----- * The issue is in `sr_unix/mu_rndwn_file.c` where we `free(tsd)` and later make a call to the `UNLINK_STATSDB_AT_BASEDB_RUNDOWN` macro which is passed the parameter `csd` which is the same as the copy of `tsd`. Fix --- * The fix is simple and is to move the `free()` call to AFTER the macro call.

…ABANDONED (fixes GTM-9400 for real) Background ---------- The below is pasted from https://gitlab.com/YottaDB/DB/YDBTest/-/issues/550#note_1733171439 * While trying to test YDBTest#550, I noticed that the KILLABANDONED error happens even with V7.0-001 whereas the GTM-9400 release note in GT.M V7.0-001 indicates this as being fixed in V7.0-001. * Below is the test case (using `tcsh`, not `sh`) that stops after a few seconds with a `KILLABANDONED` error with V7.0-000 as well as with V7.0-001. ```sh cat > kill.csh << CAT_EOF while (1) k15 reorg if (-e STOP) then break endif end CAT_EOF rm -f STOP unsetenv ydb_gbldir setenv gtmgbldir mumps.gld rm -f mumps.gld mumps.dat $gtm_dist/mumps -run GDE exit $gtm_dist/mupip create $gtm_dist/mumps -run %XCMD 'for i=1:1:100000 set ^x(i)=$j(i,200)' source kill.csh & while (1) foreach fillfactor (50 10 90) $gtm_dist/mupip reorg -fill=$fillfactor -region DEFAULT $gtm_dist/mupip integ -reg "*" if ($status) then touch STOP break endif end if (-e STOP) then break endif end ``` Issue ----- * I added an assert in `secshr_db_clnup.c` where it invoked the `INCR_ABANDONED_KILLS` macro and ran the above test to see the cause of the above `KILLABANDONED` error. * With that change, I got an assert failure (in line 446 below) and the core file showed the below stack trace. ```c (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=140185231378240) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=140185231378240) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140185231378240, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7ffcb98a5b50) at sr_unix/rts_error.c:198 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #8 secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:446 #9 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:124 #10 signal_exit_handler (exit_handler_name=0x7f7f6ac987be "deferred_exit_handler", sig=15, info=0x7f7f6ae3ddf8 <stapi_signal_handler_oscontext+3320>, context=0x7f7f6ae3de78 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:78 #11 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120 #12 deferred_signal_handler () at sr_port/deferred_signal_handler.c:74 #13 t_end (hist1=0x5617ac86b8c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1813 #14 mu_reorg (gl_ptr=0x5617ac865bc0, exclude_glist_ptr=0x7ffcb98aa450, resume=0x7ffcb98aa344, index_fill_factor=90, data_fill_factor=90, reorg_op=0) at sr_port/mu_reorg.c:572 #15 mupip_reorg () at sr_port/mupip_reorg.c:334 #16 mupip_main (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948) at sr_unix/mupip_main.c:117 #17 dlopen_libyottadb (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948, main_func=0x5617ab37c004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #18 main (argc=5, argv=0x7ffcb98bc918, envp=0x7ffcb98bc948) at sr_unix/mupip.c:22 (gdb) f 8 #8 secshr_db_clnup (secshr_state=NORMAL_TERMINATION) at sr_port/secshr_db_clnup.c:446 446 assert(!mu_reorg_process); (gdb) list 441 } else if (!dollar_tlevel) 442 { 443 if ((NULL != kip_csa) && (csa == kip_csa)) 444 { 445 /* Assert that MUPIP REORG never leaves the database with an abandoned kill */ 446 assert(!mu_reorg_process); 447 assert(0 < kip_csa->hdr->kill_in_prog); 448 DECR_KIP(csd, csa, kip_csa); 449 INCR_ABANDONED_KILLS(csd, csa); 450 } (gdb) f 13 #13 t_end (hist1=0x5617ac86b8c8, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:1813 1813 REVERT; /* no need for t_ch to be invoked if any errors occur after this point */ ``` * And below is the macro sequence how frame 13 `REVERT` ends up in frame 12 `deferred_signal_handler` call. ``` REVERT -> ENABLE_INTERRUPTS -> DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED -> deferred_signal_handler ``` * GT.M V7.0-001 fixed GTM-9400 by adding a `DEFERRED_EXIT_REORG_CHECK` macro at logical points in the mupip reorg code flow where we are guaranteed the kill-in-progress condition has been cleared (i.e. DECR_KIP has been called). ```diff $ git show -U1 tags/V7.0-001 sr_port/mu_reorg.c | grep -B2 -A1 DEFERRED_EXIT_REORG_CHECK @@ -449,2 +449,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume, DECR_KIP(cs_data, cs_addrs, kip_csa); + DEFERRED_EXIT_REORG_CHECK; if (detailed_log) -- @@ -579,2 +580,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume, DECR_KIP(cs_data, cs_addrs, kip_csa); + DEFERRED_EXIT_REORG_CHECK; if (detailed_log) @@ -677,2 +679,3 @@ boolean_t mu_reorg(glist *gl_ptr, glist *exclude_glist_ptr, boolean_t *resume, DECR_KIP(cs_data, cs_addrs, kip_csa); + DEFERRED_EXIT_REORG_CHECK; if (detailed_log) $ git show -U1 tags/V7.0-001 sr_unix/mu_swap_root.c | grep -B2 -A1 DEFERRED_EXIT_REORG_CHECK @@ -271,2 +248,3 @@ void mu_swap_root(glist *gl_ptr, int *root_swap_statistic_ptr) } + DEFERRED_EXIT_REORG_CHECK; /* a single directory tree has to be quick, so check at end, rather than each DECR_KIP */ return; ``` * But what it did not realize is that even before those logical points are reached, it is possible for `t_end.c` to invoke the `REVERT` macro which in turn would invoke `deferred_signal_handler` like is seen in the above stack trace. * Not sure how this did not get caught during the GT.M testing. Fix --- * In any case, the fix is simple and is to enhance `sr_port/deferred_signal_handler.c` to not invoke `deferred_exit_handler()` but instead `return` in case we are a `mupip reorg` process (indicated by the boolean_t typed `mu_reorg_process` global variable being TRUE) and we are in the middle of a kill-in-progress (indicated by `cs_data->kill_in_prog` being TRUE). * This way, we delay the deferred signal handling of the `MUPIP STOP` (aka `SIG-15`/SIGTERM) a little more until the logical point in `sr_port/mu_reorg.c` or `sr_unix/mu_swap_root.c` is reached where the `DEFERRED_EXIT_REORG_CHECK` macro is invoked.

…GTM-9400 in V7.0-001) Background ---------- * The `v70001/gtm9400` subtest failed in one rare in-house run with the following symptom. ```diff > ##TEST_PATH##/v70001_0_4/gtm9400/reorg_1_90.out > %YDB-F-ASSERT, Assert failed in sr_port/deferred_signal_handler.c line 40 for expression (INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) ``` * Below are the relevant details from the core file created by the `mupip reorg` process. ```c (gdb) where #0 pthread_kill () from /lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7ffe32ae6b80) at sr_unix/rts_error.c:198 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #6 deferred_signal_handler () at sr_port/deferred_signal_handler.c:40 #7 t_end (hist1=0x7ffe32ae9d10, hist2=0x0, ctn=18446744073709551614) at sr_port/t_end.c:655 #8 gvcst_bmp_mark_free (ks=0x7ffe32aeaa10) at sr_port/gvcst_bmp_mark_free.c:214 #9 mu_reorg (gl_ptr=0x14562c0, exclude_glist_ptr=0x7ffe32aeb368, resume=0x7ffe32aeb3bc, index_fill_factor=90, data_fill_factor=90, reorg_op=0) at sr_port/mu_reorg.c:452 #10 mupip_reorg () at sr_port/mupip_reorg.c:334 #11 mupip_main (argc=5, argv=0x7ffe32afd848, envp=0x7ffe32afd878) at sr_unix/mupip_main.c:117 #12 dlopen_libyottadb (argc=5, argv=0x7ffe32afd848, envp=0x7ffe32afd878, main_func=0x4015f4 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #13 main (argc=5, argv=0x7ffe32afd848, envp=0x7ffe32afd878) at sr_unix/mupip.c:22 (gdb) f 6 #6 deferred_signal_handler () at sr_port/deferred_signal_handler.c:40 40 assert(INTRPT_OK_TO_INTERRUPT == intrpt_ok_state); /* DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED ensures this */ (gdb) p forced_exit $2 = 1 (gdb) p mu_reorg_process $3 = 1 (gdb) p cs_data->kill_in_prog $4 = 1 (gdb) p intrpt_ok_state $5 = INTRPT_IN_KILL_CLEANUP ``` Issue ----- * The reorg process that assert failed was sent a `mupip stop` by the `gtm9400` subtest. * From the failure symptoms noted above, what happened is that the `mupip stop` got delivered AFTER line 321 below (when `intrpt_ok_state` was `INTRPT_OK_TO_INTERRUPT`) but before line 322. **sr_port/have_crit.h** ```c 314 /* Restore deferrable interrupts back to the state it was at time of corresponding DEFER_INTERRUPTS call */ 315 #define ENABLE_INTERRUPTS(OLDSTATE, NEWSTATE) \ 316 { \ 317 if (!multi_thread_in_use) \ 318 { \ 319 assert(OLDSTATE == intrpt_ok_state); \ 320 intrpt_ok_state = NEWSTATE; \ 321 if (INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) \ 322 DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED; \ 323 /* check if signals were deferred in deferred zone */ \ 324 } \ 325 } ``` * And as part of processing the `mupip stop`, the below code (newly introduced as part of incorporating GTM-9400 in GT.M V7.0-001 in commit 5ae98ef) set `intrpt_ok_state` to `INTRPT_IN_KILL_CLEANUP`. **sr_unix/generic_signal_handler.c** ```c 320 /* If nothing pending AND we have crit or in wcs_wtstart() or already in exit processing, wait to 321 * invoke shutdown. wcs_wtstart() manipulates the active queue that a concurrent process in crit 322 * in bt_put() might be waiting for. interrupting it can cause deadlocks (see C9C11-002178). 323 */ 324 if (mu_reorg_process && OK_TO_INTERRUPT && cs_data && cs_data->kill_in_prog) 325 DEFER_INTERRUPTS(INTRPT_IN_KILL_CLEANUP, prev_intrpt_state); /* avoid ABANDONEDKILL */ ``` * Therefore, when the `DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED` macro (in line 322 of the `ENABLE_INTERRUPTS` macro was invoked, `intrpt_ok_state` had a value that was not `INTRPT_OK_TO_INTERRUPT` and failed the assert at line 40 below. **sr_port/deferred_signal_handler.c** ```c 40 assert(INTRPT_OK_TO_INTERRUPT == intrpt_ok_state); /* DEFERRED_SIGNAL_HANDLING_CHECK_TRIMMED ensures this */ ``` Fix --- * Commit 57006c9 fixed `GTM-9400` for real (see commit message there for detail). * All that is needed is for the failing assert to account for this new possibility. And that is done by adding a `||` condition in this commit.

…ofband_clear.c) Background ---------- * After GT.M V7.0-002 changes were merged, the `r130/ydb560` subtest started failing with the following symptom. ``` %YDB-F-ASSERT, Assert failed in sr_port/outofband_clear.c line 43 for expression (TRUE == status) ``` * A simple way to reproduce this issue is to run the following and in a parallel terminal send a `kill -4` to the `mumps` process (that is stuck in the `hang` command). ```sh $ cat test.m set x=1 hang 100 $ mumps -run test ``` * Before V7.0-002 merge, one would see just 1 core file (due to the `kill -4`). But after the merge, one would see 3 core files. And the 2nd core file had the following stack trace. ``` (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=140112165532736) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140112165532736, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7ffd024690b0) at sr_unix/rts_error.c:198 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #8 outofband_clear () at sr_port/outofband_clear.c:43 #9 outofband_action (lnfetch_or_start=0) at sr_port/outofband_action.c:58 #10 async_action (lnfetch_or_start=false) at sr_port/deferred_events.c:394 #11 lvzwr_var (lv=0x60f0000005f0, n=0) at sr_port/lvzwr_var.c:184 #12 lvzwr_fini (out=0x7ffd02471dc0, t=1) at sr_port/lvzwr_fini.c:84 #13 op_lvpatwrite (count=0, arg1=140724641668224) at sr_port/op_lvpatwrite.c:85 #14 zshow_zwrite (output=0x7ffd02471dc0) at sr_port/zshow_zwrite.c:40 #15 op_zshow (func=0x7ffd0247a0e0, type=1, lvn=0x0) at sr_port/op_zshow.c:166 #16 jobexam_dump (dump_filename_arg=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fatal_file_name_buff=0x7ffd0247ae20 "/extra4/testarea1/nars/V998/tst_V998_R201_dbg_28_240320_111309/r130_0/ydb560/YDB_FATAL_ERROR.ZSHOW_DMP_89246_1.txt", fmt=0x0, dev_in_use=0x7ffd0247a240) at sr_port/jobexam_process.c:238 #17 jobexam_process (dump_file_name=0x7ffd0247bff0, dump_file_spec=0x7ffd0247c030, fmt=0x0) at sr_port/jobexam_process.c:147 #18 create_fatal_error_zshow_dmp (signal=4) at sr_port/create_fatal_error_zshow_dmp.c:66 #19 signal_exit_handler (exit_handler_name=0x7f6e64c43140 "deferred_exit_handler", sig=4, info=0x7f6e6519f938 <stapi_signal_handler_oscontext+3320>, context=0x7f6e6519f9b8 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:59 #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120 #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95 #22 set_events_from_signals (prev_intrpt_state=INTRPT_OK_TO_INTERRUPT) at sr_port/deferred_events_queue.c:48 #23 async_action (lnfetch_or_start=true) at sr_port/deferred_events.c:380 #24 l1 () at sr_x86_64/op_startintrrpt.s:40 (gdb) f 8 #8 outofband_clear () at sr_port/outofband_clear.c:43 43 assert(TRUE == status); (gdb) list 41 { 42 status = xfer_reset_if_setter(outofband); 43 assert(TRUE == status); 44 } 45 } (gdb) p outofband $1 = 11 (gdb) p (enum outofbands)outofband $2 = deferred_signal ``` Issue ----- * The issue was that `xfer_reset_if_setter()` had been reworked in GT.M V7.0-002. And that caused the handling of the `deferred_signal` type of outofband (which is a YottaDB-only value, unknown to the GT.M code base) not be handled correctly. * The reason why `xfer_reset_if_setter()` returned FALSE in line 42 above is that the `event_state` for `deferred_signal` event_type at line 249 below was `pending`. Not `active` and so the call to line 250 got skipped. That would have done the real reset that was needed. **sr_port/deferred_events.c** ```c 212 boolean_t xfer_reset_if_setter(int4 event_type) . 249 if (res = (active == TAREF1(save_xfer_root, event_type).event_state)) /* WARNING: assignment */ 250 res = (real_xfer_reset(event_type)); ``` Fix --- * The fix was to set the event_state for `deferred_signal` outofband to `active` in `deferred_signal_set()` just like it is done for `jobinterrupt` outofband in `jobinterrupt_set()`. * After this change though, an assert in line 370 below (in the `async_action()` function) failed. **sr_port/deferred_events.c** ```c 350 void async_action(bool lnfetch_or_start) . 358 if (jobinterrupt == outofband) 359 { . 367 TAREF1(save_xfer_root, jobinterrupt).event_state = pending; /* jobinterrupt gets a pass from the assert below */ 368 } else if (!lnfetch_or_start) 369 { /* something other than a new line caugth this, so */ 370 assert(pending >= TAREF1(save_xfer_root, outofband).event_state); 371 TAREF1(save_xfer_root, outofband).event_state = pending; /* make it pending in case it was not there yet */ 372 } ``` I noticed that `jobinterrupt` gets special handling in line 367. So decided to have special handling for `deferred_signal` as well. But the special handling is different here in that we do not modify the `event_state` (like is done for `jobinterrupt` in line 367 above) for the `deferred_signal` case. Just that we skip lines 370-371. * With the changes in the above 2 bullets, the simple test case shown above started working fine in that it only generated 1 core file (not 3 core files).

…er call in ss_initiate) Background ---------- * When run with a Debug build of YottaDB, a `mupip integ` failed as follows. ```c %YDB-F-ASSERT, Assert failed in sr_unix/gt_timers.c line 504 for expression ((INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) || (INTRPT_IN_DB_CSH_GETN == intrpt_ok_state) || (INTRPT_IN_GDS_RUNDOWN == intrpt_ok_state)) ``` Issue ----- * The issue is that GT.M V7.0-005 added a `START_JNL_FILE_CLOSE_TIMER_IF_NEEDED` call to the `SET_SNAPSHOTS_IN_PROG` macro as can be seen in the below diff. ```diff $ git show -U0 tags/V7.0-005 sr_port/gdsfhead.h @@ -4187 +4187 @@ MBSTART { -#define SET_SNAPSHOTS_IN_PROG(X) ((X)->snapshot_in_prog = TRUE) +#define SET_SNAPSHOTS_IN_PROG(X) MBSTART { (X)->snapshot_in_prog = TRUE; START_JNL_FILE_CLOSE_TIMER_IF_NEEDED; } MBEND ``` * This caused the assert failure in the following code (which is present only in YottaDB, not in GT.M). **sr_unix/gt_timers.c** ```c 501 } else if (jnl_file_close_timer_fptr == handler) 502 { /* Account for known instances of the above function being called from within a deferred zone. */ 503 assert((INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) || (INTRPT_IN_DB_CSH_GETN == intrpt_ok_state) 504 || (INTRPT_IN_GDS_RUNDOWN == intrpt_ok_state)); 505 safe_to_add = TRUE; ``` * Below are details from the gdb analysis of the core file ```c (gdb) where #0 __pthread_kill_implementation (no_tid=0, signo=3, threadid=140357926831936) at ./nptl/pthread_kill.c:44 #1 __pthread_kill_internal (signo=3, threadid=140357926831936) at ./nptl/pthread_kill.c:78 #2 __GI___pthread_kill (threadid=140357926831936, signo=3) at ./nptl/pthread_kill.c:89 #3 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #4 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #5 ch_cond_core () at sr_unix/ch_cond_core.c:80 #6 rts_error_va (csa=0x0, argcnt=7, var=0x7fff19e828f0) at sr_unix/rts_error.c:199 #7 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:100 #8 start_timer (tid=140357912091280, time_to_expir=60000000000, handler=0x7fa79f7dda90 <jnl_file_close_timer>, hdata_len=0, hdata=0x0) at sr_unix/gt_timers.c:503 #9 ss_initiate (reg=0x55b37821b170, util_ss_ptr=0x55b37821a9c0, ss_ctx=0x55b378219110, preserve_snapshot=0, calling_utility=0x7fa7a03c8506 "MUPIP INTEG") at sr_unix/ss_initiate.c:666 #10 mu_int_reg (reg=0x55b37821b170, return_value=0x7fff19e8560c, return_after_open=0) at sr_port/mu_int_reg.c:192 #11 mupip_integ () at sr_port/mupip_integ.c:438 #12 mupip_main (argc=4, argv=0x7fff19e89608, envp=0x7fff19e89630) at sr_unix/mupip_main.c:130 #13 dlopen_libyottadb (argc=4, argv=0x7fff19e89608, envp=0x7fff19e89630, main_func=0x55b37723f004 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #14 main (argc=4, argv=0x7fff19e89608, envp=0x7fff19e89630) at sr_unix/mupip.c:21 (gdb) f 8 #8 start_timer (tid=140357912091280, time_to_expir=60000000000, handler=0x7fa79f7dda90 <jnl_file_close_timer>, hdata_len=0, hdata=0x0) at sr_unix/gt_timers.c:503 503 assert((INTRPT_OK_TO_INTERRUPT == intrpt_ok_state) || (INTRPT_IN_DB_CSH_GETN == intrpt_ok_state) (gdb) p intrpt_ok_state $1 = INTRPT_IN_SS_INITIATE ``` Fix --- * Now that we know this is expected, the above value is also added as an accepted value in the assert.

Background ---------- * In internal testing, we noticed a rare failure in the `v51000/mu_bkup_stop` subtest where a `mupip backup` process that was sent a `SIGTERM` (by the test) ended up creating a core file due to ASAN assert failing on a double free. * Below are relevant details from the core file. ```c Core was generated by `mupip backup -online -dbg * ./49181_online1'. Program terminated with signal SIGSEGV, Segmentation fault. (gdb) where #0 ydb_os_signal_handler (sig=11, info=0x7fd09968c3f0, context=0x7fd09968c2c0) at sr_unix/ydb_os_signal_handler.c:57 #1 <signal handler called> #2 ydb_os_signal_handler (sig=6, info=0x7fd09968caf0, context=0x7fd09968c9c0) at sr_unix/ydb_os_signal_handler.c:57 #3 <signal handler called> #4 __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44 #5 __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78 #6 __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26 #7 __GI_abort () at ./stdlib/abort.c:79 #8 __sanitizer::Abort () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_posix_libcdep.cpp:143 #9 __sanitizer::Die () at ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:58 #10 __asan::ScopedInErrorReport::~ScopedInErrorReport (this=0x7ffda6de6ebe, __in_chrg=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:190 #11 __asan::ReportDoubleFree (addr=140533757257728, free_stack=<optimized out>) at ../../../../src/libsanitizer/asan/asan_report.cpp:224 #12 __asan::Allocator::ReportInvalidFree (this=<optimized out>, stack=0x7ffda6de79f0, chunk_state=<optimized out>, ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_allocator.cpp:757 #13 __interceptor_free (ptr=0x7fd090ae2800) at ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:53 #14 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1485 #15 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854 #16 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501 #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103 #18 mupip_backup_call_on_signal () at sr_port/mupip_backup.c:208 #19 signal_exit_handler (exit_handler_name=0x7fd097f1dda0 "deferred_exit_handler", sig=15, info=0x7fd098480fd8 <stapi_signal_handler_oscontext+3320>, context=0x7fd098481058 <stapi_signal_handler_oscontext+3448>, is_deferred_exit=1) at sr_unix/signal_exit_handler.c:67 #20 deferred_exit_handler () at sr_unix/deferred_exit_handler.c:120 #21 deferred_signal_handler () at sr_port/deferred_signal_handler.c:95 #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486 #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854 #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501 #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103 #26 mupip_backup () at sr_port/mupip_backup.c:1585 #27 mupip_main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip_main.c:130 #28 dlopen_libyottadb (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50, main_func=0x55af49fd9020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #29 main (argc=6, argv=0x7ffda6deef18, envp=0x7ffda6deef50) at sr_unix/mupip.c:21 (gdb) f 25 #25 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103 103 free(ptr->backup_hdr); (gdb) f 17 #17 mubclnup (curr_ptr=0x0, stage=need_to_del_tempfile) at sr_port/mubclnup.c:103 103 free(ptr->backup_hdr); (gdb) down #24 gtm_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1501 1501 gtm_free_main(addr, TAIL_CALL_LEVEL); (gdb) down #23 gtm_free_main (addr=0x7fd090ae2800, stack_level=1) at sr_port/gtm_malloc_src.h:854 854 system_free(addr); (gdb) down #22 system_free (addr=0x7fd090ae2800) at sr_port/gtm_malloc_src.h:1486 1486 ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state); (gdb) list 1481 { 1482 intrpt_state_t prev_intrpt_state; 1483 1484 DEFER_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state); 1485 free(addr); 1486 ENABLE_INTERRUPTS(INTRPT_IN_FUNC_WITH_MALLOC, prev_intrpt_state); 1487 return; 1488 } ``` Issue ----- * We did a `free(ptr->backup_hdr)` at line 103. And that in turn ended up using the system `free()` function because the test framework had randomly set the `gtmdbglvl` env var to a value of `0x80000000`. * So at line 1485 above, the system free finished but at line 1486 we noticed the SIGTERM that was deferred and so decided to handle it. But the `ptr->backup_hdr` variable was still set to a non-NULL value so as part of the deferred exit handler, we tried to free this again resulting in the double free. Fix --- * The fix is to note `ptr->backup_hdr` in a local variable and clear the former and then attempting the `free()` on the local variable. This way if we decide to do deferred exit handling after the `free()` occurred, we will notice a NULL value of `ptr->backup_hdr` and so avoid the double free. Notes ----- * This is considered a too rare a race condition to be encountered in practice and so it is expected to not be noticed by a user. Therefore no YDB issue is created for this.

Background ---------- * This is an issue in c087690 that was found by YDBTest@a8eaadc4 (YDBTest!2075). * When YottaDB was built with ASAN, running the `r202/shebang-ydb1084` subtest gave the following error. ```c ================================================================= ==29909==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x6020000001b6 at pc 0x559522eb8f9d bp 0x7ffd2cb5aa60 sp 0x7ffd2cb5a1e8 READ of size 7 at 0x6020000001b6 thread T0 #0 0x559522eb8f9c in printf_common(void*, char const*, __va_list_tag*) asan_interceptors.cpp.o #1 0x7f9d29cc83d3 in gtm_snprintf sr_unix/gtm_stdio.c:102:2 #2 0x7f9d2ac7629a in ydb_shebang sr_port/ydb_shebang.c:186:3 #3 0x7f9d29e41973 in jobchild_init sr_unix/jobchild_init.c:138:16 #4 0x7f9d29cc5db5 in gtm_startup sr_unix/gtm_startup.c:287:3 #5 0x7f9d29e04dd8 in init_gtm sr_unix/init_gtm.c:203:2 #6 0x7f9d29b62e82 in gtm_main sr_unix/gtm_main.c:194:2 #7 0x559522f55eee in dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11 #8 0x559522f54f1d in main sr_unix/gtm.c:20:9 #9 0x7f9d2efa1d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 #10 0x7f9d2efa1e3f in __libc_start_main csu/../csu/libc-start.c:392:3 #11 0x559522e97344 in _start (/usr/library/V998_R201/dbg/yottadb+0x1e344) (BuildId: a420a9112acf24ed4fac1e0bcc38f1947ddde8e8) 0x6020000001b6 is located 0 bytes to the right of 6-byte region [0x6020000001b0,0x6020000001b6) allocated by thread T0 here: #0 0x559522f1a18e in __interceptor_malloc (/usr/library/V998_R201/dbg/yottadb+0xa118e) (BuildId: a420a9112acf24ed4fac1e0bcc38f1947ddde8e8) #1 0x7f9d2a42cd8f in system_malloc sr_port/gtm_malloc_src.h:1470:9 #2 0x7f9d2a4274f9 in gtm_malloc_main sr_port/gtm_malloc_src.h:673:10 #3 0x7f9d2a436de6 in gtm_malloc sr_port/gtm_malloc_src.h:1496:9 #4 0x7f9d2b4d28f2 in zro_load sr_unix/zro_load.c:388:42 #5 0x7f9d2ac7d169 in zro_init sr_port/zro_init.c:95:2 #6 0x7f9d2b4d301f in zro_search sr_unix/zro_search.c:79:3 #7 0x7f9d2ac75b53 in ydb_shebang sr_port/ydb_shebang.c:152:2 #8 0x7f9d29e41973 in jobchild_init sr_unix/jobchild_init.c:138:16 #9 0x7f9d29cc5db5 in gtm_startup sr_unix/gtm_startup.c:287:3 #10 0x7f9d29e04dd8 in init_gtm sr_unix/init_gtm.c:203:2 #11 0x7f9d29b62e82 in gtm_main sr_unix/gtm_main.c:194:2 #12 0x559522f55eee in dlopen_libyottadb sr_unix/dlopen_libyottadb.c:151:11 #13 0x559522f54f1d in main sr_unix/gtm.c:20:9 #14 0x7f9d2efa1d8f in __libc_start_call_main csu/../sysdeps/nptl/libc_start_call_main.h:58:16 ``` Issue ----- * `TREF(dollar_zroutines)` is an mstr. That is, it has a `.addr` and `.len` field. It is not a null-terminated string. But it was used as one in the following line where the last `%s` in the SNPRINTF command corresponds to that. **sr_port/ydb_shebang.c** ``` 186 SNPRINTF(new_zro, space_needed, "%s(%s) %s", buf, rtn_path, (TREF(dollar_zroutines)).addr); ``` * This caused `snprintf` to treat it as a null terminated string and access the null byte as part of its processing. And that caused ASAN to fail with the `heap-buffer-overflow` error. Fix --- * The fix is simple and is to not treat it as a null terminated string in the `snprintf` command. And that is easily achieved by using a `%.*s` instead of a `%s` (it now requires passing the length in addition to the char pointer).

YottaDB added enhancement help wanted labels Jul 14, 2017

nars1 mentioned this issue Sep 25, 2017

[YDB#30] [NARS] [estess] Fix SIG-11 from source server if started wit… #36

Closed

nars1 mentioned this issue Sep 25, 2017

[YDB#30] [NARS] [estess] Fix SIG-11 from source server if started wit… #37

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The method of printing SVNs in zshow_svn is sub-optimal #9

The method of printing SVNs in zshow_svn is sub-optimal #9

estess commented Jul 14, 2017 •

edited

Loading

The method of printing SVNs in zshow_svn is sub-optimal #9

The method of printing SVNs in zshow_svn is sub-optimal #9

Comments

estess commented Jul 14, 2017 • edited Loading

Final Release Note

Description

Draft Release Note

estess commented Jul 14, 2017 •

edited

Loading