-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YDB#50] [NARS1] [estess] Change all GT.M release name usages (e.g. now_running field in db shared memory) to YottaDB release names #51
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…ow_running field in db shared memory) to YottaDB release names Release Note ------------- A VERMISMATCH error now reports the YottaDB release # and not the GT.M release #. Test ----- * E_ALL run many times to ensure no regressions. README ------- Almost all usages of gtm_release_name were examined and changed to ydb_release_name. The only place left was where we report $ZVERSION which does need to use gtm_release_name. This meant that database shared memory created by a GT.M version cannot be attached to by a YottaDB version even if both the versions have the same shared memory layout (because the shared memory would have the GT.M version # stored whereas the YottaDB version expects the YottaDB release # to be stored). Similarly, the jnlpool, receive pool and the object file now have the YottaDB release # stored.
estess
approved these changes
Oct 5, 2017
nars1
added a commit
that referenced
this pull request
Jan 19, 2022
… failure in a Simple Threaded API application Background ---------- * The ideminter_rolrec/interrupted_rollback_or_recover subtest failed on a RHEL8 x86_64 in-house system as well as an AARCH64 system with the following symptom. ```diff 107a108,239 > ideminter_rolrec_0/interrupted_rollback_or_recover/savedjoboutput_0_20220107_233728/impjob_imptp0.mje4 > %YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with core ``` * The C-stack of the core file had the following (frames with repeating traces not shown below) ```c #0 pthread_kill () from /usr/lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 rts_error_va (csa=0x0, argcnt=7, var=0x7ffc809d7070) at sr_unix/rts_error.c:144 #3 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #4 ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:95 . . #38 rts_error_va (csa=0x0, argcnt=7, var=0x7ffc809dadb0) at sr_unix/rts_error.c:163 #39 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #40 ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:95 #41 deferred_signal_handler () at sr_port/deferred_signal_handler.c:64 #42 eintr_handling_check () at sr_port/eintr_handling_check.c:29 #43 gtm_fprintf (stream=0x7f7f6c14b600 <_IO_2_1_stderr_>, format=0x7f7f6d7e1005 "%s") at sr_unix/gtm_stdio.c:75 #44 util_out_print_vaparm (message=0x0, flush=1, var=0x7ffc809dbb20, faocnt=2147483647) at sr_unix/util_output.c:873 #45 util_out_print (message=0x0, flush=1) at sr_unix/util_output.c:911 #46 util_cond_flush () at sr_unix/util_output.c:952 #47 ch_cond_core () at sr_unix/ch_cond_core.c:76 #48 rts_error_va (csa=0x0, argcnt=7, var=0x7ffc809dbd20) at sr_unix/rts_error.c:198 #49 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #50 ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:95 #51 deferred_signal_handler () at sr_port/deferred_signal_handler.c:64 #52 grab_lock (reg=0xe8e960, is_blocking_wait=1, onln_rlbk_action=2) at sr_unix/grab_lock.c:111 #53 tp_tend () at sr_port/tp_tend.c:1261 #54 op_tcommit () at sr_port/op_tcommit.c:494 #55 stkok3 () at sr_x86_64/opp_tcommit.s:30 (gdb) f 50 #50 0x00007f7f6d51efb2 in ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:95 95 assert((SIGINT == stapi_signal_handler_oscontext[sig_hndlr_ctrlc_handler].sig_num) || USING_ALTERNATE_SIGHANDLING); (gdb) list 93 if (STAPI_IS_SIGNAL_HANDLER_DEFERRED(sig_hndlr_ctrlc_handler)) 94 { 95 --> assert((SIGINT == stapi_signal_handler_oscontext[sig_hndlr_ctrlc_handler].sig_num) || USING_ALTERNATE_SIGHANDLING); 96 ydb_stm_invoke_deferred_signal_handler_type = sig_hndlr_continue_handler; 97 ctrlc_handler(DUMMY_SIG_NUM, NULL, NULL); 98 ydb_stm_invoke_deferred_signal_handler_type = sig_hndlr_none; 99 } (gdb) p/x stapi_signal_handler_deferred $15 = 0x404 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_none].sig_num $3 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_continue_handler].sig_num $4 = 18 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_ctrlc_handler].sig_num $5 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_dbcertify_signal_handler].sig_num $6 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_generic_signal_handler].sig_num $7 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_jobexam_signal_handler].sig_num $8 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_jobinterrupt_event].sig_num $9 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_job_term_handler].sig_num $10 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_op_fnzpeek_signal_handler].sig_num $11 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_suspsigs_handler].sig_num $12 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_timer_handler].sig_num $13 = 14 (gdb) p simpleThreadAPI_active $1 = 1 ``` * As can be seen from the gdb analysis of frame 50, the global variable `stapi_signal_handler_deferred` has 2 bits set. The ones corresponding to `sig_hndlr_ctrlc_handler` and `sig_hndlr_timer_handler`. But the `stapi_signal_handler_oscontext[]` global variable indicates that the 2 signals that were deferred corresponded to `sig_hndlr_continue_handler` and `sig_hndlr_timer_handler`. This disconnect between `sig_hndlr_ctrlc_handler` and `sig_hndlr_continue_handler` across the 2 global variables contributed to the assert failure. Issue ----- * After a code review, I have a theory on the cause of the issue. And that is the below macro got invoked by multiple threads at the same time for 2 SIGCONT signals (multiple threads possible since this is a Simple Thread API application as seen by the non-zero value of `simpleThreadAPI_active` in gdb). **sr_unix/sig_init.h** ```c 80 #define STAPI_SET_SIGNAL_HANDLER_DEFERRED(SIGHNDLRTYPE, SIG_NUM, INFO, CONTEXT) \ 81 { \ 82 SAVE_OS_SIGNAL_HANDLER_SIGNUM(SIGHNDLRTYPE, SIG_NUM); \ 83 SAVE_OS_SIGNAL_HANDLER_INFO(SIGHNDLRTYPE, INFO); \ 84 SAVE_OS_SIGNAL_HANDLER_CONTEXT(SIGHNDLRTYPE, CONTEXT); \ 85 SHM_WRITE_MEMORY_BARRIER; \ 86 stapi_signal_handler_oscontext[SIGHNDLRTYPE].sig_forwarded = TRUE; \ 87 SHM_WRITE_MEMORY_BARRIER; \ 88 /* Need interlocked add below since multiple threads can execute this code \ 89 * at the same time (e.g. if SIGALRM and SIGTERM happen at same time). \ 90 * Just like SET_DEFERRED_TIMERS_CHECK_NEEDED macro in "have_crit.h", \ 91 * we need to check the value before doing the interlocked add. \ 92 */ \ 93 if (!STAPI_IS_SIGNAL_HANDLER_DEFERRED(SIGHNDLRTYPE)) \ 94 INTERLOCK_ADD(&stapi_signal_handler_deferred, (1 << SIGHNDLRTYPE)); \ 95 SHM_WRITE_MEMORY_BARRIER; \ 96 SET_DEFERRED_STAPI_CHECK_NEEDED; \ 97 } ``` * Threads 1 and 2 both went to line 93 and found `STAPI_IS_SIGNAL_HANDLER_DEFERRED(sig_hndlr_continue_handler)` FALSE. And then both went to proceed with the `INTERLOCK_ADD` call resulting in TWO increments for the same SIGCONT signal. This caused the value of the global variable `stapi_signal_handler_deferred` to have the bit following `sig_hndlr_continue_handler` set i.e. the `sig_hndlr_ctrlc_handler` bit (in the enum sequence). Thus we ended up with `stapi_signal_handler_deferred` having the least significant 4 in its `0x404` value even though this process never had a Ctrl-C sent by the test (but SIGCONT could be sent by processes in the test at any time and not just one but more than one is also possible and is likely what happened in the test). * While `INTERLOCK_ADD` is an interlocked operation, the decision to do it is based on checking the value of the global variable and that happens outside of the interlock and so the two together do not constitute an atomic operation like they should be. Fix --- * As part of e175d54, the global variables `deferred_signal_handling_needed` and `stapi_signal_handler_deferred` were updated to use `INTERLOCK_ADD` to prevent out-of-design values with multiple threads updating the same global variable at the same time. * As part of a later commit 98e8638, `deferred_signal_handling_needed` was updated to use `COMPSWAP_LOCK` instead of `INTERLOCK_ADD`. * But `stapi_signal_handler_deferred` continued to use `INTERLOCK_ADD`. * The fix is simple and is to make `stapi_signal_handler_deferred` use `COMPSWAP_LOCK` instead of `INTERLOCK_ADD`. * Towards this, the global variable now has a type of `global_latch_t`. * A new `BIT_SET_INTERLOCKED` macro has been introduced in `sr_unix/sig_init.h`. This is based largely on the pre-existing `SET_DEFERRED_CONDITION` macro except that the new macro allows for the latch variable to be passed as a parameter. The following macros have been changed to use the new `BIT_SET_INTERLOCKED` macro. - STAPI_SET_SIGNAL_HANDLER_DEFERRED macro in sr_unix/sig_init.h - STAPI_CLEAR_SIGNAL_HANDLER_DEFERRED macro in sr_unix/sig_init.h - SET_DEFERRED_CONDITION macro in sr_port/have_crit.h - CLEAR_DEFERRED_CONDITION macro in sr_port/have_crit.h * A new `BIT_GET_INTERLOCKED` macro gets the latch value. This is based on the pre-existing `GET_DEFERRED_CONDITION` macro in sr_port/have_crit.h. All places that need to check the value of the `stapi_signal_handler_deferred` global variable now use this new macro. the pre-existing `GET_DEFERRED_CONDITION` macro has also been changed to use this new macro. * Unnecessary `GBLREF`s of `stapi_signal_handler_deferred` in a few files have been removed. * Note that these changes are done based on a theory of what could have caused the failure. It is not yet known for sure if that is indeed the cause as the failure happens very rarely (it is not reproducible on demand even after running dozens of tests on lots of systems). Only time will tell if this fix is correct. Misc changes ------------ * Noticed a longstanding typo in `sr_unix/ydb_stm_invoke_deferred_signal_handler.c` and fixed it. Not yet sure though if there is a user visible issue due to this.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Release Note
A VERMISMATCH error now reports the YottaDB release # and not the GT.M release #.
Test
README
Almost all usages of gtm_release_name were examined and changed to ydb_release_name.
The only place left was where we report $ZVERSION which does need to use gtm_release_name.
This meant that database shared memory created by a GT.M version cannot be attached to by a
YottaDB version even if both the versions have the same shared memory layout (because the
shared memory would have the GT.M version # stored whereas the YottaDB version expects the
YottaDB release # to be stored). Similarly, the jnlpool, receive pool and the object file
now have the YottaDB release # stored.