-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OFRZAUTOREL message includes full path of database file #47
Comments
nars1
added a commit
to nars1/YottaDB
that referenced
this issue
Oct 4, 2017
…ll path instead of ambiguous region name YottaDB#47 Release Note ------------- The OFRZAUTOREL message displays the full path of the database file name instead of a potentially-ambiguous region name. Test ---- * E_ALL run to ensure no regressions. * v63001/gtm8362 subtest enhanced to search for OFRZAUTOREL message coming only from the test output directory of interest. Previously, if two concurrent gtm8362 tests were running, it was possible for one test to be confused by the OFRZAUTOREL message of another test (since the message only indicated the region name). Now the OFRZAUTOREL message contains the test output directory path so it can clearly identify messages that correspond to its database file of interest.
nars1
added a commit
that referenced
this issue
Oct 4, 2017
…ll path instead of ambiguous region name #47 Release Note ------------- The OFRZAUTOREL message displays the full path of the database file name instead of a potentially-ambiguous region name. Test ---- * E_ALL run to ensure no regressions. * v63001/gtm8362 subtest enhanced to search for OFRZAUTOREL message coming only from the test output directory of interest. Previously, if two concurrent gtm8362 tests were running, it was possible for one test to be confused by the OFRZAUTOREL message of another test (since the message only indicated the region name). Now the OFRZAUTOREL message contains the test output directory path so it can clearly identify messages that correspond to its database file of interest.
ksbhaskar
changed the title
OFRZAUTOREL message should indicate database file full path instead of ambiguous region name
OFRZAUTOREL message includes full path of database file
Oct 25, 2017
nars1
added a commit
that referenced
this issue
Aug 7, 2020
…that can fail in rare cases * The below assert failed once in 1500 r126/ydb464 subtest runs across a dozen boxes. ```c assert((1 != forced_exit) || GET_DEFERRED_EXIT_CHECK_NEEDED); #27 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #28 wcs_wtstart (region=0x7f6970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:830 #29 wcs_stale (tid=8350064, hd_len=8, region=0x872fe8) at sr_port/t_end_sysops.c:1386 #30 timer_handler (why=14, info=0x7fadda7caea8 <stapi_signal_handler_oscontext+10728>, context=0x7fadda7caf28 <stapi_signal_handler_oscontext+10856>) at sr_unix/gt_timers.c:773 #31 ydb_os_signal_handler (sig=14, info=0x7ffdee359eb0, context=0x7ffdee359d80) at sr_unix/ydb_os_signal_handler.c:66 #32 <signal handler called> #33 generic_signal_handler (sig=2, info=0x7fadda7c9588 <stapi_signal_handler_oscontext+4296>, context=0x7fadda7c9608 <stapi_signal_handler_oscontext+4424>) at sr_unix/generic_signal_handler.c:244 #34 ydb_os_signal_handler (sig=2, info=0x7ffdee35a870, context=0x7ffdee35a740) at sr_unix/ydb_os_signal_handler.c:88 #35 <signal handler called> #36 clock_nanosleep () from /usr/lib64/libc.so.6 #37 m_usleep (useconds=1000) at sr_unix/sleep.c:28 #38 wcs_sleep (sleepfactor=1) at sr_port/wcs_sleep.c:28 #39 wcs_phase2_commit_wait (csa=0x7f4840, cr=0x7fadd7b88a50) at sr_port/wcs_phase2_commit_wait.c:268 #40 t_qread (blk=6, cycle=0x7ffdee35c568, cr_out=0x7ffdee35c558) at sr_port/t_qread.c:668 #41 gvcst_search (pKey=0x7f6040, pHist=0x0) at sr_port/gvcst_search.c:438 #42 gvcst_put2 (val=0x7fadda7d3940 <increment_delta_mval>, parms=0x7ffdee35f8b0) at sr_port/gvcst_put.c:980 #43 gvcst_put (val=0x7fadda7d3940 <increment_delta_mval>) at sr_port/gvcst_put.c:299 #44 gvcst_incr (increment=0x7ffdee362030, result=0x7ffdee362050) at sr_port/gvcst_incr.c:60 #45 op_gvincr (increment=0x7ffdee362030, result=0x7ffdee362050) at sr_port/op_gvincr.c:60 #46 ydb_incr_s (varname=0x7ffdee362b10, subs_used=0, subsarray=0x7ffdee3629d0, increment=0x7ffdee362aa0, ret_value=0x7ffdee362a90) at sr_unix/ydb_incr_s.c:156 #47 runProc (settings=0x7ffdee363490, curDepth=0) at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:447 #48 runProc_driver (settings=0x7ffdee363490) at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:145 #49 main () at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:93 (gdb) f 28 #28 wcs_wtstart (region=0x7f6970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:830 830 ENABLE_INTERRUPTS(INTRPT_IN_WCS_WTSTART, prev_intrpt_state); (gdb) p forced_exit $5 = 1 (gdb) p deferred_signal_handling_needed $6 = 0 ``` * This is because the signal handler for SIGALRM (sig=14) happened after line 183 but before line 193. ```c sr_port/have_crit.h ------------------- 183 forced_exit = 1; \ 184 if (in_os_signal_handler) \ 185 { /* If we are inside an OS signal handler and therefore had to defer exit \ 186 * handling, set "outofband" also to TRUE as this is checked by lots of \ 187 * potentially long-running commands in the runtime (e.g. HANG etc.) and we \ 188 * want all of those to automatically trigger process exit handling. \ 189 */ \ 190 outofband = deferred_signal; \ 191 } \ 192 /* Whenever "forced_exit" gets set to 1, set the corresponding deferred event too */ \ 193 SET_DEFERRED_EXIT_CHECK_NEEDED; \ ``` * We cannot easily protect against this situation and so it is better to remove the assert.
nars1
added a commit
that referenced
this issue
Jan 19, 2022
… failure in a Simple Threaded API application Background ---------- * The ideminter_rolrec/interrupted_rollback_or_recover subtest failed on a RHEL8 x86_64 in-house system as well as an AARCH64 system with the following symptom. ```diff 107a108,239 > ideminter_rolrec_0/interrupted_rollback_or_recover/savedjoboutput_0_20220107_233728/impjob_imptp0.mje4 > %YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with core ``` * The C-stack of the core file had the following (frames with repeating traces not shown below) ```c #0 pthread_kill () from /usr/lib64/libpthread.so.0 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 rts_error_va (csa=0x0, argcnt=7, var=0x7ffc809d7070) at sr_unix/rts_error.c:144 #3 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #4 ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:95 . . #38 rts_error_va (csa=0x0, argcnt=7, var=0x7ffc809dadb0) at sr_unix/rts_error.c:163 #39 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #40 ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:95 #41 deferred_signal_handler () at sr_port/deferred_signal_handler.c:64 #42 eintr_handling_check () at sr_port/eintr_handling_check.c:29 #43 gtm_fprintf (stream=0x7f7f6c14b600 <_IO_2_1_stderr_>, format=0x7f7f6d7e1005 "%s") at sr_unix/gtm_stdio.c:75 #44 util_out_print_vaparm (message=0x0, flush=1, var=0x7ffc809dbb20, faocnt=2147483647) at sr_unix/util_output.c:873 #45 util_out_print (message=0x0, flush=1) at sr_unix/util_output.c:911 #46 util_cond_flush () at sr_unix/util_output.c:952 #47 ch_cond_core () at sr_unix/ch_cond_core.c:76 #48 rts_error_va (csa=0x0, argcnt=7, var=0x7ffc809dbd20) at sr_unix/rts_error.c:198 #49 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #50 ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:95 #51 deferred_signal_handler () at sr_port/deferred_signal_handler.c:64 #52 grab_lock (reg=0xe8e960, is_blocking_wait=1, onln_rlbk_action=2) at sr_unix/grab_lock.c:111 #53 tp_tend () at sr_port/tp_tend.c:1261 #54 op_tcommit () at sr_port/op_tcommit.c:494 #55 stkok3 () at sr_x86_64/opp_tcommit.s:30 (gdb) f 50 #50 0x00007f7f6d51efb2 in ydb_stm_invoke_deferred_signal_handler () at sr_unix/ydb_stm_invoke_deferred_signal_handler.c:95 95 assert((SIGINT == stapi_signal_handler_oscontext[sig_hndlr_ctrlc_handler].sig_num) || USING_ALTERNATE_SIGHANDLING); (gdb) list 93 if (STAPI_IS_SIGNAL_HANDLER_DEFERRED(sig_hndlr_ctrlc_handler)) 94 { 95 --> assert((SIGINT == stapi_signal_handler_oscontext[sig_hndlr_ctrlc_handler].sig_num) || USING_ALTERNATE_SIGHANDLING); 96 ydb_stm_invoke_deferred_signal_handler_type = sig_hndlr_continue_handler; 97 ctrlc_handler(DUMMY_SIG_NUM, NULL, NULL); 98 ydb_stm_invoke_deferred_signal_handler_type = sig_hndlr_none; 99 } (gdb) p/x stapi_signal_handler_deferred $15 = 0x404 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_none].sig_num $3 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_continue_handler].sig_num $4 = 18 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_ctrlc_handler].sig_num $5 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_dbcertify_signal_handler].sig_num $6 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_generic_signal_handler].sig_num $7 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_jobexam_signal_handler].sig_num $8 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_jobinterrupt_event].sig_num $9 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_job_term_handler].sig_num $10 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_op_fnzpeek_signal_handler].sig_num $11 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_suspsigs_handler].sig_num $12 = 0 (gdb) p stapi_signal_handler_oscontext[sig_hndlr_timer_handler].sig_num $13 = 14 (gdb) p simpleThreadAPI_active $1 = 1 ``` * As can be seen from the gdb analysis of frame 50, the global variable `stapi_signal_handler_deferred` has 2 bits set. The ones corresponding to `sig_hndlr_ctrlc_handler` and `sig_hndlr_timer_handler`. But the `stapi_signal_handler_oscontext[]` global variable indicates that the 2 signals that were deferred corresponded to `sig_hndlr_continue_handler` and `sig_hndlr_timer_handler`. This disconnect between `sig_hndlr_ctrlc_handler` and `sig_hndlr_continue_handler` across the 2 global variables contributed to the assert failure. Issue ----- * After a code review, I have a theory on the cause of the issue. And that is the below macro got invoked by multiple threads at the same time for 2 SIGCONT signals (multiple threads possible since this is a Simple Thread API application as seen by the non-zero value of `simpleThreadAPI_active` in gdb). **sr_unix/sig_init.h** ```c 80 #define STAPI_SET_SIGNAL_HANDLER_DEFERRED(SIGHNDLRTYPE, SIG_NUM, INFO, CONTEXT) \ 81 { \ 82 SAVE_OS_SIGNAL_HANDLER_SIGNUM(SIGHNDLRTYPE, SIG_NUM); \ 83 SAVE_OS_SIGNAL_HANDLER_INFO(SIGHNDLRTYPE, INFO); \ 84 SAVE_OS_SIGNAL_HANDLER_CONTEXT(SIGHNDLRTYPE, CONTEXT); \ 85 SHM_WRITE_MEMORY_BARRIER; \ 86 stapi_signal_handler_oscontext[SIGHNDLRTYPE].sig_forwarded = TRUE; \ 87 SHM_WRITE_MEMORY_BARRIER; \ 88 /* Need interlocked add below since multiple threads can execute this code \ 89 * at the same time (e.g. if SIGALRM and SIGTERM happen at same time). \ 90 * Just like SET_DEFERRED_TIMERS_CHECK_NEEDED macro in "have_crit.h", \ 91 * we need to check the value before doing the interlocked add. \ 92 */ \ 93 if (!STAPI_IS_SIGNAL_HANDLER_DEFERRED(SIGHNDLRTYPE)) \ 94 INTERLOCK_ADD(&stapi_signal_handler_deferred, (1 << SIGHNDLRTYPE)); \ 95 SHM_WRITE_MEMORY_BARRIER; \ 96 SET_DEFERRED_STAPI_CHECK_NEEDED; \ 97 } ``` * Threads 1 and 2 both went to line 93 and found `STAPI_IS_SIGNAL_HANDLER_DEFERRED(sig_hndlr_continue_handler)` FALSE. And then both went to proceed with the `INTERLOCK_ADD` call resulting in TWO increments for the same SIGCONT signal. This caused the value of the global variable `stapi_signal_handler_deferred` to have the bit following `sig_hndlr_continue_handler` set i.e. the `sig_hndlr_ctrlc_handler` bit (in the enum sequence). Thus we ended up with `stapi_signal_handler_deferred` having the least significant 4 in its `0x404` value even though this process never had a Ctrl-C sent by the test (but SIGCONT could be sent by processes in the test at any time and not just one but more than one is also possible and is likely what happened in the test). * While `INTERLOCK_ADD` is an interlocked operation, the decision to do it is based on checking the value of the global variable and that happens outside of the interlock and so the two together do not constitute an atomic operation like they should be. Fix --- * As part of e175d54, the global variables `deferred_signal_handling_needed` and `stapi_signal_handler_deferred` were updated to use `INTERLOCK_ADD` to prevent out-of-design values with multiple threads updating the same global variable at the same time. * As part of a later commit 98e8638, `deferred_signal_handling_needed` was updated to use `COMPSWAP_LOCK` instead of `INTERLOCK_ADD`. * But `stapi_signal_handler_deferred` continued to use `INTERLOCK_ADD`. * The fix is simple and is to make `stapi_signal_handler_deferred` use `COMPSWAP_LOCK` instead of `INTERLOCK_ADD`. * Towards this, the global variable now has a type of `global_latch_t`. * A new `BIT_SET_INTERLOCKED` macro has been introduced in `sr_unix/sig_init.h`. This is based largely on the pre-existing `SET_DEFERRED_CONDITION` macro except that the new macro allows for the latch variable to be passed as a parameter. The following macros have been changed to use the new `BIT_SET_INTERLOCKED` macro. - STAPI_SET_SIGNAL_HANDLER_DEFERRED macro in sr_unix/sig_init.h - STAPI_CLEAR_SIGNAL_HANDLER_DEFERRED macro in sr_unix/sig_init.h - SET_DEFERRED_CONDITION macro in sr_port/have_crit.h - CLEAR_DEFERRED_CONDITION macro in sr_port/have_crit.h * A new `BIT_GET_INTERLOCKED` macro gets the latch value. This is based on the pre-existing `GET_DEFERRED_CONDITION` macro in sr_port/have_crit.h. All places that need to check the value of the `stapi_signal_handler_deferred` global variable now use this new macro. the pre-existing `GET_DEFERRED_CONDITION` macro has also been changed to use this new macro. * Unnecessary `GBLREF`s of `stapi_signal_handler_deferred` in a few files have been removed. * Note that these changes are done based on a theory of what could have caused the failure. It is not yet known for sure if that is indeed the cause as the failure happens very rarely (it is not reproducible on demand even after running dozens of tests on lots of systems). Only time will tell if this fix is correct. Misc changes ------------ * Noticed a longstanding typo in `sr_unix/ydb_stm_invoke_deferred_signal_handler.c` and fixed it. Not yet sure though if there is a user visible issue due to this.
nars1
added a commit
that referenced
this issue
Mar 29, 2024
* This commit addresses merge conflicts involving deleted files. * The list of deleted files was identified using the following commands as part of the prior commit (i.e. after the `git cherry-pick` command was run but before the `git commit` command was run). ```sh $ git status | grep 'deleted' deleted: sr_i386/ttt.c deleted by us: README deleted by them: sr_i386/GTMDefinedTypesInitDebug.m deleted by them: sr_i386/GTMDefinedTypesInitRelease.m deleted by them: sr_port/cmerrors_ctl.c deleted by them: sr_port/cmierrors_ctl.c deleted by them: sr_port/gdeerrors_ctl.c deleted by them: sr_port/merrors_ansi.h deleted by them: sr_port/merrors_ctl.c deleted by them: sr_port/underr_strict.c deleted by us: sr_unix/gtm_tls_impl.c deleted by us: sr_unix/gtmcrypt_dbk_ref.c deleted by us: sr_unix/gtmcrypt_util.c deleted by us: sr_unix/gtmcrypt_util.h deleted by us: sr_x86_64/merrors_ansi.h deleted by us: sr_x86_64/merrors_ctl.c ``` * All files that show up as `deleted` are deleted by both GT.M and YottaDB and so we don't need to do anything for them in this commit. The list of such files is - sr_i386/ttt.c * All files that show up as `deleted by us:` are deleted in this commit since those were deleted even before in the YDB git repository. That is a straightforward change. But each of these files needed to be reviewed to see if the GT.M changes to each file needs to be incorporated somewhere else in the YDB git repository. They are described below. - README : This file only has cosmetic changes in every GT.M release. No need to incorporate this into the README.md file (corresponding file in the YDB repository). - The following are all generated files that have been moved to `sr_port` in the YottaDB side whereas they stay in `sr_x86_64` in the GT.M side. They were automatically regenerated on the YDB side in this commit. - sr_x86_64/merrors_ansi.h - sr_x86_64/merrors_ctl.c - sr_unix/gtm_tls_impl.c : This file had been moved to YDBEncrypt/gtm_tls_impl.c (a different git repository) so I had to use the following commands to incorporate the changes. The following command (run in the YDB git repo) creates a patch file containing the GT.M V7.0-002 changes. ```sh cd YDB git show tags/V7.0-003 -- sr_unix/gtm_tls_impl.c > /tmp/patch ``` The following command (run in the YDBEncrypt git repo) applies the patch to the YDBEncrypt repo file. ```sh cd YDBEncrypt patch -i /tmp/patch gtm_tls_impl.c ``` The patch command produced the following output - Hunk #10 FAILED at 656. The GT.M changes in this Hunk were - Change a `GC_APPEND_OPENSSL_ERROR` call to `SET_AND_APPEND_OPENSSL_ERROR`. This was incorporated into the YDBEncrypt side. - Remove 2 lines that were needed until TLSv1.3 ciphers were handled. This was a `#ifdef GTM_ONLY` section in the YottaDB code and that was removed now that GT.M also supports TLSv1.3 like YottaDB did from a while ago. - Hunk #11 FAILED at 669. The GT.M changes in this Hunk were - To replace `UPDATE_ERROR_STRING(...)` calls with `gtm_tls_set_error(NULL, ...)`. This was incorporated into the YDBEncrypt side. - Hunk #47 FAILED at 1531. The GT.M changes in this Hunk were - To add a `socket->errstr = NULL;`. This was incorporated into the YDBEncrypt side. - Hunk #48 FAILED at 1584. The GT.M changes in this Hunk were - In the `gtm_tls_renegotiate()` function to add `TLS1_3_VERSION` support. This was already added in the YDBEncrypt side a while ago. I compared the code flow in both cases and they seemed identical except for `0x10101000L` vs `0x1010100fL` usages in the GT.M vs YDBEncrypt side for the `OPENSSL_VERSION_NUMBER` macro. I did not think this was a big issue so kept the YDBEncrypt side as is and discarded the GT.M side of this change. And the following files were then removed from YDBEncrypt - gtm_tls_impl.c.rej - gtm_tls_impl.c.orig - sr_unix/gtmcrypt_dbk_ref.c : This file had been moved to YDBEncrypt/gtmcrypt_dbk_ref.c (a different git repository) so I used similar commands as `gtm_tls_impl.c` in the previous bullet to do the patching. In this case, there were no `Hunk FAILED` messages so the automatic patching suceeded. And the following files were then removed from YDBEncrypt - gtmcrypt_dbk_ref.c.orig - sr_unix/gtmcrypt_util.c : This file had been moved to YDBEncrypt/gtmcrypt_util.c (a different git repository) so I used similar commands as `gtm_tls_impl.c` in the previous bullet to do the patching. In this case, there were no `Hunk FAILED` messages so the automatic patching suceeded. And the following files were then removed from YDBEncrypt - gtmcrypt_util.c.orig - sr_unix/gtmcrypt_util.h : This file had been moved to YDBEncrypt/gtmcrypt_util.h (a different git repository) so I used similar commands as `gtm_tls_impl.c` in the previous bullet to do the patching. In this case, there were no `Hunk FAILED` messages so the automatic patching suceeded. And the following files were then removed from YDBEncrypt - gtmcrypt_util.h.orig * All files that show up as `deleted by them:` are deleted in this commit since those were deleted in the GT.M repository. In addition, they were examined to see if any GT.M changes need to be retrofitted into some other module in the YottaDB side. - The following are files that are no longer maintained in the YottaDB side and so they are removed as part of this commit. - sr_i386/GTMDefinedTypesInitDebug.m - sr_i386/GTMDefinedTypesInitRelease.m - The following are all generated files. Out of the below 5, only the last 2 had material changes so just those two were automatically regenerated on the YDB side in this commit. - sr_port/cmerrors_ctl.c - sr_port/cmierrors_ctl.c - sr_port/gdeerrors_ctl.c - sr_port/merrors_ansi.h - sr_port/merrors_ctl.c - sr_port/underr_strict.c : The `underr_strict()` function was nixed and instead a `underr_overwrite()` function was introduced in `sr_port/underr_overwrite.c`. So all existing usages of `underr_strict()` were replaced with `underr_overwrite()`. This includes the following files - sr_x86_64/mval_def.si : In this file, there was a merge conflict. The GT.M change was to just change the function that is invoked from the old one to the new one. So that change was incorporated into the YottaDB side and the GT.M side of the conflict was removed. - sr_aarch64/mval_def.si : The new function name was used instead of the old one. - sr_armv7l/mval_def.si : The new function name was used instead of the old one.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Final Release Note
The OFRZAUTOREL message includes the full path of the database file. Previously it included the region name, which can be ambiguous. (YDB#47)
Description
The OFRZAUTOREL message is sent to the syslog whenever an online freeze is autoreleased. This message currently only displays the region name. In a system where there are multiple instances of YottaDB running, each with potentially the same set of region names, having just the region name is not useful in identifying exactly which database file had its online freeze released. It would be desirable for the OFRZAUTOREL message to log the full path of the database file name instead of the region name.
Draft Release Note
The OFRZAUTOREL message displays the full path of the database file name instead of a potentially-ambiguous region name.
The text was updated successfully, but these errors were encountered: