Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix rare timing issue in MUPIP INTEG -ONLINE which can cause close() …
…on a garbage file descriptor Background ---------- * This is a very rare test failure that was seen only once and on a slow ARM in-house box in internal testing. * The `stress/concurr` subtest failed with the following diff. ```diff --- concurr/concurr.diff --- 69a70,181 > host:REMOTE_SIDE:stress_1/concurr/stress_oli.out > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE) > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE) > %YDB-E-NOTALLDBRNDWN, Not all regions were successfully rundown ``` * The assert failure created a core file with the following stack trace ```c #6 gtm_close (fd=1626061471) at sr_unix/gtm_fd_trace.c:185 #7 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192 #8 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:501 #9 gv_rundown () at sr_port/gv_rundown.c:122 #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144 #11 __run_exit_handlers (status=150374524, listp=0xffff9c805680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108 #12 __GI_exit (status=<optimized out>) at exit.c:139 #13 gtm_image_exit (status=150374524) at sr_unix/gtm_image_exit.c:27 #14 util_base_ch (arg=150374524) at sr_port/util_base_ch.c:124 #15 mu_int_ch (arg=150374524) at sr_unix/mu_int_ch.c:35 #16 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192 #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #18 gtm_close (fd=-559038737) at sr_unix/gtm_fd_trace.c:185 #19 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192 #20 jnl_file_close_timer () at sr_unix/jnl_file_close_timer.c:74 #21 timer_handler (why=0, info=0xffff9c65df68 <stapi_signal_handler_oscontext+47048>, context=0xffff9c65dff0 <stapi_signal_handler_oscontext+47184>, is_os_signal_handler=0) at sr_unix/gt_timers.c:889 #22 check_for_deferred_timers () at sr_unix/gt_timers.c:1267 #23 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78 #24 gtm_free (addr=0xaaaaffca1980) at sr_port/gtm_malloc_src.h:1038 #25 ss_release (ss_ctx=0xaaaaffc78910) at sr_unix/ss_release.c:226 #26 mupip_integ () at sr_port/mupip_integ.c:801 #27 mupip_main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip_main.c:122 #28 dlopen_libyottadb (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980, main_func=0xaaaae7dd6648 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #29 main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip.c:22 ``` Issue ----- * Frame 18 in the stack trace above indicates a `gtm_close()` call happening with an fd of `-559038737`. * Frame 6 in the stack trace above indicates a `gtm_close()` call happening with an fd of `fd=1626061471`. * The real issue is in Frame 26 in the stack trace above where we call `ss_release()`. The relevant code is pasted below. **sr_port/mupip_integ.c** ```c 799 assert(SNAPSHOTS_IN_PROG(csa)); 800 assert(NULL != csa->ss_ctx); 801 ss_release(&csa->ss_ctx); 802 CLEAR_SNAPSHOTS_IN_PROG(csa); ``` * Line 801 does the `ss_release()` call and Line 802 clears the flag in `csa` that records that a snapshot is in progress. * But `ss_release()` first calls `ss_context_destroy()` and then calls `free()` so it is possible that a timer interrupt gets handled in a deferred fashion right after the `free()` but before the `CLEAR_SNAPSHOTS_IN_PROG` macro gets executed. This means we would invoke `ss_destroy_context()` on the `csa->ss_ctx` structure again inside the timer. And that would be looking at an already freed context structure. Which can then explain why garbage values of `fd` got used in the `gtm_close()` calls. Fix --- * The fix is in `sr_port/mupip_integ.c` to clear all context in global variables that indicate a snapshot is in progress BEFORE calling `ss_release()`. * Additionally, the following files were changed since the warning text from `clang-tidy` changed a bit. While at it, I also verified that this warning is a false alarm. - ci/tidy_warnings_debug.ref - ci/tidy_warnings_release.ref
- Loading branch information