Skip to content

Commit

Permalink
Fix rare timing issue in MUPIP INTEG -ONLINE which can cause close() …
Browse files Browse the repository at this point in the history
…on a garbage file descriptor

Background
----------
* This is a very rare test failure that was seen only once and on a slow ARM in-house box in
  internal testing.

* The `stress/concurr` subtest failed with the following diff.

  ```diff
  --- concurr/concurr.diff ---
  69a70,181
  > host:REMOTE_SIDE:stress_1/concurr/stress_oli.out
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE)
  > %YDB-F-ASSERT, Assert failed in sr_unix/gtm_fd_trace.c line 185 for expression (FALSE)
  > %YDB-E-NOTALLDBRNDWN, Not all regions were successfully rundown
  ```

* The assert failure created a core file with the following stack trace

  ```c
   #6 gtm_close (fd=1626061471) at sr_unix/gtm_fd_trace.c:185
   #7 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192
   #8 gds_rundown (cleanup_udi=1) at sr_unix/gds_rundown.c:501
   #9 gv_rundown () at sr_port/gv_rundown.c:122
  #10 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144
  #11 __run_exit_handlers (status=150374524, listp=0xffff9c805680 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
  #12 __GI_exit (status=<optimized out>) at exit.c:139
  #13 gtm_image_exit (status=150374524) at sr_unix/gtm_image_exit.c:27
  #14 util_base_ch (arg=150374524) at sr_port/util_base_ch.c:124
  #15 mu_int_ch (arg=150374524) at sr_unix/mu_int_ch.c:35
  #16 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192
  #17 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99
  #18 gtm_close (fd=-559038737) at sr_unix/gtm_fd_trace.c:185
  #19 ss_destroy_context (lcl_ss_ctx=0xaaaaffca1980) at sr_unix/ss_context_mgr.c:192
  #20 jnl_file_close_timer () at sr_unix/jnl_file_close_timer.c:74
  #21 timer_handler (why=0, info=0xffff9c65df68 <stapi_signal_handler_oscontext+47048>, context=0xffff9c65dff0 <stapi_signal_handler_oscontext+47184>, is_os_signal_handler=0) at sr_unix/gt_timers.c:889
  #22 check_for_deferred_timers () at sr_unix/gt_timers.c:1267
  #23 deferred_signal_handler () at sr_port/deferred_signal_handler.c:78
  #24 gtm_free (addr=0xaaaaffca1980) at sr_port/gtm_malloc_src.h:1038
  #25 ss_release (ss_ctx=0xaaaaffc78910) at sr_unix/ss_release.c:226
  #26 mupip_integ () at sr_port/mupip_integ.c:801
  #27 mupip_main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip_main.c:122
  #28 dlopen_libyottadb (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980, main_func=0xaaaae7dd6648 "mupip_main") at sr_unix/dlopen_libyottadb.c:151
  #29 main (argc=6, argv=0xffffcd1e7948, envp=0xffffcd1e7980) at sr_unix/mupip.c:22
  ```

Issue
-----
* Frame 18 in the stack trace above indicates a `gtm_close()` call happening with an fd of `-559038737`.

* Frame 6 in the stack trace above indicates a `gtm_close()` call happening with an fd of `fd=1626061471`.

* The real issue is in Frame 26 in the stack trace above where we call `ss_release()`. The relevant code
  is pasted below.

  **sr_port/mupip_integ.c**
  ```c
       799     assert(SNAPSHOTS_IN_PROG(csa));
       800     assert(NULL != csa->ss_ctx);
       801     ss_release(&csa->ss_ctx);
       802     CLEAR_SNAPSHOTS_IN_PROG(csa);
  ```

* Line 801 does the `ss_release()` call and Line 802 clears the flag in `csa` that records that a snapshot
  is in progress.

* But `ss_release()` first calls `ss_context_destroy()` and then calls `free()` so it is possible that a
  timer interrupt gets handled in a deferred fashion right after the `free()` but before the
  `CLEAR_SNAPSHOTS_IN_PROG` macro gets executed. This means we would invoke `ss_destroy_context()` on the
  `csa->ss_ctx` structure again inside the timer. And that would be looking at an already freed context
  structure. Which can then explain why garbage values of `fd` got used in the `gtm_close()` calls.

Fix
---
* The fix is in `sr_port/mupip_integ.c` to clear all context in global variables that indicate a snapshot
  is in progress BEFORE calling `ss_release()`.

* Additionally, the following files were changed since the warning text from `clang-tidy` changed a bit.
  While at it, I also verified that this warning is a false alarm.
  - ci/tidy_warnings_debug.ref
  - ci/tidy_warnings_release.ref
  • Loading branch information
nars1 committed Jan 31, 2022
1 parent 652c6b1 commit 232dd1e
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 5 deletions.
2 changes: 1 addition & 1 deletion ci/tidy_warnings_debug.ref
Original file line number Diff line number Diff line change
Expand Up @@ -279,7 +279,7 @@ mupip_backup.c:warning: Value stored to 'save_errno' is never read [clang-analyz
mupip_backup.c:warning: Value stored to 'tn_specified' is never read [clang-analyzer-deadcode.DeadStores]
mupip_backup.c:warning: Value stored to 'tn_specified' is never read [clang-analyzer-deadcode.DeadStores]
mupip_extend.c:warning: Value stored to 'blocks' is never read [clang-analyzer-deadcode.DeadStores]
mupip_integ.c:warning: Access to field 'snapshot_in_prog' results in a dereference of a null pointer (loaded from variable 'csa') [clang-analyzer-core.NullDereference]
mupip_integ.c:warning: Access to field 'ss_ctx' results in a dereference of a null pointer (loaded from variable 'csa') [clang-analyzer-core.NullDereference]
mupip_integ.c:warning: Assigned value is garbage or undefined [clang-analyzer-core.uninitialized.Assign]
mupip_integ.c:warning: Dereference of undefined pointer value [clang-analyzer-core.NullDereference]
mupip_integ.c:warning: Dereference of undefined pointer value [clang-analyzer-core.NullDereference]
Expand Down
2 changes: 1 addition & 1 deletion ci/tidy_warnings_release.ref
Original file line number Diff line number Diff line change
Expand Up @@ -413,7 +413,7 @@ mupip_backup.c:warning: Value stored to 'tn_specified' is never read [clang-anal
mupip_backup.c:warning: Value stored to 'tn_specified' is never read [clang-analyzer-deadcode.DeadStores]
mupip_extend.c:warning: Although the value stored to 'status' is used in the enclosing expression, the value is never actually read from 'status' [clang-analyzer-deadcode.DeadStores]
mupip_extend.c:warning: Value stored to 'blocks' is never read [clang-analyzer-deadcode.DeadStores]
mupip_integ.c:warning: Access to field 'snapshot_in_prog' results in a dereference of a null pointer (loaded from variable 'csa') [clang-analyzer-core.NullDereference]
mupip_integ.c:warning: Access to field 'ss_ctx' results in a dereference of a null pointer (loaded from variable 'csa') [clang-analyzer-core.NullDereference]
mupip_integ.c:warning: Assigned value is garbage or undefined [clang-analyzer-core.uninitialized.Assign]
mupip_integ.c:warning: Dereference of undefined pointer value [clang-analyzer-core.NullDereference]
mupip_integ.c:warning: Dereference of undefined pointer value [clang-analyzer-core.NullDereference]
Expand Down
16 changes: 13 additions & 3 deletions sr_port/mupip_integ.c
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
* Copyright (c) 2001-2019 Fidelity National Information *
* Services, Inc. and/or its subsidiaries. All rights reserved. *
* *
* Copyright (c) 2018-2021 YottaDB LLC and/or its subsidiaries. *
* Copyright (c) 2018-2022 YottaDB LLC and/or its subsidiaries. *
* All rights reserved. *
* *
* This source code contains the intellectual property *
Expand Down Expand Up @@ -796,10 +796,20 @@ void mupip_integ(void)
region_freeze(gv_cur_region, FALSE, FALSE, FALSE, FALSE, FALSE);
} else
{
snapshot_context_ptr_t ss_ctx;

/* ss_release() does calls to ss_destroy_context() and free() and can be interrupted by timers
* which can potentially invoke ss_destroy_context() on an already freed context structure
* resulting in a [heap-use-after-free] error. So clear everything in global variables that
* indicates a snapshot is in progress BEFORE the ss_release() call and then do the ss_release()
* call using a local copy of the context.
*/
assert(SNAPSHOTS_IN_PROG(csa));
assert(NULL != csa->ss_ctx);
ss_release(&csa->ss_ctx);
ss_ctx = csa->ss_ctx;
assert(NULL != ss_ctx);
csa->ss_ctx = NULL;
CLEAR_SNAPSHOTS_IN_PROG(csa);
ss_release(&ss_ctx);
}
rptr = rptr->fPtr;
if (NULL == rptr)
Expand Down

0 comments on commit 232dd1e

Please sign in to comment.