Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Handle edge case in jnlpool mutex_salvage with kill -9 of online roll…
…back process This is an issue identified based on a rare ideminter_rolrec/interrupted_rollback_or_recover subtest failure. The test ran 4 rollbacks and killed all but the last one using kill -9 before they finished. The last rollback failed an assert. %YDB-F-ASSERT, Assert failed in sr_unix/mutex.c line 1045 for expression (((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno) Below is the C-stack (gdb) where #0 kill () at ../sysdeps/unix/syscall-template.S:84 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:69 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148 #3 ch_cond_core () at sr_unix/ch_cond_core.c:64 #4 rts_error_va (csa=0x0, argcnt=7, var=0x7fff2eebdeb0) at sr_unix/rts_error.c:159 #5 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:92 #6 mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045 #7 gtm_mutex_lock (reg=0x106f960, mutex_spin_parms=0x7f7690440080, crash_count=0, mutex_lock_type=MUTEX_LOCK_WRITE) at sr_unix/mutex.c:703 #8 grab_lock (reg=0x106f960, is_blocking_wait=1, onln_rlbk_action=1) at sr_unix/grab_lock.c:83 #9 mur_open_files () at sr_port/mur_open_files.c:492 #10 mupip_recover () at sr_port/mupip_recover.c:195 #11 mupip_main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip_main.c:124 #12 dlopen_libyottadb (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688, main_func=0x401424 "mupip_main") at sr_unix/dlopen_libyottadb.c:148 #13 main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip.c:19 (gdb) f 6 #6 0x00007f76931b75bd in mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045 1045 assert(((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno); (gdb) p lastJplCmt->jnl_seqno $1 = 323033 (gdb) p jpl->jnl_seqno $2 = 293120 We were expecting the two seqnos to be 1 apart but they are way apart. This is because had killed a prior rollback (the first rollback) just before it had finished the rollback. Below is its log. > cat ROLLBACK2_1.logx . . %YDB-I-RLBKJNSEQ, Journal seqno of the instance after rollback is 293120 [0x0000000000047900] . %YDB-I-FILERENAME, File ideminter_rolrec_0_7/interrupted_rollback_or_recover/g.mjl_2018284231355 is renamed to ideminter_rolrec_0_7/interrupted_rollback_or_recover/rolled_bak_g.mjl_2018284231355 Killed The fact that it printed the RLBKJNSEQ and FILERENAME messages implies it was in mur_close_files() when it was killed. There is code in mur_close_files() where we reset various fields in "jpl" to reset the state of the journal pool based on the post-rollback instance seqno. This code also needs to clear a few 2-phase-jnl-commit related fields so "mutex_salvage" when it comes in later (in this test, it came in as part of the later rollback) after the kill -9 does not fail the above assert. This failure can be easily reproduced by running an online rollback with the -resync qualifier to take back the state of the instance to a prior seqno, setting a break point in "rel_lock()" and quitting from the debugger once that break point is hit (this simulates kill -9 of online rollback). Reissuing the same online rollback out of the debugger should show the assert failure. Reissuing the online rollback with a production build did not show any issues so the suspicion is that this is a dbg-only issue hence no tracking is done as a separate issue at gitlab.
- Loading branch information