Skip to content

Commit

Permalink
Handle edge case in jnlpool mutex_salvage with kill -9 of online roll…
Browse files Browse the repository at this point in the history
…back process

This is an issue identified based on a rare ideminter_rolrec/interrupted_rollback_or_recover subtest
failure. The test ran 4 rollbacks and killed all but the last one using kill -9 before they finished.
The last rollback failed an assert.

%YDB-F-ASSERT, Assert failed in sr_unix/mutex.c line 1045 for expression (((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno)

Below is the C-stack

(gdb) where
 #0  kill () at ../sysdeps/unix/syscall-template.S:84
 #1  gtm_dump_core () at sr_unix/gtm_dump_core.c:69
 #2  gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:148
 #3  ch_cond_core () at sr_unix/ch_cond_core.c:64
 #4  rts_error_va (csa=0x0, argcnt=7, var=0x7fff2eebdeb0) at sr_unix/rts_error.c:159
 #5  rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:92
 #6  mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045
 #7  gtm_mutex_lock (reg=0x106f960, mutex_spin_parms=0x7f7690440080, crash_count=0, mutex_lock_type=MUTEX_LOCK_WRITE) at sr_unix/mutex.c:703
 #8  grab_lock (reg=0x106f960, is_blocking_wait=1, onln_rlbk_action=1) at sr_unix/grab_lock.c:83
 #9  mur_open_files () at sr_port/mur_open_files.c:492
 #10 mupip_recover () at sr_port/mupip_recover.c:195
 #11 mupip_main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip_main.c:124
 #12 dlopen_libyottadb (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688, main_func=0x401424 "mupip_main") at sr_unix/dlopen_libyottadb.c:148
 #13 main (argc=9, argv=0x7fff2eec9638, envp=0x7fff2eec9688) at sr_unix/mupip.c:19

(gdb) f 6
 #6  0x00007f76931b75bd in mutex_salvage (reg=0x106f960) at sr_unix/mutex.c:1045
1045            assert(((lastJplCmt->jnl_seqno + 1) == jpl->jnl_seqno) || !lastJplCmt->jnl_seqno);

(gdb) p lastJplCmt->jnl_seqno
$1 = 323033

(gdb) p jpl->jnl_seqno
$2 = 293120

We were expecting the two seqnos to be 1 apart but they are way apart.

This is because had killed a prior rollback (the first rollback) just before it had finished the rollback.
Below is its log.

> cat ROLLBACK2_1.logx
.
.
%YDB-I-RLBKJNSEQ, Journal seqno of the instance after rollback is 293120 [0x0000000000047900]
.
%YDB-I-FILERENAME, File ideminter_rolrec_0_7/interrupted_rollback_or_recover/g.mjl_2018284231355 is renamed to ideminter_rolrec_0_7/interrupted_rollback_or_recover/rolled_bak_g.mjl_2018284231355
Killed

The fact that it printed the RLBKJNSEQ and FILERENAME messages implies it was in mur_close_files()
when it was killed.

There is code in mur_close_files() where we reset various fields in "jpl" to reset the state of
the journal pool based on the post-rollback instance seqno. This code also needs to clear a few
2-phase-jnl-commit related fields so "mutex_salvage" when it comes in later (in this test, it came in
as part of the later rollback) after the kill -9 does not fail the above assert.

This failure can be easily reproduced by running an online rollback with the -resync qualifier to take back
the state of the instance to a prior seqno, setting a break point in "rel_lock()" and quitting from the
debugger once that break point is hit (this simulates kill -9 of online rollback). Reissuing the same online
rollback out of the debugger should show the assert failure. Reissuing the online rollback with a production
build did not show any issues so the suspicion is that this is a dbg-only issue hence no tracking is done
as a separate issue at gitlab.
  • Loading branch information
nars1 committed Oct 12, 2018
1 parent fe9b776 commit 45344b0
Showing 1 changed file with 28 additions and 15 deletions.
43 changes: 28 additions & 15 deletions sr_port/mur_close_files.c
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,9 @@ boolean_t mur_close_files(void)
jnl_private_control *jpc;
jnl_buffer_ptr_t jb;
uint4 jnl_status;
jnlpool_ctl_ptr_t jpl;
jpl_phase2_in_prog_t *lastJplCmt;
int index2;
# ifdef DEBUG
int semval;
# endif
Expand Down Expand Up @@ -677,7 +680,8 @@ boolean_t mur_close_files(void)
* (a) ONLINE ROLLBACK
* (b) Regular ROLLBACK with Anticipatory Freeze scheme
*/
if (jnlpool->jnlpool_ctl && !(was_crit = csa->now_crit)) /* note: assignment */
jpl = jnlpool->jnlpool_ctl;
if ((NULL != jpl) && !(was_crit = csa->now_crit)) /* note: assignment */
{
assert(!jgbl.onlnrlbk);
assert(NULL != jnlpool);
Expand All @@ -687,32 +691,41 @@ boolean_t mur_close_files(void)
grab_lock(jnlpool->jnlpool_dummy_reg, TRUE, ASSERT_NO_ONLINE_ROLLBACK);
}
last_histinfo_seqno = repl_inst_histinfo_truncate(murgbl.consist_jnl_seqno);
if ((NULL != jnlpool->jnlpool_ctl) && !was_crit)
if ((NULL != jpl) && !was_crit)
rel_lock(jnlpool->jnlpool_dummy_reg);
/* The above also updates "repl_inst_filehdr->jnl_seqno". If regular rollback, it also
* updates "repl_inst_filehdr->crash" to FALSE. For online rollback, we have to update
* the crash field ONLY if there is NO journal pool and that is done below.
*/
if ((NULL != jnlpool->jnlpool_ctl) && jgbl.onlnrlbk)
if ((NULL != jpl) && jgbl.onlnrlbk)
{ /* journal pool still exists and some backward and forward processing happened. More
* importantly, the database was taken to a prior logical state. Refresh the journal
* pool fields to reflect the new state.
*/
assert(csa->now_crit && csa->hold_onto_crit);
jnlpool->jnlpool_ctl->last_histinfo_seqno = last_histinfo_seqno;
jnlpool->jnlpool_ctl->jnl_seqno = murgbl.consist_jnl_seqno;
jnlpool->jnlpool_ctl->start_jnl_seqno = murgbl.consist_jnl_seqno;
jnlpool->jnlpool_ctl->rsrv_write_addr = jnlpool->jnlpool_ctl->write_addr = 0;
jnlpool->jnlpool_ctl->rsrv_write_addr = 0;
assert(jnlpool->jnlpool_ctl->phase2_commit_index1
== jnlpool->jnlpool_ctl->phase2_commit_index2);
jnlpool->jnlpool_ctl->lastwrite_len = 0;
jnlpool->jnlpool_ctl->max_zqgblmod_seqno = max_zqgblmod_seqno;
jpl->last_histinfo_seqno = last_histinfo_seqno;
jpl->jnl_seqno = murgbl.consist_jnl_seqno;
jpl->start_jnl_seqno = murgbl.consist_jnl_seqno;
jpl->rsrv_write_addr = jpl->write_addr = 0;
jpl->lastwrite_len = 0;
jpl->max_zqgblmod_seqno = max_zqgblmod_seqno;
assert(jpl->phase2_commit_index1 == jpl->phase2_commit_index2);
/* The below few zero-initializations of lastJplCmt are needed by "mutex_salvage"
* (see "CMT02 < killed <= CMT03" comment there) to ensure that if we get killed
* before the "rel_lock" later below, "mutex_salvage" will do a no-op because of
* the 0 values in lastJplCmt.
*/
index2 = jpl->phase2_commit_index2;
DECR_PHASE2_COMMIT_INDEX(index2, JPL_PHASE2_COMMIT_ARRAY_SIZE);
lastJplCmt = &jpl->phase2_commit_array[index2];
lastJplCmt->jnl_seqno = 0;
lastJplCmt->start_write_addr = 0;
lastJplCmt->tot_jrec_len = 0;
/* Keep strm_seqno in journal pool in sync with the one in instance file header */
assert(SIZEOF(jnlpool->jnlpool_ctl->strm_seqno) == SIZEOF(inst_hdr->strm_seqno));
memcpy(jnlpool->jnlpool_ctl->strm_seqno, inst_hdr->strm_seqno,
assert(SIZEOF(jpl->strm_seqno) == SIZEOF(inst_hdr->strm_seqno));
memcpy(jpl->strm_seqno, inst_hdr->strm_seqno,
MAX_SUPPL_STRMS * SIZEOF(seq_num));
if (!jnlpool->jnlpool_ctl->upd_disabled)
if (!jpl->upd_disabled)
{ /* Simulate a fresh instance startup by writing a new history record with
* the rollback'ed sequence number. This is required as otherwise the source
* server startup will NOT realize that receiver server needs to rollback or
Expand Down

0 comments on commit 45344b0

Please sign in to comment.