Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[#722] MUPIP REPLIC -SOURCE -TRIGUPDATE replicates updates inside tri…
…ggers too Background ---------- * Below is pasted from https://gitlab.com/YottaDB/DB/YDB/-/issues/722#description _Database updates made by triggers are [not propagated by the replication stream](https://docs.yottadb.com/ProgrammersGuide/triggers.html#multisite-database-replication) because the design point was that database updates from triggers are derivable from the primary update. However, this means that times (and other data such as process ids) are not replicated and hence not easily recreated. The proposed enhancement adds an option when starting a Source Server to include trigger updates in the replication stream._ Core changes ------------ * `sr_unix/mupip_cmd.c` has a new `TRIGUPDATE` option in the `gtmsource_qual[]` array. * `-trigupdate` is allowed only if `-secondary` is also specified. This limits the possible commands that can specify `-trigupdate` to `mupip replic -source -start` or `mupip replic -source -activate` when they also specify `-secondary=...`. Towards this, `sr_unix/mupip_cmd_disallow.c` is modified to disallow `TRIGUPDATE` if `SECONDARY` is not also specified. * An active source server startup (`mupip replic -source -start -secondary=...`) or a source server activation (`mupip replic -source -activate -secondary=...`) now supports an optional `-trigupdate` option which when specified implies that database updates made inside trigger invocations are also included in the replication stream. * The fact that a `-trigupdate` option was specified is noted down in the boolean valued variable `gtmsource_options.trigupdate` (set to FALSE if `-trigupdate` was not specified and TRUE if it was). For a `mupip replic -source -start` command, the source server that eventually starts is forked off the process that specifies the source server startup command and so `gtmsource_options.trigupdate` is inherited implicitly and is therefore usable. But for a `mupip replic -source -activate` command, the process that specifies this command is different from the concurrently running source server and so `gtmsource_options.trigupdate` is only usable in the activate process and not transferred to the running source server. It is therefore necessary to copy over this user specified option from the activate command to the corresponding source server specific structure in the journal pool (`gtmsource_local` structure which is in shared memory). This copy happens in `sr_unix/jnlpool_init.c` (for a `mupip replic -source -start`) and in `sr_unix/gtmsource_mode_change.c` (for a `mupip replic -source -activate` command). And because of the above, the replication filter can safely use `jnlpool->gtmsource_local->trigupdate` when it needs to know whether `-trigupdate` was specified if the caller is a source server. * The replication filter functions `jnl_v44TOv44()` and `jnl_v44TOv24()` were modified in `sr_port/repl_filter.c` to check if `-trigupdate` was specified (checked using `gtmsource_local->trigupdate`). And if so, they replicate updates that happen inside triggers (those which have `JS_NOT_REPLICATED_MASK` bit set in the `nodeflags` member). The `nodeflags` member is modified to clear the `JS_NOT_REPLICATED_MASK` bit in such records. Additionally they do not replicate any LGTRIG (trigger definitions) or ZTRIG (ZTRIGGER command) or $ZTWORMHOLE jnl records. Both these filter functions had code that previously issued an error if no conversion occurred as it was an out-of-design scenario. That code has now been replaced with code that generates a NULL record since in this case the entire transaction consists of journal records that are not replicated. This NULL record logic was copied over from the filter function `jnl_v44TOv22()` which was otherwise unchanged because that function is only used in case the receiver side is pre-V6.2-000 in which case it does not support LGTRIG records (issues a `%YDB-E-REPLNOHASHTREC` error). `jrec_null.bitmask.filler` is also initialized in `INITIALIZE_V44_NULL_RECORD` now that this macro is used for `jnl_v44TOv44()` filter conversion too (not just for `jnl_v44TOv22()` conversion like previously). Misc changes ------------ * Since the `gtmsource_local_struct` structure has a new `trigupdate` member, `repl_inst_dump_gtmsourcelocal()` in `sr_unix/repl_inst_dump.c` was modified to dump this new field. * Initialization of default values for `CONNECT_PARMS` was duplicated in `sr_unix/gtmsource_get_opt.c` in the `if` and `else` blocks. That is now moved to before the `if` thereby avoiding the duplication. * A stale comment in `sr_port/jnl.h` was fixed (`align_str` was no longer in use like `ztworm_str` or `lgtrig_str` was). * Regenerated GTMDefinedTypesInit*.m for sr_x86_64,sr_aarch64,sr_armv6l due to `gtmsource_local_struct` and `gtmsource_options_t` structure layout/size changes. * Fixed a pre-existing issue with `gv_target` maintenance in `sr_port/gvcst_jrt_null.c`. Some background first. After enhancing the YDBTest test framework to test with `-trigupdate`, I noticed a rare test failure with the following symptom. This issue showed up only now due to more NULL records possible with the use of `-trigupdate`. ```sh $ cat ##REMOTE_PATH##/online_rollback_1_10/trestartrootverify/instance2/RCVR_22_01_55_4.log.updproc . . Thu Feb 16 22:02:29 2023 : ----> TPRETRY for sequence number 11252 [0x2bf4] Thu Feb 16 22:02:29 2023 : Jnl seq no : 11252 [0x2bf4];Rectype : 17 - TCOM %YDB-F-ASSERT, Assert failed in sr_port/gvcst_root_search.c line 210 for expression (cs_addrs == gv_target->gd_csa) %YDB-I-IPCNOTDEL, Thu Feb 16 22:02:29 2023 : Update process did not delete IPC resources for region AREG ``` And below was the debugger analysis. ```c (gdb) where #8 gvcst_redo_root_search () at sr_port/gvcst_root_search.c:210 #9 t_retry (failure=cdb_sc_gvtrootmod2) at sr_port/t_retry.c:558 #10 t_end (hist1=0x0, hist2=0x0, ctn=18446744071629176832) at sr_port/t_end.c:1896 #11 gvcst_jrt_null (salvaged=0) at sr_port/gvcst_jrt_null.c:72 #12 updproc_actions (gld_db_files=0x62d000001ac0) at sr_port/updproc.c:1046 #13 updproc () at sr_port/updproc.c:502 #14 mupip_main (argc=3, argv=0x7ffc86786618, envp=0x7ffc86786638) at sr_unix/mupip_main.c:122 #15 dlopen_libyottadb (argc=3, argv=0x7ffc86786618, envp=0x7ffc86786638, main_func=0x563cba3c1020 "mupip_main") at sr_unix/dlopen_libyottadb.c:151 #16 main (argc=3, argv=0x7ffc86786618, envp=0x7ffc86786638) at sr_unix/mupip.c:22 (gdb) f 8 #8 gvcst_redo_root_search () at sr_port/gvcst_root_search.c:210 210 assert(cs_addrs == gv_target->gd_csa); (gdb) p cs_addrs $1 = (sgmnt_addrs *) 0x62d000026040 (gdb) p gv_target->gd_csa $2 = (sgmnt_addrs *) 0x62d00001e840 (gdb) p cs_addrs->sgm_info_ptr->gv_cur_region->rname $4 = "DREG", '\000' <repeats 27 times> (gdb) p $2->sgm_info_ptr->gv_cur_region->rname $5 = "DEFAULT", '\000' <repeats 24 times> (gdb) p gv_target->gvname $6 = {var_name = {char_len = 0, len = 31, addr = 0x62d000057f10 "jrandomvariableinimptpfillprogr"}, hash_code = 1897224776, marked = 0} ``` ```m YDB>write $view("region","^jrandomvariableinimptpfillprogr") DEFAULT ``` So the assert failure is because `gv_target` pointed to a global variable name that mapped to the `DEFAULT` region whereas the current region were writing the NULL record was for `DREG`. The cause of the assert was the call to `gvcst_redo_root_search()` in line 558. **sr_port/t_retry.c** ```c 545 case cdb_sc_gvtrootmod2: 546 if (!redo_root_search_done) 547 RESET_ALL_GVT_CLUES; 548 /* It is possible for a read-only transaction to release crit after detecting gvtrootmod2, during 549 * which time yet another root block could have moved. In that case, the MISMATCH_ROOT_CYCLES check 550 * would have already done the redo_root_search. 551 */ 552 assert(!redo_root_search_done || !update_trans); 553 if (WANT_REDO_ROOT_SEARCH) 554 { /* Note: An online rollback can occur DURING gvcst_redo_root_search, which can remove gbls 555 * from db, leading to gv_target->root being 0, even though failure code is not 556 * cdb_sc_onln_rlbk2 557 */ --> 558 gvcst_redo_root_search(); 559 } ``` The `cdb_sc_gvtrootmod2` failure code is because of a concurrent online rollback on the receiver side. But in order to handle it, we invoke the `gvcst_redo_root_search()` if the `WANT_REDO_ROOT_SEARCH` macro returned TRUE. The macro is defined as follows. **sr_port/t_retry.c** ```c 67 /* In mu_reorg if we are in gvcst_bmp_mark_free, we actually have a valid gv_target. Find its root before the next iteration 68 * in mu_reorg. 69 */ 70 #define WANT_REDO_ROOT_SEARCH \ --> 71 ( (NULL != gv_target) \ 72 && (DIR_ROOT != gv_target->root) \ 73 && !redo_root_search_done \ 74 && !TREF(in_gvcst_redo_root_search) \ 75 && !mu_reorg_upgrd_dwngrd_in_prog \ 76 && !mu_reorg_encrypt_in_prog \ 77 && (!TREF(in_gvcst_bmp_mark_free) || mu_reorg_process) \ 78 ) ``` Line 71 is the issue. `gv_target` should have been `NULL` in the `gvcst_jrt_null()` case since we are not dealing with any global name (the `NULL` journal record is to denote an empty transaction). Given the above analysis, the fix is to set `gv_target` to `NULL` in `sr_port/gvcst_jrt_null.c`. Additionally, `gv_currkey->base[0]` had to also be cleared to keep it in sync with gv_target in DEBUG code (just like is being already done in the `GVTR_SWITCH_REG_AND_HASHT_BIND_NAME` macro in `sr_unix/gv_trigger.h`). * Fixed `sr_unix/mupip_cmd_disallow.c` to disallow `ZEROBACKLOG` option if `SHUTDOWN` is not also specified. I noticed this issue while adding disallow code for `TRIGUPDATE`. The `ZEROBACKLOG` option was introduced in 60e7e2d (GT.M V6.3-000) which was many years ago but this option was allowed to be specified in various source server commands that had nothing to do with this option (for example, the command `mupip replic -source -deactivate -zerobacklog` was allowed but it did not make any sense). This misfeature is fixed in the current commit by generating a `%YDB-E-CLIERR` error for such meaningless commands. * Made `cstart` and `jstart` variables DEBUG_ONLY in `sr_port/repl_filter.c`. This removed all the `clang-tidy` warnings related to these variables in the following reference files. - ci/tidy_warnings_release_x86_64.ref - ci/tidy_warnings_release_aarch64.ref In addition, a `Value stored to 'prefix' is never read [clang-analyzer-deadcode.DeadStores]` warning also no longer shows up after all changes to `sr_port/repl_filter.c` in this commit. Not exactly sure where the change happened but not spending time on it since the warning has now disappeared. This meant removing one line of warning from the following reference files. - ci/tidy_warnings_release_x86_64.ref - ci/tidy_warnings_release_aarch64.ref - ci/tidy_warnings_debug_aarch64.ref - ci/tidy_warnings_debug_x86_64.ref * Enhanced `ci/create_tidy_warnings.sh` to capture detail in case of compilation errors. As a background, I had a `clang-tidy-amd64` pipeline job fail because my changes to `sr_port/repl_filter.c` had a compilation error. In this case, the `clang-tidy-14` call in `ci/create_tidy_warnings.sh` exited with a non-zero status but since we had redirected the stderr to `/dev/null` (i.e. `2>/dev/null`) we had no error text to look at to see why it exited abnormally. I had to run the same command locally to determine the cause of the error. This is fixed by redirecting stderr to `$output_dir/tidy_warnings.err`. And if the `clang-tidy` exited with a non-zero status, we examine this file for any lines containing `error` (case insensitive). We print those lines in the standard output. That way it is more likely to give helpful information as to which C file had compilation troubles. In my case, I saw the following 2 lines of extra output indicating `sr_port/repl_filter.c` had compilation errors. ``` 4617 warnings and 20 errors generated. Error while processing /builds/nars1/YDB/sr_port/repl_filter.c. ```
- Loading branch information