-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce footprint of engine #33
Comments
FYI. I am currently working on reducing the size of the utility programs. That should most likely eliminate the need to make any configure related changes since the new utilities would be very small in size (a few Kbs instead of a few Mbs). It might even change the title and/or the Release Note wording. |
nars1
added a commit
to nars1/YottaDB
that referenced
this issue
Nov 21, 2017
… from 34Mb to 12Mb Utilities in the YottaDB distribution (e.g. mupip, dse, lke etc.) took up a lot of space in the installation directory but most of their size was due to statically linking code that was already existing in libgtmshr.so. So the utilities were reworked to do a dlopen() of libgtmshr.so (just like mumps already does) thereby greatly reducing their sizes (from a few megabytes to a few kilobytes). This caused an increase in libgtmshr.so (since it now also includes utility-specific code) but was negligible compared to the reduction in the sizes of the utilities. With the changes, the size of the YottaDB binary tarball (a .zip file) reduced from 13.0Mb to 3.5Mb. And the size of the directory after installing YottaDB reduced from 33.5Mb to 12.0Mb. The utilities affected are dbcertify, dse, gtcm_gnp_server, gtcm_play, gtcm_server, gtcm_shmclean, lke and mupip. Most of them were approximately 3.5Mb in size and are now 13Kb in size. In addition, ftok was also trimmed down (it was already in the Kb so size savings was not as great). The geteuid utility is relied upon by the configure script so it was not touched for now. High-level change for DSE --------------------------- a) A new module dlopen_libgtmshr.c is created to do a dlopen() of libgtmshr.so and do a dlsym() of an input symbol. This code is mostly moved from what used to be gtm.c. gtm.c now calls dlopen_libgtmshr() with a parameter "gtm_main" indicating the function to be invoked is gtm_main(). b) dse.c was renamed as dse_main.c and a copy of gtm.c was made into dse.c with the parameter "dse_main" being passed to dlopen_libgtmshr() to indicate dse_main() needs to be invoked. The GBLDEF of cmd_ary was moved from dse.c to gbldefs.c. dse_main.c passes the dse-specific dse_cmd_ary global to common_startup_init() which sets the "cmd_ary" global with this value. c) libdse.list included the list of modules which are DSE specific. These previously went into a libdse.a library as part of the build. This list file is now removed. So all these dse-specific modules get compiled into libmumps.a instead, the same library that houses mumps-specific modules. d) While linking libgtmshr.so, dse_main and dse_cmd are also specified in the command line so all the functions they call (basically all dse-specific functions) get pulled into libgtmshr.so. e) dse_main was added to gtmshr_symbols.exp, which contains the list of functions which are exported (i.e. visible externally) from libgtmshr.so. This logic is mostly the same for the other utilities too. Miscellaneous other changes ----------------------------- * For the GT.CM OMI server there were two utilities gtcm_server and gtcm_play both of which separately defined the same set of global variables. These had to be moved to a common location (gbldefs.c) to avoid multiple definition of these symbols now that these two functions are in libgtmshr.so. gvcmz_neterr.c, gvcmz_error_stub.c and gvcmz_neterr_stub.c had a GBLDEF of neterr_pending which was now removed as this GBLDEF is in gbldefs.c. gtcm_init.c (a function invoked by gtcm_server and gtcm_play) * mdef.h now includes the prototypes of the new _main functions (e.g. dse_main, lke_main etc.). Nixed gtm_main.h and moved the prototype of gtm_main() from there into mdef.h. * get_src_line.c : Noticed a LITDEF that was unnecessary so removed it. * New MIXIMAGE error message issued when a C function tries to invoke more than one base image function included in libgtmshr.so (e.g. gtm_main, dse_main, mupip_main etc.). * While removing libdse.list, libmupip.list, liblke.list and libdbcertify.list was straightforward, the following list files needed a little more work on top of being removed. ./sr_unix_cm/libgtcm.list ./sr_unix_gnp/libcmisockettcp.list ./sr_unix_gnp/libgnpclient.list ./sr_unix_gnp/libgnpserver.list ./sr_unix/libstub.list Previously, utilities used to link to specific libraries (derived from the above .list files). For example, lke and libgtmshr used to link to libgnpclient.a whereas utilities like mupip did not. mupip instead linked to libstub.a which had stub versions of all those functions defined in libgtnpclient.a that did nothing but assert that we never reach them. The only exception was gvcmy_open(). This issued an UNIMPLOP error if invoked from mupip (through libstub.a) whereas it did real work if invoked from mumps or lke (through libgnpclient.a). But now that both mupip and mumps open libgtmshr.so and that has the non-stub version of gvcmy_open(), this function needed to be changed to check the current image and if so, allow only lke and mumps and issue an UNIMPLOP error in case of any other image invocation. Similarly gvcmy_rundown() was a no-op if invoked from mupip etc. while it did real work in case of mumps/lke (if they had opened regions on the server side). This function was anyways doing a check of "ntd_root" (which is guaranteed to be NULL for mupip etc.) and doing a return in that case so no changes were needed for this function. While at this, care was taken to delete all modules included by libstub.list and go through all functions defined by modules in the other list files above (libgtcm.list, libcmisockettcp.list etc.) and add an assert at each function entry that it is indeed invoked by the image that we expect to and not by any other image (e.g. ASSERT_IS_LIBGNPCLIENT assert). * After the above changes, an assert failed in mupip that rc_mval2subsc (instead of mval2subsc) was being invoked from mupip. Both rc_mval2subsc and mval2subsc were now being folded into libgtmshr.so and were defining a function of the same name mval2subsc() and the linker had incorrectly picked the rc version for mupip. Thanks to the newly introduced checks, this was caught and the fix was to remove rc_mval2subsc.c altogether since it was only needed for the SUN platform port of GT.M many decades back. * The file msg.m was changed as part of a recent commit to generate ydbmerrors.h. While this file resides in sr_x86_64 or sr_armv7l, the file that is generated ends up in the "gen" subdirectory of the cmake build. The "gen" subdirectory is already included as part of the include list in each C compile (ahead of any other sr_* directory) so it is possible in some cases (timing issue with a multi-threaded cmake build) that a C compile that requires ydbmerrors.h finds it in the "gen" directory (ahead of the sr_x86_64 or sr_armv7l directory) while the file is still being built and so ends up with compilation errors. To fix this, the existing "gen" directory was split into two directories "genused" and "gennotused". Both directories contain generated files. But "genused" directory contains the files that are also used by the cmake build whereas the "gennotused" contains generated files that are not currently used in the cmake build. With this split-up, gtm_threadgbl_deftypes.h and gtm_threadgbl_deftypes_asm.si end up in the "genused" directory while everything else (including ydbmerrors.h) ends up in "gennotused". And only "genused" is added to the include search path in the cmake build that way files generated in "gennotused" do not cause timing issues in the build anymore. There is still a todo that we need to verify the gennotused directory contents against the repository versions of those files to make sure they are the same and if not flag an error. But that is a separate issue which will be taken care of later.
nars1
added a commit
that referenced
this issue
Nov 21, 2017
…4Mb to 12Mb Utilities in the YottaDB distribution (e.g. mupip, dse, lke etc.) took up a lot of space in the installation directory but most of their size was due to statically linking code that was already existing in libgtmshr.so. So the utilities were reworked to do a dlopen() of libgtmshr.so (just like mumps already does) thereby greatly reducing their sizes (from a few megabytes to a few kilobytes). This caused an increase in libgtmshr.so (since it now also includes utility-specific code) but was negligible compared to the reduction in the sizes of the utilities. With the changes, the size of the YottaDB binary tarball (a .zip file) reduced from 13.0Mb to 3.5Mb. And the size of the directory after installing YottaDB reduced from 33.5Mb to 12.0Mb. The utilities affected are dbcertify, dse, gtcm_gnp_server, gtcm_play, gtcm_server, gtcm_shmclean, lke and mupip. Most of them were approximately 3.5Mb in size and are now 13Kb in size. In addition, ftok was also trimmed down (it was already in the Kb so size savings was not as great). The geteuid utility is relied upon by the configure script so it was not touched for now. High-level change for DSE --------------------------- a) A new module dlopen_libgtmshr.c is created to do a dlopen() of libgtmshr.so and do a dlsym() of an input symbol. This code is mostly moved from what used to be gtm.c. gtm.c now calls dlopen_libgtmshr() with a parameter "gtm_main" indicating the function to be invoked is gtm_main(). b) dse.c was renamed as dse_main.c and a copy of gtm.c was made into dse.c with the parameter "dse_main" being passed to dlopen_libgtmshr() to indicate dse_main() needs to be invoked. The GBLDEF of cmd_ary was moved from dse.c to gbldefs.c. dse_main.c passes the dse-specific dse_cmd_ary global to common_startup_init() which sets the "cmd_ary" global with this value. c) libdse.list included the list of modules which are DSE specific. These previously went into a libdse.a library as part of the build. This list file is now removed. So all these dse-specific modules get compiled into libmumps.a instead, the same library that houses mumps-specific modules. d) While linking libgtmshr.so, dse_main and dse_cmd are also specified in the command line so all the functions they call (basically all dse-specific functions) get pulled into libgtmshr.so. e) dse_main was added to gtmshr_symbols.exp, which contains the list of functions which are exported (i.e. visible externally) from libgtmshr.so. This logic is mostly the same for the other utilities too. Miscellaneous other changes ----------------------------- * For the GT.CM OMI server there were two utilities gtcm_server and gtcm_play both of which separately defined the same set of global variables. These had to be moved to a common location (gbldefs.c) to avoid multiple definition of these symbols now that these two functions are in libgtmshr.so. gvcmz_neterr.c, gvcmz_error_stub.c and gvcmz_neterr_stub.c had a GBLDEF of neterr_pending which was now removed as this GBLDEF is in gbldefs.c. gtcm_init.c (a function invoked by gtcm_server and gtcm_play) * mdef.h now includes the prototypes of the new _main functions (e.g. dse_main, lke_main etc.). Nixed gtm_main.h and moved the prototype of gtm_main() from there into mdef.h. * get_src_line.c : Noticed a LITDEF that was unnecessary so removed it. * New MIXIMAGE error message issued when a C function tries to invoke more than one base image function included in libgtmshr.so (e.g. gtm_main, dse_main, mupip_main etc.). * While removing libdse.list, libmupip.list, liblke.list and libdbcertify.list was straightforward, the following list files needed a little more work on top of being removed. ./sr_unix_cm/libgtcm.list ./sr_unix_gnp/libcmisockettcp.list ./sr_unix_gnp/libgnpclient.list ./sr_unix_gnp/libgnpserver.list ./sr_unix/libstub.list Previously, utilities used to link to specific libraries (derived from the above .list files). For example, lke and libgtmshr used to link to libgnpclient.a whereas utilities like mupip did not. mupip instead linked to libstub.a which had stub versions of all those functions defined in libgtnpclient.a that did nothing but assert that we never reach them. The only exception was gvcmy_open(). This issued an UNIMPLOP error if invoked from mupip (through libstub.a) whereas it did real work if invoked from mumps or lke (through libgnpclient.a). But now that both mupip and mumps open libgtmshr.so and that has the non-stub version of gvcmy_open(), this function needed to be changed to check the current image and if so, allow only lke and mumps and issue an UNIMPLOP error in case of any other image invocation. Similarly gvcmy_rundown() was a no-op if invoked from mupip etc. while it did real work in case of mumps/lke (if they had opened regions on the server side). This function was anyways doing a check of "ntd_root" (which is guaranteed to be NULL for mupip etc.) and doing a return in that case so no changes were needed for this function. While at this, care was taken to delete all modules included by libstub.list and go through all functions defined by modules in the other list files above (libgtcm.list, libcmisockettcp.list etc.) and add an assert at each function entry that it is indeed invoked by the image that we expect to and not by any other image (e.g. ASSERT_IS_LIBGNPCLIENT assert). * After the above changes, an assert failed in mupip that rc_mval2subsc (instead of mval2subsc) was being invoked from mupip. Both rc_mval2subsc and mval2subsc were now being folded into libgtmshr.so and were defining a function of the same name mval2subsc() and the linker had incorrectly picked the rc version for mupip. Thanks to the newly introduced checks, this was caught and the fix was to remove rc_mval2subsc.c altogether since it was only needed for the SUN platform port of GT.M many decades back. * The file msg.m was changed as part of a recent commit to generate ydbmerrors.h. While this file resides in sr_x86_64 or sr_armv7l, the file that is generated ends up in the "gen" subdirectory of the cmake build. The "gen" subdirectory is already included as part of the include list in each C compile (ahead of any other sr_* directory) so it is possible in some cases (timing issue with a multi-threaded cmake build) that a C compile that requires ydbmerrors.h finds it in the "gen" directory (ahead of the sr_x86_64 or sr_armv7l directory) while the file is still being built and so ends up with compilation errors. To fix this, the existing "gen" directory was split into two directories "genused" and "gennotused". Both directories contain generated files. But "genused" directory contains the files that are also used by the cmake build whereas the "gennotused" contains generated files that are not currently used in the cmake build. With this split-up, gtm_threadgbl_deftypes.h and gtm_threadgbl_deftypes_asm.si end up in the "genused" directory while everything else (including ydbmerrors.h) ends up in "gennotused". And only "genused" is added to the include search path in the cmake build that way files generated in "gennotused" do not cause timing issues in the build anymore. There is still a todo that we need to verify the gennotused directory contents against the repository versions of those files to make sure they are the same and if not flag an error. But that is a separate issue which will be taken care of later.
nars1
added a commit
to nars1/YottaDB
that referenced
this issue
Dec 6, 2017
…issued when needed While trying to check whether the MIXIMAGE error (introduced as part of a recent YottaDB#33 commit) is issued correctly, a couple of code issues were identified. 1) gtm_threadgbl_init() is invoked by the GTM_THREADGBL_INIT macro and that is invoked only once per image (i.e. once in mumps, once in mupip etc.). And so it had a check that gtm_threadgbl be NULL and issued an error otherwise. But it is possible in case the process has already loaded an image and is attempting to load another image that gtm_threadgbl is non-NULL. This is a case where a MIXIMAGE error is about to be issued by common_startup_init() once we return from this function. So this check is now changed so we return if non-NULL. 2) gtcm_server_main() set "ctxt" global variable to NULL at function entry. This caused an error issued in "common_startup_init" to SIG-11 because this global variable is used by the rts_error scheme as part of maintaining the condition-handler stack. It is not clear why the NULL set is needed so that is now removed. This way, if a MIXIMAGE error is going to be issued, "ctxt" would already be non-NULL (as part of the err_init() done for the first image that was loaded). Also, a redundant "ctxt = ctxt" statement a little later in this function is now removed.
nars1
added a commit
to nars1/YottaDB
that referenced
this issue
Dec 6, 2017
…issued when appropriate While trying to check whether the MIXIMAGE error (introduced as part of a recent YottaDB#33 commit) is issued correctly, a couple of code issues were identified. 1) gtm_threadgbl_init() is invoked by the GTM_THREADGBL_INIT macro and that is invoked only once per image (i.e. once in mumps, once in mupip etc.). And so it had a check that gtm_threadgbl be NULL and issued an error otherwise. But it is possible in case the process has already loaded an image and is attempting to load another image that gtm_threadgbl is non-NULL. This is a case where a MIXIMAGE error is about to be issued by common_startup_init() once we return from this function. So this check is now changed so we return if non-NULL. 2) gtcm_server_main() set "ctxt" global variable to NULL at function entry. This caused an error issued in "common_startup_init" to SIG-11 because this global variable is used by the rts_error scheme as part of maintaining the condition-handler stack. It is not clear why the NULL set is needed so that is now removed. This way, if a MIXIMAGE error is going to be issued, "ctxt" would already be non-NULL (as part of the err_init() done for the first image that was loaded). Also, a redundant "ctxt = ctxt" statement a little later in this function is now removed.
nars1
added a commit
that referenced
this issue
Dec 6, 2017
…when appropriate While trying to check whether the MIXIMAGE error (introduced as part of a recent #33 commit) is issued correctly, a couple of code issues were identified. 1) gtm_threadgbl_init() is invoked by the GTM_THREADGBL_INIT macro and that is invoked only once per image (i.e. once in mumps, once in mupip etc.). And so it had a check that gtm_threadgbl be NULL and issued an error otherwise. But it is possible in case the process has already loaded an image and is attempting to load another image that gtm_threadgbl is non-NULL. This is a case where a MIXIMAGE error is about to be issued by common_startup_init() once we return from this function. So this check is now changed so we return if non-NULL. 2) gtcm_server_main() set "ctxt" global variable to NULL at function entry. This caused an error issued in "common_startup_init" to SIG-11 because this global variable is used by the rts_error scheme as part of maintaining the condition-handler stack. It is not clear why the NULL set is needed so that is now removed. This way, if a MIXIMAGE error is going to be issued, "ctxt" would already be non-NULL (as part of the err_init() done for the first image that was loaded). Also, a redundant "ctxt = ctxt" statement a little later in this function is now removed.
nars1
added a commit
to nars1/YottaDBtest
that referenced
this issue
Dec 11, 2017
… error is first issued instead of GTMDISTUNVERIF so fix reference file to reflect new behavior
nars1
added a commit
to YottaDB/YDBTest
that referenced
this issue
Dec 11, 2017
… error is first issued instead of GTMDISTUNVERIF so fix reference file to reflect new behavior
ksbhaskar
changed the title
Compact installation minimizing footprint of engine
Reduce footprint of engine
Mar 19, 2018
chathaway-codes
pushed a commit
that referenced
this issue
Nov 18, 2018
…secondary errors if primary error is out-of-memory If already exiting, do not open any object/source directories (which could include relinkctl files) as part of $ZROUTINES initialization. This avoids potentially nasty codepaths particulary if the reason we are exiting is an out-of-memory. We do not expect any user to run such extreme out-of-memory codepaths/tests so it is not considered necessary to create a user-visible issue for this. For example, below are two C-stacks that showed up in core dumps while running the simpleapi/fatalerror2 subtest. In both cases, if we avoid the zro_init() call we can avoid such cores. Core1 ------ Notice the local variables passed in #0 have "Cannot access memory" errors. Most likely there was no space allocating the C-stack in this core. (gdb) where #0 ydb_trans_log_name (envindx=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c5c>, trans=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c50>, buffer=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c48>, buffer_len=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c58>, ignore_errors=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c44>, is_ydb_env_match=<error reading variable: Cannot access memory at address 0x7ffe1e3c6c38>) at sr_port/ydb_trans_log_name.c:41 #1 util_out_send_oper (addr=0x7ffe1e3c7800 "%YDB-E-RELINKCTLERR, Error with relink control structure for $ZROUTINES directory ., %YDB-E-SYSCALL, Error received from system call mmap() -- called from module "..., len=287) at sr_unix/util_output.c:731 #2 util_out_print_vaparm (message=0x0, flush=4, var=0x7ffe1e3c8050, faocnt=2147483647) at sr_unix/util_output.c:871 #3 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904 #4 jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:261 #5 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36 #6 rts_error_va (csa=0x0, argcnt=12, var=0x7ffe1e3c82b0) at sr_unix/rts_error.c:159 #7 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92 #8 relinkctl_map (linkctl=0x7ffe1e3c8890) at sr_unix/relinkctl.c:679 #9 relinkctl_open (linkctl=0x7ffe1e3c8890, object_dir_missing=0) at sr_unix/relinkctl.c:333 #10 relinkctl_attach (obj_container_name=0x7ffe1e3cbb50, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188 #11 zro_load (str=0x5611ed710ce8) at sr_unix/zro_load.c:159 #12 zro_init () at sr_port/zro_init.c:51 #13 zshow_svn (output=0x7ffe1e40f0b0, one_sv=0) at sr_port/zshow_svn.c:694 #14 op_zshow (func=0x7ffe1e4171b0, type=1, lvn=0x0) at sr_port/op_zshow.c:166 #15 jobexam_dump (dump_filename_arg=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0, fatal_file_name_buff=0x7ffe1e417c40 "simpleapi_0_2/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_65362_1.txt") at sr_port/jobexam_process.c:232 #16 jobexam_process (dump_file_name=0x7ffe1e418c90, dump_file_spec=0x7ffe1e418cb0) at sr_port/jobexam_process.c:152 #17 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66 #18 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224 #19 rts_error_va (csa=0x0, argcnt=5, var=0x7ffe1e41a6a0) at sr_unix/rts_error.c:159 #20 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92 #21 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114 #22 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748 #23 lvtreenode_newblock (sym=0x5611ed733b40, numElems=2097152) at sr_port/lv_newblock.c:82 #24 lvtreenode_getslot (sym=0x5611ed733b40) at sr_port/lv_getslot.c:145 #25 lvAvlTreeNodeInsert (lvt=0x5611ed736050, key=0x7ffe1e41aab0, parent=0x5611f87cb608) at sr_port/lv_tree.c:1698 #26 op_putindx (argcnt=1, start=0x5611ed73b0a0) at sr_port/op_putindx.c:192 #27 callg (fnptr=0x7fb75d4f4fff <op_putindx>, paramlist=0x7ffe1e41ae60) at sr_unix/callg.c:60 #28 ydb_set_s (varname=0x7ffe1e41b5e0, subs_used=1, subsarray=0x7ffe1e41b5f0, value=0x7ffe1e41ade0) at sr_unix/ydb_set_s.c:108 #29 gvnset () at fatalerror.c:56 #30 ydb_tp_s (tpfn=0x5611ed225260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193 #31 main () at fatalerror.c:32 Core2 ----- In this case there is a SIG-11 deep inside syslog(). Most likely due to an out-of-memory situation. Program terminated with signal SIGSEGV, Segmentation fault. #0 vfprintf () from /usr/lib64/libc.so.6 #1 fprintf () from /usr/lib64/libc.so.6 #2 __vsyslog_chk () from /usr/lib64/libc.so.6 #3 syslog () from /usr/lib64/libc.so.6 #4 util_out_send_oper (addr=0x7ffdadd5ec10 "%YDB-E-JOBEXAMFAIL, YottaDB process 50787 executing $ZJOBEXAM function failed with the preceding error message -- generated from 0x", '0' <repeats 16 times>, ".", len=149) at sr_unix/util_output.c:761 #5 util_out_print_vaparm (message=0x0, flush=4, var=0x7ffdadd5f460, faocnt=2147483647) at sr_unix/util_output.c:871 #6 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:904 #7 send_msg_va (csa=0x0, arg_count=0, var=0x7ffdadd5fa00) at sr_unix/send_msg.c:149 #8 send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:79 #9 jobexam_dump_ch (arg=150383514) at sr_port/jobexam_process.c:264 #10 gtm_maxstr_ch (arg=150383514) at sr_port/gtm_maxstr.c:36 #11 rts_error_va (csa=0x0, argcnt=12, var=0x7ffdadd5fc60) at sr_unix/rts_error.c:159 #12 rts_error_csa (csa=0x0, argcnt=12) at sr_unix/rts_error.c:92 #13 relinkctl_map (linkctl=0x7ffdadd60240) at sr_unix/relinkctl.c:679 #14 relinkctl_open (linkctl=0x7ffdadd60240, object_dir_missing=0) at sr_unix/relinkctl.c:333 #15 relinkctl_attach (obj_container_name=0x7ffdadd63500, objpath=0x0, objpath_alloc_len=0) at sr_unix/relinkctl.c:188 #16 zro_load (str=0x55df19dd3ce8) at sr_unix/zro_load.c:159 #17 zro_init () at sr_port/zro_init.c:51 #18 zshow_svn (output=0x7ffdadda6a60, one_sv=0) at sr_port/zshow_svn.c:694 #19 op_zshow (func=0x7ffdaddaeb60, type=1, lvn=0x0) at sr_port/op_zshow.c:166 #20 jobexam_dump (dump_filename_arg=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660, fatal_file_name_buff=0x7ffdaddaf5f0 "simpleapi_0_40/fatalerror2/YDB_FATAL_ERROR.ZSHOW_DMP_50787_1.txt") at sr_port/jobexam_process.c:232 #21 jobexam_process (dump_file_name=0x7ffdaddb0640, dump_file_spec=0x7ffdaddb0660) at sr_port/jobexam_process.c:152 #22 create_fatal_error_zshow_dmp (signal=150373340) at sr_port/create_fatal_error_zshow_dmp.c:66 #23 ydb_simpleapi_ch (arg=150373340) at sr_unix/ydb_simpleapi_ch.c:224 #24 rts_error_va (csa=0x0, argcnt=5, var=0x7ffdaddb2050) at sr_unix/rts_error.c:159 #25 rts_error_csa (csa=0x0, argcnt=5) at sr_unix/rts_error.c:92 #26 raise_gtmmemory_error () at sr_port/gtm_malloc_src.h:1114 #27 gtm_malloc (size=184549392) at sr_port/gtm_malloc_src.h:748 #28 lvtreenode_newblock (sym=0x55df19df6b40, numElems=2097152) at sr_port/lv_newblock.c:82 #29 lvtreenode_getslot (sym=0x55df19df6b40) at sr_port/lv_getslot.c:145 #30 lvAvlTreeNodeInsert (lvt=0x55df19df9050, key=0x7ffdaddb2460, parent=0x55df24e8e5c8) at sr_port/lv_tree.c:1698 #31 op_putindx (argcnt=1, start=0x55df19dfe0a0) at sr_port/op_putindx.c:192 #32 callg (fnptr=0x7feae36c6fff <op_putindx>, paramlist=0x7ffdaddb2810) at sr_unix/callg.c:60 #33 ydb_set_s (varname=0x7ffdaddb2f90, subs_used=1, subsarray=0x7ffdaddb2fa0, value=0x7ffdaddb2790) at sr_unix/ydb_set_s.c:108 #34 gvnset () at fatalerror.c:56 #35 ydb_tp_s (tpfn=0x55df18a5c260 <gvnset>, tpfnparm=0x0, transid=0x0, namecount=0, varnames=0x0) at sr_unix/ydb_tp_s.c:193 #36 main () at fatalerror.c:32
chathaway-codes
pushed a commit
that referenced
this issue
Nov 27, 2018
…ops while a V4 block is dirty and SimpleThreadAPI is active The simplethreadapi/lockst subtest failed as follows when the test framework randomly chose V4 format blocks. %YDB-F-MAXRTSERRDEPTH Error loop detected - aborting image with core And produced a core with the following C-stack #0 __pthread_kill (threadid=<optimized out>, signo=3) at ../sysdeps/unix/sysv/linux/pthread_kill.c:57 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:72 #2 send_msg_va (csa=0x0, arg_count=9, var=0x7f25a0f7ff40) at sr_unix/send_msg.c:123 #3 send_msg (arg_count=9) at sr_unix/send_msg.c:74 #4 gtm_assert2 (condlen=17, condtext=0x7f25a62dc0d6 "!timer_in_handler", file_name_len=45, file_name=0x7f25a62dc0a8 "sr_unix/send_msg.c", line_no=125) at sr_port/gtm_assert2.c:34 #5 send_msg_va (csa=0x0, arg_count=9, var=0x7f25a0f80570) at sr_unix/send_msg.c:125 #6 send_msg (arg_count=9) at sr_unix/send_msg.c:74 #7 gtm_assert2 (condlen=17, condtext=0x7f25a62dc0d6 "!timer_in_handler", file_name_len=45, file_name=0x7f25a62dc0a8 "sr_unix/send_msg.c", line_no=125) at sr_port/gtm_assert2.c:34 . . #32 send_msg_va (csa=0x0, arg_count=9, var=0x7f25a0f83d20) at sr_unix/send_msg.c:125 #33 send_msg (arg_count=9) at sr_unix/send_msg.c:74 #34 gtm_assert2 (condlen=17, condtext=0x7f25a6305658 "!timer_in_handler", file_name_len=51, file_name=0x7f25a6305320 "sr_port/gtm_malloc_src.h", line_no=685) at sr_port/gtm_assert2.c:34 #35 gtm_malloc (size=4096) at sr_port/gtm_malloc_src.h:685 #36 wcs_wtstart (region=0x5558a88e1170, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:538 #37 wcs_stale (tid=93839273365872, hd_len=8, region=0x5558a8979628) at sr_port/t_end_sysops.c:1387 #38 timer_handler (why=14) at sr_unix/gt_timers.c:815 #39 <signal handler called> #40 futex_wait_cancelable (private=<optimized out>, expected=0, futex_word=0x5558a88e2ea8) at ../sysdeps/unix/sysv/linux/futex-internal.h:88 #41 __pthread_cond_wait_common (abstime=0x0, mutex=0x5558a88e2e40, cond=0x5558a88e2e80) at pthread_cond_wait.c:502 #42 __pthread_cond_wait (cond=0x5558a88e2e80, mutex=0x5558a88e2e40) at pthread_cond_wait.c:655 #43 ydb_stm_thread (parm=0x0) at sr_unix/ydb_stm_thread.c:92 #44 start_thread (arg=0x7f25a0f85700) at pthread_create.c:463 #45 clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95 The primary issue is failure in the assert "assert(!timer_in_handler)" inside the PTHREAD_MUTEX_LOCK_IF_NEEDED macro at frame # 35 (gtm_malloc). This is because we are about to make a pthread_mutex_lock() call which should not be done while inside a signal handler (according to the man pages). To fix this issue, we skip flushing that particular cache-record in wcs_wtstart() thereby avoiding the need to call gtm_malloc() (and in turn the pthread_mutex_lock() call) while inside the timer handler.
nars1
added a commit
that referenced
this issue
Aug 7, 2020
…that can fail in rare cases * The below assert failed once in 1500 r126/ydb464 subtest runs across a dozen boxes. ```c assert((1 != forced_exit) || GET_DEFERRED_EXIT_CHECK_NEEDED); #27 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #28 wcs_wtstart (region=0x7f6970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:830 #29 wcs_stale (tid=8350064, hd_len=8, region=0x872fe8) at sr_port/t_end_sysops.c:1386 #30 timer_handler (why=14, info=0x7fadda7caea8 <stapi_signal_handler_oscontext+10728>, context=0x7fadda7caf28 <stapi_signal_handler_oscontext+10856>) at sr_unix/gt_timers.c:773 #31 ydb_os_signal_handler (sig=14, info=0x7ffdee359eb0, context=0x7ffdee359d80) at sr_unix/ydb_os_signal_handler.c:66 #32 <signal handler called> #33 generic_signal_handler (sig=2, info=0x7fadda7c9588 <stapi_signal_handler_oscontext+4296>, context=0x7fadda7c9608 <stapi_signal_handler_oscontext+4424>) at sr_unix/generic_signal_handler.c:244 #34 ydb_os_signal_handler (sig=2, info=0x7ffdee35a870, context=0x7ffdee35a740) at sr_unix/ydb_os_signal_handler.c:88 #35 <signal handler called> #36 clock_nanosleep () from /usr/lib64/libc.so.6 #37 m_usleep (useconds=1000) at sr_unix/sleep.c:28 #38 wcs_sleep (sleepfactor=1) at sr_port/wcs_sleep.c:28 #39 wcs_phase2_commit_wait (csa=0x7f4840, cr=0x7fadd7b88a50) at sr_port/wcs_phase2_commit_wait.c:268 #40 t_qread (blk=6, cycle=0x7ffdee35c568, cr_out=0x7ffdee35c558) at sr_port/t_qread.c:668 #41 gvcst_search (pKey=0x7f6040, pHist=0x0) at sr_port/gvcst_search.c:438 #42 gvcst_put2 (val=0x7fadda7d3940 <increment_delta_mval>, parms=0x7ffdee35f8b0) at sr_port/gvcst_put.c:980 #43 gvcst_put (val=0x7fadda7d3940 <increment_delta_mval>) at sr_port/gvcst_put.c:299 #44 gvcst_incr (increment=0x7ffdee362030, result=0x7ffdee362050) at sr_port/gvcst_incr.c:60 #45 op_gvincr (increment=0x7ffdee362030, result=0x7ffdee362050) at sr_port/op_gvincr.c:60 #46 ydb_incr_s (varname=0x7ffdee362b10, subs_used=0, subsarray=0x7ffdee3629d0, increment=0x7ffdee362aa0, ret_value=0x7ffdee362a90) at sr_unix/ydb_incr_s.c:156 #47 runProc (settings=0x7ffdee363490, curDepth=0) at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:447 #48 runProc_driver (settings=0x7ffdee363490) at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:145 #49 main () at /usr/library/gtm_test/T998/simpleapi/inref/randomWalk.c:93 (gdb) f 28 #28 wcs_wtstart (region=0x7f6970, writes=0, cr_list_ptr=0x0, cr2flush=0x0) at sr_unix/wcs_wtstart.c:830 830 ENABLE_INTERRUPTS(INTRPT_IN_WCS_WTSTART, prev_intrpt_state); (gdb) p forced_exit $5 = 1 (gdb) p deferred_signal_handling_needed $6 = 0 ``` * This is because the signal handler for SIGALRM (sig=14) happened after line 183 but before line 193. ```c sr_port/have_crit.h ------------------- 183 forced_exit = 1; \ 184 if (in_os_signal_handler) \ 185 { /* If we are inside an OS signal handler and therefore had to defer exit \ 186 * handling, set "outofband" also to TRUE as this is checked by lots of \ 187 * potentially long-running commands in the runtime (e.g. HANG etc.) and we \ 188 * want all of those to automatically trigger process exit handling. \ 189 */ \ 190 outofband = deferred_signal; \ 191 } \ 192 /* Whenever "forced_exit" gets set to 1, set the corresponding deferred event too */ \ 193 SET_DEFERRED_EXIT_CHECK_NEEDED; \ ``` * We cannot easily protect against this situation and so it is better to remove the assert.
nars1
added a commit
that referenced
this issue
Aug 7, 2020
…signal_handling_needed; Fixes rare test failures Background ---------- * The `locks/locks_main` subtest failed on an ARMV7L Pi 3 system with the following assert in in-house testing (Note: this failure happened only once and is not easily reproducible in hundreds of test runs). ```diff > %YDB-F-ASSERT, Assert failed in sr_unix/util_output.c line 761 for expression (!GET_DEFERRED_EXIT_CHECK_NEEDED || (1 == forced_exit)) ``` * The C-stack from the core file looked as follows. ```c #25 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:134 #26 ch_cond_core () at sr_unix/ch_cond_core.c:80 #27 rts_error_va (csa=0x0, argcnt=7, var=...) at sr_unix/rts_error.c:192 #28 rts_error_csa (csa=0x0, argcnt=7) at sr_unix/rts_error.c:99 #29 util_out_send_oper (addr=0x7ea37734 "%YDB-I-SUSPENDING, Process Received Signal 19. Suspending processing on user request or attempt to do terminal I/O while running in the background -- generated from 0x75F61ACC.", len=177) at sr_unix/util_output.c:761 #30 util_out_print_vaparm (message=0x0, flush=4, var=..., faocnt=2147483647) at sr_unix/util_output.c:880 #31 util_out_print (message=0x0, flush=4) at sr_unix/util_output.c:913 #32 send_msg_va (csa=0x0, arg_count=0, var=...) at sr_unix/send_msg.c:193 #33 send_msg_csa (csa=0x0, arg_count=3) at sr_unix/send_msg.c:86 #34 suspend (sig=19) at sr_unix/suspend.c:39 #35 deferred_signal_handler () at sr_port/deferred_signal_handler.c:86 #36 gtm_hiber_start_wait_any (hiber=1) at sr_unix/gt_timers.c:347 #37 op_lock2_common (timeout=9223372036854775800, laflag=64 '@') at sr_port/op_lock2.c:358 #38 op_lock2 (timeout=0xe8bca8, laflag=64 '@') at sr_port/op_lock2.c:158 #39 op_incrlock (timeout=0xe8bca8) at sr_port/op_incrlock.c:37 (gdb) f 29 #29 0x75fa3960 in util_out_send_oper (addr=0x7ea37734 "%YDB-I-SUSPENDING, Process Received Signal 19. Suspending processing on user request or attempt to do terminal I/O while running in the background -- generated from 0x75F61ACC.", len=177) at sr_unix/util_output.c:761 761 ENABLE_INTERRUPTS(INTRPT_IN_LOG_FUNCTION, prev_intrpt_state); (gdb) p deferred_signal_handling_needed $4 = 4294967292 (gdb) p/x deferred_signal_handling_needed $5 = 0xfffffffc ``` * As can be seen from the gdb output above, the global variable `deferred_signal_handling_needed` had a negative value (`0xfffffffc` is equal to `-4`). This global variable is supposed to always be 0 or a positive value indicating a bitmask of deferred events that need to be processed. A negative value implies an out-of-design state resulting in incorrect interpretations of the deferred events that need to be processed. The assert that failed indicates that `forced_exit` is not 1 (implying that a SIGTERM or similar fatal signal was not deferred) but the global variable `deferred_signal_handling_needed` indicates that such a fatal signal had been deferred implying a contradiction. * After a lot of review of the code, I have a theory as to what might have happened. Let us say the global variable `deferred_signal_handling_needed` had the initial value of 1 indicating there is a deferred timer event. The `CLEAR_DEFERRED_TIMERS_CHECK_NEEDED` macro first does a `GET_DEFERRED_TIMERS_CHECK_NEEDED` call and if that returns TRUE (indicating the bit denoting the deferred timer signal is set), it then does an `INTERLOCK_ADD` call to decrement the global variable by the value `DEFERRED_SIGNAL_HANDLING_TIMERS` (which is 1) effectively clearing the bit. But it is possible a timer (or any other signal) interrupt occurs right after the `GET_DEFERRED_TIMERS_CHECK_NEEDED` macro call but before the `INTERLOCK_ADD` call. And inside the signal handler this global variable could be updated to a different value. For example, `timer_handler()` (invoked if a SIGALRM signal is received) does a `CLEAR_DEFERRED_TIMERS_CHECK_NEEDED` macro call on its own which would clear the bit in the global variable and set it to the value 0. Once the `timer_handler()` call returns, the `INTERLOCK_ADD()` will now be doing a decrement of the global variable by the `DEFERRED_SIGNAL_HANDLING_TIMERS` value thinking it is clearing the bit without realizing the bit got cleared by a signal handler in between. This decrement will now take the global variable value to an out of design state of `-1`. Once there, the global variable can take on other out of design values like `-2` and `-4` etc. like what was seen in the core. Fix --- * The primary issue is that the `INTERLOCK_ADD()` macro is not aware of state changes to the global variable that occur after we examine the value of the global variable but before we do the addition/subtraction. * This is fixed by reworking that to instead use `COMPSWAP_LOCK` macro with a for loop that is retried in case the global variable value changes concurrently and retries the loop. * Took this opportunity to add some asserts and rework the macros in `have_crit.h` so they are simplified and code duplication is avoided. Also entailed changing the type of `deferred_signal_handling_needed` to `global_latch_t`. * Added a `GLOBAL_LATCH_VALUE` macro in `sr_port/mdef.h` to get the latch value.
nars1
added a commit
that referenced
this issue
Mar 15, 2021
…ady started exiting * As part of a prior commit (SHA 723688c) various functions that started a timer (`wcs_clean_dbsync()`, `wcs_stale()` etc.) were fixed to not start one if we have already started exit processing. * One such timer function that should also have been fixed but was left out is `gtmsource_heartbeat_timer()`. We had an in-house test failure which failed an assert in `start_timer()` because `gtmsource_heartbeat_timer()` was being started while we had already started exit processing. Below is the C-stack of the failure for the record. ```c #0 __pthread_kill () at ../sysdeps/unix/sysv/linux/pthread_kill.c:56 #1 gtm_dump_core () at sr_unix/gtm_dump_core.c:74 #2 gtm_fork_n_core () at sr_unix/gtm_fork_n_core.c:163 #3 ch_cond_core () at sr_unix/ch_cond_core.c:80 #4 rts_error_va () at sr_unix/rts_error.c:192 #5 rts_error_csa () at sr_unix/rts_error.c:99 #6 start_timer () at sr_unix/gt_timers.c:433 #7 gtmsource_heartbeat_timer () at sr_unix/gtmsource_heartbeat.c:74 #8 timer_handler () at sr_unix/gt_timers.c:889 #9 ydb_os_signal_handler () at sr_unix/ydb_os_signal_handler.c:63 #10 <signal handler called> #11 __GI___libc_write () at ../sysdeps/unix/sysv/linux/write.c:26 #12 _IO_new_file_write () at fileops.c:1181 #13 new_do_write () at libioP.h:948 #14 _IO_new_file_xsputn () at fileops.c:1255 #15 _IO_new_file_xsputn () at fileops.c:1197 #16 __GI__IO_fwrite () at libioP.h:948 #17 gtm_fwrite () at sr_port/eintr_wrappers.h:334 #18 gtm_fprintf () at tdio.c:82 #19 util_out_print_vaparm () at sr_nix/util_output.c:876 #20 util_out_print () at sr_unix/util_output.c:914 #21 gtm_putmsg_csa () at sr_unix/gtm_putmsg.c:73 #22 gds_rundown () at sr_unix/gds_rundown.c:1060 #23 gv_rundown () at sr_port/gv_rundown.c:122 #24 mupip_exit_handler () at sr_unix/mupip_exit_handler.c:144 #25 __run_exit_handlers () at exit.c:108 #26 __GI_exit () at exit.c:139 #27 gtm_image_exit () at sr_unix/gtm_image_exit.c:27 #28 util_base_ch () at sr_port/util_base_ch.c:124 #29 gtmsource_ch () at sr_port/gtmsource_ch.c:96 #30 gtmsource_readfiles () at aDB/V999_R131/sr_unix/gtmsource_readfiles.c:2023 #31 gtmsource_get_jnlrecs () attaDB/V999_R131/sr_unix/gtmsource_process_ops.c:980 #32 gtmsource_process () at sr_unix/gtmsource_process.c:1546 #33 gtmsource () at sr_unix/gtmsource.c:525 #34 mupip_main () at sr_unix/mupip_main. #35 dlopen_libyottadb () at /Distri9_R131/sr_unix/dlopen_libyottadb.c:151 #36 main () at sr_unix/mupip.c:22 ``` * This failure is now fixed by checking `exit_handler_active` and if it is `TRUE` we skip starting this timer.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Final Release Note
The YottaDB install directory size is 14-15Mb (down from 34Mb in prior versions). (YDB#33)
Description
For production deployments in which no development will occur, and where the footprint of the engine matters (e.g., embedded databases in Internet of Things appliances), not all typically installed components are needed. The footprint of a YottaDB as installed should be examined, and options added to the installer (the
configure
script or its replacement) to omit components that are not essential to the core operation of an application. The existing installer has some options already, for example whether to install support for UTF-8, and whether object code files of utility programs should be deleted once a shared library is built. But an application may not even need some or all utility programs in a production deployment (perhaps not even GDE or LKE), and options should be added to omit these.The above was the original description. Since then, another change happened which noticeably reduced the size of the utility programs (and hence the size of the YottaDB installation). That reduced the size of the installation from 34Mb to 12Mb. And because of that, the above description is considered moot.
Draft Release Note
The YottaDB install directory size is now 12Mb (noticeably down from 34Mb in prior versions).
Messages & Recovery Manual
A MIXIMAGE error is issued when a C function tries to invoke more than one base image function included in libgtmshr.so (e.g. gtm_main, dse_main, mupip_main etc.). Only one base image function can be invoked and only once for the lifetime of the process. This prevents a MUMPS process for example from morphing into some other utility (say MUPIP or DSE).
The text was updated successfully, but these errors were encountered: