Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory corruption during CPU intensive work #93

Open
yazun opened this issue Jan 18, 2021 · 3 comments
Open

Memory corruption during CPU intensive work #93

yazun opened this issue Jan 18, 2021 · 3 comments

Comments

@yazun
Copy link
Contributor

yazun commented Jan 18, 2021

After roughly 10 hours of quite intensive memory-mostly data crunching (50-60% CPU load, zeroish IO load) we see a crash and a core as below:

Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: dr3_ops_cs36 surveys 192.168.168.154(34674) REMOTE SUBPLAN (coord4:1'.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000008aff06 in CopyDataRowTupleToSlot (slot=slot@entry=0x1bcdda0, combiner=<optimized out>) at execRemote.c:1843
1843    execRemote.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-307.el7.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-46.el7.x86_64 libcom_err-1.42.9-17.el7.x86_64 libselinux-2.5-15.el7.x86_64 libxml2-2.9.1-6.el7.4.x86_64 nspr-4.21.0-1.el7.x86_64 nss-3.44.0-7.el7_7.x86_64 nss-softokn-freebl-3.44.0-8.el7_7.x86_64 nss-util-3.44.0-4.el7_7.x86_64 openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  0x00000000008aff06 in CopyDataRowTupleToSlot (slot=slot@entry=0x1bcdda0, combiner=<optimized out>) at execRemote.c:1843
#1  0x00000000008b3e72 in FetchTuple (combiner=combiner@entry=0x1bcd7e0) at execRemote.c:2144
#2  0x00000000008bd728 in ExecRemoteSubplan (pstate=0x1bcd7e0) at execRemote.c:10744
#3  0x0000000000904acc in ExecProcNode (node=0x1bcd7e0) at ../../../src/include/executor/executor.h:273
#4  fetch_input_tuple (aggstate=aggstate@entry=0x1bcd018) at nodeAgg.c:725
#5  0x000000000091354d in agg_retrieve_direct (aggstate=<optimized out>) at nodeAgg.c:3312
#6  ExecAgg (pstate=<optimized out>) at nodeAgg.c:3022
#7  0x0000000000906672 in ExecProcNode (node=0x1bcd018) at ../../../src/include/executor/executor.h:273
#8  ExecMaterial (pstate=0x1bccca8) at nodeMaterial.c:134
#9  0x000000000091cd7c in ExecProcNode (node=0x1bccca8) at ../../../src/include/executor/executor.h:273
#10 ExecNestLoop (pstate=0x1bbb020) at nodeNestloop.c:170
#11 0x00000000009480df in ExecProcNode (node=0x1bbb020) at ../../../src/include/executor/executor.h:273
#12 ExecutePlan (execute_once=<optimized out>, dest=0x1898f18, direction=<optimized out>, numberTuples=0, sendTuples=<optimized out>, operation=CMD_SELECT, use_parallel_mode=<optimized out>, planstate=0x1bbb020, estate=0x1bb9c08) at execMain.c:1955
#13 standard_ExecutorRun (queryDesc=0x19e6c18, direction=<optimized out>, count=0, execute_once=<optimized out>) at execMain.c:465
#14 0x00000000006d034e in AdvanceProducingPortal (portal=portal@entry=0x19e3398, can_wait=can_wait@entry=0 '\000') at pquery.c:2592
#15 0x00000000006d2f27 in PortalRun (portal=0x19e3398, count=<optimized out>, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x19968e8, altdest=0x19968e8, completionTag=0x7ffe096d9730 "") at pquery.c:1344
#16 0x0000000000705d53 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x19964d8 "p_7_4a39_3_137b0456") at postgres.c:2958
#17 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:5507
#18 0x000000000079c4ed in BackendRun (port=0x18898b0) at postmaster.c:4979
#19 BackendStartup (port=0x18898b0) at postmaster.c:4651
#20 ServerLoop () at postmaster.c:1956
#21 0x000000000079d366 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster.c:1564
#22 0x0000000000497c53 in main (argc=5, argv=0x1855680) at main.c:228
(gdb)

It happened already twice, so seems like a high probable scenario - it happens with no RAM strain.

and the offending part seems to be coming from a corrupted pocket?
offending line

datarow = (RemoteDataRow) palloc(sizeof(RemoteDataRowData) + combiner->currentRow->msglen);
(gdb) p combiner->currentRow->msglen
value has been optimized out
(gdb) up
#1  0x00000000008b3e72 in FetchTuple (combiner=combiner@entry=0x1bcd7e0) at execRemote.c:2144
2144    in execRemote.c
(gdb) p *combiner
$4 = {ss = {ps = {type = T_RemoteSubplanState, plan = 0x188d7d8, state = 0x1bb9c08, ExecProcNode = 0x8bd660 <ExecRemoteSubplan>, ExecProcNodeReal = 0x8bd660 <ExecRemoteSubplan>, instrument = 0x0, worker_instrument = 0x0, qual = 0x0, lefttree = 0x0, righttree = 0x0, initPlan = 0x0, subPlan = 0x0, chgParam = 0x0, ps_ResultTupleSlot = 0x1bcdda0, ps_ExprContext = 0x1c92218, ps_ProjInfo = 0x0, skip_data_mask_check = 0 '\000', audit_fga_qual = 0x0}, ss_currentRelation = 0x0,
    ss_currentScanDesc = 0x0, ss_ScanTupleSlot = 0x0, ss_currentMaskDesc = 0x0, inited = 0 '\000'}, node_count = 0, connections = 0x1bce528, conn_count = 1, current_conn = 0, current_conn_rows_consumed = 1, combine_type = COMBINE_TYPE_NONE, command_complete_count = 11, request_type = REQUEST_TYPE_QUERY, tuple_desc = 0x0, description_count = 0, copy_in_count = 0, copy_out_count = 0, copy_file = 0x0, processed = 0, errorCode = "\000\000\000\000", errorMessage = 0x0,
  errorDetail = 0x0, errorHint = 0x0, returning_node = 0, currentRow = 0xf3, rowBuffer = 0x7f7f988e05f8, tapenodes = 0x0, tapemarks = 0x7f7f988e07c8, prerowBuffers = 0x0, dataRowBuffer = 0x0, dataRowMemSize = 0x7f7f988e0898, nDataRows = 0x0, tmpslot = 0x0, errorNode = 0x0, backend_pid = 0, is_abort = 0 '\000', merge_sort = 0 '\000', extended_query = 1 '\001', probing_primary = 0 '\000', tuplesortstate = 0x0, remoteCopyType = REMOTE_COPY_NONE, tuplestorestate = 0x0,
  cursor = 0x7f7f98bc0fe8 "p_7_4a39_2_137b044d", update_cursor = 0x0, cursor_count = 12, cursor_connections = 0x7f7f988e01d8, recv_node_count = 12, recv_tuples = 0, recv_total_time = -1, DML_processed = 0, conns = 0x0, ccount = 0, recv_datarows = 0}
(gdb) p combiner->currentRow->msglen
Cannot access memory at address 0xf7
(gdb) p *combiner->currentRow
Cannot access memory at address 0xf3

Any idea if this could be fixed?

The queries are similar and involve index lookups, q3c index and lateral join + aggregates within lateral.

@yazun
Copy link
Contributor Author

yazun commented Jan 18, 2021

There are roughly 250 processes running, so around 700-800 active processes per datanode.

@yazun
Copy link
Contributor Author

yazun commented Jan 18, 2021

I should also mention that we use nonstandard blocksize (16KB) and both sender_thread_batch_size and sender_thread_buffer_size set to 64.

@yazun
Copy link
Contributor Author

yazun commented Jan 21, 2021

We had another crash under the same load, but at different place:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: dr3_ops_cs36 surveys 192.168.168.154(20672) REMOTE SUBPLAN (coord10:'.
Program terminated with signal 11, Segmentation fault.
#0  pfree (pointer=<optimized out>, pointer=<optimized out>, pointer=<optimized out>) at mcxt.c:1027
1027    mcxt.c: No such file or directory.
Missing separate debuginfos, use: debuginfo-install R-core-3.6.0-1.el7.x86_64 bzip2-libs-1.0.6-13.el7.x86_64 cyrus-sasl-lib-2.1.26-23.el7.x86_64 glibc-2.17-307.el7.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-46.el7.x86_64 libcom_err-1.42.9-17.el7.x86_64 libgcc-4.8.5-39.el7.x86_64 libgfortran-4.8.5-39.el7.x86_64 libgomp-4.8.5-39.el7.x86_64 libicu-50.2-4.el7_7.x86_64 libquadmath-4.8.5-39.el7.x86_64 libselinux-2.5-15.el7.x86_64 libstdc++-4.8.5-39.el7.x86_64 libxml2-2.9.1-6.el7.4.x86_64 ncurses-libs-5.9-14.20130511.el7_4.x86_64 nspr-4.21.0-1.el7.x86_64 nss-3.44.0-7.el7_7.x86_64 nss-softokn-freebl-3.44.0-8.el7_7.x86_64 nss-util-3.44.0-4.el7_7.x86_64 openblas-Rblas-0.3.3-2.el7.x86_64 openldap-2.4.44-21.el7_6.x86_64 openssl-libs-1.0.2k-19.el7.x86_64 pcre-8.32-17.el7.x86_64 pcre2-10.23-2.el7.x86_64 readline-6.2-11.el7.x86_64 tre-0.8.0-18.20140228gitc2f5d13.el7.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0  pfree (pointer=<optimized out>, pointer=<optimized out>, pointer=<optimized out>) at mcxt.c:1027
#1  0x00000000009038d9 in heap_freetuple (htup=<optimized out>) at heaptuple.c:1827
#2  ExecClearTuple (slot=0x2450830) at execTuples.c:499
#3  0x0000000000931ced in ExecEndCteScan (node=0x2450320) at nodeCtescan.c:291
#4  ExecEndNode (node=<optimized out>) at execProcnode.c:741
#5  0x0000000000931dcf in ExecEndNode (node=<optimized out>) at execProcnode.c:632
#6  ExecEndNestLoop (node=0x2356620) at nodeNestloop.c:396
#7  ExecEndNode (node=<optimized out>) at execProcnode.c:764
#8  0x0000000000931dcf in ExecEndNode (node=<optimized out>) at execProcnode.c:632
#9  ExecEndNestLoop (node=0x23577d8) at nodeNestloop.c:396
#10 ExecEndNode (node=<optimized out>) at execProcnode.c:764
#11 0x0000000000939960 in ExecEndNode (node=<optimized out>) at execProcnode.c:632
#12 ExecEndPlan (estate=0x2355958, planstate=<optimized out>) at execMain.c:1823
#13 standard_ExecutorEnd (queryDesc=0x2117688) at execMain.c:597
#14 0x00000000009978fc in PortalCleanup (portal=0x2113e08) at portalcmds.c:398
#15 0x00000000005143be in MarkPortalFailed (portal=<optimized out>, portal=<optimized out>, portal=<optimized out>) at portalmem.c:542
#16 0x00000000006d3698 in PortalRun (portal=0x2113e08, count=0, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x22170e8, altdest=0x22170e8, completionTag=0x7fff24850a80 "") at pquery.c:1510
#17 0x0000000000705d53 in exec_execute_message (max_rows=9223372036854775807, portal_name=0x2216cd8 "p_2_9333_4_d80da54") at postgres.c:2958
#18 PostgresMain (argc=<optimized out>, argv=<optimized out>, dbname=<optimized out>, username=<optimized out>) at postgres.c:5507
#19 0x000000000079c4ed in BackendRun (port=0x2118700) at postmaster.c:4979
#20 BackendStartup (port=0x2118700) at postmaster.c:4651
#21 ServerLoop () at postmaster.c:1956
#22 0x000000000079d366 in PostmasterMain (argc=5, argv=<optimized out>) at postmaster.c:1564
#23 0x0000000000497c53 in main (argc=5, argv=0x20d36b0) at main.c:228

@yazun yazun changed the title Memory corruption after some CPU intensive work Memory corruption during CPU intensive work Jan 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant