Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OLAP leading to SIGSEV based crashes #149

Open
Dan-RAI opened this issue Feb 29, 2024 · 0 comments
Open

OLAP leading to SIGSEV based crashes #149

Dan-RAI opened this issue Feb 29, 2024 · 0 comments

Comments

@Dan-RAI
Copy link

Dan-RAI commented Feb 29, 2024

With set prefer_olap = 'on' we observe process crashes in running TPC-H benchmark queries (for instance Q2) already at scale factor 10 in parallel with more than 10 clients on a single coordinator. The time until occurance of a crash strongly reduces with the number of clients. With more than 200 we observe them already after a few seconds. (If useful, we can provide you directly with scripts to reproduce this issue.)

It seems that memory gets corrupted. During a crash, always the first element of the memory freelist points to a non-accessible region (here to 0x10):

freelist = {0x0, 
    0x10, 0x0, 0x0, 0x0, 0x7fca55abbfd0, 0x0, 0x0, 0x0, 0x23fae98, 0x0}

This results in a SIGSEV in the memory allocation.

Stack trace:

#0  AllocSetAlloc (context=0x238ef18, size=16) at aset.c:707
#1  0x0000000000990f78 in palloc (size=size@entry=16) at mcxt.c:935
#2  0x0000000000724bb4 in new_list (type=type@entry=T_IntList) at list.c:68
#3  0x0000000000724d45 in lappend_int (list=list@entry=0x0, datum=4) at list.c:151
#4  0x0000000000677d56 in ExecInitQual (qual=<optimized out>, parent=parent@entry=0x24d0378) at execExpr.c:206
#5  0x000000000069d432 in ExecInitIndexScan (node=node@entry=0x24151a0, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at nodeIndexscan.c:931
#6  0x0000000000684f76 in ExecInitNode (node=0x24151a0, estate=estate@entry=0x23f65a0, eflags=1) at execProcnode.c:225
#7  0x00000000006a6418 in ExecInitNestLoop (node=node@entry=0x2414620, estate=estate@entry=0x23f65a0, eflags=<optimized out>, eflags@entry=1)
    at nodeNestloop.c:338
#8  0x00000000006850aa in ExecInitNode (node=0x2414620, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at execProcnode.c:298
#9  0x00000000006a63f6 in ExecInitNestLoop (node=node@entry=0x2414190, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at nodeNestloop.c:333
#10 0x00000000006850aa in ExecInitNode (node=0x2414190, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at execProcnode.c:298
#11 0x00000000006a63f6 in ExecInitNestLoop (node=node@entry=0x24132d8, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at nodeNestloop.c:333
#12 0x00000000006850aa in ExecInitNode (node=0x24132d8, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at execProcnode.c:298
#13 0x000000000069116b in ExecInitAgg (node=node@entry=0x24131c0, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at nodeAgg.c:3911
#14 0x000000000068512e in ExecInitNode (node=0x24131c0, estate=estate@entry=0x23f65a0, eflags=eflags@entry=1) at execProcnode.c:331
#15 0x00000000006df01a in ExecShutdownRemoteSubplan (node=node@entry=0x23f71d0) at execRemote.c:11373
#16 0x0000000000684e11 in ExecShutdownNode (node=0x23f71d0) at execProcnode.c:873
#17 0x00000000007247cf in planstate_tree_walker (planstate=planstate@entry=0x23f6bc8, walker=walker@entry=0x684d9d <ExecShutdownNode>, 
    context=context@entry=0x0) at nodeFuncs.c:3784
#18 0x0000000000684dc5 in ExecShutdownNode (node=0x23f6bc8) at execProcnode.c:856
#19 0x00000000007205b6 in planstate_walk_subplans (plans=<optimized out>, walker=walker@entry=0x684d9d <ExecShutdownNode>, context=context@entry=0x0)
    at nodeFuncs.c:3864
#20 0x0000000000724837 in planstate_tree_walker (planstate=planstate@entry=0x245f370, walker=walker@entry=0x684d9d <ExecShutdownNode>, 
    context=context@entry=0x0) at nodeFuncs.c:3844
#21 0x0000000000684dc5 in ExecShutdownNode (node=0x245f370) at execProcnode.c:856
#22 0x00000000007247cf in planstate_tree_walker (planstate=planstate@entry=0x245eed8, walker=walker@entry=0x684d9d <ExecShutdownNode>, 
    context=context@entry=0x0) at nodeFuncs.c:3784
#23 0x0000000000684dc5 in ExecShutdownNode (node=node@entry=0x245eed8) at execProcnode.c:856
#24 0x000000000067ee42 in ExecutePlan (estate=estate@entry=0x23f65a0, planstate=0x245eed8, use_parallel_mode=<optimized out>, 
    operation=operation@entry=CMD_SELECT, sendTuples=sendTuples@entry=1 '\001', numberTuples=numberTuples@entry=0, direction=ForwardScanDirection, 
    dest=0x22c3948, execute_once=1 '\001') at execMain.c:2063
#25 0x000000000067f0a9 in standard_ExecutorRun (queryDesc=0x2313a50, direction=ForwardScanDirection, count=0, execute_once=<optimized out>) at execMain.c:466
#26 0x000000000067f163 in ExecutorRun (queryDesc=queryDesc@entry=0x2313a50, direction=direction@entry=ForwardScanDirection, count=count@entry=0, 
    execute_once=<optimized out>) at execMain.c:409
#27 0x0000000000861ed9 in PortalRunSelect (portal=portal@entry=0x2260510, forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807, 
    dest=dest@entry=0x22c3948) at pquery.c:1722
#28 0x000000000086438a in PortalRun (portal=portal@entry=0x2260510, count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1 '\001', 
    run_once=<optimized out>, dest=dest@entry=0x22c3948, altdest=altdest@entry=0x22c3948, completionTag=0x7ffe6a2b6b50 "") at pquery.c:1362
#29 0x000000000085fb15 in exec_execute_message (portal_name=portal_name@entry=0x22c3530 "p_1_1dfd6c_2_79f38aea", max_rows=9223372036854775807, 
    max_rows@entry=0) at postgres.c:3065
#30 0x0000000000860c65 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x20853d0, dbname=<optimized out>, username=<optimized out>) at postgres.c:5645
#31 0x00000000007d3a48 in BackendRun (port=port@entry=0x20fb6b0) at postmaster.c:5034
#32 0x00000000007d5b3f in BackendStartup (port=port@entry=0x20fb6b0) at postmaster.c:4706
#33 0x00000000007d5d41 in ServerLoop () at postmaster.c:1963
#34 0x00000000007d7058 in PostmasterMain (argc=argc@entry=5, argv=argv@entry=0x20835a0) at postmaster.c:1571
#35 0x000000000072052f in main (argc=5, argv=0x20835a0) at main.c:233

The database itself throws error messages like this:

ERROR:  Failed to receive more data from data node 16394
WARNING:  combiner is not prepared for instrumentation
WARNING:  pgxc_abort_connections dn node:dn6 invalid socket 4294967295!
ERROR:  node:dn2, backend_pid:4190542, nodename:dn1,backend_pid:3367739,message:Failed to receive more data from data node 16394
ERROR:  Failed to receive more data from data node 16394
WARNING:  pgxc_abort_connections dn node:dn6 invalid socket 4294967295!
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant