Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

carbon-c-relay hanging with no indication of an issue #250

Closed
tehlers320 opened this issue Feb 13, 2017 · 39 comments
Closed

carbon-c-relay hanging with no indication of an issue #250

tehlers320 opened this issue Feb 13, 2017 · 39 comments
Labels

Comments

@tehlers320
Copy link

tehlers320 commented Feb 13, 2017

I have a relay to a relay setup for aggregation. A few times now since 2.6 was introduced (from 2.3) the process is hanging with no error/warnings to indicate whats happening.

The only error i have is from the first relay.

[2017-02-13 16:40:18] (ERR) failed to write() to 127.0.0.1:2009: Resource temporarily unavailable
[2017-02-13 16:40:18] (ERR) server 127.0.0.1:2009: OK
[2017-02-13 16:40:20] (ERR) failed to write() to 127.0.0.1:2009: Connection reset by peer
[2017-02-13 16:40:26] (ERR) server 127.0.0.1:2009: OK

We moved to 2.6 to use the validator and is the only change from where we were stable with 2.3 to 2.6 in the config.

cluster graphite
    carbon_ch
        10.1.1.2:2005=grph001
        10.1.1.3:2005=grph002
        10.1.1.4:2005=grph003
        10.1.1.5:2005=grph004
   ;

match *
    validate ^[0-9.e+-]+\ [0-9.e+-]+$ else drop
    ;

match ^(stats\.gauges|stats_counts)\.([A-Z]+)..*
    send to blackhole
    stop
    ;

match ^(stats\.gauges|stats_counts)\.[\w&\d&\-]+[A-Z]..*
    send to blackhole
    stop
    ;

match ^stats\.gauges\.[\w&\-]*\.\w{1,8}\..*
    send to blackhole
    stop
    ;

match ^(stats\.gauges|stats_counts)\.([^.]+)\.([^.]+)\..*(ip-.+)
    send to blackhole
    stop
    ;
match ^stats_counts\.gauges..*
    send to blackhole
    stop
    ;

####### aggregates #######

aggregate
    ^stats\.gauges\.([^.]+)\.([^.]+)\.(.*)
    every 30 seconds
    expire after 65 seconds
    timestamp at end of bucket
    compute avg write to
    prod.us-west-2.stats.gauges.\1.aggregator.\3
    compute max write to
    prod.us-west-2.stats.gauges.\1.aggregator.\3.upper
    compute min write to
    prod.us-west-2.stats.gauges.\1.aggregator.\3.lower
    ;

aggregate
    ^stats_counts\.([^.]+)\.([^.]+)\.(.*)
    every 30 seconds
    expire after 65 seconds
    timestamp at end of bucket
    compute sum write to
    prod.us-west-2.stats_counts.\1.aggregator.\3
    ;


######## matches ########
match *
    validate ^[0-9.e+-]+\ [0-9.e+-]+$ else drop
    send to graphite
    stop
    ;

Startup settings:

root      4100     1  5 Feb09 ?        04:58:21 /usr/bin/carbon-c-relay -B 10 -H ip-10-1-1-5-monitoring-proxy-2005 -f /etc/carbon-c-relay-2005.conf -w 4 -q 25600000 -b 5000 -p 2005 -D -l /var/log/carbon-relay/carbon-c-relay-2005.log -P /var/run/carbon-c-relay-2005.pid
root     22820     1 99 16:40 ?        00:13:55 /usr/bin/carbon-c-relay -B 10 -H ip-10-1-1-5-monitoring-proxy-2009 -f /etc/carbon-c-relay-2009.conf -w 4 -q 25600000 -b 5000 -p 2009 -D -l /var/log/carbon-relay/carbon-c-relay-2009.log -P /var/run/carbon-c-relay-2009.pid

Our systems are pretty busy, so this could be a contributing factor. We run these all on m4.xlarge instances. (4 core 16gb of ram)

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                        
 4100 root      20   0 5899m 5.5g 1868 S  4.7 35.0 298:00.85 carbon-c-relay                                                                                  
22820 root      20   0 1288m 947m 1828 S 105.1  5.9  11:19.70 carbon-c-relay 

I just wanted to start a bug in case it is a real issue and not something we have done on our side. We will remove the validator on a couple of nodes for testing but its a long lag time before a node will hang.

@grobian grobian added the bug label Feb 13, 2017
@tehlers320
Copy link
Author

Occurred again on a node where we left validation on.

This time i grabbed an strace of the pid and all the threads for 60 seconds with

timeout 60 strace -f -p 4163 -p 4166 -p 4167 -p 4168 -p 4169 -p 4170 -p 4171 -p 4172 -p 4173 -p 4174 -p 4175 -p 4176 -p 4177

carbon-c-relay-strace.txt

@grobian
Copy link
Owner

grobian commented Feb 18, 2017

are the two instances there just to cover 2 ports?

@grobian
Copy link
Owner

grobian commented Feb 18, 2017

and do you know which instance is not producing results any more? the "idle" one or the busy one?

@grobian
Copy link
Owner

grobian commented Feb 18, 2017

this could be related to issue #242

@tehlers320
Copy link
Author

Relay 2005 is ahead of it and just sends on to graphite and relay 2009 shown here is the one hanging. I separated my aggregation.

2005 - just relay and validate
2009 - aggregation

So far a 2005 relay has not hung. But it's very sporadic, no hangs since the last posting. Traffic to 2009 is only TCP as it should come from the relay.

@grobian
Copy link
Owner

grobian commented Feb 21, 2017

I see. So it must be an aggregator issue.

@tehlers320
Copy link
Author

I am not sure, so far there has not been another hang. I have 4 of them in the zone where this has happened. I put 2 on 2.3 and left 2 on 2.6 for further testing.

@grobian
Copy link
Owner

grobian commented Feb 22, 2017

Ok, that's interesting. Did you build the versions yourself? It would be nice to get a stacktrace or something of the relay when it is "hung", that might give a clue what it is blocked on.

@tehlers320
Copy link
Author

Yes i built off the tagged branch https://github.com/grobian/carbon-c-relay/tree/v2.6

Did you mean a core dump? I have an strace above at https://github.com/grobian/carbon-c-relay/files/774433/carbon-c-relay-strace.txt

Ill gather both, if/when it happens. Im also wondering if there is a way to speedup getting to a hang. Maybe bursting traffic to it, im not sure.

@tehlers320 tehlers320 changed the title carbon-c-relay 2.6 hanging with no indication of an issue carbon-c-relay hanging with no indication of an issue Feb 28, 2017
@tehlers320
Copy link
Author

tehlers320 commented Feb 28, 2017

Well, the hang has occurred on 2.3 so this is not a 2.6 issue.

Thread 13 (Thread 0x7f36cd4f2700 (LWP 16290)):
#0  0x00007f36ff436dad in poll () from /lib64/libc.so.6
#1  0x0000000000406918 in ?? ()
#2  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7f36cccf1700 (LWP 16291)):
#0  0x00007f36ff71703e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x0000000000412665 in ?? ()
#2  0x000000000040e27b in ?? ()
#3  0x000000000040ea6a in ?? ()
#4  0x0000000000406f15 in ?? ()
#5  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f36cc4f0700 (LWP 16292)):
#0  0x00007f36ff71703e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x0000000000412665 in ?? ()
#2  0x000000000040e27b in ?? ()
#3  0x000000000040ea6a in ?? ()
#4  0x0000000000406f15 in ?? ()
#5  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f36cbcef700 (LWP 16293)):
#0  0x00007f36ff71703e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x0000000000412665 in ?? ()
#2  0x000000000040e27b in ?? ()
#3  0x000000000040ea6a in ?? ()
#4  0x0000000000406f15 in ?? ()
#5  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f36cb4ee700 (LWP 16294)):
#0  0x00007f36ff71703e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x0000000000412665 in ?? ()
#2  0x000000000040e27b in ?? ()
#3  0x000000000040ea6a in ?? ()
#4  0x0000000000406f15 in ?? ()
#5  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f36caced700 (LWP 16295)):
#0  0x00007f36ff71a43d in write () from /lib64/libpthread.so.0
#1  0x000000000040f681 in ?? ()
#2  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f36ca4ec700 (LWP 16296)):
#0  0x00007f36ff71a43d in write () from /lib64/libpthread.so.0
#1  0x0000000000411fa8 in ?? ()
#2  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f36c9ceb700 (LWP 16297)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff4084f4 in sleep () from /lib64/libc.so.6
#2  0x00000000004106db in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f36c94ea700 (LWP 16298)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff439274 in usleep () from /lib64/libc.so.6
#2  0x000000000040f824 in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f36c8ce9700 (LWP 16299)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff439274 in usleep () from /lib64/libc.so.6
#2  0x000000000040f824 in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f36c84e8700 (LWP 16300)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff439274 in usleep () from /lib64/libc.so.6
#2  0x000000000040f824 in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f36c7ce7700 (LWP 16301)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff439274 in usleep () from /lib64/libc.so.6
#2  0x000000000040f824 in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f36ffe3c740 (LWP 16269)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff4084f4 in sleep () from /lib64/libc.so.6
#2  0x000000000040368d in ?? ()
#3  0x00007f36ff36bb35 in __libc_start_main () from /lib64/libc.so.6
#4  0x0000000000402249 in ?? ()
#5  0x00007ffea8061478 in ?? ()
#6  0x000000000000001c in ?? ()
#7  0x0000000000000014 in ?? ()
#8  0x00007ffea8062e90 in ?? ()
#9  0x00007ffea8062ea8 in ?? ()
#10 0x00007ffea8062eab in ?? ()
#11 0x00007ffea8062eae in ?? ()
#12 0x00007ffea8062eb1 in ?? ()
#13 0x00007ffea8062ed6 in ?? ()
#14 0x00007ffea8062ed9 in ?? ()
#15 0x00007ffea8062ef7 in ?? ()
#16 0x00007ffea8062efa in ?? ()
#17 0x00007ffea8062efc in ?? ()
#18 0x00007ffea8062eff in ?? ()
#19 0x00007ffea8062f08 in ?? ()
#20 0x00007ffea8062f0b in ?? ()
#21 0x00007ffea8062f10 in ?? ()
#22 0x00007ffea8062f13 in ?? ()
#23 0x00007ffea8062f18 in ?? ()
#24 0x00007ffea8062f1b in ?? ()
#25 0x00007ffea8062f1e in ?? ()
#26 0x00007ffea8062f4c in ?? ()
#27 0x00007ffea8062f4f in ?? ()
#28 0x0000000000000000 in ?? ()

I have a core dump this time, its 405mb though.

@tehlers320
Copy link
Author

tehlers320 commented Feb 28, 2017

Just a shot in the dark, but some of dropwizards "OneMinuteRate" metrics turn into these after days of inactivity

2.964393875e-314

I wonder if trying to perform an aggregate on such a big float is a problem.

@tehlers320
Copy link
Author

tehlers320 commented Mar 1, 2017

Found something interesting but it may be unrelated. In the main graphite tree some pids seem to have dumped random chunks of their aggregated metrics. I may have not noticed this before since we mostly use grafana and there is a cleanup mechanism in place for non-writing metrics.

_aggregator_stub_0x10ee570__production.us-west-2.stats.gauges...dropwizardstuff

Exploring the dumped tree there are averages/min/max/sum so it is not consistently one or the other.

@grobian
Copy link
Owner

grobian commented Mar 3, 2017

that's not supposed to happen :(

Did you by chance reload the relay? Any idea if that metric could be generated by 2.6?

@tehlers320
Copy link
Author

tehlers320 commented Mar 3, 2017

No i wasn't careful enough to mark the crash time with what instance. I moved 100% back to 2.6 now though after seeing 2.3 crash.

Would this core dump (from 2.3) be useful? I just dont know where to send it at 405mb!

EDIT:
one of the 2.6 relays was hung this morning a stub did not get sent to the graphite tree. Also yes we have chef automation that reloads the relays from time to time when updates go in.

@grobian
Copy link
Owner

grobian commented Mar 3, 2017

I don't think the core-dump is any useful, the only aspect of it is the backtrace. The volume of the dump is in the send buffers and/or aggregation states, which are probably sensitive data, so I'd first like to try without.

@tehlers320
Copy link
Author

Sorry ive not used gdb before.

I removed some of the names on the string in the metricput. But whats curious is the whole stack stops. The internal metrics stop as well.

thread apply all bt

Thread 13 (Thread 0x7f25af4a6740 (LWP 22195)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aea724f4 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x00000000004036c5 in main (argc=<optimized out>, argv=<optimized out>) at relay.c:834

Thread 12 (Thread 0x7f2577b52700 (LWP 22208)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000410817 in server_queuereader (d=0x1582fd0) at server.c:453
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f2577b52700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 11 (Thread 0x7f256fb52700 (LWP 22207)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000410817 in server_queuereader (d=0x157d670) at server.c:453
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f256fb52700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 10 (Thread 0x7f2578353700 (LWP 22206)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000410817 in server_queuereader (d=0x1577d30) at server.c:453
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f2578353700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 9 (Thread 0x7f2578b54700 (LWP 22205)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000410817 in server_queuereader (d=0x1572380) at server.c:453
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f2578b54700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 8 (Thread 0x7f2579355700 (LWP 22204)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aea724f4 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x0000000000411a1b in collector_runner (s=0x15b3840) at collector.c:140
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f2579355700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 7 (Thread 0x7f2579b56700 (LWP 22203)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000004132e8 in aggregator_expire (sub=0x1594cc0) at aggregator.c:491
---Type <return> to continue, or q <return> to quit---
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f2579b56700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 6 (Thread 0x7f257a357700 (LWP 22202)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aeaa3274 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x0000000000410a8c in server_queuereader (d=0x15b3840) at server.c:121
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f257a357700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 5 (Thread 0x7f257ab58700 (LWP 22201)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aeaa3274 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x0000000000406ef9 in dispatch_runner (arg=0x15b3580) at dispatcher.c:644
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f257ab58700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 4 (Thread 0x7f257b359700 (LWP 22200)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aeaa3274 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x0000000000406ef9 in dispatch_runner (arg=0x15b32c0) at dispatcher.c:644
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f257b359700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 3 (Thread 0x7f257bb5a700 (LWP 22199)):
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
#1  0x00000000004139a5 in aggregator_putmetric (s=0x1594cc0,
    metric=0x7f257cb6bd0d "stats.gauges.something1.something2-38fc7067bfc5.gauge.response.something3.something4.1500531.unreadmessages.count 0 1488557174\n",
    firstspace=0x7f257cb6bd8a " 0 1488557174\n", nmatch=<optimized out>, pmatch=0x7f257bb59a70) at aggregator.c:237
#2  0x000000000040f120 in router_route_intern (blackholed=0x7f257bb59ccf "", ret=0x7f257cb6dd10, curlen=0x7f257bb59cc0, retsize=64, srcaddr=0x7f257cb69ced "127.0.0.1", metric=<optimized out>,
    firstspace=0x7f257cb6bd8a " 0 1488557174\n", r=0x7f25af466413) at router.c:3111
#3  0x000000000040fb0a in router_route (rtr=<optimized out>, ret=<optimized out>, retcnt=0x7f257cb6e110, retsize=<optimized out>, srcaddr=<optimized out>, metric=<optimized out>,
    firstspace=0x7f257cb6bd8a " 0 1488557174\n") at router.c:3218
#4  0x0000000000406f95 in dispatch_connection (start=..., self=0x15b3000, conn=0x7f257cb69ce8) at dispatcher.c:448
#5  dispatch_runner (arg=0x15b3000) at dispatcher.c:633
#6  0x00007f25aed7ddc5 in start_thread (arg=0x7f257bb5a700) at pthread_create.c:308
#7  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 2 (Thread 0x7f257c35b700 (LWP 22198)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
---Type <return> to continue, or q <return> to quit---
#1  0x00007f25aeaa3274 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x0000000000406ef9 in dispatch_runner (arg=0x1599010) at dispatcher.c:644
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f257c35b700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7f257cb5c700 (LWP 22197)):
#0  0x00007f25aeaa0dad in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000406958 in dispatch_runner (arg=0x15b49c0) at dispatcher.c:583
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f257cb5c700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

@grobian
Copy link
Owner

grobian commented Mar 3, 2017

One worker is active and in the process of acquiring a lock. The rest is doing nothing, waiting for data to arrive. The lock likely is blocking since the expiry thread is stuck in write (keeping the lock).

@tehlers320
Copy link
Author

Well, this may be solved. I had a host hung this morning and came to realize that it may be hanging on getting a connection. The receiving servers are set at -B 10 . I noticed the default is -B 32 now so i moved them and the "stuck/hung" pid unhung. Ill close this out in a couple of days if no hangs occur.

@grobian
Copy link
Owner

grobian commented Mar 6, 2017

That's interesting info! I'll see if I can do anything here to make it more clear what's going on.

@tehlers320
Copy link
Author

It was a fluke it seems, i have 2 hung and a restart of the relays they send to did not "unlock them".

grobian added a commit that referenced this issue Mar 22, 2017
This is an attempt to do something about issue #250, but may be not
desirable, because bursts generated by the aggregator will likely result
in drops if the queue pressure is high.  On the other hand, we try to
spread the aggregation load, and server queues should be adjusted to
handle this in the normal case.
Idea is that blocking the expiry thread might cause an avalanche of
blocks resulting in a spiraling behaviour towards a full stop.
@grobian
Copy link
Owner

grobian commented Mar 22, 2017

I'm not sure if current master is usable for you, but I would be interested in if you could try above commit (apply to v2.6 should be fine) to see if it makes any difference.

@tehlers320
Copy link
Author

I am rolling it out to our test env. I see some regexs were changed

/etc/carbon-c-relay-2009.conf:56:4: invalid expression '^stats\.gauges\.[\w': Unmatched [ or [^
    ^stats\.gauges\.[\w&\-]*\.\w{1,8}\..*
    ^^^^^^^^^^^^^^^^^^^

@grobian
Copy link
Owner

grobian commented Mar 22, 2017

Hmmm, does quoting the entire expression work? I think it stops at the &, I'll fix that.

@grobian
Copy link
Owner

grobian commented Mar 22, 2017

I've pushed the fix for & and ,

@tehlers320
Copy link
Author

I actually just removed it, our regex is becoming overly complicated and weird. I've been hunting down bad behaviors and getting them fixed in preparation for metrics2.0

@tehlers320
Copy link
Author

tehlers320 commented Mar 23, 2017

I added a new node to the mix, which i wish i hadn't because 2.6 vs 3.x memory pattern is completely different so we could see a difference. A nice flat line?

screen shot 2017-03-23 at 9 17 12 am

@grobian
Copy link
Owner

grobian commented Mar 23, 2017

Hmmm, you want me to backport that particular commit to 2.6 and branch it? Shouldn't be hard, the commit will apply cleanly.

@tehlers320
Copy link
Author

Nah thanks though! We are debating rolling out what we have in test which is 3.x-alpha i guess we can call it.

@grobian
Copy link
Owner

grobian commented Mar 23, 2017

hmmm, that's kind of scary, it's bleeding edge :)

@tehlers320
Copy link
Author

It's been in test for almost 24 hours and monit is around to restart it if it crashes, meh.

@tehlers320
Copy link
Author

I figured out what the fluke. The upstream 2009 relay (added after this bug was filed) which is used to funnel all the metrics to via average also hangs. This has the effect of hanging the upper/lower tier with the config posted above. All of these now have the "2.6.1" patch applied as of 5 days ago. I took another core dump but the only thing that has changed is the lines shifted down 1 from the patch.

The new funnel system to prevent overwriting of whisper datapoints which we found was happening.
2005 (just relays) -> graphite 2005 (just relays) -> go-carbon
2005 (just relays) -> local aggregator 2009 (upper/lower)-> graphite 2009 (avg) -> go-carbon

Upstream configs are the same as the old above except they do not do validate and they have 16 workers (automatically assigned by chef node[:cpu][:total] )

cluster graphite
    any_of
        127.0.0.1:2003
    ;

####### aggregates #######

aggregate
    ^ prod.([^.]+)\.stats\.gauges\.([^.]+)\.([^.]+)\.(.*)
    every 65 seconds
    expire after 70 seconds
    timestamp at end of bucket
    compute avg write to
    prod.\1.stats.gauges.\2.\3.\4
    send to graphite
    stop
    ;

aggregate
    ^prod\.([^.]+)\.stats_counts\.([^.]+)\.([^.]+)\.(.*)
    every 65 seconds
    expire after 70 seconds
    timestamp at end of bucket
    compute sum write to
    prod.\1.stats_counts.\2.\3.\4
    send to graphite
    stop
    ;

aggregate
    ^ prod\.([^.]+)\.stats\.counters\.([^.]+)\.([^.]+)\.(.*)
    every 65 seconds
    expire after 70 seconds
    timestamp at end of bucket
    compute sum write to
    prod.\1.stats.counters.\2.\3.\4
    send to graphite
    stop
    ;


######## matches ########
match *
    send to graphite
    stop
    ;

one of the threads

Thread 3 (Thread 0x7f9970bd1700 (LWP 43092)):
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
#1  0x00000000004139c5 in aggregator_putmetric (s=0xaab5a0,
    metric=0x7f9971bfc6bd "prod.us-west-2.stats.gauges.someapp.aggregator.org.mongodb.driver.ConnectionPool.Size.58da0c04cac5160dcea7e267.somefeature.somedb.27017.upper 3.000000 14906"..., firstspace=0x7f9971bfc776 " 3.000000 1490693850\n",
    nmatch=<optimized out>, pmatch=0x7f9970bd0a70) at aggregator.c:237
#2  0x000000000040f140 in router_route_intern (blackholed=0x7f9970bd0ccf "", ret=0x7f9971bfe6c0, curlen=0x7f9970bd0cc0, retsize=64, srcaddr=0x7f9971bfa69d "10.0.1.1", metric=<optimized out>, firstspace=0x7f9971bfc776 " 3.000000 1490693850\n", r=0x7f9a01138d11) at router.c:3111
#3  0x000000000040fb2a in router_route (rtr=<optimized out>, ret=<optimized out>, retcnt=0x7f9971bfeac0, retsize=<optimized out>, srcaddr=<optimized out>, metric=<optimized out>, firstspace=0x7f9971bfc776 " 3.000000 1490693850\n") at router.c:3218
#4  0x0000000000406fb5 in dispatch_connection (start=..., self=0xaaa740, conn=0x7f9971bfa698) at dispatcher.c:449
#5  dispatch_runner (arg=0xaaa740) at dispatcher.c:634
#6  0x00007f9a00a51dc5 in start_thread (arg=0x7f9970bd1700) at pthread_create.c:308
#7  0x00007f9a0077f6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

@tehlers320
Copy link
Author

tehlers320 commented Apr 6, 2017

I was stable until today when i pushed an automation update to have chef change worker counter to (core count/2), since we have other pids running here i thought this made sense. This set my workers to 2 on some zones and many of these stalled that are very lite on CPU to begin with (20% user average). Again only aggregator nodes hung. Making me wonder if it's always been an issue with the worker thread counts.

Should the formula for worker threads be based off destinations instead of core count?

Or i'm grasping at straws here. But lowering the worker count to 2 reproduced the issue very quickly.

@grobian
Copy link
Owner

grobian commented Apr 7, 2017

It certainly seems to be. It is a bit surprising, but gives a hint where the block is coming from. Thanks!

@devinkramer
Copy link

I experienced the same hanging issue after updating to 2.6 from a very old 0.44 on a node that primarily handles aggregations sent from other carbon-c-relays instances and then sends them along to store. I downgraded it to 2.3 in hopes that it was not present there but after going back through this entire thread it sounds like tehlers320 had the same issue with 2.3? Happy to provide any details I can to help hunt down the cause but was also wondering what the earliest version is that introduces the "-B backlog" option which is the primary reason I was updating.

@tehlers320
Copy link
Author

@devinkramer i haven't had issues since i upped my thread count to double my core count. At the very least it's mitigating this issue.

@devinkramer
Copy link

I'll give that a try thanks!

@devinkramer
Copy link

Unfortunately it still got wedged for me with the thread count set to 8 (-w 8) on a 4 core VM running version 2.3.

@grobian
Copy link
Owner

grobian commented May 19, 2017

I think mark found the cause, and made a patch for it: #274

It is on my todo list to incorporate his changes for the next release.

@grobian
Copy link
Owner

grobian commented Aug 8, 2017

the aggregator changes are in the master branch

@grobian grobian closed this as completed Aug 8, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants