carbon-c-relay hanging with no indication of an issue #250

tehlers320 · 2017-02-13T16:54:39Z

I have a relay to a relay setup for aggregation. A few times now since 2.6 was introduced (from 2.3) the process is hanging with no error/warnings to indicate whats happening.

The only error i have is from the first relay.

[2017-02-13 16:40:18] (ERR) failed to write() to 127.0.0.1:2009: Resource temporarily unavailable
[2017-02-13 16:40:18] (ERR) server 127.0.0.1:2009: OK
[2017-02-13 16:40:20] (ERR) failed to write() to 127.0.0.1:2009: Connection reset by peer
[2017-02-13 16:40:26] (ERR) server 127.0.0.1:2009: OK

We moved to 2.6 to use the validator and is the only change from where we were stable with 2.3 to 2.6 in the config.

cluster graphite
    carbon_ch
        10.1.1.2:2005=grph001
        10.1.1.3:2005=grph002
        10.1.1.4:2005=grph003
        10.1.1.5:2005=grph004
   ;

match *
    validate ^[0-9.e+-]+\ [0-9.e+-]+$ else drop
    ;

match ^(stats\.gauges|stats_counts)\.([A-Z]+)..*
    send to blackhole
    stop
    ;

match ^(stats\.gauges|stats_counts)\.[\w&\d&\-]+[A-Z]..*
    send to blackhole
    stop
    ;

match ^stats\.gauges\.[\w&\-]*\.\w{1,8}\..*
    send to blackhole
    stop
    ;

match ^(stats\.gauges|stats_counts)\.([^.]+)\.([^.]+)\..*(ip-.+)
    send to blackhole
    stop
    ;
match ^stats_counts\.gauges..*
    send to blackhole
    stop
    ;

####### aggregates #######

aggregate
    ^stats\.gauges\.([^.]+)\.([^.]+)\.(.*)
    every 30 seconds
    expire after 65 seconds
    timestamp at end of bucket
    compute avg write to
    prod.us-west-2.stats.gauges.\1.aggregator.\3
    compute max write to
    prod.us-west-2.stats.gauges.\1.aggregator.\3.upper
    compute min write to
    prod.us-west-2.stats.gauges.\1.aggregator.\3.lower
    ;

aggregate
    ^stats_counts\.([^.]+)\.([^.]+)\.(.*)
    every 30 seconds
    expire after 65 seconds
    timestamp at end of bucket
    compute sum write to
    prod.us-west-2.stats_counts.\1.aggregator.\3
    ;


######## matches ########
match *
    validate ^[0-9.e+-]+\ [0-9.e+-]+$ else drop
    send to graphite
    stop
    ;

Startup settings:

root      4100     1  5 Feb09 ?        04:58:21 /usr/bin/carbon-c-relay -B 10 -H ip-10-1-1-5-monitoring-proxy-2005 -f /etc/carbon-c-relay-2005.conf -w 4 -q 25600000 -b 5000 -p 2005 -D -l /var/log/carbon-relay/carbon-c-relay-2005.log -P /var/run/carbon-c-relay-2005.pid
root     22820     1 99 16:40 ?        00:13:55 /usr/bin/carbon-c-relay -B 10 -H ip-10-1-1-5-monitoring-proxy-2009 -f /etc/carbon-c-relay-2009.conf -w 4 -q 25600000 -b 5000 -p 2009 -D -l /var/log/carbon-relay/carbon-c-relay-2009.log -P /var/run/carbon-c-relay-2009.pid

Our systems are pretty busy, so this could be a contributing factor. We run these all on m4.xlarge instances. (4 core 16gb of ram)

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                        
 4100 root      20   0 5899m 5.5g 1868 S  4.7 35.0 298:00.85 carbon-c-relay                                                                                  
22820 root      20   0 1288m 947m 1828 S 105.1  5.9  11:19.70 carbon-c-relay

I just wanted to start a bug in case it is a real issue and not something we have done on our side. We will remove the validator on a couple of nodes for testing but its a long lag time before a node will hang.

The text was updated successfully, but these errors were encountered:

tehlers320 · 2017-02-14T15:00:06Z

Occurred again on a node where we left validation on.

This time i grabbed an strace of the pid and all the threads for 60 seconds with

timeout 60 strace -f -p 4163 -p 4166 -p 4167 -p 4168 -p 4169 -p 4170 -p 4171 -p 4172 -p 4173 -p 4174 -p 4175 -p 4176 -p 4177

carbon-c-relay-strace.txt

grobian · 2017-02-18T21:07:44Z

are the two instances there just to cover 2 ports?

grobian · 2017-02-18T21:09:37Z

and do you know which instance is not producing results any more? the "idle" one or the busy one?

grobian · 2017-02-18T21:55:22Z

this could be related to issue #242

tehlers320 · 2017-02-20T14:59:30Z

Relay 2005 is ahead of it and just sends on to graphite and relay 2009 shown here is the one hanging. I separated my aggregation.

2005 - just relay and validate
2009 - aggregation

So far a 2005 relay has not hung. But it's very sporadic, no hangs since the last posting. Traffic to 2009 is only TCP as it should come from the relay.

grobian · 2017-02-21T07:19:35Z

I see. So it must be an aggregator issue.

tehlers320 · 2017-02-21T15:08:13Z

I am not sure, so far there has not been another hang. I have 4 of them in the zone where this has happened. I put 2 on 2.3 and left 2 on 2.6 for further testing.

grobian · 2017-02-22T12:13:11Z

Ok, that's interesting. Did you build the versions yourself? It would be nice to get a stacktrace or something of the relay when it is "hung", that might give a clue what it is blocked on.

tehlers320 · 2017-02-22T15:27:57Z

Yes i built off the tagged branch https://github.com/grobian/carbon-c-relay/tree/v2.6

Did you mean a core dump? I have an strace above at https://github.com/grobian/carbon-c-relay/files/774433/carbon-c-relay-strace.txt

Ill gather both, if/when it happens. Im also wondering if there is a way to speedup getting to a hang. Maybe bursting traffic to it, im not sure.

tehlers320 · 2017-02-28T15:52:28Z

Well, the hang has occurred on 2.3 so this is not a 2.6 issue.

Thread 13 (Thread 0x7f36cd4f2700 (LWP 16290)):
#0  0x00007f36ff436dad in poll () from /lib64/libc.so.6
#1  0x0000000000406918 in ?? ()
#2  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 12 (Thread 0x7f36cccf1700 (LWP 16291)):
#0  0x00007f36ff71703e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x0000000000412665 in ?? ()
#2  0x000000000040e27b in ?? ()
#3  0x000000000040ea6a in ?? ()
#4  0x0000000000406f15 in ?? ()
#5  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 11 (Thread 0x7f36cc4f0700 (LWP 16292)):
#0  0x00007f36ff71703e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x0000000000412665 in ?? ()
#2  0x000000000040e27b in ?? ()
#3  0x000000000040ea6a in ?? ()
#4  0x0000000000406f15 in ?? ()
#5  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 10 (Thread 0x7f36cbcef700 (LWP 16293)):
#0  0x00007f36ff71703e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x0000000000412665 in ?? ()
#2  0x000000000040e27b in ?? ()
#3  0x000000000040ea6a in ?? ()
#4  0x0000000000406f15 in ?? ()
#5  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 9 (Thread 0x7f36cb4ee700 (LWP 16294)):
#0  0x00007f36ff71703e in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x0000000000412665 in ?? ()
#2  0x000000000040e27b in ?? ()
#3  0x000000000040ea6a in ?? ()
#4  0x0000000000406f15 in ?? ()
#5  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 8 (Thread 0x7f36caced700 (LWP 16295)):
#0  0x00007f36ff71a43d in write () from /lib64/libpthread.so.0
#1  0x000000000040f681 in ?? ()
#2  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 7 (Thread 0x7f36ca4ec700 (LWP 16296)):
#0  0x00007f36ff71a43d in write () from /lib64/libpthread.so.0
#1  0x0000000000411fa8 in ?? ()
#2  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#3  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 6 (Thread 0x7f36c9ceb700 (LWP 16297)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff4084f4 in sleep () from /lib64/libc.so.6
#2  0x00000000004106db in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 5 (Thread 0x7f36c94ea700 (LWP 16298)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff439274 in usleep () from /lib64/libc.so.6
#2  0x000000000040f824 in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 4 (Thread 0x7f36c8ce9700 (LWP 16299)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff439274 in usleep () from /lib64/libc.so.6
#2  0x000000000040f824 in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x7f36c84e8700 (LWP 16300)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff439274 in usleep () from /lib64/libc.so.6
#2  0x000000000040f824 in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f36c7ce7700 (LWP 16301)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff439274 in usleep () from /lib64/libc.so.6
#2  0x000000000040f824 in ?? ()
#3  0x00007f36ff713dc5 in start_thread () from /lib64/libpthread.so.0
#4  0x00007f36ff4416ed in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f36ffe3c740 (LWP 16269)):
#0  0x00007f36ff40865d in nanosleep () from /lib64/libc.so.6
#1  0x00007f36ff4084f4 in sleep () from /lib64/libc.so.6
#2  0x000000000040368d in ?? ()
#3  0x00007f36ff36bb35 in __libc_start_main () from /lib64/libc.so.6
#4  0x0000000000402249 in ?? ()
#5  0x00007ffea8061478 in ?? ()
#6  0x000000000000001c in ?? ()
#7  0x0000000000000014 in ?? ()
#8  0x00007ffea8062e90 in ?? ()
#9  0x00007ffea8062ea8 in ?? ()
#10 0x00007ffea8062eab in ?? ()
#11 0x00007ffea8062eae in ?? ()
#12 0x00007ffea8062eb1 in ?? ()
#13 0x00007ffea8062ed6 in ?? ()
#14 0x00007ffea8062ed9 in ?? ()
#15 0x00007ffea8062ef7 in ?? ()
#16 0x00007ffea8062efa in ?? ()
#17 0x00007ffea8062efc in ?? ()
#18 0x00007ffea8062eff in ?? ()
#19 0x00007ffea8062f08 in ?? ()
#20 0x00007ffea8062f0b in ?? ()
#21 0x00007ffea8062f10 in ?? ()
#22 0x00007ffea8062f13 in ?? ()
#23 0x00007ffea8062f18 in ?? ()
#24 0x00007ffea8062f1b in ?? ()
#25 0x00007ffea8062f1e in ?? ()
#26 0x00007ffea8062f4c in ?? ()
#27 0x00007ffea8062f4f in ?? ()
#28 0x0000000000000000 in ?? ()

I have a core dump this time, its 405mb though.

tehlers320 · 2017-02-28T20:51:22Z

Just a shot in the dark, but some of dropwizards "OneMinuteRate" metrics turn into these after days of inactivity

2.964393875e-314

I wonder if trying to perform an aggregate on such a big float is a problem.

tehlers320 · 2017-03-01T16:05:38Z

Found something interesting but it may be unrelated. In the main graphite tree some pids seem to have dumped random chunks of their aggregated metrics. I may have not noticed this before since we mostly use grafana and there is a cleanup mechanism in place for non-writing metrics.

_aggregator_stub_0x10ee570__production.us-west-2.stats.gauges...dropwizardstuff

Exploring the dumped tree there are averages/min/max/sum so it is not consistently one or the other.

grobian · 2017-03-03T15:56:22Z

that's not supposed to happen :(

Did you by chance reload the relay? Any idea if that metric could be generated by 2.6?

tehlers320 · 2017-03-03T15:59:06Z

No i wasn't careful enough to mark the crash time with what instance. I moved 100% back to 2.6 now though after seeing 2.3 crash.

Would this core dump (from 2.3) be useful? I just dont know where to send it at 405mb!

EDIT:
one of the 2.6 relays was hung this morning a stub did not get sent to the graphite tree. Also yes we have chef automation that reloads the relays from time to time when updates go in.

grobian · 2017-03-03T16:02:16Z

I don't think the core-dump is any useful, the only aspect of it is the backtrace. The volume of the dump is in the send buffers and/or aggregation states, which are probably sensitive data, so I'd first like to try without.

tehlers320 · 2017-03-03T17:08:26Z

Sorry ive not used gdb before.

I removed some of the names on the string in the metricput. But whats curious is the whole stack stops. The internal metrics stop as well.

thread apply all bt

Thread 13 (Thread 0x7f25af4a6740 (LWP 22195)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aea724f4 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x00000000004036c5 in main (argc=<optimized out>, argv=<optimized out>) at relay.c:834

Thread 12 (Thread 0x7f2577b52700 (LWP 22208)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000410817 in server_queuereader (d=0x1582fd0) at server.c:453
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f2577b52700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 11 (Thread 0x7f256fb52700 (LWP 22207)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000410817 in server_queuereader (d=0x157d670) at server.c:453
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f256fb52700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 10 (Thread 0x7f2578353700 (LWP 22206)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000410817 in server_queuereader (d=0x1577d30) at server.c:453
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f2578353700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 9 (Thread 0x7f2578b54700 (LWP 22205)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000410817 in server_queuereader (d=0x1572380) at server.c:453
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f2578b54700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 8 (Thread 0x7f2579355700 (LWP 22204)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aea724f4 in __sleep (seconds=0) at ../sysdeps/unix/sysv/linux/sleep.c:137
#2  0x0000000000411a1b in collector_runner (s=0x15b3840) at collector.c:140
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f2579355700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 7 (Thread 0x7f2579b56700 (LWP 22203)):
#0  0x00007f25aed8443d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x00000000004132e8 in aggregator_expire (sub=0x1594cc0) at aggregator.c:491
---Type <return> to continue, or q <return> to quit---
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f2579b56700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 6 (Thread 0x7f257a357700 (LWP 22202)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aeaa3274 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x0000000000410a8c in server_queuereader (d=0x15b3840) at server.c:121
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f257a357700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 5 (Thread 0x7f257ab58700 (LWP 22201)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aeaa3274 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x0000000000406ef9 in dispatch_runner (arg=0x15b3580) at dispatcher.c:644
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f257ab58700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 4 (Thread 0x7f257b359700 (LWP 22200)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f25aeaa3274 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x0000000000406ef9 in dispatch_runner (arg=0x15b32c0) at dispatcher.c:644
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f257b359700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 3 (Thread 0x7f257bb5a700 (LWP 22199)):
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
#1  0x00000000004139a5 in aggregator_putmetric (s=0x1594cc0,
    metric=0x7f257cb6bd0d "stats.gauges.something1.something2-38fc7067bfc5.gauge.response.something3.something4.1500531.unreadmessages.count 0 1488557174\n",
    firstspace=0x7f257cb6bd8a " 0 1488557174\n", nmatch=<optimized out>, pmatch=0x7f257bb59a70) at aggregator.c:237
#2  0x000000000040f120 in router_route_intern (blackholed=0x7f257bb59ccf "", ret=0x7f257cb6dd10, curlen=0x7f257bb59cc0, retsize=64, srcaddr=0x7f257cb69ced "127.0.0.1", metric=<optimized out>,
    firstspace=0x7f257cb6bd8a " 0 1488557174\n", r=0x7f25af466413) at router.c:3111
#3  0x000000000040fb0a in router_route (rtr=<optimized out>, ret=<optimized out>, retcnt=0x7f257cb6e110, retsize=<optimized out>, srcaddr=<optimized out>, metric=<optimized out>,
    firstspace=0x7f257cb6bd8a " 0 1488557174\n") at router.c:3218
#4  0x0000000000406f95 in dispatch_connection (start=..., self=0x15b3000, conn=0x7f257cb69ce8) at dispatcher.c:448
#5  dispatch_runner (arg=0x15b3000) at dispatcher.c:633
#6  0x00007f25aed7ddc5 in start_thread (arg=0x7f257bb5a700) at pthread_create.c:308
#7  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 2 (Thread 0x7f257c35b700 (LWP 22198)):
#0  0x00007f25aea7265d in nanosleep () at ../sysdeps/unix/syscall-template.S:81
---Type <return> to continue, or q <return> to quit---
#1  0x00007f25aeaa3274 in usleep (useconds=<optimized out>) at ../sysdeps/unix/sysv/linux/usleep.c:32
#2  0x0000000000406ef9 in dispatch_runner (arg=0x1599010) at dispatcher.c:644
#3  0x00007f25aed7ddc5 in start_thread (arg=0x7f257c35b700) at pthread_create.c:308
#4  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

Thread 1 (Thread 0x7f257cb5c700 (LWP 22197)):
#0  0x00007f25aeaa0dad in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x0000000000406958 in dispatch_runner (arg=0x15b49c0) at dispatcher.c:583
#2  0x00007f25aed7ddc5 in start_thread (arg=0x7f257cb5c700) at pthread_create.c:308
#3  0x00007f25aeaab6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

grobian · 2017-03-03T19:57:12Z

One worker is active and in the process of acquiring a lock. The rest is doing nothing, waiting for data to arrive. The lock likely is blocking since the expiry thread is stuck in write (keeping the lock).

tehlers320 · 2017-03-06T17:43:51Z

Well, this may be solved. I had a host hung this morning and came to realize that it may be hanging on getting a connection. The receiving servers are set at -B 10 . I noticed the default is -B 32 now so i moved them and the "stuck/hung" pid unhung. Ill close this out in a couple of days if no hangs occur.

grobian · 2017-03-06T17:46:27Z

That's interesting info! I'll see if I can do anything here to make it more clear what's going on.

tehlers320 · 2017-03-08T15:36:54Z

It was a fluke it seems, i have 2 hung and a restart of the relays they send to did not "unlock them".

This is an attempt to do something about issue #250, but may be not desirable, because bursts generated by the aggregator will likely result in drops if the queue pressure is high. On the other hand, we try to spread the aggregation load, and server queues should be adjusted to handle this in the normal case. Idea is that blocking the expiry thread might cause an avalanche of blocks resulting in a spiraling behaviour towards a full stop.

grobian · 2017-03-22T10:46:52Z

I'm not sure if current master is usable for you, but I would be interested in if you could try above commit (apply to v2.6 should be fine) to see if it makes any difference.

tehlers320 · 2017-03-22T19:15:59Z

I am rolling it out to our test env. I see some regexs were changed

/etc/carbon-c-relay-2009.conf:56:4: invalid expression '^stats\.gauges\.[\w': Unmatched [ or [^
    ^stats\.gauges\.[\w&\-]*\.\w{1,8}\..*
    ^^^^^^^^^^^^^^^^^^^

grobian · 2017-03-22T19:19:37Z

Hmmm, does quoting the entire expression work? I think it stops at the &, I'll fix that.

grobian · 2017-03-22T19:22:31Z

I've pushed the fix for & and ,

tehlers320 · 2017-03-22T19:29:10Z

I actually just removed it, our regex is becoming overly complicated and weird. I've been hunting down bad behaviors and getting them fixed in preparation for metrics2.0

tehlers320 · 2017-03-23T14:18:28Z

I added a new node to the mix, which i wish i hadn't because 2.6 vs 3.x memory pattern is completely different so we could see a difference. A nice flat line?

grobian · 2017-03-23T14:21:08Z

Hmmm, you want me to backport that particular commit to 2.6 and branch it? Shouldn't be hard, the commit will apply cleanly.

tehlers320 · 2017-03-23T14:29:57Z

Nah thanks though! We are debating rolling out what we have in test which is 3.x-alpha i guess we can call it.

grobian · 2017-03-23T14:39:46Z

hmmm, that's kind of scary, it's bleeding edge :)

tehlers320 · 2017-03-23T14:41:23Z

It's been in test for almost 24 hours and monit is around to restart it if it crashes, meh.

tehlers320 · 2017-03-28T15:12:24Z

I figured out what the fluke. The upstream 2009 relay (added after this bug was filed) which is used to funnel all the metrics to via average also hangs. This has the effect of hanging the upper/lower tier with the config posted above. All of these now have the "2.6.1" patch applied as of 5 days ago. I took another core dump but the only thing that has changed is the lines shifted down 1 from the patch.

The new funnel system to prevent overwriting of whisper datapoints which we found was happening.
2005 (just relays) -> graphite 2005 (just relays) -> go-carbon
2005 (just relays) -> local aggregator 2009 (upper/lower)-> graphite 2009 (avg) -> go-carbon

Upstream configs are the same as the old above except they do not do validate and they have 16 workers (automatically assigned by chef node[:cpu][:total] )

cluster graphite
    any_of
        127.0.0.1:2003
    ;

####### aggregates #######

aggregate
    ^ prod.([^.]+)\.stats\.gauges\.([^.]+)\.([^.]+)\.(.*)
    every 65 seconds
    expire after 70 seconds
    timestamp at end of bucket
    compute avg write to
    prod.\1.stats.gauges.\2.\3.\4
    send to graphite
    stop
    ;

aggregate
    ^prod\.([^.]+)\.stats_counts\.([^.]+)\.([^.]+)\.(.*)
    every 65 seconds
    expire after 70 seconds
    timestamp at end of bucket
    compute sum write to
    prod.\1.stats_counts.\2.\3.\4
    send to graphite
    stop
    ;

aggregate
    ^ prod\.([^.]+)\.stats\.counters\.([^.]+)\.([^.]+)\.(.*)
    every 65 seconds
    expire after 70 seconds
    timestamp at end of bucket
    compute sum write to
    prod.\1.stats.counters.\2.\3.\4
    send to graphite
    stop
    ;


######## matches ########
match *
    send to graphite
    stop
    ;

one of the threads

Thread 3 (Thread 0x7f9970bd1700 (LWP 43092)):
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
#1  0x00000000004139c5 in aggregator_putmetric (s=0xaab5a0,
    metric=0x7f9971bfc6bd "prod.us-west-2.stats.gauges.someapp.aggregator.org.mongodb.driver.ConnectionPool.Size.58da0c04cac5160dcea7e267.somefeature.somedb.27017.upper 3.000000 14906"..., firstspace=0x7f9971bfc776 " 3.000000 1490693850\n",
    nmatch=<optimized out>, pmatch=0x7f9970bd0a70) at aggregator.c:237
#2  0x000000000040f140 in router_route_intern (blackholed=0x7f9970bd0ccf "", ret=0x7f9971bfe6c0, curlen=0x7f9970bd0cc0, retsize=64, srcaddr=0x7f9971bfa69d "10.0.1.1", metric=<optimized out>, firstspace=0x7f9971bfc776 " 3.000000 1490693850\n", r=0x7f9a01138d11) at router.c:3111
#3  0x000000000040fb2a in router_route (rtr=<optimized out>, ret=<optimized out>, retcnt=0x7f9971bfeac0, retsize=<optimized out>, srcaddr=<optimized out>, metric=<optimized out>, firstspace=0x7f9971bfc776 " 3.000000 1490693850\n") at router.c:3218
#4  0x0000000000406fb5 in dispatch_connection (start=..., self=0xaaa740, conn=0x7f9971bfa698) at dispatcher.c:449
#5  dispatch_runner (arg=0xaaa740) at dispatcher.c:634
#6  0x00007f9a00a51dc5 in start_thread (arg=0x7f9970bd1700) at pthread_create.c:308
#7  0x00007f9a0077f6ed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

tehlers320 · 2017-04-06T20:19:03Z

I was stable until today when i pushed an automation update to have chef change worker counter to (core count/2), since we have other pids running here i thought this made sense. This set my workers to 2 on some zones and many of these stalled that are very lite on CPU to begin with (20% user average). Again only aggregator nodes hung. Making me wonder if it's always been an issue with the worker thread counts.

Should the formula for worker threads be based off destinations instead of core count?

Or i'm grasping at straws here. But lowering the worker count to 2 reproduced the issue very quickly.

grobian · 2017-04-07T07:53:59Z

It certainly seems to be. It is a bit surprising, but gives a hint where the block is coming from. Thanks!

devinkramer · 2017-05-18T01:57:22Z

I experienced the same hanging issue after updating to 2.6 from a very old 0.44 on a node that primarily handles aggregations sent from other carbon-c-relays instances and then sends them along to store. I downgraded it to 2.3 in hopes that it was not present there but after going back through this entire thread it sounds like tehlers320 had the same issue with 2.3? Happy to provide any details I can to help hunt down the cause but was also wondering what the earliest version is that introduces the "-B backlog" option which is the primary reason I was updating.

tehlers320 · 2017-05-18T03:42:10Z

@devinkramer i haven't had issues since i upped my thread count to double my core count. At the very least it's mitigating this issue.

devinkramer · 2017-05-18T06:06:12Z

I'll give that a try thanks!

devinkramer · 2017-05-19T00:07:44Z

Unfortunately it still got wedged for me with the thread count set to 8 (-w 8) on a 4 core VM running version 2.3.

grobian · 2017-05-19T06:51:46Z

I think mark found the cause, and made a patch for it: #274

It is on my todo list to incorporate his changes for the next release.

grobian · 2017-08-08T17:07:10Z

the aggregator changes are in the master branch

grobian added the bug label Feb 13, 2017

tehlers320 changed the title ~~carbon-c-relay 2.6 hanging with no indication of an issue~~ carbon-c-relay hanging with no indication of an issue Feb 28, 2017

tehlers320 closed this as completed Feb 28, 2017

tehlers320 reopened this Feb 28, 2017

grobian closed this as completed Aug 8, 2017

carbon-c-relay hanging with no indication of an issue #250

carbon-c-relay hanging with no indication of an issue #250

Comments

tehlers320 commented Feb 13, 2017 • edited Loading

tehlers320 commented Feb 14, 2017

grobian commented Feb 18, 2017

grobian commented Feb 18, 2017

grobian commented Feb 18, 2017

tehlers320 commented Feb 20, 2017

grobian commented Feb 21, 2017

tehlers320 commented Feb 21, 2017

grobian commented Feb 22, 2017

tehlers320 commented Feb 22, 2017

tehlers320 commented Feb 28, 2017 • edited Loading

tehlers320 commented Feb 28, 2017 • edited Loading

tehlers320 commented Mar 1, 2017 • edited Loading

grobian commented Mar 3, 2017

tehlers320 commented Mar 3, 2017 • edited Loading

grobian commented Mar 3, 2017

tehlers320 commented Mar 3, 2017

grobian commented Mar 3, 2017

tehlers320 commented Mar 6, 2017

grobian commented Mar 6, 2017

tehlers320 commented Mar 8, 2017

grobian commented Mar 22, 2017

tehlers320 commented Mar 22, 2017

grobian commented Mar 22, 2017

grobian commented Mar 22, 2017 • edited Loading

tehlers320 commented Mar 22, 2017

tehlers320 commented Mar 23, 2017 • edited Loading

grobian commented Mar 23, 2017

tehlers320 commented Mar 23, 2017

grobian commented Mar 23, 2017

tehlers320 commented Mar 23, 2017

tehlers320 commented Mar 28, 2017

tehlers320 commented Apr 6, 2017 • edited Loading

grobian commented Apr 7, 2017

devinkramer commented May 18, 2017

tehlers320 commented May 18, 2017

devinkramer commented May 18, 2017

devinkramer commented May 19, 2017

grobian commented May 19, 2017

grobian commented Aug 8, 2017

tehlers320 commented Feb 13, 2017 •

edited

Loading

tehlers320 commented Feb 28, 2017 •

edited

Loading

tehlers320 commented Feb 28, 2017 •

edited

Loading

tehlers320 commented Mar 1, 2017 •

edited

Loading

tehlers320 commented Mar 3, 2017 •

edited

Loading

grobian commented Mar 22, 2017 •

edited

Loading

tehlers320 commented Mar 23, 2017 •

edited

Loading

tehlers320 commented Apr 6, 2017 •

edited

Loading