-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
carbon-c-relay hanging with no indication of an issue #250
Comments
Occurred again on a node where we left validation on. This time i grabbed an strace of the pid and all the threads for 60 seconds with
|
are the two instances there just to cover 2 ports? |
and do you know which instance is not producing results any more? the "idle" one or the busy one? |
this could be related to issue #242 |
Relay 2005 is ahead of it and just sends on to graphite and relay 2009 shown here is the one hanging. I separated my aggregation. 2005 - just relay and validate So far a 2005 relay has not hung. But it's very sporadic, no hangs since the last posting. Traffic to 2009 is only TCP as it should come from the relay. |
I see. So it must be an aggregator issue. |
I am not sure, so far there has not been another hang. I have 4 of them in the zone where this has happened. I put 2 on 2.3 and left 2 on 2.6 for further testing. |
Ok, that's interesting. Did you build the versions yourself? It would be nice to get a stacktrace or something of the relay when it is "hung", that might give a clue what it is blocked on. |
Yes i built off the tagged branch https://github.com/grobian/carbon-c-relay/tree/v2.6 Did you mean a core dump? I have an strace above at https://github.com/grobian/carbon-c-relay/files/774433/carbon-c-relay-strace.txt Ill gather both, if/when it happens. Im also wondering if there is a way to speedup getting to a hang. Maybe bursting traffic to it, im not sure. |
Well, the hang has occurred on 2.3 so this is not a 2.6 issue.
I have a core dump this time, its 405mb though. |
Just a shot in the dark, but some of dropwizards "OneMinuteRate" metrics turn into these after days of inactivity 2.964393875e-314 I wonder if trying to perform an aggregate on such a big float is a problem. |
Found something interesting but it may be unrelated. In the main graphite tree some pids seem to have dumped random chunks of their aggregated metrics. I may have not noticed this before since we mostly use grafana and there is a cleanup mechanism in place for non-writing metrics. _aggregator_stub_0x10ee570__production.us-west-2.stats.gauges...dropwizardstuff Exploring the dumped tree there are averages/min/max/sum so it is not consistently one or the other. |
that's not supposed to happen :( Did you by chance reload the relay? Any idea if that metric could be generated by 2.6? |
No i wasn't careful enough to mark the crash time with what instance. I moved 100% back to 2.6 now though after seeing 2.3 crash. Would this core dump (from 2.3) be useful? I just dont know where to send it at 405mb! EDIT: |
I don't think the core-dump is any useful, the only aspect of it is the backtrace. The volume of the dump is in the send buffers and/or aggregation states, which are probably sensitive data, so I'd first like to try without. |
Sorry ive not used gdb before. I removed some of the names on the string in the metricput. But whats curious is the whole stack stops. The internal metrics stop as well.
|
One worker is active and in the process of acquiring a lock. The rest is doing nothing, waiting for data to arrive. The lock likely is blocking since the expiry thread is stuck in write (keeping the lock). |
Well, this may be solved. I had a host hung this morning and came to realize that it may be hanging on getting a connection. The receiving servers are set at -B 10 . I noticed the default is -B 32 now so i moved them and the "stuck/hung" pid unhung. Ill close this out in a couple of days if no hangs occur. |
That's interesting info! I'll see if I can do anything here to make it more clear what's going on. |
It was a fluke it seems, i have 2 hung and a restart of the relays they send to did not "unlock them". |
This is an attempt to do something about issue #250, but may be not desirable, because bursts generated by the aggregator will likely result in drops if the queue pressure is high. On the other hand, we try to spread the aggregation load, and server queues should be adjusted to handle this in the normal case. Idea is that blocking the expiry thread might cause an avalanche of blocks resulting in a spiraling behaviour towards a full stop.
I'm not sure if current master is usable for you, but I would be interested in if you could try above commit (apply to v2.6 should be fine) to see if it makes any difference. |
I am rolling it out to our test env. I see some regexs were changed
|
Hmmm, does quoting the entire expression work? I think it stops at the &, I'll fix that. |
I've pushed the fix for & and , |
I actually just removed it, our regex is becoming overly complicated and weird. I've been hunting down bad behaviors and getting them fixed in preparation for metrics2.0 |
Hmmm, you want me to backport that particular commit to 2.6 and branch it? Shouldn't be hard, the commit will apply cleanly. |
Nah thanks though! We are debating rolling out what we have in test which is 3.x-alpha i guess we can call it. |
hmmm, that's kind of scary, it's bleeding edge :) |
It's been in test for almost 24 hours and monit is around to restart it if it crashes, meh. |
I figured out what the fluke. The upstream 2009 relay (added after this bug was filed) which is used to funnel all the metrics to via average also hangs. This has the effect of hanging the upper/lower tier with the config posted above. All of these now have the "2.6.1" patch applied as of 5 days ago. I took another core dump but the only thing that has changed is the lines shifted down 1 from the patch. The new funnel system to prevent overwriting of whisper datapoints which we found was happening. Upstream configs are the same as the old above except they do not do validate and they have 16 workers (automatically assigned by chef node[:cpu][:total] )
one of the threads
|
I was stable until today when i pushed an automation update to have chef change worker counter to (core count/2), since we have other pids running here i thought this made sense. This set my workers to 2 on some zones and many of these stalled that are very lite on CPU to begin with (20% user average). Again only aggregator nodes hung. Making me wonder if it's always been an issue with the worker thread counts. Should the formula for worker threads be based off destinations instead of core count? Or i'm grasping at straws here. But lowering the worker count to 2 reproduced the issue very quickly. |
It certainly seems to be. It is a bit surprising, but gives a hint where the block is coming from. Thanks! |
I experienced the same hanging issue after updating to 2.6 from a very old 0.44 on a node that primarily handles aggregations sent from other carbon-c-relays instances and then sends them along to store. I downgraded it to 2.3 in hopes that it was not present there but after going back through this entire thread it sounds like tehlers320 had the same issue with 2.3? Happy to provide any details I can to help hunt down the cause but was also wondering what the earliest version is that introduces the "-B backlog" option which is the primary reason I was updating. |
@devinkramer i haven't had issues since i upped my thread count to double my core count. At the very least it's mitigating this issue. |
I'll give that a try thanks! |
Unfortunately it still got wedged for me with the thread count set to 8 (-w 8) on a 4 core VM running version 2.3. |
I think mark found the cause, and made a patch for it: #274 It is on my todo list to incorporate his changes for the next release. |
the aggregator changes are in the master branch |
I have a relay to a relay setup for aggregation. A few times now since 2.6 was introduced (from 2.3) the process is hanging with no error/warnings to indicate whats happening.
The only error i have is from the first relay.
We moved to 2.6 to use the validator and is the only change from where we were stable with 2.3 to 2.6 in the config.
Startup settings:
Our systems are pretty busy, so this could be a contributing factor. We run these all on m4.xlarge instances. (4 core 16gb of ram)
I just wanted to start a bug in case it is a real issue and not something we have done on our side. We will remove the validator on a couple of nodes for testing but its a long lag time before a node will hang.
The text was updated successfully, but these errors were encountered: