2.6+ Round-Robin DNS Entry forward any_of causes hung connections #262

g2bg · 2017-04-22T00:33:36Z

On 2.5 the functionality was that it would pick 1 ip from the DNS entry to forward to and stick with it.
In 2.6 and 3.0 it will connect to 1 of the ip's from the DNS entry and send metrics then leaves the connection open.
This leads to ever increased connection counts until it runs out of file descriptors.

So 2 prong request would be

Fix connection close issue
in any_of load all ip's from a multi-ip dns resolution

[2017-04-21 17:26:56] (MSG) starting carbon-c-relay v3.0 (98424a-dirty), pid=39656
configuration:
relay hostname = server_name
listen port = 2003
listen interface = 127.0.0.1
workers = 4
send batch size = 2500
server queue size = 25000
server max stalls = 4
listen backlog = 32
server connection IO timeout = 600ms
debug = true
configuration = /etc/carbon-c-relay.conf

parsed configuration follows:
statistics
submit every 60 seconds
prefix with carbon.relays.server_name
;

cluster local_carbon
any_of
multiipdnsentry:2003
;

match *
send to local_carbon
stop
;

[2017-04-21 17:26:56] (MSG) listening on tcp4 127.0.0.1 port 2003
[2017-04-21 17:26:56] (MSG) listening on udp4 127.0.0.1 port 2003
[2017-04-21 17:26:56] (MSG) listening on UNIX socket /tmp/.s.carbon-c-relay.2003
[2017-04-21 17:26:56] (MSG) starting 4 workers
[2017-04-21 17:26:56] (MSG) starting statistics collector
[2017-04-21 17:26:56] (MSG) starting servers

/usr/bin/carbon-c-relay -P /var/run/carbon-c-relay/carbon-c-relay.pid -D -p 2003 -i 127.0.0.1 -w 4 -b 2500 -q 25000 -l /var/log/carbon-c-relay/carbon-c-relay.log -s -f /etc/carbon-c-relay.conf

grobian · 2017-04-22T07:50:49Z

For 1. I'm searching, sofar no clue why you see that behaviour yet.
For 2. I'm not sure if I understand your request. Could it be that you want useall behaviour? It will expand all IP addresses instead of re-resolving on each connection attempt.

grobian · 2017-04-22T07:54:18Z

As for 1. I think it's an inverted logic bug.

…lved This may be causing what is observed in issue #262. Due to inverted logic, connections bound to be re-resolved were actually re-resolved, but the info was never used, and addrinfo was leaked. On resolve failure this could even result in errors or other undefined behaviour.

grobian · 2017-04-22T08:13:23Z

Due to a bug, connections would never switch to another IP, memory would be leaked though.

g2bg · 2017-04-22T12:42:57Z

Validated that the inverted logic fix does now rotate ips.
Unfortunately still seeing the connection increasing issue. I can easily replicate this one for debugging, within 5 min it's built up 100 or so established connections to the other relays.
This is on both centos 6 and 7.

As for 2, my thought was if a RR DNS was used in an ANY_OF entry could we pull all the ip's from the dns entry and put them in as individual members. That way it would split the traffic between them but would allow for a node being down. This would allow a centralized mgmt of the endpoints vs having all the ip's in the remote relays.

grobian · 2017-04-22T18:11:06Z

Ok, so there's something going on there.

What you describe in 2 seems exactly like the useall feature to me. Try this:

cluster local_carbon
any_of useall
multiipdnsentry:2003
;

From what you tell me in this mode the relay should not build up a large pile of connections. If this is indeed the case, it should narrow the search somewhat, but I'm curious...

g2bg · 2017-04-22T20:10:15Z

Agreed it looks like useall is what I was looking for there. Thank you so much.

As for the other issue that only made it quite worse, after 1 min I had 240 established connections.
I have a build setup I can add some debugging code to try and find it just not quite sure where to put it at the moment.

grobian · 2017-04-23T07:48:49Z

The idea is that a connection is made, and reused when there are metrics to write within a certain timeout (something like 10s off the top of my head). It should absolutely NOT open a new connection for each time it tries to write.

grobian · 2017-04-23T08:35:04Z

I just did a simple test to verify the disconnect behaviour, and it seems to trigger (it's 3 seconds). Can you tell me a bit about how many addresses your multiipdnsentry resolves to, and how much data is flowing towards the relay? If you use the stats, how much connections are made to the relay (nonNegativeDerivative(carbon.relays.host.connections)), and what are the other relays? Are they also c-relays, or different software?

g2bg · 2017-04-23T12:31:59Z

Sure, this is a per host relay to relay setup.
Client / Host is setup as above and has a collectd and other application metrics pumping in on a 10s interval. I believe last count was around 2k metrics or so.
The Destinations 4 of them have carbon-c-relay (I've tried this with 2.5,2.6 and 3.0 no diff) talk to the multitude of carbon-cache instances. Currently running around 1m/s metrics on 3.0.
Watching the behavior of 2.5 it is opening and closing connections on interval.
In 2.6/3.0 it opens the connections on interval as well, just never closes the old.
Only with the RR DNS though if I put all 4 destination ip's in there's no issue.

Interesting side note. If I set the DNS entry with useall it expands to the IP's as members in the log when it outputs the config. This has the connection issues though. If I copy / paste that config and use it there's no problems.

Is there a way to adjust the disconnect timeout?

grobian · 2017-04-23T13:31:29Z

think I found the problem

…solve Part of what's reported in issue #262, a server that's the result of use_all expansion should be treated as if it were given as IP address.

In case of connecting to a destination with multiple resolutions (addrinfo) break out of the connect as soon as one succeed, don't try to connect to all, since it builds up a huge pile of connections which we'll never free.

grobian · 2017-04-23T13:38:43Z

If you could try latest master, that would be awesome. If it solves the problem for you, I'll release v3.1 shortly to fix this screwup.

g2bg · 2017-04-23T14:27:23Z

Centos 6 fails to make from master
bison -d conffile.y
conffile.y:36.20-30: syntax error, unexpected {...}
make: *** [conffile.tab.c] Error 1

Centos 7 completes.
With a any_of useall and RRDNS it expands in the log output config still.
It goes in order only connecting to 1 ip.
If the first fails it goes to the second.
Connections do not grow with this config.

example:
10.1.1.3:2003
10.1.1.2:2003
10.1.1.1:2003
10.1.1.4:2003
it will always choose 10.1.1.3 unless it's unavailable. Is this expected behavior for a any_of useall ?
If I specify all 4 ip's in the config it will connect to all 4.

grobian · 2017-04-23T15:28:51Z

You can touch conffile.tab.* and conffile.yy.* for git doesn't store mtimes :(

I haven't found a way to work around this yet.

I'll look into why useall doesn't connect to others.

grobian · 2017-04-23T15:36:11Z

hah, use_all never updates the configuration, so the router thinks there's only one entry.

grobian · 2017-04-23T15:56:56Z

hmmm, test mode shows all entries would get used ...

grobian · 2017-04-25T09:51:01Z

I've not been able to reproduce the behaviour where it will pick the first node. That actually is the behaviour of a failover cluster. Not that I don't trust your observations, but are you sure you use any_of useall in this case, and you see no distribution of metrics over all of the expanded hosts?

grobian · 2017-04-27T08:45:11Z

I think I found a reason/cause for the behaviour you see.

This is likely the problem observed in issue #262 where the first address is taken all the time, because the full stack of addresses were assigned to every server.

grobian · 2017-04-29T09:53:09Z

I think I've fixed this, if not, please reopen.

grobian added the bug label Apr 22, 2017

grobian added a commit that referenced this issue Apr 23, 2017

router_add_server: for use_all expanded servers we don't need to rere…

ea6a6b7

…solve Part of what's reported in issue #262, a server that's the result of use_all expansion should be treated as if it were given as IP address.

grobian closed this as completed Apr 29, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.6+ Round-Robin DNS Entry forward any_of causes hung connections #262

2.6+ Round-Robin DNS Entry forward any_of causes hung connections #262

g2bg commented Apr 22, 2017

grobian commented Apr 22, 2017

grobian commented Apr 22, 2017

grobian commented Apr 22, 2017

g2bg commented Apr 22, 2017 •

edited

Loading

grobian commented Apr 22, 2017

g2bg commented Apr 22, 2017

grobian commented Apr 23, 2017

grobian commented Apr 23, 2017

g2bg commented Apr 23, 2017

grobian commented Apr 23, 2017

grobian commented Apr 23, 2017

g2bg commented Apr 23, 2017

grobian commented Apr 23, 2017

grobian commented Apr 23, 2017

grobian commented Apr 23, 2017

grobian commented Apr 25, 2017

grobian commented Apr 27, 2017

grobian commented Apr 29, 2017

2.6+ Round-Robin DNS Entry forward any_of causes hung connections #262

2.6+ Round-Robin DNS Entry forward any_of causes hung connections #262

Comments

g2bg commented Apr 22, 2017

grobian commented Apr 22, 2017

grobian commented Apr 22, 2017

grobian commented Apr 22, 2017

g2bg commented Apr 22, 2017 • edited Loading

grobian commented Apr 22, 2017

g2bg commented Apr 22, 2017

grobian commented Apr 23, 2017

grobian commented Apr 23, 2017

g2bg commented Apr 23, 2017

grobian commented Apr 23, 2017

grobian commented Apr 23, 2017

g2bg commented Apr 23, 2017

grobian commented Apr 23, 2017

grobian commented Apr 23, 2017

grobian commented Apr 23, 2017

grobian commented Apr 25, 2017

grobian commented Apr 27, 2017

grobian commented Apr 29, 2017

g2bg commented Apr 22, 2017 •

edited

Loading