Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.6+ Round-Robin DNS Entry forward any_of causes hung connections #262

Closed
g2bg opened this issue Apr 22, 2017 · 18 comments
Closed

2.6+ Round-Robin DNS Entry forward any_of causes hung connections #262

g2bg opened this issue Apr 22, 2017 · 18 comments
Labels

Comments

@g2bg
Copy link

g2bg commented Apr 22, 2017

On 2.5 the functionality was that it would pick 1 ip from the DNS entry to forward to and stick with it.
In 2.6 and 3.0 it will connect to 1 of the ip's from the DNS entry and send metrics then leaves the connection open.
This leads to ever increased connection counts until it runs out of file descriptors.

So 2 prong request would be

  1. Fix connection close issue
  2. in any_of load all ip's from a multi-ip dns resolution

[2017-04-21 17:26:56] (MSG) starting carbon-c-relay v3.0 (98424a-dirty), pid=39656
configuration:
relay hostname = server_name
listen port = 2003
listen interface = 127.0.0.1
workers = 4
send batch size = 2500
server queue size = 25000
server max stalls = 4
listen backlog = 32
server connection IO timeout = 600ms
debug = true
configuration = /etc/carbon-c-relay.conf

parsed configuration follows:
statistics
submit every 60 seconds
prefix with carbon.relays.server_name
;

cluster local_carbon
any_of
multiipdnsentry:2003
;

match *
send to local_carbon
stop
;

[2017-04-21 17:26:56] (MSG) listening on tcp4 127.0.0.1 port 2003
[2017-04-21 17:26:56] (MSG) listening on udp4 127.0.0.1 port 2003
[2017-04-21 17:26:56] (MSG) listening on UNIX socket /tmp/.s.carbon-c-relay.2003
[2017-04-21 17:26:56] (MSG) starting 4 workers
[2017-04-21 17:26:56] (MSG) starting statistics collector
[2017-04-21 17:26:56] (MSG) starting servers

/usr/bin/carbon-c-relay -P /var/run/carbon-c-relay/carbon-c-relay.pid -D -p 2003 -i 127.0.0.1 -w 4 -b 2500 -q 25000 -l /var/log/carbon-c-relay/carbon-c-relay.log -s -f /etc/carbon-c-relay.conf

@grobian
Copy link
Owner

grobian commented Apr 22, 2017

For 1. I'm searching, sofar no clue why you see that behaviour yet.
For 2. I'm not sure if I understand your request. Could it be that you want useall behaviour? It will expand all IP addresses instead of re-resolving on each connection attempt.

@grobian
Copy link
Owner

grobian commented Apr 22, 2017

As for 1. I think it's an inverted logic bug.

grobian added a commit that referenced this issue Apr 22, 2017
…lved

This may be causing what is observed in issue #262.  Due to inverted
logic, connections bound to be re-resolved were actually re-resolved,
but the info was never used, and addrinfo was leaked.  On resolve
failure this could even result in errors or other undefined behaviour.
@grobian
Copy link
Owner

grobian commented Apr 22, 2017

Due to a bug, connections would never switch to another IP, memory would be leaked though.

@grobian grobian added the bug label Apr 22, 2017
@g2bg
Copy link
Author

g2bg commented Apr 22, 2017

Validated that the inverted logic fix does now rotate ips.
Unfortunately still seeing the connection increasing issue. I can easily replicate this one for debugging, within 5 min it's built up 100 or so established connections to the other relays.
This is on both centos 6 and 7.

As for 2, my thought was if a RR DNS was used in an ANY_OF entry could we pull all the ip's from the dns entry and put them in as individual members. That way it would split the traffic between them but would allow for a node being down. This would allow a centralized mgmt of the endpoints vs having all the ip's in the remote relays.

@grobian
Copy link
Owner

grobian commented Apr 22, 2017

Ok, so there's something going on there.

What you describe in 2 seems exactly like the useall feature to me. Try this:

cluster local_carbon
any_of useall
multiipdnsentry:2003
;

From what you tell me in this mode the relay should not build up a large pile of connections. If this is indeed the case, it should narrow the search somewhat, but I'm curious...

@g2bg
Copy link
Author

g2bg commented Apr 22, 2017

Agreed it looks like useall is what I was looking for there. Thank you so much.

As for the other issue that only made it quite worse, after 1 min I had 240 established connections.
I have a build setup I can add some debugging code to try and find it just not quite sure where to put it at the moment.

@grobian
Copy link
Owner

grobian commented Apr 23, 2017

The idea is that a connection is made, and reused when there are metrics to write within a certain timeout (something like 10s off the top of my head). It should absolutely NOT open a new connection for each time it tries to write.

@grobian
Copy link
Owner

grobian commented Apr 23, 2017

I just did a simple test to verify the disconnect behaviour, and it seems to trigger (it's 3 seconds). Can you tell me a bit about how many addresses your multiipdnsentry resolves to, and how much data is flowing towards the relay? If you use the stats, how much connections are made to the relay (nonNegativeDerivative(carbon.relays.host.connections)), and what are the other relays? Are they also c-relays, or different software?

@g2bg
Copy link
Author

g2bg commented Apr 23, 2017

Sure, this is a per host relay to relay setup.
Client / Host is setup as above and has a collectd and other application metrics pumping in on a 10s interval. I believe last count was around 2k metrics or so.
The Destinations 4 of them have carbon-c-relay (I've tried this with 2.5,2.6 and 3.0 no diff) talk to the multitude of carbon-cache instances. Currently running around 1m/s metrics on 3.0.
Watching the behavior of 2.5 it is opening and closing connections on interval.
In 2.6/3.0 it opens the connections on interval as well, just never closes the old.
Only with the RR DNS though if I put all 4 destination ip's in there's no issue.

Interesting side note. If I set the DNS entry with useall it expands to the IP's as members in the log when it outputs the config. This has the connection issues though. If I copy / paste that config and use it there's no problems.

Is there a way to adjust the disconnect timeout?

@grobian
Copy link
Owner

grobian commented Apr 23, 2017

think I found the problem

grobian added a commit that referenced this issue Apr 23, 2017
…solve

Part of what's reported in issue #262, a server that's the result of
use_all expansion should be treated as if it were given as IP address.
grobian added a commit that referenced this issue Apr 23, 2017


In case of connecting to a destination with multiple resolutions
(addrinfo) break out of the connect as soon as one succeed, don't try to
connect to all, since it builds up a huge pile of connections which
we'll never free.
@grobian
Copy link
Owner

grobian commented Apr 23, 2017

If you could try latest master, that would be awesome. If it solves the problem for you, I'll release v3.1 shortly to fix this screwup.

@g2bg
Copy link
Author

g2bg commented Apr 23, 2017

Centos 6 fails to make from master
bison -d conffile.y
conffile.y:36.20-30: syntax error, unexpected {...}
make: *** [conffile.tab.c] Error 1

Centos 7 completes.
With a any_of useall and RRDNS it expands in the log output config still.
It goes in order only connecting to 1 ip.
If the first fails it goes to the second.
Connections do not grow with this config.

example:
10.1.1.3:2003
10.1.1.2:2003
10.1.1.1:2003
10.1.1.4:2003
it will always choose 10.1.1.3 unless it's unavailable. Is this expected behavior for a any_of useall ?
If I specify all 4 ip's in the config it will connect to all 4.

@grobian
Copy link
Owner

grobian commented Apr 23, 2017

You can touch conffile.tab.* and conffile.yy.* for git doesn't store mtimes :(

I haven't found a way to work around this yet.

I'll look into why useall doesn't connect to others.

@grobian
Copy link
Owner

grobian commented Apr 23, 2017

hah, use_all never updates the configuration, so the router thinks there's only one entry.

@grobian
Copy link
Owner

grobian commented Apr 23, 2017

hmmm, test mode shows all entries would get used ...

@grobian
Copy link
Owner

grobian commented Apr 25, 2017

I've not been able to reproduce the behaviour where it will pick the first node. That actually is the behaviour of a failover cluster. Not that I don't trust your observations, but are you sure you use any_of useall in this case, and you see no distribution of metrics over all of the expanded hosts?

@grobian
Copy link
Owner

grobian commented Apr 27, 2017

I think I found a reason/cause for the behaviour you see.

grobian added a commit that referenced this issue Apr 27, 2017
This is likely the problem observed in issue #262 where the first
address is taken all the time, because the full stack of addresses were
assigned to every server.
@grobian
Copy link
Owner

grobian commented Apr 29, 2017

I think I've fixed this, if not, please reopen.

@grobian grobian closed this as completed Apr 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants