Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote cluster DNS cached indefinitely #28858

Closed
dylanPowers opened this issue Mar 1, 2018 · 9 comments · Fixed by #32764
Closed

Remote cluster DNS cached indefinitely #28858

dylanPowers opened this issue Mar 1, 2018 · 9 comments · Fixed by #32764
Assignees
Labels
>bug :Distributed/Network Http and internode communication implementations

Comments

@dylanPowers
Copy link

dylanPowers commented Mar 1, 2018

I'm finding that cross cluster search doesn't continuously resolve hostnames and instead caches the addresses indefinitely. Correct me if I'm wrong in my analysis but it appears that hostnames for the seeds are only resolved upon node startup and update to the seed list. I'm seeing this at https://github.com/elastic/elasticsearch/blob/v6.3.2/server/src/main/java/org/elasticsearch/transport/RemoteClusterService.java#L318 and https://github.com/elastic/elasticsearch/blob/v6.3.2/server/src/main/java/org/elasticsearch/transport/RemoteClusterService.java#L333.

This makes cross cluster search impossible to use in any environment with changing IP's such as a containerized environment.

The best idea I have for a workaround is to have each node routinely ping it's own /_remote/info endpoint and whenever an empty seeds list appears from it being disconnected, dynamically remove the seed from the cluster settings, and re-add it.

@javanna javanna added :Search/Search Search-related issues that do not fall into other categories discuss :Distributed/Network Http and internode communication implementations and removed :Search/Search Search-related issues that do not fall into other categories labels Mar 1, 2018
@CpuID
Copy link

CpuID commented Mar 4, 2018

Experienced the same thing today, running Cross Cluster Nodes in Amazon ECS pointing at Amazon ECS hosted coordinator nodes for AZ-specific clusters (due to cross-AZ data transfer costs hitting 4~ digits previously), behind each AZ specific coord nodes are stateful EC2 based ES data nodes.

Had an AZ specific coordinator replacement occur, and the CCS nodes could no longer access that cluster and would error on queries. Ended up having to replace the CCS nodes so they would rediscover the IP listing on startup.

This would be awesome to have resolved.

@CpuID
Copy link

CpuID commented Mar 4, 2018

@dylanPowers can you confirm what version you are running right now? We are on 5.6.x, was considering upgrading to 6.1.x sooner rather than later until I saw this reported :)

@dylanPowers
Copy link
Author

Currently running 5.6.x with the intent to upgrade to 6.x for the optional remote cluster feature. I didn't see anything in the source or change log to denote any sort of fix. I ended up creating a separate service that pings all the nodes at /_remote/info to do what I said previously and it works pretty well.

@jasontedor jasontedor added :Distributed/Network Http and internode communication implementations and removed discuss :Distributed/Network Http and internode communication implementations labels Mar 9, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@thomasriley
Copy link

Hey,

Also suffering from this issue with our Elasticsearch environment (ES 6.2.2) when using Cross Cluster Search. Seems like once it connects to the seed on a specific IP it does not lookup the DNS again.

Elasticsearch should observe the TTL of the DNS so that if the Elasticearch instances behind the DNS for the seed change, Elasticsearch will move the connection to whichever instance is returned by the new DNS lookup. This is very important in a failure scenario when the instance it is connected to is no longer available, it would be preferred if it looks up the DNS name again, not matter what the TTL is currently at.

The impact here is that if you replace the node that is being used as the seed, Kibana users are faced with a not so nice error.

Error: Request to Elasticsearch failed: {"error":{"root_cause":[{"type":"connect_transport_exception","reason":"[][ip:9300] connect_exception"}],"type":"transport_exception","reason":"unable to communicate with remote cluster [logs]","caused_by":{"type":"connect_transport_exception","reason":"[][ip:9300] connect_exception","caused_by":{"type":"annotated_no_route_to_host_exception","reason":"No route to host: hostname-to-out-cluster/ip:9300","caused_by":{"type":"no_route_to_host_exception","reason":"No route to host"}}}},"status":500}

Cheers,
Tom

@colings86 colings86 added the >bug label Apr 24, 2018
@chernecov
Copy link
Contributor

chernecov commented May 31, 2018

Hi. Any updates on this topic?
No pressure, but when to expect this problem to be addressed?
Thanks.

Manual elasticsearch restart helps, but setting crontab for that is not the answer :)

@CpuID
Copy link

CpuID commented Jun 27, 2018

This would be great to have resolved... I don't feel confident enough on my Java skills to implement it though. Anyone willing to contribute a fix?

@original-brownbear
Copy link
Member

Fix incoming in #32764

@CpuID
Copy link

CpuID commented Aug 10, 2018

@original-brownbear woohoo thx :)

original-brownbear added a commit that referenced this issue Aug 18, 2018
* Lazy resolve DNS (i.e. `String` to `DiscoveryNode`) to not run into indefinitely caching lookup issues (provided the JVM dns cache is configured correctly as explained in https://www.elastic.co/guide/en/elasticsearch/reference/6.3/networkaddress-cache-ttl.html)
   * Changed `InetAddress` type to `String` for that higher up the stack
   * Passed down `Supplier<DiscoveryNode>` instead of outright `DiscoveryNode` from `RemoteClusterAware#buildRemoteClustersSeeds` on to lazy resolve DNS when the `DiscoveryNode` is actually used (could've also passed down the value of `clusterName = REMOTE_CLUSTERS_SEEDS.getNamespace(concreteSetting)` together with the `List<String>` of hosts, but this route seemed to introduce less duplication and resulted in a significantly smaller changeset).
* Closes #28858
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Aug 19, 2018
* Lazy resolve DNS (i.e. `String` to `DiscoveryNode`) to not run into indefinitely caching lookup issues (provided the JVM dns cache is configured correctly as explained in https://www.elastic.co/guide/en/elasticsearch/reference/6.3/networkaddress-cache-ttl.html)
   * Changed `InetAddress` type to `String` for that higher up the stack
   * Passed down `Supplier<DiscoveryNode>` instead of outright `DiscoveryNode` from `RemoteClusterAware#buildRemoteClustersSeeds` on to lazy resolve DNS when the `DiscoveryNode` is actually used (could've also passed down the value of `clusterName = REMOTE_CLUSTERS_SEEDS.getNamespace(concreteSetting)` together with the `List<String>` of hosts, but this route seemed to introduce less duplication and resulted in a significantly smaller changeset).
* Closes elastic#28858
original-brownbear added a commit that referenced this issue Aug 19, 2018
* Lazy resolve DNS (i.e. `String` to `DiscoveryNode`) to not run into indefinitely caching lookup issues (provided the JVM dns cache is configured correctly as explained in https://www.elastic.co/guide/en/elasticsearch/reference/6.3/networkaddress-cache-ttl.html)
   * Changed `InetAddress` type to `String` for that higher up the stack
   * Passed down `Supplier<DiscoveryNode>` instead of outright `DiscoveryNode` from `RemoteClusterAware#buildRemoteClustersSeeds` on to lazy resolve DNS when the `DiscoveryNode` is actually used (could've also passed down the value of `clusterName = REMOTE_CLUSTERS_SEEDS.getNamespace(concreteSetting)` together with the `List<String>` of hosts, but this route seemed to introduce less duplication and resulted in a significantly smaller changeset).
* Closes #28858
jasontedor pushed a commit that referenced this issue Aug 21, 2018
* Lazy resolve DNS (i.e. `String` to `DiscoveryNode`) to not run into indefinitely caching lookup issues (provided the JVM dns cache is configured correctly as explained in https://www.elastic.co/guide/en/elasticsearch/reference/6.3/networkaddress-cache-ttl.html)
   * Changed `InetAddress` type to `String` for that higher up the stack
   * Passed down `Supplier<DiscoveryNode>` instead of outright `DiscoveryNode` from `RemoteClusterAware#buildRemoteClustersSeeds` on to lazy resolve DNS when the `DiscoveryNode` is actually used (could've also passed down the value of `clusterName = REMOTE_CLUSTERS_SEEDS.getNamespace(concreteSetting)` together with the `List<String>` of hosts, but this route seemed to introduce less duplication and resulted in a significantly smaller changeset).
* Closes #28858
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Network Http and internode communication implementations
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants