Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DotNetty Remote Transport Issues with .NET Core 2.1 #3506

Closed
DaKaLKa opened this issue Jun 13, 2018 · 72 comments
Closed

DotNetty Remote Transport Issues with .NET Core 2.1 #3506

DaKaLKa opened this issue Jun 13, 2018 · 72 comments

Comments

@DaKaLKa
Copy link

DaKaLKa commented Jun 13, 2018

I have a test application with 1 seed node and two service nodes.
After migrating to .NET Core 2.1, I have some "Can't join seed-node" issues using Akka.net 1.3.6 / 1.3.8. The software is build and executed in docker with official Microsoft .NET Core 2.1 Images (microsoft/dotnet:2.1-sdk and microsoft/dotnet:2.1-runtime). Sometimes the nodes get connected without problems, sometimes I have the "can't join seed-nodes"-problems without any changes.
The problem appears on my local laptom and on our docker-server (and some other test-Systems).

The same application caused no problems running it on windows through Visual Studio.
Migration back to ".NET Core 2.0" solved the problem for now. (Project-Settings and other microsoft/dotnet:2.0-docker-Images).

@Danthar
Copy link
Member

Danthar commented Jun 13, 2018

We don't officially support .NET Core 2.1 yet. Heck, we aren't even on netstandard 2.0 yet (although work is underway). But thanks for confirming that there are indeed issues :)

@Aaronontheweb Aaronontheweb added the linux :penguin: label Jun 13, 2018
@Aaronontheweb
Copy link
Member

Technically, everything we do should be working on .NET Core 2.1 since .NET Standard 1.6 can be consumed upstream. We'll investigate. Might be a DotNetty issue since v0.4.8 started using some of the unsafe APIs.

@xlegalles
Copy link

It may be also valid for Windows as our implementation does not work anymore since we moved to .NET Core 2.1. We still need to investigate but the problem is exactly the same: remote actors cannot join anymore. We tested both in Docker and out of container.

@xlegalles
Copy link

Forget my comment: it was not related to AKKA. So it means that we are able to use AKKA.Net with 2.1 under Windows.

@annymsMthd
Copy link
Contributor

We are currently running a 1.3.8 cluster with .net core 2.1 in k8s with only a few minor issues.

@Lutando
Copy link

Lutando commented Jul 2, 2018

@annymsMthd what were those minor issues?

@annymsMthd
Copy link
Contributor

Sometimes we have a node that starts kicking out other nodes after we deploy even though it's not leader. It just starts marking nodes as down and even though they restart it just targets them again and marks them. When we reset the node that is doing the marking everything works. We only see this when we deploy though so it isn't a show stopper.

@Lutando
Copy link

Lutando commented Jul 3, 2018

We have a similar behavior. Might be worth investigating.

@dgioulakis
Copy link

dgioulakis commented Jul 19, 2018

Experiencing the same issue. Net core 2.1 on Linux. The problem is random ; restarting often works

@pmorelli92
Copy link

Could it be that it get solved if instead of using DotNetty 0.4.8 we upgrade to 0.5.0?

@mmisztal1980
Copy link
Contributor

@pmorelli92 just tried to launch my cluster under docker for Windows (linux), with DotNetty.Handlers 0.5.0 . Nodes still fail to connect.

@Aaronontheweb
Copy link
Member

Based on some of the quality issues we've had with DotNetty recently, I've been exploring a replacement for it built on Akka.Streams.

But that is a ways off still. In the short run, I'll look more closely at the .NET Core 2.1 issues you all have been reporting. Strong chance it's something fucked up with the runtime itself and DotNetty has to patch around it in order to get it to work. Wouldn't be the first time this has happened. Part of the price of being on the bleeding edge.

RE: the cluster issues. We'll look into that too - is that caused by nodes having connectivity problems or does this appear to be a 100% programmatic issue?

@Aaronontheweb Aaronontheweb self-assigned this Sep 10, 2018
@mmisztal1980
Copy link
Contributor

mmisztal1980 commented Sep 10, 2018

@Aaronontheweb I've examined my hocon multiple times (hocon below is the final configuration) and to be honest I don't see anything wrong with it. (keep in mind that it's been a while).

Hocon obtained from: core-cluster-svc

  akka : {
    stdout-loglevel : DEBUG
    loglevel : INFO
    log-config-on-start : on
    loggers : ["Akka.Logger.Serilog.SerilogLogger, Akka.Logger.Serilog"]
    actor : {
      provider : cluster
      debug : {
        receive : on
        autoreceive : on
        lifecycle : on
        event-stream : on
        unhandled : on
      }
    }
    remote : {
      dot-netty : {
        tcp : {
          transport-class : "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
          transport-protocol : tcp
          hostname : 0.0.0.0
          public-hostname : core-cluster-svc
          port : 9000
        }
      }
    }
    cluster : {
      log-info : on
      seed-nodes : ["akka.tcp://System@core-cluster-svc-1:9000","akka.tcp://System@core-cluster-svc-2:9000","akka.tcp://System@core-cluster-svc:9000"]
      roles : [cluster-seed]
    }
  }
version: '3.4'

services:
  core-cluster-svc:
    image: core-cluster-svc
    build:
      context: .
      dockerfile: src/Core.Cluster.Service/Dockerfile
  core-cluster-svc-1:
    image: core-cluster-svc
    depends_on: 
      - core-cluster-svc
  core-cluster-svc-2:
    image: core-cluster-svc
    depends_on: 
      - core-cluster-svc

Please note that Core.Cluster.Service's purpose is similar to that of Lighthouse's, however I've made it run on top of .net core 2.1 & had it utilize constructs from Microsoft.Extensions.Hosting & Microsoft.Extensions.Configuration

I've performed connectivity check between:

  • core-cluster-svc & core-cluster-svc1
  • core-cluster-svc & core-cluster-svc2

The checks have been completed via telnet:

I've attached to the running container via: docker exec -it <core-cluster-svc_containerId> /bin/bash

And then executed:

apt-get update
apt-get install telnet
telnet core-cluster-svc-1 9000
telnet core-cluster-svc-2 9000

Both connections were successful as I'd expect them to be, since docker-compose creates a docker-network with dns support.

@alexsandro-xpt
Copy link

@annymsMthd and @Lutando Are you have a code demo to test it?. I will run it under Kubernetes.

@mmisztal1980
Copy link
Contributor

mmisztal1980 commented Sep 12, 2018

@Aaronontheweb

I've set akka.remote.dot-netty.tcp.log-transport = true and set the overall logging level to DEBUG.

Here's the output I've captured:
log.txt

The below solution is being run under Docker for Windows (linux containers).

version: '2.4'

services:
  dmon-telemetry-db:
    image: docker.elastic.co/elasticsearch/elasticsearch:6.3.0
    healthcheck:
      test: ["CMD-SHELL", "curl --silent --fail localhost:9200/_cluster/health || exit 1"]
      interval: 30s
      timeout: 30s
      retries: 3
  dmon-telemetry-db-dashboard:
    depends_on: 
      dmon-telemetry-db:
          condition: service_healthy
    image: docker.elastic.co/kibana/kibana:6.3.0
  core-cluster-svc: # This is an equivalent of Lighthouse running under .net core 2.1. Instance 1/3
    depends_on: 
      dmon-telemetry-db:
          condition: service_healthy
    image: core-cluster-svc
    build:
      context: .
      dockerfile: src/Core.Cluster.Service/Dockerfile
  core-cluster-svc-1: # Instance 2/3
    image: core-cluster-svc
    depends_on: 
      - core-cluster-svc
  core-cluster-svc-2: # Instance 3/3
    image: core-cluster-svc
    depends_on: 
      - core-cluster-svc

Hope that helps a bit.

@Aaronontheweb
Copy link
Member

@mmisztal1980

�[35mcore-cluster-svc-2_1           |�[0m Serilog: 2018-09-12T20:23:28.0208726Z Exception caught while converting property value: DotNetty.Common.Utilities.IllegalReferenceCountException: Illegal reference count of 0 for this object
�[35mcore-cluster-svc-2_1           |�[0m    at DotNetty.Buffers.ThrowHelper.ThrowIllegalReferenceCountException(Int32 count)
�[35mcore-cluster-svc-2_1           |�[0m    at DotNetty.Buffers.UnpooledHeapByteBuffer.get_Capacity()
�[35mcore-cluster-svc-2_1           |�[0m    at DotNetty.Buffers.AbstractByteBuffer.ToString()
�[35mcore-cluster-svc-2_1           |�[0m    at Serilog.Capturing.PropertyValueConverter.CreatePropertyValue(Object value, Destructuring destructuring, Int32 depth)
�[35mcore-cluster-svc-2_1           |�[0m    at Serilog.Capturing.PropertyValueConverter.CreatePropertyValue(Object value, Destructuring destructuring)

Sure looks like DotNetty catching fire under the covers. Problem with their reference counting system...

@Aaronontheweb
Copy link
Member

So I'll need to investigate why this happens on .NET Core 2.1 but not elsewhere (AFAIK) - I'll begin an investigation and I'm going to upgrade this issue to "confirmed bug."

@Aaronontheweb Aaronontheweb added this to the 1.3.10 milestone Sep 12, 2018
@Aaronontheweb Aaronontheweb changed the title Test compatibility with .NET Core 2.1 under Linux DotNetty Remote Transport Issues with .NET Core 2.1 Sep 12, 2018
@Aaronontheweb Aaronontheweb removed the linux :penguin: label Sep 12, 2018
@Aaronontheweb
Copy link
Member

This is also neither a Windows or Linux-specific issue. Appears to be both runtimes.

@ianhorton
Copy link

I saw this issue in K8s, I put together a repo with some net/akka versions that have will not form a cluster happily. https://gitlab.com/ian.horton.vibe/akka-cluster-issue/

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Sep 22, 2018 via email

@Aaronontheweb
Copy link
Member

I'm convinced that part of the solution to this problem relies in solving #3573 - we've had issues in the past with .NET Core on Linux demonstrating different behavior than it does on Windows and outside of a few built-in specs we have inside the Akka.Remote.Tests and Akka.Cluster.Tests libraries, we don't do much in the way of automated End2End testing on Linux. This stems largely from a decision we made last year to punt on trying to get the MNTR to run on POSIX systems.

It's time to revisit that and fix it - only way we can catch these issues on Linux. Assuming that we're going to get identical behavior on Windows and Linux from the same runtime and dependencies is clearly not good enough.

@Aaronontheweb
Copy link
Member

Created a real sample to try to reproduce this using DotNetty only: https://github.com/Aaronontheweb/DotNettyRepro

I can get it to work on .NET Core 2.1 and .NET Core 2.0, so I'm going to need to take a look at our remoting code and see if there's something we need to change in the architecture of the DotNettyTransport itself.

@Aaronontheweb
Copy link
Member

Tested one theory just now, which is that DNS enumeration changed between .NET Core 2.0 and 2.1 - anecdotally this doesn't appear to be the case: Aaronontheweb/DotNettyRepro#2

@Aaronontheweb
Copy link
Member

Here we go

[WARNING][11/22/2018 20:48:14][Thread 0008][remoting] Tried to associate with unreachable remote address [akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2550]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2550] Caused by: [System.AggregateException: One or more errors occurred. (No such device or address) ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException: No such device or address
   at System.Net.Dns.InternalGetHostByName(String hostName)
   at System.Net.Dns.ResolveCallback(Object context)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
   at System.Net.Dns.EndGetHostEntry(IAsyncResult asyncResult)
   at System.Net.Dns.<>c.<GetHostEntryAsync>b__27_1(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.ResolveNameAsync(DnsEndPoint address, AddressFamily addressFamily)
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.DnsToIPEndpoint(DnsEndPoint dns)
   at Akka.Remote.Transport.DotNetty.TcpTransport.MapEndpointAsync(EndPoint socketAddress)
   at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress)
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__11_54(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot)
---> (Inner Exception #0) System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000005, 6): No such device or address
   at System.Net.Dns.InternalGetHostByName(String hostName)
   at System.Net.Dns.ResolveCallback(Object context)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
   at System.Net.Dns.EndGetHostEntry(IAsyncResult asyncResult)
   at System.Net.Dns.<>c.<GetHostEntryAsync>b__27_1(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.ResolveNameAsync(DnsEndPoint address, AddressFamily addressFamily)
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.DnsToIPEndpoint(DnsEndPoint dns)
   at Akka.Remote.Transport.DotNetty.TcpTransport.MapEndpointAsync(EndPoint socketAddress)
   at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress)
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress)<---

@Aaronontheweb
Copy link
Member

And lo and behold, it looks like .NET Core itself is experiencing random failures with their DNS system on Linux... As recently as two weeks ago https://github.com/dotnet/coreclr/issues/20924

And

https://github.com/dotnet/corefx/issues/28051

And

https://github.com/dotnet/corefx/issues/15640

@Aaronontheweb
Copy link
Member

And there were some major changes to the Dns class introduced into .NET Core 2.1: dotnet/corefx#26850

@Aaronontheweb
Copy link
Member

The changes submitted in that PR should affect Windows only - looking through it now to see if there's anything in the API surface area that could have affected Linux, since we can see via @ianhorton's sample that keeping everything else constant (Akka version, DotNetty version, etc) but switching from .NET Core 2.0 to 2.1 is enough to go from 100% ok to persistent failure.

@Aaronontheweb
Copy link
Member

This PR, on the other hand, directly affected the Linux DNS resolution code in .NET Core 2.1: dotnet/corefx#26926

@Aaronontheweb
Copy link
Member

https://github.com/dotnet/corefx/issues/32611 - another culprit....

@Aaronontheweb
Copy link
Member

I've given this a try on the latest .NET Core 2.2 preview as well - has the same issue as .NET Core 2.1.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Nov 22, 2018

So there were some configuration bugs in @ianhorton's sample that caused the solution to reproduce failures continuously for .NET Core 2.1 - namely the K8s YAML files had the wrong port numbers and ActorSystem names for everything except .NET Core 2.0 / Akka.NET v1.3.8. I fixed those and I can now get a cluster to form just fine on .NET Core 2.1.

However, I am seeing issues similar to what @annymsMthd described where nodes become randomly unreachable. Also seeing the DNS issues I tagged earlier in this thread come up as well, but the cluster was able to recover from those upon reconnect - seems like the new DNS code introduced in .NET Core 2.1 is flaky on Linux, as is indicated by the other issues.

I'm going to look into why some of the nodes were being marked as unreachable in .NET Core 2.1, that looks like an issue still, and I'm able to reproduce that on Akka.NET v1.3.8.

TL;DR; If you are unable to get a cluster to form at all on .NET Core 2.1, it's likely a configuration issue on your end.

Please thoroughly check to make sure you're using the right port numbers, ActorSystem names, and everything else before opening an issue.

@Aaronontheweb
Copy link
Member

Also seeing the DNS issues I tagged earlier in this thread come up as well, but the cluster was able to recover from those upon reconnect

Just to clarify this - means that the cluster's automatic reconnection code handled this on its own. No human intervention required.

@mmisztal1980
Copy link
Contributor

I'll follow up with my .net core 2.1 sample tomorrow and verify if there is a config issue.
Thanks @Aaronontheweb

@mmisztal1980
Copy link
Contributor

I've re-examined my config, found 2 issues. Having fixed them, the cluster was formed.

core-cluster-svc-3_1           | [INFO][11/23/2018 00:43:27][Thread 0005][[akka://System/system/log1-SerilogLogger#1188292203]] SerilogLogger started
core-cluster-svc-2_1           | [INFO][11/23/2018 00:43:27][Thread 0005][[akka://System/system/log1-SerilogLogger#1787629061]] SerilogLogger started
dmon-telemetry-db_1            | [2018-11-23T00:43:29,039][WARN ][o.e.d.a.a.i.t.p.PutIndexTemplateRequest] [U5rCLOu] Deprecated field [template] used, replaced by [index_patterns]
dmon-telemetry-db_1            | [2018-11-23T00:43:29,206][WARN ][o.e.d.a.a.i.t.p.PutIndexTemplateRequest] [U5rCLOu] Deprecated field [template] used, replaced by [index_patterns]
core-cluster-svc-2_1           | [00:43:29 INF] ClusterService Started
core-cluster-svc-3_1           | [00:43:29 INF] ClusterService Started
core-cluster-svc-3_1           | Application started. Press Ctrl+C to shut down.
core-cluster-svc-3_1           | Hosting environment: Production
core-cluster-svc-3_1           | Content root path: /app/
core-cluster-svc-2_1           | Application started. Press Ctrl+C to shut down.
core-cluster-svc-2_1           | Hosting environment: Production
core-cluster-svc-2_1           | Content root path: /app/
dmon-telemetry-db_1            | [2018-11-23T00:43:29,898][WARN ][o.e.d.a.a.i.t.p.PutIndexTemplateRequest] [U5rCLOu] Deprecated field [template] used, replaced by [index_patterns]
core-cluster-svc-1_1           | [00:43:30 INF] ClusterService Started
core-cluster-svc-1_1           | Application started. Press Ctrl+C to shut down.
core-cluster-svc-1_1           | Hosting environment: Production
core-cluster-svc-1_1           | Content root path: /app/
core-cluster-svc-1_1           | [00:43:30 INF] This node is UP
core-cluster-svc-1_1           | [00:43:34 INF] Node [akka.tcp://System@core-cluster-svc-2:9000] is JOINING, roles [cluster-seed]
core-cluster-svc-1_1           | [00:43:34 INF] Node [akka.tcp://System@core-cluster-svc-3:9000] is JOINING, roles [cluster-seed]
core-cluster-svc-3_1           | [00:43:34 INF] Welcome from [akka.tcp://System@core-cluster-svc-1:9000]
core-cluster-svc-2_1           | [00:43:34 INF] Welcome from [akka.tcp://System@core-cluster-svc-1:9000]
core-cluster-svc-2_1           | [00:43:34 INF] Member is Up: Member(address = akka.tcp://System@core-cluster-svc-1:9000, Uid=1923374807 status = Up, role=[cluster-seed], upNumber=1)
core-cluster-svc-1_1           | [00:43:34 INF] Leader is moving node [akka.tcp://System@core-cluster-svc-2:9000] to [Up]
core-cluster-svc-1_1           | [00:43:34 INF] Leader is moving node [akka.tcp://System@core-cluster-svc-3:9000] to [Up]
core-cluster-svc-1_1           | [00:43:34 INF] Member is Up: Member(address = akka.tcp://System@core-cluster-svc-2:9000, Uid=1135850219 status = Up, role=[cluster-seed], upNumber=2)
core-cluster-svc-1_1           | [00:43:34 INF] Member is Up: Member(address = akka.tcp://System@core-cluster-svc-3:9000, Uid=1798594014 status = Up, role=[cluster-seed], upNumber=3)
core-cluster-svc-3_1           | [00:43:34 INF] Member is Up: Member(address = akka.tcp://System@core-cluster-svc-1:9000, Uid=1923374807 status = Up, role=[cluster-seed], upNumber=1)
core-cluster-svc-2_1           | [00:43:34 INF] This node is UP
core-cluster-svc-2_1           | [00:43:34 INF] Member is Up: Member(address = akka.tcp://System@core-cluster-svc-2:9000, Uid=1135850219 status = Up, role=[cluster-seed], upNumber=2)
core-cluster-svc-2_1           | [00:43:34 INF] Member is Up: Member(address = akka.tcp://System@core-cluster-svc-3:9000, Uid=1798594014 status = Up, role=[cluster-seed], upNumber=3)
core-cluster-svc-2_1           | [00:43:34 INF] Message MemberUp from akka://System/system/cluster/core/publisher to akka://System/system/cluster/$a was not delivered. 1 dead letters encountered.
core-cluster-svc-3_1           | [00:43:34 INF] Member is Up: Member(address = akka.tcp://System@core-cluster-svc-2:9000, Uid=1135850219 status = Up, role=[cluster-seed], upNumber=2)
core-cluster-svc-3_1           | [00:43:34 INF] Member is Up: Member(address = akka.tcp://System@core-cluster-svc-3:9000, Uid=1798594014 status = Up, role=[cluster-seed], upNumber=3)
core-cluster-svc-3_1           | [00:43:34 INF] This node is UP
dmon-telemetry-db_1            | [2018-11-23T00:43:36,206][INFO ][o.e.c.m.MetaDataMappingService] [U5rCLOu] [application-log-2018.11/9-BO_oF0QwKzxkYAKBJnrA] update_mapping [logevent]

@AndreSteenbergen
Copy link
Contributor

One question though... Is this with dotnetty updated to 0.6? Or are newer versions automatically using 0.6?

@ianhorton
Copy link

Thanks @Aaronontheweb, apologies for my poorly configured example.

@Aaronontheweb
Copy link
Member

@AndreSteenbergen ah, my reproduction fix was with v0.4.8. We're still going to upgrade to v0.6.0.

@Aaronontheweb
Copy link
Member

@ianhorton I really appreciated your sample - gave me an opportunity to really explore the issue in-depth. Glad it was a configuration issue and not something much more seriously wrong with the .NET networking stack.

@Aaronontheweb
Copy link
Member

Ok, so a bit more work with @ianhorton's example over the weekend and I have a theory as to what is going on here:

Seed node:

[WARNING][11/24/2018 18:49:24][Thread 0008][remoting] Tried to associate with unreachable remote address [akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2552]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2552] Caused by: [System.AggregateException: One or more errors occurred. (Resource temporarily unavailable) ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException: Resource temporarily unavailable
   at System.Net.Dns.InternalGetHostByName(String hostName)
   at System.Net.Dns.ResolveCallback(Object context)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
   at System.Net.Dns.EndGetHostEntry(IAsyncResult asyncResult)
   at System.Net.Dns.<>c.<GetHostEntryAsync>b__27_1(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.ResolveNameAsync(DnsEndPoint address, AddressFamily addressFamily)
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.DnsToIPEndpoint(DnsEndPoint dns)
   at Akka.Remote.Transport.DotNetty.TcpTransport.MapEndpointAsync(EndPoint socketAddress)
   at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress)
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress)
   --- End of inner exception stack trace ---
   at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
   at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__11_54(Task`1 result)
   at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location where exception was thrown ---
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot)
---> (Inner Exception #0) System.Net.Internals.SocketExceptionFactory+ExtendedSocketException (00000001, 11): Resource temporarily unavailable
   at System.Net.Dns.InternalGetHostByName(String hostName)
   at System.Net.Dns.ResolveCallback(Object context)
--- End of stack trace from previous location where exception was thrown ---
   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
   at System.Net.Dns.EndGetHostEntry(IAsyncResult asyncResult)
   at System.Net.Dns.<>c.<GetHostEntryAsync>b__27_1(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.ResolveNameAsync(DnsEndPoint address, AddressFamily addressFamily)
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.DnsToIPEndpoint(DnsEndPoint dns)
   at Akka.Remote.Transport.DotNetty.TcpTransport.MapEndpointAsync(EndPoint socketAddress)
   at Akka.Remote.Transport.DotNetty.TcpTransport.AssociateInternal(Address remoteAddress)
   at Akka.Remote.Transport.DotNetty.DotNettyTransport.Associate(Address remoteAddress)<---

Failed node:

Initializing
RunSingle
[Docker-Bootstrap] IP=cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138
[Docker-Bootstrap] PORT=
[Docker-Bootstrap] SEEDS=akka.tcp://cluster-system@cluster-seed-netcore21-akka138-0.cluster-seed-netcore21-akka138:2552,akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2552,akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2552
[INFO][11/24/2018 18:49:16][Thread 0001][remoting] Starting remoting
[INFO][11/24/2018 18:49:17][Thread 0001][remoting] Remoting started; listening on addresses : [akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2552]
[INFO][11/24/2018 18:49:17][Thread 0001][remoting] Remoting now listens on addresses: [akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2552]
[INFO][11/24/2018 18:49:17][Thread 0001][Cluster] Cluster Node [akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2552] - Starting up...
[INFO][11/24/2018 18:49:17][Thread 0001][Cluster] Cluster Node [akka.tcp://cluster-system@cluster-seed-netcore21-akka138-1.cluster-seed-netcore21-akka138:2552] - Started up successfully
Members: 0
Unreachable: 0
Leader: no leader
Members: 0
Unreachable: 0
Leader: no leader
Members: 0
Unreachable: 0
Leader: no leader
[WARNING][11/24/2018 18:49:27][Thread 0004][[akka://cluster-system/system/cluster/core/daemon/joinSeedNodeProcess-1#1215423967]] Couldn't join seed nodes after [2] attempts, will try again. seed-nodes=[akka.tcp://cluster-system@cluster-seed-netcore21-akka138-0.cluster-seed-netcore21-akka138:2552]
Members: 0
Unreachable: 0
Leader: no leader
[WARNING][11/24/2018 18:49:32][Thread 0003][[akka://cluster-system/system/cluster/core/daemon/joinSeedNodeProcess-1#1215423967]] Couldn't join seed nodes after [3] attempts, will try again. seed-nodes=[akka.tcp://cluster-system@cluster-seed-netcore21-akka138-0.cluster-seed-netcore21-akka138:2552]
Members: 0
Unreachable: 0
Leader: no leader
[WARNING][11/24/2018 18:49:37][Thread 0010][[akka://cluster-system/system/cluster/core/daemon/joinSeedNodeProcess-1#1215423967]] Couldn't join seed nodes after [4] attempts, will try again. seed-nodes=[akka.tcp://cluster-system@cluster-seed-netcore21-akka138-0.cluster-seed-netcore21-akka138:2552]
Members: 0
Unreachable: 0
Leader: no leader
[WARNING][11/24/2018 18:49:42][Thread 0018][[akka://cluster-system/system/cluster/core/daemon/joinSeedNodeProcess-1#1215423967]] Couldn't join seed nodes after [5] attempts, will try again. seed-nodes=[akka.tcp://cluster-system@cluster-seed-netcore21-akka138-0.cluster-seed-netcore21-akka138:2552]
Members: 0
Unreachable: 0
Leader: no leader

These DNS failures introduced on POSIX systems in .NET Core 2.1 are easy to reproduce, hence why those issues I linked to earlier are open on the CoreFX repo. However, what I suspect is happening is that at least one of the nodes in @ianhorton's reproduction sample fails to resolve DNS on host binding, and I think our DotNettyTransport is swallowing that exception and never actually starting correctly. So the reason why we've had some users running on .NET Core 2.1 just fine, I suspect, is because they're using public-hostname for their DNS and 0.0.0.0 for hostname, which means the transport never actually has to perform a DNS resolution on bind.

I'm going to be traveling most of today but this is the thing I want to test ASAP.

@Aaronontheweb
Copy link
Member

Aaronontheweb commented Nov 26, 2018

Well, scratch that theory - @ianhorton's sample uses https://github.com/petabridge/akkadotnet-bootstrap and thus doesn't actually need to perform a DNS resolution on startup. So in that case, I'm not sure why things aren't reconnecting - although there definitely are some DNS issues occurring.

@AndreSteenbergen
Copy link
Contributor

@Aaronontheweb have you upgraded dotnetty? I am on a linux vps, when running net core 2.0 and nuget akka 1.3.10, I get a few unexplainable errors every now and then, ranging from Akka PDU issues to not deserializing contracts. The problems have disappeared since I upgraded to dotnetty 0.6. I don't know if this will help in any way, just thought it might be worth a mention ...

@Aaronontheweb
Copy link
Member

@AndreSteenbergen it looks like the DotNetty upgrade to v0.6.0 has resolved most of the issues. DNS problems still show up every now and then but otherwise we're good. I'll get my PR resolved ASAP so we can roll this into a v1.3.11 update.

@senzacionale
Copy link

@mmisztal1980 can you please post fixed config file and what was wrong. I am having same provblem.

Thank you

@Aaronontheweb
Copy link
Member

@senzacionale upgrading to DotNetty v0.6.0 should solve the problem. I'm working on getting this out in Akka.NET v1.3.11 but thanks to some really annoying byte-rot in the .NET Core ecosystem, I'm having to do a bit of yak-shaving to get our test suite to play nicely with it. Watch this repository for updates.

@mmisztal1980
Copy link
Contributor

@senzacionale my issues were with the seed nodes & public hostname(s). unfortunately I'm not using hocon to configure my app so I can't help you here.

@coinpennant
Copy link

@mmisztal1980 what are you using. Maybe I also need to change :)

@AndreSteenbergen
Copy link
Contributor

@Aaronontheweb I can confirm an upgrade to dotnetty 0.6 and akka.net 1.3.10, works well on linux (Ubuntu 16.04 LTS) in combination with net core 2.1. This combo with netcore 2.0 gave me some Akka PDU issues and protobuf issues. I have been running about 24 hours now, without any issues regarding missed frames, connectivity errors or whatsoever.

@Aaronontheweb
Copy link
Member

Resolved via #3633

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests