-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DotNetty Remote Transport Issues with .NET Core 2.1 #3506
Comments
We don't officially support .NET Core 2.1 yet. Heck, we aren't even on netstandard 2.0 yet (although work is underway). But thanks for confirming that there are indeed issues :) |
Technically, everything we do should be working on .NET Core 2.1 since .NET Standard 1.6 can be consumed upstream. We'll investigate. Might be a DotNetty issue since v0.4.8 started using some of the unsafe APIs. |
It may be also valid for Windows as our implementation does not work anymore since we moved to .NET Core 2.1. We still need to investigate but the problem is exactly the same: remote actors cannot join anymore. We tested both in Docker and out of container. |
Forget my comment: it was not related to AKKA. So it means that we are able to use AKKA.Net with 2.1 under Windows. |
We are currently running a 1.3.8 cluster with .net core 2.1 in k8s with only a few minor issues. |
@annymsMthd what were those minor issues? |
Sometimes we have a node that starts kicking out other nodes after we deploy even though it's not leader. It just starts marking nodes as down and even though they restart it just targets them again and marks them. When we reset the node that is doing the marking everything works. We only see this when we deploy though so it isn't a show stopper. |
We have a similar behavior. Might be worth investigating. |
Experiencing the same issue. Net core 2.1 on Linux. The problem is random ; restarting often works |
Could it be that it get solved if instead of using DotNetty 0.4.8 we upgrade to 0.5.0? |
@pmorelli92 just tried to launch my cluster under docker for Windows (linux), with |
Based on some of the quality issues we've had with DotNetty recently, I've been exploring a replacement for it built on Akka.Streams. But that is a ways off still. In the short run, I'll look more closely at the .NET Core 2.1 issues you all have been reporting. Strong chance it's something fucked up with the runtime itself and DotNetty has to patch around it in order to get it to work. Wouldn't be the first time this has happened. Part of the price of being on the bleeding edge. RE: the cluster issues. We'll look into that too - is that caused by nodes having connectivity problems or does this appear to be a 100% programmatic issue? |
@Aaronontheweb I've examined my hocon multiple times (hocon below is the final configuration) and to be honest I don't see anything wrong with it. (keep in mind that it's been a while). Hocon obtained from: core-cluster-svc akka : {
stdout-loglevel : DEBUG
loglevel : INFO
log-config-on-start : on
loggers : ["Akka.Logger.Serilog.SerilogLogger, Akka.Logger.Serilog"]
actor : {
provider : cluster
debug : {
receive : on
autoreceive : on
lifecycle : on
event-stream : on
unhandled : on
}
}
remote : {
dot-netty : {
tcp : {
transport-class : "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
transport-protocol : tcp
hostname : 0.0.0.0
public-hostname : core-cluster-svc
port : 9000
}
}
}
cluster : {
log-info : on
seed-nodes : ["akka.tcp://System@core-cluster-svc-1:9000","akka.tcp://System@core-cluster-svc-2:9000","akka.tcp://System@core-cluster-svc:9000"]
roles : [cluster-seed]
}
} version: '3.4'
services:
core-cluster-svc:
image: core-cluster-svc
build:
context: .
dockerfile: src/Core.Cluster.Service/Dockerfile
core-cluster-svc-1:
image: core-cluster-svc
depends_on:
- core-cluster-svc
core-cluster-svc-2:
image: core-cluster-svc
depends_on:
- core-cluster-svc Please note that Core.Cluster.Service's purpose is similar to that of Lighthouse's, however I've made it run on top of .net core 2.1 & had it utilize constructs from I've performed connectivity check between:
The checks have been completed via telnet: I've attached to the running container via: And then executed: apt-get update
apt-get install telnet
telnet core-cluster-svc-1 9000
telnet core-cluster-svc-2 9000 Both connections were successful as I'd expect them to be, since docker-compose creates a docker-network with dns support. |
@annymsMthd and @Lutando Are you have a code demo to test it?. I will run it under Kubernetes. |
I've set Here's the output I've captured: The below solution is being run under Docker for Windows (linux containers). version: '2.4'
services:
dmon-telemetry-db:
image: docker.elastic.co/elasticsearch/elasticsearch:6.3.0
healthcheck:
test: ["CMD-SHELL", "curl --silent --fail localhost:9200/_cluster/health || exit 1"]
interval: 30s
timeout: 30s
retries: 3
dmon-telemetry-db-dashboard:
depends_on:
dmon-telemetry-db:
condition: service_healthy
image: docker.elastic.co/kibana/kibana:6.3.0
core-cluster-svc: # This is an equivalent of Lighthouse running under .net core 2.1. Instance 1/3
depends_on:
dmon-telemetry-db:
condition: service_healthy
image: core-cluster-svc
build:
context: .
dockerfile: src/Core.Cluster.Service/Dockerfile
core-cluster-svc-1: # Instance 2/3
image: core-cluster-svc
depends_on:
- core-cluster-svc
core-cluster-svc-2: # Instance 3/3
image: core-cluster-svc
depends_on:
- core-cluster-svc Hope that helps a bit. |
Sure looks like DotNetty catching fire under the covers. Problem with their reference counting system... |
So I'll need to investigate why this happens on .NET Core 2.1 but not elsewhere (AFAIK) - I'll begin an investigation and I'm going to upgrade this issue to "confirmed bug." |
This is also neither a Windows or Linux-specific issue. Appears to be both runtimes. |
I saw this issue in K8s, I put together a repo with some net/akka versions that have will not form a cluster happily. https://gitlab.com/ian.horton.vibe/akka-cluster-issue/ |
Thanks for sending this.. I'm trying to imagine what change we could have introduced in v1.3.9 that could have caused this, but only on Linux
…Sent from my iPhone
On Sep 22, 2018, at 12:35 PM, Ian Horton ***@***.***> wrote:
I saw this issue in K8s, I put together a repo with some net/akka versions that have will not form a cluster happily. https://gitlab.com/ian.horton.vibe/akka-cluster-issue/
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I'm convinced that part of the solution to this problem relies in solving #3573 - we've had issues in the past with .NET Core on Linux demonstrating different behavior than it does on Windows and outside of a few built-in specs we have inside the Akka.Remote.Tests and Akka.Cluster.Tests libraries, we don't do much in the way of automated End2End testing on Linux. This stems largely from a decision we made last year to punt on trying to get the MNTR to run on POSIX systems. It's time to revisit that and fix it - only way we can catch these issues on Linux. Assuming that we're going to get identical behavior on Windows and Linux from the same runtime and dependencies is clearly not good enough. |
Created a real sample to try to reproduce this using DotNetty only: https://github.com/Aaronontheweb/DotNettyRepro I can get it to work on .NET Core 2.1 and .NET Core 2.0, so I'm going to need to take a look at our remoting code and see if there's something we need to change in the architecture of the |
Tested one theory just now, which is that DNS enumeration changed between .NET Core 2.0 and 2.1 - anecdotally this doesn't appear to be the case: Aaronontheweb/DotNettyRepro#2 |
Here we go
|
And lo and behold, it looks like .NET Core itself is experiencing random failures with their DNS system on Linux... As recently as two weeks ago https://github.com/dotnet/coreclr/issues/20924 And https://github.com/dotnet/corefx/issues/28051 And |
And there were some major changes to the |
The changes submitted in that PR should affect Windows only - looking through it now to see if there's anything in the API surface area that could have affected Linux, since we can see via @ianhorton's sample that keeping everything else constant (Akka version, DotNetty version, etc) but switching from .NET Core 2.0 to 2.1 is enough to go from 100% ok to persistent failure. |
This PR, on the other hand, directly affected the Linux DNS resolution code in .NET Core 2.1: dotnet/corefx#26926 |
https://github.com/dotnet/corefx/issues/32611 - another culprit.... |
I've given this a try on the latest .NET Core 2.2 preview as well - has the same issue as .NET Core 2.1. |
So there were some configuration bugs in @ianhorton's sample that caused the solution to reproduce failures continuously for .NET Core 2.1 - namely the K8s YAML files had the wrong port numbers and However, I am seeing issues similar to what @annymsMthd described where nodes become randomly unreachable. Also seeing the DNS issues I tagged earlier in this thread come up as well, but the cluster was able to recover from those upon reconnect - seems like the new DNS code introduced in .NET Core 2.1 is flaky on Linux, as is indicated by the other issues. I'm going to look into why some of the nodes were being marked as unreachable in .NET Core 2.1, that looks like an issue still, and I'm able to reproduce that on Akka.NET v1.3.8. TL;DR; If you are unable to get a cluster to form at all on .NET Core 2.1, it's likely a configuration issue on your end.Please thoroughly check to make sure you're using the right port numbers, ActorSystem names, and everything else before opening an issue. |
Just to clarify this - means that the cluster's automatic reconnection code handled this on its own. No human intervention required. |
I'll follow up with my .net core 2.1 sample tomorrow and verify if there is a config issue. |
I've re-examined my config, found 2 issues. Having fixed them, the cluster was formed.
|
One question though... Is this with dotnetty updated to 0.6? Or are newer versions automatically using 0.6? |
Thanks @Aaronontheweb, apologies for my poorly configured example. |
@AndreSteenbergen ah, my reproduction fix was with v0.4.8. We're still going to upgrade to v0.6.0. |
@ianhorton I really appreciated your sample - gave me an opportunity to really explore the issue in-depth. Glad it was a configuration issue and not something much more seriously wrong with the .NET networking stack. |
Ok, so a bit more work with @ianhorton's example over the weekend and I have a theory as to what is going on here: Seed node:
Failed node:
These DNS failures introduced on POSIX systems in .NET Core 2.1 are easy to reproduce, hence why those issues I linked to earlier are open on the CoreFX repo. However, what I suspect is happening is that at least one of the nodes in @ianhorton's reproduction sample fails to resolve DNS on host binding, and I think our I'm going to be traveling most of today but this is the thing I want to test ASAP. |
Well, scratch that theory - @ianhorton's sample uses https://github.com/petabridge/akkadotnet-bootstrap and thus doesn't actually need to perform a DNS resolution on startup. So in that case, I'm not sure why things aren't reconnecting - although there definitely are some DNS issues occurring. |
@Aaronontheweb have you upgraded dotnetty? I am on a linux vps, when running net core 2.0 and nuget akka 1.3.10, I get a few unexplainable errors every now and then, ranging from Akka PDU issues to not deserializing contracts. The problems have disappeared since I upgraded to dotnetty 0.6. I don't know if this will help in any way, just thought it might be worth a mention ... |
@AndreSteenbergen it looks like the DotNetty upgrade to v0.6.0 has resolved most of the issues. DNS problems still show up every now and then but otherwise we're good. I'll get my PR resolved ASAP so we can roll this into a v1.3.11 update. |
@mmisztal1980 can you please post fixed config file and what was wrong. I am having same provblem. Thank you |
@senzacionale upgrading to DotNetty v0.6.0 should solve the problem. I'm working on getting this out in Akka.NET v1.3.11 but thanks to some really annoying byte-rot in the .NET Core ecosystem, I'm having to do a bit of yak-shaving to get our test suite to play nicely with it. Watch this repository for updates. |
@senzacionale my issues were with the seed nodes & public hostname(s). unfortunately I'm not using hocon to configure my app so I can't help you here. |
@mmisztal1980 what are you using. Maybe I also need to change :) |
@Aaronontheweb I can confirm an upgrade to dotnetty 0.6 and akka.net 1.3.10, works well on linux (Ubuntu 16.04 LTS) in combination with net core 2.1. This combo with netcore 2.0 gave me some Akka PDU issues and protobuf issues. I have been running about 24 hours now, without any issues regarding missed frames, connectivity errors or whatsoever. |
Resolved via #3633 |
I have a test application with 1 seed node and two service nodes.
After migrating to .NET Core 2.1, I have some "Can't join seed-node" issues using Akka.net 1.3.6 / 1.3.8. The software is build and executed in docker with official Microsoft .NET Core 2.1 Images (microsoft/dotnet:2.1-sdk and microsoft/dotnet:2.1-runtime). Sometimes the nodes get connected without problems, sometimes I have the "can't join seed-nodes"-problems without any changes.
The problem appears on my local laptom and on our docker-server (and some other test-Systems).
The same application caused no problems running it on windows through Visual Studio.
Migration back to ".NET Core 2.0" solved the problem for now. (Project-Settings and other microsoft/dotnet:2.0-docker-Images).
The text was updated successfully, but these errors were encountered: