-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Akka.Cluster.Sharding: duplicate shards / entities #6973
Comments
Possible causes / issues I want to investigate while we gather data from users:
All of these scenarios are things that would be caused by DData's eventual consistency, in one form or another. Going to start poking around and see if I can spot anything that might lead to problems here. However, with the one piece of customer data I have in front of me - it's possible that this could be caused when |
When using akka.net/src/contrib/cluster/Akka.Cluster.Sharding/DDataShardCoordinator.cs Lines 116 to 124 in f2867df
Can rule that out as a source of problems. |
Might be a false alarm - first user had multiple clusters all writing to the same persistence store for v1.5 shard coordinator data. |
Here's our akka setup config as requested on discord.
|
This is the actorsystem
|
akkaConfig.WithShardRegion<SchedulingManagerActor>(
nameof(SchedulingManagerActor),
_ => Props.Create(() => new SchedulingManagerActor(removeScheduleActor, scheduleMessageActor, projectionActor, builder.GetRequiredService<IScheduleMessageCommandFactory>())
),
new MessageExtractor(),
new ShardOptions {StateStoreMode = StateStoreMode.DData, Role = "subscriber"}); Ok, so no |
So far, from the two users who have reported this issue to me (the third user had self-inflicted problems) - it looks like this issue occurs when state-store-mode = ddata and remember-entities is off. Modeling the state machine now so I can help get a better idea of where this can possibly occur. |
Spent a few hours going through this, looked at a few areas where a duplicate shard might be possible, but was able to rule them out. In order to solve this I think I'm going to need a dump with a large number of |
I am going to change the log level to DEBUG and check if it causes any performance issues. I hope to capture helpful logs. |
We have added more logging but no error yet. |
I don't have any examples from our prod environment, but this is from our test environment. Same setup.
|
I'm back from vacation and I'll be picking up work on this again |
Welcome back :) |
Yes! I have some data that indicates that this is a problem caused by a shard rebalancing / handoff timing out. Going to write a reproduction for that as soon as I can. Doing some onsite work with a customer this week but this is high on my to-do list. |
@JoeWorkyWork Can you tell us under what condition were the cluster in when these problems occured? Were you updating the cluster, were there different versions of Akka running in the cluster at the time, did any of the cluster node leave/rejoin at the time, what version(s) of Akka were running in the cluster, etc. |
Sorry missed your reply @Arkatufus. I'll get back with some more info next week hopefully. |
@Arkatufus Sadly we didn't save any raw logs from that incident, so can't give you much on the cluster behavior. Complete list of akka versions at the incident:
Rough timeline:
Later we upgraded to |
Did not mean to close this issue - still under investigation. |
I think I've found the smoking gun here, from some of the logs provided on #7285 Host 1
Host 2Host 2, which starts up and joins the cluster later, immediately allocates a duplicate shard:
The logs are incomplete and don't indicate a hand-off or anything, but they make me think that there's a problem with how the |
Still not closed yet - GitHub pulled the trigger a tad early. |
we would have seen A LOT of smoke and fire if this didn't work correctly, but since we're in the midst of testing for all sorts of member transition-related issues for akkadotnet#6973 we thought it would be best to add a sanity check.
we would have seen A LOT of smoke and fire if this didn't work correctly, but since we're in the midst of testing for all sorts of member transition-related issues for #6973 we thought it would be best to add a sanity check.
Eliminates the source of akkadotnet#6793, which was caused by using the incorrect ordering methodology when it came to determining which `ClusterSingletonManager` to hand-over to during member state transitions. close akkadotnet#6973 close akkadotnet#7196
Gonna give my test lab reproductions an entire night to try to reproduce this, but the long and the short of it is:
|
Will share a post-mortem on here or perhaps on the YouTube channel, but this fix is going into v1.5.27. |
Version Information
Version of Akka.NET? v1.5 - all versions including v1.5.13
Which Akka.NET Modules? Akka.Cluster.Sharding
Describe the bug
We have very rough and loose data on this right now, but it's been reported by multiple users including on petabridge/Akka.Persistence.Azure#350 - looks like there could be something wrong with Akka.Cluster.Sharding in v1.5 that allows a
Shard
to be allocated more than once. This is our thread to investigate.If you have run into this issue, provide a full dump of your config here please
The text was updated successfully, but these errors were encountered: