Cluster sharding delivers message to the wrong entity #1463

jchapuis · 2024-09-06T18:43:48Z

Sorry to bring some bad news, I have been investigating failing tests in endless4s/endless-transaction#48, a PR that upgrades Pekko from 1.0.3 to 1.1.0 and I think I found a serious issue.

The failing test suite is stress-testing event-sourced entities using the persistence test toolkit, and I have identified that a command sometimes gets delivered to the wrong entity.

I have bisected the problem to this optimization that was introduced after the 1.1.0-M1 release. That new code makes use of a var cache and it doesn't seem thread-safe. Could it be that we introduced races?

The text was updated successfully, but these errors were encountered:

pjfanning · 2024-09-06T19:30:00Z

@jchapuis @Roiocam I looked at the PR that seems to be the issue and I think if we try to change the var cache to some sort of AtomicObject or lazy val that the benefits of the change will vanish. I suggest that we revert it.

@jchapuis would you have any idea how we could minimise a reproducible test that could be used for regression purposes?

jchapuis · 2024-09-06T19:41:30Z

@pjfanning I haven't yet had the time to look into this. I decided to report the problem as soon as I had a certain degree of confidence that there was a change of behavior, due to the release still being young, I got worried. I would say some tests sending commands to multiple sharded entities in quick succession, verifying that each command gets delivered to the proper destination.

pjfanning · 2024-09-07T09:23:05Z

@raboof it would be nice to be able to start on the 1.0.1 RC in the next few days.

What do you think of this course of action?

revert chore: avoid the double evaluation of entityId in ClusterSharding #1304
hopefully @jchapuis will have time to test endless4s with a Pekko snapshot with this change
at some point early next week, we get an RC together
at some point, we try to fill in the test gap in Pekko so that we don't get a future bug in this area - including tests that try to force multithreaded evaluation of the entityId
we avoid further changes to cluster sharding until we are happy that the test coverage has improved

jchapuis · 2024-09-07T11:59:13Z

@pjfanning sure happy to run endless4s tests as soon as the revert is merged

Roiocam · 2024-09-07T20:31:54Z

@jchapuis @Roiocam I looked at the PR that seems to be the issue and I think if we try to change the var cache to some sort of AtomicObject or lazy val that the benefits of the change will vanish. I suggest that we revert it.

after investigating, i think because the extractEntityId instance was shared by both ShardRegion Actor and multiple Shard Actors.

pjfanning · 2024-09-07T20:46:27Z

@Roiocam is it ok to revert for a quick 1.1.1 release (#1464) and maybe trying a new optimisation change later?

Roiocam · 2024-09-07T20:51:31Z

@Roiocam is it ok to revert for a quick 1.0.1 release (#1464) and maybe trying a new optimisation change later?

Of course, should be 1.1.1?

A hindsight remark that maintaining state within a function is bad idea.

pjfanning · 2024-09-07T20:56:00Z

@Roiocam is it ok to revert for a quick 1.0.1 release (#1464) and maybe trying a new optimisation change later?

Of course, should be 1.1.1?

A hindsight remark that maintaining state within a function is bad idea.

You're right - 1.1.1 is the next release. Thanks for approving #1464. I will merge it.

Roiocam · 2024-09-07T21:02:21Z

Thanks @jchapuis and @pjfanning

jchapuis · 2024-09-07T22:06:46Z

@pjfanning @Roiocam I can confirm my tests are now passing with the revert

pjfanning · 2024-09-09T21:01:38Z

#1467 merged

Related with apache#1463

* add unit test protect ExtractEntityId can be shared safely Related with #1463 * chore: avoid the double evaluation of entityId in ClusterSharding (#1304) * chore: avoid the double evaluation of entityId in ClusterSharding * new cacheable partial function * optimized for review * fix the right type * Revert "chore: avoid the double evaluation of entityId in ClusterSharding (#1…" (#1464) This reverts commit b0e9886. * grammar fix * sort imports --------- Co-authored-by: PJ Fanning <[email protected]>

pjfanning added this to the 1.1.1 milestone Sep 6, 2024

pjfanning added the bug Something isn't working label Sep 6, 2024

pjfanning mentioned this issue Sep 7, 2024

Revert "chore: avoid the double evaluation of entityId in ClusterSharding" #1464

Merged

pjfanning mentioned this issue Sep 8, 2024

add some warnings to release notes #1467

Merged

pjfanning closed this as completed Sep 9, 2024

Roiocam added a commit to Roiocam/pekko that referenced this issue Sep 11, 2024

add unit test protect ExtractEntityId can be shared safely

335dc31

Related with apache#1463

Roiocam mentioned this issue Sep 11, 2024

add unit test protect ExtractEntityId can be shared safely #1475

Merged

Roiocam added a commit to Roiocam/pekko that referenced this issue Sep 11, 2024

add unit test protect ExtractEntityId can be shared safely

0bf32b9

Related with apache#1463

Roiocam added a commit to Roiocam/pekko that referenced this issue Sep 11, 2024

add unit test protect ExtractEntityId can be shared safely

cf43276

Related with apache#1463

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster sharding delivers message to the wrong entity #1463

Cluster sharding delivers message to the wrong entity #1463

jchapuis commented Sep 6, 2024

pjfanning commented Sep 6, 2024

jchapuis commented Sep 6, 2024 •

edited

Loading

pjfanning commented Sep 7, 2024

jchapuis commented Sep 7, 2024

Roiocam commented Sep 7, 2024

pjfanning commented Sep 7, 2024 •

edited

Loading

Roiocam commented Sep 7, 2024

pjfanning commented Sep 7, 2024 •

edited

Loading

Roiocam commented Sep 7, 2024

jchapuis commented Sep 7, 2024

pjfanning commented Sep 9, 2024

Cluster sharding delivers message to the wrong entity #1463

Cluster sharding delivers message to the wrong entity #1463

Comments

jchapuis commented Sep 6, 2024

pjfanning commented Sep 6, 2024

jchapuis commented Sep 6, 2024 • edited Loading

pjfanning commented Sep 7, 2024

jchapuis commented Sep 7, 2024

Roiocam commented Sep 7, 2024

pjfanning commented Sep 7, 2024 • edited Loading

Roiocam commented Sep 7, 2024

pjfanning commented Sep 7, 2024 • edited Loading

Roiocam commented Sep 7, 2024

jchapuis commented Sep 7, 2024

pjfanning commented Sep 9, 2024

jchapuis commented Sep 6, 2024 •

edited

Loading

pjfanning commented Sep 7, 2024 •

edited

Loading

pjfanning commented Sep 7, 2024 •

edited

Loading