Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster sharding delivers message to the wrong entity #1463

Closed
jchapuis opened this issue Sep 6, 2024 · 11 comments
Closed

Cluster sharding delivers message to the wrong entity #1463

jchapuis opened this issue Sep 6, 2024 · 11 comments
Labels
bug Something isn't working
Milestone

Comments

@jchapuis
Copy link
Contributor

jchapuis commented Sep 6, 2024

Sorry to bring some bad news, I have been investigating failing tests in endless4s/endless-transaction#48, a PR that upgrades Pekko from 1.0.3 to 1.1.0 and I think I found a serious issue.

The failing test suite is stress-testing event-sourced entities using the persistence test toolkit, and I have identified that a command sometimes gets delivered to the wrong entity.

I have bisected the problem to this optimization that was introduced after the 1.1.0-M1 release. That new code makes use of a var cache and it doesn't seem thread-safe. Could it be that we introduced races?

@pjfanning pjfanning added this to the 1.1.1 milestone Sep 6, 2024
@pjfanning pjfanning added the bug Something isn't working label Sep 6, 2024
@pjfanning
Copy link
Contributor

@jchapuis @Roiocam I looked at the PR that seems to be the issue and I think if we try to change the var cache to some sort of AtomicObject or lazy val that the benefits of the change will vanish. I suggest that we revert it.

@jchapuis would you have any idea how we could minimise a reproducible test that could be used for regression purposes?

@jchapuis
Copy link
Contributor Author

jchapuis commented Sep 6, 2024

@pjfanning I haven't yet had the time to look into this. I decided to report the problem as soon as I had a certain degree of confidence that there was a change of behavior, due to the release still being young, I got worried. I would say some tests sending commands to multiple sharded entities in quick succession, verifying that each command gets delivered to the proper destination.

@pjfanning
Copy link
Contributor

@raboof it would be nice to be able to start on the 1.0.1 RC in the next few days.

What do you think of this course of action?

  1. revert chore: avoid the double evaluation of entityId in ClusterSharding #1304
  2. hopefully @jchapuis will have time to test endless4s with a Pekko snapshot with this change
  3. at some point early next week, we get an RC together
  4. at some point, we try to fill in the test gap in Pekko so that we don't get a future bug in this area - including tests that try to force multithreaded evaluation of the entityId
  5. we avoid further changes to cluster sharding until we are happy that the test coverage has improved

@jchapuis
Copy link
Contributor Author

jchapuis commented Sep 7, 2024

@pjfanning sure happy to run endless4s tests as soon as the revert is merged

@Roiocam
Copy link
Member

Roiocam commented Sep 7, 2024

@jchapuis @Roiocam I looked at the PR that seems to be the issue and I think if we try to change the var cache to some sort of AtomicObject or lazy val that the benefits of the change will vanish. I suggest that we revert it.

after investigating, i think because the extractEntityId instance was shared by both ShardRegion Actor and multiple Shard Actors.

截屏2024-09-08 04 30 40 截屏2024-09-08 04 31 17

@pjfanning
Copy link
Contributor

pjfanning commented Sep 7, 2024

@Roiocam is it ok to revert for a quick 1.1.1 release (#1464) and maybe trying a new optimisation change later?

@Roiocam
Copy link
Member

Roiocam commented Sep 7, 2024

@Roiocam is it ok to revert for a quick 1.0.1 release (#1464) and maybe trying a new optimisation change later?

Of course, should be 1.1.1?

A hindsight remark that maintaining state within a function is bad idea.

@pjfanning
Copy link
Contributor

pjfanning commented Sep 7, 2024

@Roiocam is it ok to revert for a quick 1.0.1 release (#1464) and maybe trying a new optimisation change later?

Of course, should be 1.1.1?

A hindsight remark that maintaining state within a function is bad idea.

You're right - 1.1.1 is the next release. Thanks for approving #1464. I will merge it.

@Roiocam
Copy link
Member

Roiocam commented Sep 7, 2024

Thanks @jchapuis and @pjfanning

@jchapuis
Copy link
Contributor Author

jchapuis commented Sep 7, 2024

@pjfanning @Roiocam I can confirm my tests are now passing with the revert

@pjfanning
Copy link
Contributor

#1467 merged

Roiocam added a commit to Roiocam/pekko that referenced this issue Sep 11, 2024
Roiocam added a commit to Roiocam/pekko that referenced this issue Sep 11, 2024
Roiocam added a commit to Roiocam/pekko that referenced this issue Sep 11, 2024
pjfanning added a commit that referenced this issue Sep 12, 2024
* add unit test protect ExtractEntityId can be shared safely

Related with #1463

* chore: avoid the double evaluation of entityId in ClusterSharding (#1304)

* chore: avoid the double evaluation of entityId in ClusterSharding

* new cacheable partial function

* optimized for review

* fix the right type

* Revert "chore: avoid the double evaluation of entityId in ClusterSharding (#1…" (#1464)

This reverts commit b0e9886.

* grammar fix

* sort imports

---------

Co-authored-by: PJ Fanning <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants