-
Notifications
You must be signed in to change notification settings - Fork 575
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RequestReplyHelper#call fails with IllegalMonitorStateException #118
Comments
At first glance it looks to me like an actor's mailbox is being shared with some other fiber/actor trying to receive messages from it (and usually actors mailboxes are single-consumer as they are meant to be private to the actor for Is something like this somehow happening in your code? |
Actually I don't share actor's mailbox anywhere. All communications beetween actors is implemented through primitive message sending or RequestReplyHelper.call pattern. But what I found looks strange for me. Method ActorRef#sendSync is calling Fiber.exec according to the following stacktrace.
I guess at the moment when sendSync is trying to unpark some receiver-actor - exactly that actor is blocked on his mailbox receive method. Am I correct? The first one (RequestReplyCall is nested in RequestReplyCall) -
And the second
|
Hi.
Especially when you say sequentially? |
Hi, |
Yes, you are correct. But how are you reproducing it and how do you know there's no unregister? |
I've simply added 2 breakpoints on register and unregister methods that prints stacktraces.
Unfortunately, I don't understand how to reproduce it, it happens quite rarely. I need some time to write a minimal showcase for reproducing it.. |
Do you provide any configurations to the actor's mailbox (size, policy)? |
No, I don't. Every actor is constructed by co.paralleluniverse.actors.behaviors.BehaviorActor#BehaviorActor(java.lang.String) |
I've put Anyway, can you try with |
Yeah, for sure I can. Thanks. |
Good. In that case just note that all bug-fixes go into the |
Unfortunately, it doesn't help. Everything remains the same, so we can conclude there is no exception thrown between calls to register and unregister. Don't you find strange that exception message
tells us that exactly the same strand is trying to call OwnedSynchronizer.register once again? |
Yes. Unfortunately my debugging capabilities are diminished due to not being able to reproduce the problem, so any further information you can provide -- like a flight recorder dump (if you can reproduce with it on) -- would be very helpful. |
This is so frustrating. With your commit test starts to always pass when I run gradle in debug-mode and fails without debug-mode sometimes like before... I've just run it 100 times in debug mode and it passes always. May be it can be explained by the fact that debug-mode disables some JIT-specific optimizations and chances to make race condition happen are close to zero due to performance drop. Ok, my next step would be to provide here flight recorder dump OR close this bug. Just need some time. thanks |
No need to close this bug. It's a real race condition, and eventually we'll catch it. It isn't the first one nor the last. :) |
It seems I found out the reason why unregister doesn't get called. I have uninstrumented method in a callstack:
And what we've got in that case - when fiber parks it throws SuspendExecution and then in some place it's catched and is not rethrown as a RuntimeSuspendExecution. So the parked fiber couldn't be unparked correctly and it loses it stack including lines
That means method co.paralleluniverse.actors.Mailbox#unlock is never called. If my theory is correct then I need to instrument this method
But I've tried to add
to my META-INF/suspendables file. Unfortunately neither of those helps. |
Yeah, it seems I was right. If I spawn all my actors with spawnThread (so it means Fiber.park will not being called) - than this problem goes away and never appears. |
There should never be reason to instrument |
I've got your point! I fixed that problem, now instrumentation works fine (I got rid of CompletableFuture.AsyncRun at my stack trace) and this error never happens. So summing up my experience I had to say that Quasar users should pay more attention to uninstrumented methods warnings otherwise they can get unpredictable behaviour of the application if SuspendExecution exception is silently ignored somewhere. In any case I would like to ask one more question if you don't mind. Don't know is it related to this issue or not. When fiber is unparked is there any guarantee that worker thread would be the same as one before parking? It seems for me new worker thread would be obtained from fiber-pool without any guarantee... To deal with that problem quasar provides some concurrent abstractions like co.paralleluniverse.strands.concurrent.ReentrantLock. But what should I do with third-party library classes which uses jdk native java.util.concurrent.locks.ReentrantLock ? At this point I rarely get exception:
Thanks |
That entirely depends on the
Why is that a problem? Thread-locals travel with the fiber.
Ah, that's code that explicitly accesses the identity of the underlying thread. If you must use this in a fiber (and why does the fiber block between lock acquisition and release?) you can use a scheduler that pins fibers to threads. It shouldn't be hard to do. The simplest one is simply a scheduler that schedules all fibers onto a single thread, which is created with: new FiberExecutorScheduler(Executors.newSingleThreadExecutor()) Of course, you are free to use different scheduler for different fibers and mix them freely. However, in this case I would consider (again, without knowing your codebase, so it may or may not be a good suggestion) to use plain threads to call the IO library, and have them send messages to fibers over channels/actor-mailboxes. |
Thanks for fast reply, that makes sense. I think your last suggestion is the most suitable for us. And as a temporary solution I wrapped call org.apache.avro.ipc.Requestor.getRemote to co.paralleluniverse.fibers.FiberAsync#runBlocking(java.util.concurrent.ExecutorService, co.paralleluniverse.common.util.CheckedCallable<V,E>) with single thread pool as a first argument. And now java.util.concurrent.locks.ReentrantLock starts to work correct in a thread worker. Thanks again for your really useful responses. I think this issue can be closed because the original problem was happening due to incomplete instrumentation of application code. |
Glad to be of service. |
Hey.
First, I apologize for not being able to share all the code reproducing the problem which happens quite rarely (it seems like there is a race condition somewhere).
But I can describe environment in some words - we have complex actor's interaction algorithm at our actor system, some of them are communicating via RequestReplyHelper#call. And sometimes this exception raises. The problems are
So its hard for me to reason about the source of this exception. Could you help me to find out are we doing something wrong or is this a bug in quasar-actors? It would be great if you've already encountered this problem and know the mistakes we are doing. Thanks.
The text was updated successfully, but these errors were encountered: