Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Some code changes for reproducing queue hangs up #124

Closed
wants to merge 2 commits into from

Conversation

bozaro
Copy link

@bozaro bozaro commented Oct 6, 2015

I have a problem: some times server hangs up in method SingleConsumerLinkedArrayQueue.prePeek() with infinity loop on line:

                if (tail == n)
                    return false;

                while (n.next == null); // wait for next <== HERE
                Node next = n.next; // can't be null because we're called by the consumer
                clearNext(n);
                clearPrev(next);

I try to localize this problem and have a following way to reproduce it:

  • Open project in Idea;
  • Add breakpoint on SingleConsumerLinkedArrayQueue.java:70
                if (v == 0) {
                    // PUT BREAKPOINT HERE
                    v = 1;
                }
  • Run from Idea gradle task: quasar:quasar-core [jdk8Test](it's run only SingleConsumerLinkedArrayObjectQueueTest.uglyTest%28%29)
  • On breakpoint simply press Resume (F9) to program continue.

After that "consumer" thread hangs in infinity loop on SingleConsumerLinkedArrayQueue.java:76

I tried to get rid of breakpoints in this test, but could not :(

I also do not understand how this queue works.

I tested this problem on Oracle JDK 1.8.0_45, Oracle JDK 1.8.0_60 and OpenJDK 1.8.0_45

@bozaro
Copy link
Author

bozaro commented Oct 7, 2015

This algorithm allows me to reproduce the problem with a probability of about 100%.

I think only SingleConsumerLinkedArrayQueue author @pron can help with my problem.

@pron
Copy link
Contributor

pron commented Oct 7, 2015

I will take a look, but in the meantime you can use SingleConsumerLinkedObjectQueue instead (i.e. Linked instead of LinkedArray)

@pron pron added the bug label Oct 7, 2015
@bozaro
Copy link
Author

bozaro commented Oct 7, 2015

Thanks. I will try to replace queue in Channels.newChannel and test our code.

@bozaro
Copy link
Author

bozaro commented Oct 7, 2015

I tried to replace SingleConsumerLinkedArrayQueue by SingleConsumerLinkedObjectQueue and looks like hangs on SingleConsumerLinkedQueue.succ():

    Node<E> succ(final Node<E> node) {
        if (node == null)
            return pk();
        if (tail == node)
            return null; // an enq following this will test the lock again

        Node succ;
        while ((succ = node.next) == null); // wait for next <== HANGS HERE
        return succ;
    }

Unfortunately, I have not yet know how to reproduce this problem in a small test.

@pron
Copy link
Contributor

pron commented Oct 7, 2015

Oh boy. I'll try to submit a fix some time next week.

@pron
Copy link
Contributor

pron commented Oct 7, 2015

Can you post a full stack trace (or at least as many frames up the stack as you can)?

@pron
Copy link
Contributor

pron commented Oct 7, 2015

Actually, the two failures suggest a common cause, which might make this easy to pinpoint (and if so the fix will be much quicker). They suggest the bug may be in the channel rather than the queue. A stack trace would help greatly.

@bozaro
Copy link
Author

bozaro commented Oct 7, 2015

Stacktrace with SingleConsumerLinkedQueue:

"ForkJoinPool-default-fiber-pool-worker-9@3954" daemon prio=5 tid=0x46 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
      at co.paralleluniverse.strands.queues.SingleConsumerLinkedQueue.succ(SingleConsumerLinkedQueue.java:132)
      at co.paralleluniverse.strands.queues.SingleConsumerLinkedQueue$LinkedQueueIterator.hasNext(SingleConsumerLinkedQueue.java:233)
      at co.paralleluniverse.actors.SelectiveReceiveHelper.receive(SelectiveReceiveHelper.java:105)
      at co.paralleluniverse.actors.behaviors.RequestReplyHelper.call(RequestReplyHelper.java:174)
      at co.paralleluniverse.actors.behaviors.RequestReplyHelper.call(RequestReplyHelper.java:112)
      at service.actor.RemoteTransceiver.lambda$doRequest$0(RemoteTransceiver.java:80)
      at service.actor.RemoteTransceiver$$Lambda$83.161522478.get(Unknown Source:-1)
      at util.Suspendables.getSuppressedInterruptable(Suspendables.java:58)
      at service.actor.SyncTransceiver.transceive(SyncTransceiver.java:33)
      at org.apache.avro.ipc.Requestor.request(Requestor.java:147)
      at org.apache.avro.ipc.Requestor.request(Requestor.java:101)
      at org.apache.avro.ipc.specific.SpecificRequestor.invoke(SpecificRequestor.java:118)
      at service.actor.ActorRegistry$1.invoke(ActorRegistry.java:80)
      at com.sun.proxy.$Proxy17.getSessionList(Unknown Source:-1)
      at lobby.actors.client.ListsSessions.handleRequest(ListsSessions.java:29)
      at lobby.actors.client.ListsSessions.handleRequest(ListsSessions.java:20)
      at actors.service.http.CommandExecutor.handleMessage(CommandExecutor.java:77)
      at co.paralleluniverse.actors.behaviors.BehaviorActor.behavior(BehaviorActor.java:237)
      at co.paralleluniverse.actors.behaviors.BehaviorActor.doRun(BehaviorActor.java:293)
      at co.paralleluniverse.actors.behaviors.BehaviorActor.doRun(BehaviorActor.java:36)
      at co.paralleluniverse.actors.Actor.run0(Actor.java:691)
      at co.paralleluniverse.actors.ActorRunner.run(ActorRunner.java:51)
      at co.paralleluniverse.fibers.Fiber.run(Fiber.java:1024)
      at co.paralleluniverse.fibers.Fiber.run1(Fiber.java:1019)
      at co.paralleluniverse.fibers.Fiber.exec(Fiber.java:730)
      at co.paralleluniverse.fibers.FiberForkJoinScheduler$FiberForkJoinTask.exec1(FiberForkJoinScheduler.java:265)
      at co.paralleluniverse.concurrent.forkjoin.ParkableForkJoinTask.doExec(ParkableForkJoinTask.java:116)
      at co.paralleluniverse.concurrent.forkjoin.ParkableForkJoinTask.exec(ParkableForkJoinTask.java:73)
      at co.paralleluniverse.fibers.FiberForkJoinScheduler$FiberForkJoinTask.doExec(FiberForkJoinScheduler.java:272)
      at co.paralleluniverse.fibers.Fiber.immediateExecHelper(Fiber.java:1205)
      at co.paralleluniverse.fibers.Fiber.exec(Fiber.java:1173)
      at co.paralleluniverse.fibers.Fiber.yieldAndUnpark1(Fiber.java:691)
      at co.paralleluniverse.fibers.Fiber.yieldAndUnpark(Fiber.java:619)
      at co.paralleluniverse.strands.Strand.yieldAndUnpark(Strand.java:531)
      at co.paralleluniverse.strands.OwnedSynchronizer.signalAndWait(OwnedSynchronizer.java:66)
      at co.paralleluniverse.strands.channels.QueueChannel.signalAndWait(QueueChannel.java:109)
      at co.paralleluniverse.strands.channels.QueueChannel.send0(QueueChannel.java:245)
      at co.paralleluniverse.strands.channels.QueueChannel.sendSync(QueueChannel.java:204)
      at co.paralleluniverse.actors.Mailbox.sendSync(Mailbox.java:73)
      at co.paralleluniverse.actors.Actor.sendSync(Actor.java:429)
      at co.paralleluniverse.actors.ActorRef.sendSync(ActorRef.java:97)
      at co.paralleluniverse.actors.behaviors.RequestReplyHelper.call(RequestReplyHelper.java:173)
      at co.paralleluniverse.actors.behaviors.RequestReplyHelper.call(RequestReplyHelper.java:112)
      at actors.service.http.CommandExecutor$InitSuspendableFunction.apply(CommandExecutor.java:105)
      at actors.service.http.CommandExecutor$InitSuspendableFunction.apply(CommandExecutor.java:86)
      at actors.service.http.CommandServlet$Action.process(CommandServlet.java:90)
      at actors.service.http.CommandServlet.doGet(CommandServlet.java:121)
      at co.paralleluniverse.fibers.servlet.FiberHttpServlet.service(FiberHttpServlet.java:165)
      at co.paralleluniverse.fibers.servlet.FiberHttpServlet$1.run(FiberHttpServlet.java:120)
      at co.paralleluniverse.strands.SuspendableUtils$VoidSuspendableCallable.run(SuspendableUtils.java:44)
      at co.paralleluniverse.strands.SuspendableUtils$VoidSuspendableCallable.run(SuspendableUtils.java:32)
      at co.paralleluniverse.fibers.Fiber.run(Fiber.java:1024)
      at co.paralleluniverse.fibers.Fiber.run1(Fiber.java:1019)
      at co.paralleluniverse.fibers.Fiber.exec(Fiber.java:730)
      at co.paralleluniverse.fibers.FiberForkJoinScheduler$FiberForkJoinTask.exec1(FiberForkJoinScheduler.java:265)
      at co.paralleluniverse.concurrent.forkjoin.ParkableForkJoinTask.doExec(ParkableForkJoinTask.java:116)
      at co.paralleluniverse.concurrent.forkjoin.ParkableForkJoinTask.exec(ParkableForkJoinTask.java:73)
      at jsr166e.ForkJoinTask.doExec(ForkJoinTask.java:261)
      at jsr166e.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:988)
      at jsr166e.ForkJoinPool.runWorker(ForkJoinPool.java:1628)
      at jsr166e.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

@bozaro
Copy link
Author

bozaro commented Oct 7, 2015

Used quasar code version 0.7.3

@bozaro
Copy link
Author

bozaro commented Oct 7, 2015

I create similar class SingleConsumerLinkedObjectQueueTest for reproducing problem in SingleConsumerLinkedObjectQueue

I have a following way to reproduce problem:

  • Open project in Idea;
  • Add breakpoint on SingleConsumerLinkedQueue.java:95
                if (v == 0) {
                    // PUT BREAKPOINT HERE
                    v = 1;
                }
  • Run from Idea gradle task: quasar:quasar-core jdk8Test
  • On breakpoint simply press Resume (F9) to program continue.

pron added a commit that referenced this pull request Oct 12, 2015
@pron
Copy link
Contributor

pron commented Oct 12, 2015

Can you try the 0.7.0 branch now (with the latest commit: dcf669b), still with SingleConsumerLinkedQueue? (it's a single-line change; might be easier to just apply it by hand)

@bozaro
Copy link
Author

bozaro commented Oct 12, 2015

Hello.

I try this change with "breakpoint" test and it's have no effects: programm hangs in infinity loop.

Now I found a very strange thing: when I stop at a breakpoint, I see in Idea two different stacktraces on a single thread:

Stacktrace, getted by Idea "Get thread dump" button (button with photocamera icon):

"consumer@1729" daemon prio=5 tid=0x11 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
      at co.paralleluniverse.strands.queues.SingleConsumerLinkedQueue.pk(SingleConsumerLinkedQueue.java:118)
      at co.paralleluniverse.strands.queues.SingleConsumerLinkedQueue.isEmpty(SingleConsumerLinkedQueue.java:167)
      at co.paralleluniverse.strands.queues.SingleConsumerLinkedObjectQueue.isEmpty(SingleConsumerLinkedObjectQueue.java:22)
      at co.paralleluniverse.strands.queues.SingleConsumerLinkedQueue.deq(SingleConsumerLinkedQueue.java:95)
      at co.paralleluniverse.strands.queues.SingleConsumerLinkedQueue.poll(SingleConsumerLinkedQueue.java:51)
      at co.paralleluniverse.strands.queues.SingleConsumerLinkedObjectQueue.poll(SingleConsumerLinkedObjectQueue.java:22)
      at co.paralleluniverse.strands.queues.SingleConsumerLinkedObjectQueueTest$1.run(SingleConsumerLinkedObjectQueueTest.java:29)
      at java.lang.Thread.run(Thread.java:745)

Stacktrace on "Debugger" tab:

at co.paralleluniverse.strands.queues.SingleConsumerLinkedQueue.deq(SingleConsumerLinkedObjectQueue.java:95)
at co.paralleluniverse.strands.queues.SingleConsumerLinkedQueue.poll(SingleConsumerLinkedQueue.java:51)
at co.paralleluniverse.strands.queues.SingleConsumerLinkedObjectQueue.poll(SingleConsumerLinkedObjectQueue.java:22)
at co.paralleluniverse.strands.queues.SingleConsumerLinkedObjectQueueTest$1.run(SingleConsumerLinkedObjectQueueTest.java:29)
at java.lang.Thread.run(Thread.java:745)

I am currently trying to find an explanation for this phenomenon.

@bozaro
Copy link
Author

bozaro commented Oct 12, 2015

I change queue to SingleConsumerLinkedQueue, apply dcf669b and rebuild quasar.

After that I run out project: it looks like worked.

Currently the code has been successfully operating in 15 minutes. Previously, he hangup somewhere for 30 seconds.

@pron
Copy link
Contributor

pron commented Oct 12, 2015

Good! Now, I'll try to find the bug in the linked-array queue.

@bozaro
Copy link
Author

bozaro commented Oct 12, 2015

Looks like "breakpoint" test is really incorrect: there is some JIT black magic.

Fix from dcf669b really helps to us. Thanks a lot.

pron added a commit that referenced this pull request Oct 13, 2015
@pron
Copy link
Contributor

pron commented Oct 13, 2015

Alright, so I've made another single-line commit -- 5179c26 -- that will hopefully fix the original issue with the linked-array queue. It's on the 0.7.0 branch (or you can manually apply it). Please try and let me know.

@bozaro
Copy link
Author

bozaro commented Oct 13, 2015

I change queue back to SingleConsumerLinkedArrayQueue, apply 5179c26 and rebuild quasar.

After that I run out project: it looks like worked.

@pron
Copy link
Contributor

pron commented Oct 13, 2015

Excellent! As it turns out, those were two different bugs in the two queues, but the reason they went unnoticed until now is that they only manifest if you do a lot of selective receives.

@pron pron closed this Oct 13, 2015
@bozaro
Copy link
Author

bozaro commented Oct 13, 2015

Thanks a lot 🍻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants