-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[2.8.1] FGC and throw NPE #14268
Comments
@eolivelli @codelipenghui @michaeljmarshall PTAL,thanks! |
Similar to this issue: #9234 |
This looks like a thread safety issue. There's #12606 which improves the situation slightly. The solution in #11387 would help address such issues in better way although that still has some gaps and possible issues. |
@lhotari I even took an execution task to see and found that it is only reading and writing zk, and the timeout time of zk is only 30 seconds, so I can't understand why it reaches the hour level: |
What do you mean with "reaches the hour level"? 13450016 microseconds is 13.45 seconds . |
With FGC, do you mean "Full GC"? The GC log tells the reason why Full GCs get triggered. GC logging was added by #7498 . Have you checked the GC log entries? |
Aha,I was wrong, it was 13.45 seconds . |
The gc log of the node in question is not saved. I found some FULL GC logs from other FULL GC logs on the line, but these nodes are not continuously FULL GC: The JVM monitoring of the problem broker is as follows: It seems that there is a memory leak. The memory size of the old age continues to increase, but the read and write traffic of the node has not increased. @lhotari |
I uploaded the dump file to Baidu Cloud Disk, which can be downloaded here: |
close by #14515 |
Describe the bug
The phenomenon of our online problem is as follows:
1. NPE exception:
03:22:50.442 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught
java.lang.NullPointerException: null
at org.apache.bookkeeper.mledger.impl.OpAddEntry.addComplete(OpAddEntry.java:153) ~[org.apache.pulsar-managed-ledger-2.8.1.2.jar:2.8.1.2]
at
2. Frequent FGC, I dumped the memory and found that there are a lot of pendingAddOp objects (more than seven million), but the write traffic has dropped to the bottom:
The one that occupies the most heap memory is pendingAcks:
3.Looking at the log, some thread pools take too long to execute tasks, even reaching the hour level:
03:23:47.838 [bookkeeper-ml-scheduler-OrderedScheduler-54-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433@2fc9a98e:class org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433 took too long 12721684 micros to execute.
03:24:53.508 [bookkeeper-ml-scheduler-OrderedScheduler-32-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable@3f387130:class org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable took too long 16594876 micros to execute.
03:26:02.079 [bookkeeper-ml-scheduler-OrderedScheduler-35-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$$Lambda$92/1773008684@4ddf5185:class org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$$Lambda$92/1773008684 took too long 17111879 micros to execute.
4. zookkepeer session timeout:
I noticed there is this PR that may be related to this issue: #12993
The text was updated successfully, but these errors were encountered: