[2.8.1] FGC and throw NPE #14268

lordcheng10 · 2022-02-14T09:46:52Z

Describe the bug
The phenomenon of our online problem is as follows:
1. NPE exception:
03:22:50.442 [BookKeeperClientWorker-OrderedExecutor-4-0] ERROR org.apache.bookkeeper.common.util.SafeRunnable - Unexpected throwable caught
java.lang.NullPointerException: null
at org.apache.bookkeeper.mledger.impl.OpAddEntry.addComplete(OpAddEntry.java:153) ~[org.apache.pulsar-managed-ledger-2.8.1.2.jar:2.8.1.2]
at
2. Frequent FGC, I dumped the memory and found that there are a lot of pendingAddOp objects (more than seven million), but the write traffic has dropped to the bottom:

The one that occupies the most heap memory is pendingAcks：

3.Looking at the log, some thread pools take too long to execute tasks, even reaching the hour level:
03:23:47.838 [bookkeeper-ml-scheduler-OrderedScheduler-54-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433@2fc9a98e:class org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433 took too long 12721684 micros to execute.
03:24:53.508 [bookkeeper-ml-scheduler-OrderedScheduler-32-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable@3f387130:class org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable took too long 16594876 micros to execute.
03:26:02.079 [bookkeeper-ml-scheduler-OrderedScheduler-35-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$$Lambda$92/1773008684@4ddf5185:class org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$$Lambda$92/1773008684 took too long 17111879 micros to execute.

4. zookkepeer session timeout:

I noticed there is this PR that may be related to this issue: #12993

lordcheng10 · 2022-02-14T09:51:36Z

@eolivelli @codelipenghui @michaeljmarshall PTAL,thanks!

lordcheng10 · 2022-02-14T15:06:10Z

Similar to this issue： #9234
03:23:47.838 [bookkeeper-ml-scheduler-OrderedScheduler-54-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433@2fc9a98e:class org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$$Lambda$1000/1597732433 took too long 12721684 micros to execute.

lhotari · 2022-02-15T05:17:03Z

Unexpected throwable caught
java.lang.NullPointerException: null
at org.apache.bookkeeper.mledger.impl.OpAddEntry.addComplete(OpAddEntry.java:153) ~[org.apache.pulsar-managed-ledger-2.8.1.2.jar:2.8.1.2]

This looks like a thread safety issue. There's #12606 which improves the situation slightly. The solution in #11387 would help address such issues in better way although that still has some gaps and possible issues.

lordcheng10 · 2022-02-15T05:40:56Z

Unexpected throwable caught
java.lang.NullPointerException: null
at org.apache.bookkeeper.mledger.impl.OpAddEntry.addComplete(OpAddEntry.java:153) ~[org.apache.pulsar-managed-ledger-2.8.1.2.jar:2.8.1.2]

This looks like a thread safety issue. There's #12606 which improves the situation slightly. The solution in #11387 would help address such issues in better way although that still has some gaps and possible issues.

@lhotari
The most confusing thing is why the FGC is triggered, and the thread pool execution time reaches the hour level：
04:35:46.571 [bookkeeper-ml-scheduler-OrderedScheduler-18-0] WARN org.apache.bookkeeper.common.util.OrderedExecutor - Runnable org.apache.bookkeeper.mledger.util.SafeRun$1@89aa735:class org.apache.bookkeeper.mledger.util.SafeRun$1 took too long 13450016 micros to execute.

I even took an execution task to see and found that it is only reading and writing zk, and the timeout time of zk is only 30 seconds, so I can't understand why it reaches the hour level:

lhotari · 2022-02-15T06:31:32Z

I even took an execution task to see and found that it is only reading and writing zk, and the timeout time of zk is only 30 seconds, so I can't understand why it reaches the hour level:

What do you mean with "reaches the hour level"? 13450016 microseconds is 13.45 seconds .

lhotari · 2022-02-15T06:36:23Z

The most confusing thing is why the FGC is triggered

With FGC, do you mean "Full GC"? The GC log tells the reason why Full GCs get triggered. GC logging was added by #7498 . Have you checked the GC log entries?

lhotari · 2022-02-15T06:38:51Z

I noticed there is this PR that may be related to this issue: #12993

Yes, #12993 could help addressing the NPE. Have you tested the behavior with Pulsar 2.8.2 which includes #12993?

lordcheng10 · 2022-02-15T06:55:27Z

What do you mean with "reaches the hour level"? 13450016 microseconds is 13.45 seconds .

Aha,I was wrong, it was 13.45 seconds .

lordcheng10 · 2022-02-15T07:14:05Z

I noticed there is this PR that may be related to this issue: #12993

Yes, #12993 could help addressing the NPE. Have you tested the behavior with Pulsar 2.8.2 which includes #12993?

I didn't test 2.8.2, only found NPE in 2.8.1

lordcheng10 · 2022-02-15T07:41:47Z

The most confusing thing is why the FGC is triggered

With FGC, do you mean "Full GC"? The GC log tells the reason why Full GCs get triggered. GC logging was added by #7498 . Have you checked the GC log entries?

The gc log of the node in question is not saved. I found some FULL GC logs from other FULL GC logs on the line, but these nodes are not continuously FULL GC:
2022-02-15T12:45:47.604+0800: 423715.809: [Full GC (Allocation Failure) 23G->10G(24G), 11.4667426 secs]
[Eden: 0.0B(12.0G)->0.0B(12.0G) Survivors: 0.0B->0.0B Heap: 23.9G(24.0G)->10.6G(24.0G)], [Metaspace: 65542K->65542K(1114112K)]
[Times: user=20.56 sys=0.00, real=11.47 secs]

The JVM monitoring of the problem broker is as follows:

It seems that there is a memory leak. The memory size of the old age continues to increase, but the read and write traffic of the node has not increased. @lhotari

lordcheng10 · 2022-02-15T07:45:07Z

I uploaded the dump file to Baidu Cloud Disk, which can be downloaded here:
https://pan.baidu.com/s/1qNMQFS5QtpkwcKpOlK5HiA
Extraction code: cvgr @lhotari

codelipenghui · 2022-03-15T10:23:23Z

close by #14515

lordcheng10 added the type/bug The PR fixed a bug or issue reported a bug label Feb 14, 2022

sijie mentioned this issue Feb 14, 2022

ISSUE-14268: [2.8.1] FGC and throw NPE streamnative/pulsar-archived#3738

Closed

lordcheng10 mentioned this issue Feb 22, 2022

[WIP] fix memory leak: Use ConcurrentLongLongPairHas that supports to shrink #14407

Closed

3 tasks

lordcheng10 mentioned this issue Mar 1, 2022

Optimize memory usage: support to shrink for pendingAcks map #14515

Merged

3 tasks

codelipenghui closed this as completed Mar 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[2.8.1] FGC and throw NPE #14268

[2.8.1] FGC and throw NPE #14268

lordcheng10 commented Feb 14, 2022 •

edited

Loading

lordcheng10 commented Feb 14, 2022

lordcheng10 commented Feb 14, 2022 •

edited

Loading

lhotari commented Feb 15, 2022

lordcheng10 commented Feb 15, 2022 •

edited

Loading

lhotari commented Feb 15, 2022 •

edited

Loading

lhotari commented Feb 15, 2022

lhotari commented Feb 15, 2022 •

edited

Loading

lordcheng10 commented Feb 15, 2022

lordcheng10 commented Feb 15, 2022

lordcheng10 commented Feb 15, 2022 •

edited

Loading

lordcheng10 commented Feb 15, 2022 •

edited

Loading

codelipenghui commented Mar 15, 2022

[2.8.1] FGC and throw NPE #14268

[2.8.1] FGC and throw NPE #14268

Comments

lordcheng10 commented Feb 14, 2022 • edited Loading

lordcheng10 commented Feb 14, 2022

lordcheng10 commented Feb 14, 2022 • edited Loading

lhotari commented Feb 15, 2022

lordcheng10 commented Feb 15, 2022 • edited Loading

lhotari commented Feb 15, 2022 • edited Loading

lhotari commented Feb 15, 2022

lhotari commented Feb 15, 2022 • edited Loading

lordcheng10 commented Feb 15, 2022

lordcheng10 commented Feb 15, 2022

lordcheng10 commented Feb 15, 2022 • edited Loading

lordcheng10 commented Feb 15, 2022 • edited Loading

codelipenghui commented Mar 15, 2022

lordcheng10 commented Feb 14, 2022 •

edited

Loading

lordcheng10 commented Feb 14, 2022 •

edited

Loading

lordcheng10 commented Feb 15, 2022 •

edited

Loading

lhotari commented Feb 15, 2022 •

edited

Loading

lhotari commented Feb 15, 2022 •

edited

Loading

lordcheng10 commented Feb 15, 2022 •

edited

Loading

lordcheng10 commented Feb 15, 2022 •

edited

Loading