[ManagedLedger] Pin executor and scheduled executor threads for ManagedLedgerImpl #11387

lhotari · 2021-07-20T07:59:21Z

Motivation

OpReadEntry is not multi-thread safe. OpReadEntry.entries is an ArrayList without any synchronization.
However it is accessed from multiple threads.

Here's an example of such code:

OpReadEntry.entries mutated:

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java

Line 81 in 5ad4059

entries.addAll(filteredEntries);

Mutation triggered from multiple threads:

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java

Lines 137 to 148 in 5ad4059

    
           if (entries.size() < count && cursor.hasMoreEntries() && 
        
                   ((PositionImpl) cursor.getReadPosition()).compareTo(maxPosition) < 0) { 
        
               // We still have more entries to read from the next ledger, schedule a new async operation 
        
               if (nextReadPosition.getLedgerId() != readPosition.getLedgerId()) { 
        
                   cursor.ledger.startReadOperationOnLedger(nextReadPosition, OpReadEntry.this); 
        
               } 
        
               // Schedule next read in a different thread 
        
               cursor.ledger.getExecutor().execute(safeRun(() -> { 
        
                   readPosition = cursor.ledger.startReadOperationOnLedger(nextReadPosition, OpReadEntry.this); 
        
                   cursor.ledger.asyncReadEntries(OpReadEntry.this); 
        
               }));

The operations seem to happen sequentially when OpReadEntry.entries is mutated or accessed, but this is not enough for ensuring thread safety in Java.

The goal of this change is to improve Managed Ledger operations by running operations in the pinned executor thread where the thread is picked by the hash of the managed ledger name. This is how most of the code is already written, but there are some exceptions in the current code. The goal of this PR change is to ensure that the pinned executor is used in all cases where work is scheduled to run using the ledger's executor.

Modifications

Don't expose OrderedExecutor from ManagedLedgerImpl.getExecutor
- instead, return executor that is pinned to a single thread with .chooseThread(getName())
  - most usages of ManagedLedgerImpl.getExecutor were already calling .chooseThread(ml.getName()), however
    some locations were omitting it. It's better to always pin the ManagedLedgerImpl.getExecutor
    to a single thread.
Don't expose OrderedScheduler from ManagedLedgerImpl.getScheduledExecutor
instead return scheduled executor service that is pinned to a single thread with .chooseThread(getName()).
Pin executor and scheduled executor usage inside ManagedLedgerImpl class
this improves thread safety of Managed Ledger code base since more operations will happen in a single thread
- some classes such as OpReadEntry are not multi-thread safe. OpReadEntry.entries is a ArrayList without any synchronization.

Known gaps

ManagedLedgerImpl uses two separate executors: the scheduled executor and a "normal" executor. This leads to multi-thread access. It would be better to combine the execution of both scheduled and "normal" execution to a single thread in some upcoming PRs.

BewareMyPower · 2021-07-20T10:00:26Z

Great find!

But I have a question about it. entries is exposed by followed public method:

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java

Lines 163 to 165 in 5ad4059

    
           public int getNumberOfEntriesToRead() { 
        
               return count - entries.size(); 
        
           }

And the method's call stack could be:

ManagedCursor#asyncReadEntries
  ManagedCursorImpl#asyncReadEntries
    ManagedLedgerImpl#asyncReadEntries
      ManagedLedgerImpl#internalReadFromLedger

Could ManagedCursor#asyncReadEntries be called in a different thread?

lhotari · 2021-07-20T10:31:19Z

Could ManagedCursor#asyncReadEntries be called in a different thread?

@BewareMyPower This is one example in the current code base where a different thread might be used:

pulsar/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java

Lines 137 to 148 in 5ad4059

    
           if (entries.size() < count && cursor.hasMoreEntries() && 
        
                   ((PositionImpl) cursor.getReadPosition()).compareTo(maxPosition) < 0) { 
        
               // We still have more entries to read from the next ledger, schedule a new async operation 
        
               if (nextReadPosition.getLedgerId() != readPosition.getLedgerId()) { 
        
                   cursor.ledger.startReadOperationOnLedger(nextReadPosition, OpReadEntry.this); 
        
               } 
        
               // Schedule next read in a different thread 
        
               cursor.ledger.getExecutor().execute(safeRun(() -> { 
        
                   readPosition = cursor.ledger.startReadOperationOnLedger(nextReadPosition, OpReadEntry.this); 
        
                   cursor.ledger.asyncReadEntries(OpReadEntry.this); 
        
               }));

BewareMyPower · 2021-07-20T11:03:47Z

@lhotari I know. What I want to ask is when ManagedCursor#asyncReadEntries is called, the OpReadEntry#getNumberOfEntriesToRead will be called eventually even after your change. Is it called in another thread? If yes, there could still be a thread safety problem.

lhotari · 2021-07-20T11:25:43Z

@lhotari I know. What I want to ask is when ManagedCursor#asyncReadEntries is called, the OpReadEntry#getNumberOfEntriesToRead will be called eventually even after your change. Is it called in another thread? If yes, there could still be a thread safety problem.

The concurrency design doesn't currently ensure single thread access. It's possible that there is multi-thread access at the moment. For example, the scheduler uses a different executor and therefore also a different thread. However, I believe that this PR improves the existing solution and can help reduce issues caused by thread safety issues.
I think it's possible to incrementally refactor the concurrency solution towards a solution where a single thread handles all accesses for a specific key (managed ledger name in this case). We can have a broader discussion about this on the mailing list or in some upcoming community meeting.

eolivelli · 2021-07-20T14:15:47Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java

@@ -141,8 +141,8 @@ void checkReadCompletion() {
                cursor.ledger.startReadOperationOnLedger(nextReadPosition, OpReadEntry.this);
            }

-            // Schedule next read in a different thread


This is a behaviour change
how can we verify that we are not breaking something or reducing overall performances ?

how can we verify that we are not breaking something or reducing overall performances ?

testing, testing, testing. we need more Fallout tests. :)

devinbost · 2021-07-20T21:31:30Z

Related to #6054

eolivelli

@rdhabalia @merlimat @codelipenghui can you please give your opinion on this patch ?

I believe it is a good fix, but we need more eyes

…etExecutor - instead, return executor that is pinned to a single thread - most usages of ManagedLedgerImpl.getExecutor were already calling chooseThread(ml.getName()), however some locations were omitting it. It's better to always pin the ManagedLedgerImpl.getExecutor to a single thread. - this improves thread safety of Managed Ledger code base since more operations will happen in a single thread - some classes such as OpReadEntry are not multi-thread safe. OpReadEntry.entries is a ArrayList without any synchronization.

merlimat · 2021-08-20T17:35:12Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -252,7 +253,8 @@
    protected volatile State state = null;

    private final OrderedScheduler scheduledExecutor;
-    private final OrderedExecutor executor;
+    private final ScheduledExecutorService pinnedScheduledExecutor;
+    private final Executor pinnedExecutor;


If we are using 2 threads, one with the regular executor (which is more efficient) and the other for the pinnedScheduledExecutor, wouldn't that mean that we still have more than 1 thread accessing some of the objects?

Would it make sense to use the generic scheduledExecutor (just for deferring purposes) and then jump back into the same pinnedExecutor?

That's true.

Perhaps a more optimal solution would be to have the capability for scheduling tasks on the pinned scheduler. I don't know why this solution isn't available in the underlying Bookkeeper libraries that are used. The benefit of that is that there isn't an additional thread switch when the scheduled task triggers.

Would it make sense to use the generic scheduledExecutor (just for deferring purposes) and then jump back into the same pinnedExecutor?

@merlimat Do you mean scheduledExecutor.schedule(pinnedExecutor.execute() ...) ?
Seems to be a feasible way right now :)

Perhaps a more optimal solution would be to have the capability for scheduling tasks on the pinned scheduler

@lhotari The scheduled executor is less efficient compared to the normal executor because it has to maintain the delayed tasks. For that it's preferable not to use it directly in the critical data path, but only when we want to defer actions or for background tasks.

merlimat · 2021-08-20T17:35:59Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -298,7 +300,8 @@ public ManagedLedgerImpl(ManagedLedgerFactoryImpl factory, BookKeeper bookKeeper
        this.ledgerMetadata = LedgerMetadataUtils.buildBaseManagedLedgerMetadata(name);
        this.digestType = BookKeeper.DigestType.fromApiDigestType(config.getDigestType());
        this.scheduledExecutor = scheduledExecutor;
-        this.executor = bookKeeper.getMainWorkerPool();
+        this.pinnedScheduledExecutor = scheduledExecutor.chooseThread(name);
+        this.pinnedExecutor = bookKeeper.getMainWorkerPool().chooseThread(name);


I don't know why we never did this, but this saves a lot of string hashings too :)

merlimat · 2021-08-20T17:39:01Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -2159,7 +2162,7 @@ void notifyCursors() {
                break;
            }

-            executor.execute(safeRun(waitingCursor::notifyEntriesAvailable));
+            pinnedExecutor.execute(safeRun(waitingCursor::notifyEntriesAvailable));


Is this required to be on the same executor?

We're notify multiple cursors that entries are available, this should be able to progress in parallel.

There are 2 places will call the notifyCursors() method, one is OpAddEntry.safeRun(), it already run the pinnedExecutor so don't need to jump again.

Another one is the ledger closed, looks only need to change here.

merlimat · 2021-08-20T17:39:38Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -2170,7 +2173,7 @@ void notifyWaitingEntryCallBacks() {
                break;
            }

-            executor.execute(safeRun(cb::entriesAvailable));
+            pinnedExecutor.execute(safeRun(cb::entriesAvailable));


Same for this one, it should be same to spread into multiple threads.

merlimat · 2021-08-20T17:40:53Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

    }

    private void scheduleDeferredTrimming(boolean isTruncate, CompletableFuture<?> promise) {
-        scheduledExecutor.schedule(safeRun(() -> trimConsumedLedgersInBackground(isTruncate, promise)), 100, TimeUnit.MILLISECONDS);
+        pinnedScheduledExecutor


Since trimConsumedLedgersInBackground() is already jumping on the pinnedExecutor, we shouldn't need to use a specific thread for the scheduled executor.

merlimat · 2021-08-20T17:43:50Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java

@@ -93,7 +93,7 @@ public void readEntriesFailed(ManagedLedgerException exception, Object ctx) {

        if (!entries.isEmpty()) {
            // There were already some entries that were read before, we can return them
-            cursor.ledger.getExecutor().execute(safeRun(() -> {
+            cursor.ledger.getPinnedExecutor().execute(safeRun(() -> {


I think we should be careful in not serializing every cursor into the managed ledger pinned thread, as it could become a bottleneck where there are many cursors on a topic.

Yes that's true.

The reason to use the pinned executor is to adhere to Java Memory Model rules of correct synchronization. There's a generic problem in OpReadEntry since it's sharing an array that is mutated by multiple threads. JLS 17.4 explains that "Incorrectly Synchronized Programs May Exhibit Surprising Behavior".

I would assume that "entries" would have to be copied to a new list before sharing if we want to use multiple threads. Is that right?

I think the entry reading happens one by one, if we got the read entries failed here, this means we will not get a chance to add more elements to the list right(all the previous read operations are done)?

merlimat · 2021-08-20T17:52:40Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpReadEntry.java

-            // Schedule next read in a different thread
-            cursor.ledger.getExecutor().execute(safeRun(() -> {
+            // Schedule next read
+            cursor.ledger.getPinnedExecutor().execute(safeRun(() -> {


Other than the consideration that different cursors shouldn't be pinned on a single thread, the reason for jumping to a different thread here is to avoid a stack overflow.

When the read is being served from the ML cache, it's coming back from same thread. There are some conditions in which we ask for next read.

eg. If you ask to read 100 entries and we only got 20 entries from current ledger, we'll schedule a read for the remaining 80 on next ledger. In some cases there could be abnormal distributions, like 1 entry per ledger and it would be chaining all the reads and callback within the same stack.

Therefore, the "jump to a random thread" was introduced to break that chain.

Wouldn't the usage of the pinned executor achieve the same result? It prevents the stack from going deeper and deeper.
Why would it have to jump to a random thread to break the chain?

The only reason that comes into my mind is the case where there's a completable future that gets triggered as part of the call flow and it is being waited to complete in the same thread as where the result should be executed in. That would never complete and would dead lock. Would that be the case here to use a different executor?

It looks like the stack should be checkReadCompletion -> entryCache.asyncReadEntry0 -> checkReadCompletion -> entryCache.asyncReadEntry0 -> checkReadCompletion -> entryCache.asyncReadEntry0 and so on, if we have entries in the cache.

Wouldn't the usage of the pinned executor achieve the same result? It prevents the stack from going deeper and >deeper.
Why would it have to jump to a random thread to break the chain?

@lhotari Uhm, I think that some executors are short-circuiting the queue if they detect that you're trying to add a task from the same executor thread. That is the case for Netty IO thread, though I just check that it shouldn't happen on the ThreadPoolExecutor on which the OrderedExecutor is based upon.

tisonkun · 2022-12-09T11:30:48Z

Closed as stale and conflict. Please rebase and resubmit the patch if it's still relevant.

lhotari requested review from codelipenghui, eolivelli, rdhabalia and merlimat July 20, 2021 07:59

lhotari changed the title ~~[ManagedLedger] Don't expose OrderedExecutor from ManagedLedgerImpl.getExecutor~~ [ManagedLedger] Pin executor thread for usages of ManagedLedgerImpl.getExecutor Jul 20, 2021

lhotari added the area/broker label Jul 20, 2021

lhotari added this to the 2.9.0 milestone Jul 20, 2021

lhotari self-assigned this Jul 20, 2021

lhotari added the doc-not-needed Your PR changes do not impact docs label Jul 20, 2021

lhotari changed the title ~~[ManagedLedger] Pin executor thread for usages of ManagedLedgerImpl.getExecutor~~ [ManagedLedger] Pin executor and scheduled executor threads for ManagedLedgerImpl Jul 20, 2021

lhotari requested review from BewareMyPower and hangc0276 July 20, 2021 11:41

BewareMyPower previously approved these changes Jul 20, 2021

View reviewed changes

eolivelli reviewed Jul 20, 2021

View reviewed changes

lhotari closed this Jul 22, 2021

lhotari reopened this Jul 22, 2021

sijie self-requested a review July 22, 2021 15:38

lhotari mentioned this pull request Jul 31, 2021

NPE in managed ledger on read failed #11521

Closed

lhotari mentioned this pull request Aug 10, 2021

[Issue-11282] Fix NPE in OpReadEntry #11292

Closed

2 tasks

lhotari force-pushed the lh-fix-managedledger-thread-safety branch from f8e2aae to 43a099f Compare August 11, 2021 06:13

eolivelli previously approved these changes Aug 16, 2021

View reviewed changes

lhotari requested a review from ivankelly August 17, 2021 14:49

lhotari added 3 commits August 17, 2021 17:50

Use pinned executor and pinned scheduled executor for ManagedLedger

e121cf5

Rename methods to clarify the behavior

c869ec4

lhotari force-pushed the lh-fix-managedledger-thread-safety branch from 43a099f to c869ec4 Compare August 17, 2021 14:53

merlimat reviewed Aug 20, 2021

View reviewed changes

lhotari mentioned this pull request Aug 24, 2021

Fix the topic in fenced state and can not recover. #11737

Merged

lhotari marked this pull request as draft September 20, 2021 12:32

eolivelli modified the milestones: 2.9.0, 2.10.0 Oct 6, 2021

lhotari mentioned this pull request Nov 4, 2021

[ML] Avoid passing OpAddEntry across a thread boundary in asyncAddEntry #12606

Merged

codelipenghui modified the milestones: 2.10.0, 2.11.0 Jan 18, 2022

lhotari mentioned this pull request Feb 15, 2022

[2.8.1] FGC and throw NPE #14268

Closed

congbobo184 dismissed stale reviews from eolivelli and BewareMyPower via c869ec4 February 16, 2022 02:58

lhotari mentioned this pull request Mar 4, 2022

[Proto] java.lang.IllegalStateException: Some required fields are missing #14436

Closed

codelipenghui modified the milestones: 2.11.0, 2.12.0 Jul 26, 2022

tisonkun closed this Dec 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ManagedLedger] Pin executor and scheduled executor threads for ManagedLedgerImpl #11387

[ManagedLedger] Pin executor and scheduled executor threads for ManagedLedgerImpl #11387

lhotari commented Jul 20, 2021 •

edited

Loading

BewareMyPower commented Jul 20, 2021

lhotari commented Jul 20, 2021 •

edited

Loading

BewareMyPower commented Jul 20, 2021

lhotari commented Jul 20, 2021 •

edited

Loading

eolivelli Jul 20, 2021

lhotari Jul 20, 2021

devinbost commented Jul 20, 2021

eolivelli left a comment

merlimat Aug 20, 2021

lhotari Aug 24, 2021

codelipenghui Aug 24, 2021

merlimat Aug 26, 2021

merlimat Aug 20, 2021

merlimat Aug 20, 2021

codelipenghui Aug 24, 2021

merlimat Aug 20, 2021

merlimat Aug 20, 2021

merlimat Aug 20, 2021

lhotari Aug 24, 2021

codelipenghui Aug 24, 2021

merlimat Aug 20, 2021

lhotari Aug 24, 2021 •

edited

Loading

codelipenghui Aug 24, 2021

merlimat Aug 26, 2021

tisonkun commented Dec 9, 2022

	if (entries.size() < count && cursor.hasMoreEntries() &&
	((PositionImpl) cursor.getReadPosition()).compareTo(maxPosition) < 0) {
	// We still have more entries to read from the next ledger, schedule a new async operation
	if (nextReadPosition.getLedgerId() != readPosition.getLedgerId()) {
	cursor.ledger.startReadOperationOnLedger(nextReadPosition, OpReadEntry.this);
	}

	// Schedule next read in a different thread
	cursor.ledger.getExecutor().execute(safeRun(() -> {
	readPosition = cursor.ledger.startReadOperationOnLedger(nextReadPosition, OpReadEntry.this);
	cursor.ledger.asyncReadEntries(OpReadEntry.this);
	}));

[ManagedLedger] Pin executor and scheduled executor threads for ManagedLedgerImpl #11387

[ManagedLedger] Pin executor and scheduled executor threads for ManagedLedgerImpl #11387

Conversation

lhotari commented Jul 20, 2021 • edited Loading

Motivation

Modifications

Known gaps

BewareMyPower commented Jul 20, 2021

lhotari commented Jul 20, 2021 • edited Loading

BewareMyPower commented Jul 20, 2021

lhotari commented Jul 20, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

devinbost commented Jul 20, 2021

eolivelli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lhotari Aug 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tisonkun commented Dec 9, 2022

lhotari commented Jul 20, 2021 •

edited

Loading

lhotari commented Jul 20, 2021 •

edited

Loading

lhotari commented Jul 20, 2021 •

edited

Loading

lhotari Aug 24, 2021 •

edited

Loading