Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Broker] Fix race condition in invalidating ledger cache entries #10480

Conversation

lhotari
Copy link
Member

@lhotari lhotari commented May 4, 2021

Fixes #10433

Motivation

See #10433 . There's a rare race condition in invalidating ledger cache entries stored in RangeCache .

Modifications

  • add separate invalidate method for invalidating EntryImpl ledger cache entries. This prevents race conditions in invalidation.

@lhotari lhotari marked this pull request as draft May 4, 2021 12:36
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari I left a few ideas.
I am not saying you are on the wrong way, but I am not sure we are really fixing the problem

@@ -90,7 +90,7 @@ public Value get(Key key) {
try {
value.retain();
return value;
} catch (Throwable t) {
} catch (IllegalReferenceCountException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we should log something here.
this case must not happen

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, logging would be useful to get more information. I'm just wondering if it should be done only at debug level since it's not a real problem. It's part of the expected behavior that this could sometimes happen.

@@ -113,7 +113,7 @@ public Value get(Key key) {
try {
value.retain();
values.add(value);
} catch (Throwable t) {
} catch (IllegalReferenceCountException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably we should log something here.
this case must not happen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's expected to happen and it's fine when it happens. It just indicates the entry is being evicted when we're trying to access it. If the retain succeeds, the operation was successful, otherwise the entry is already gone.

removedSize += weighter.getSize(value);
value.release();
long entrySize = weighter.getSize(value);
if (value.invalidate()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid we are only hiding the problem.
there must be some coordination about these entries, some clear protocol about who is the owner of the entry.

when we get to this point then we must be sure that the refcount is valid on the value, otherwise it is always an hazard.

I believe that the right protocol is that before calling weighter.getSize(value); we should try to acquire the entry and in case of failure we can ignore the entry.

Value value = entry.getValue();
if (value.tryAcquire()) {
            ++removedEntries;
            removedSize += weighter.getSize(value);
            value.release(); // this refers to the value.tryAcquire()
}

when we remove the entry we must have some write lock over the entry, that prevents double releases

Copy link
Contributor

@merlimat merlimat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari Great finding. I agree with the assessment that multiple invalidations are happening at the same time to cause this.

I think the correct fix here would be to ensure that there's only one eviction happening at a given point in time, so that we avoid touching an entry whose ref-count is potentially 0.

For that, we'd need to make sure that, in RangeCache, removeRange(),evictLeastAccessedEntries() and evictLEntriesBeforeTimestamp() are either called with a mutex (like clear() is already doing) or at least from the same single thread.

@@ -113,7 +113,7 @@ public Value get(Key key) {
try {
value.retain();
values.add(value);
} catch (Throwable t) {
} catch (IllegalReferenceCountException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's expected to happen and it's fine when it happens. It just indicates the entry is being evicted when we're trying to access it. If the retain succeeds, the operation was successful, otherwise the entry is already gone.

release();
return true;
}
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the correct approach. If the issue is that we're using an already released buffer, we should fix that instead.

This will avoid decrementing the ref-count more than once, but it will not prevent the 2nd thread from accessing an entry whose ref-count was already to 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a safety measure to prevent races in invalidating the entries. I agree that the issues in releasing must be fixed. The benefit of adding a separate method for invalidation would help detect when the problem is caused by invalidating the entry twice. Some logging could be added to detect the issues where there's a race in invalidation which causes a "double release".

Currently, it seems that the problems that we are seeing could occur only when there's a race in invalidation. At a quick glance, there doesn't seem to be other code paths where the entry is released but not retained as part of the same "flow".

Would this justify adding some extra protection against races in invalidation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will avoid decrementing the ref-count more than once, but it will not prevent the 2nd thread from accessing an entry whose ref-count was already to 0.

Yes, that's a good point. I'm thinking of a solution where invalidation would be a completely separate operation which triggers when reference count is 1 or gets back to 1. Another protection here would be a change in logic that release operations to change reference count from 1 to 0 would be rejected completely. That would prevent bugs which are caused by release being called too many times. Those issues could be logged and fixed if such bugs exist.

removedSize += weighter.getSize(value);
value.release();
++removedEntries;
long entrySize = weighter.getSize(value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eg: in case of concurrent invalidate, the value is already invalid here

@dlg99
Copy link
Contributor

dlg99 commented May 4, 2021

@lhotari I agree with @merlimat on " If the issue is that we're using an already released buffer, we should fix that instead."
It could be caused by i.e. an implementation of the ReadEntriesCallback/ReadEntryCallback/whatever other callback that is expected to properly release entry. Or, as you mentioned at the issue, it could be netty/netty#10986

Other issue is: do we have a repro?
Have we confirmed that upgrade of Netty to a version with fix netty/netty#10986 doesn't help?
and that this fix helps?

@merlimat
Copy link
Contributor

merlimat commented May 4, 2021

Actually, I'm not 100% sure that the having invalidations called by multiple thread could lead to the issue. In all the cases the entries are removed from the ConcurrentSkipList before getting release, and the guarantee there is that the removal should be atomic.

@lhotari
Copy link
Member Author

lhotari commented May 5, 2021

Actually, I'm not 100% sure that the having invalidations called by multiple thread could lead to the issue. In all the cases the entries are removed from the ConcurrentSkipList before getting release, and the guarantee there is that the removal should be atomic.

This is true. Therefore, thinking of the changes as a safety measure and way to detect the source of the problem as explained in my previous comment could be the rationale for adding a separate method for invalidation.

@lhotari
Copy link
Member Author

lhotari commented May 5, 2021

@lhotari I agree with @merlimat on " If the issue is that we're using an already released buffer, we should fix that instead."

Yes I agree on this, I'll try to dig deeper. :)

It could be caused by i.e. an implementation of the ReadEntriesCallback/ReadEntryCallback/whatever other callback that is expected to properly release entry.

Good point. I'll track those release calls.

Or, as you mentioned at the issue, it could be netty/netty#10986

Pulsar might not be impacted in cases where AbstractCASReferenceCounted base class is used. Here's AbstractCASReferenceCounted release logic:

private boolean release0(int decrement) {
for (;;) {
int refCnt = this.refCnt;
if (refCnt < decrement) {
throw new IllegalReferenceCountException(refCnt, -decrement);
}
if (refCntUpdater.compareAndSet(this, refCnt, refCnt - decrement)) {
if (refCnt == decrement) {
deallocate();
return true;
}
return false;
}
}
}

Other issue is: do we have a repro?

No. I assume that it's a rare issue since there's not many reports about it. It could be possible to achieve a repro at some kind of unit/integration test level using JCStress, but it's hard to estimate the effort to achieve a repro.

Have we confirmed that upgrade of Netty to a version with fix netty/netty#10986 doesn't help?

No. The bug report is about concurrent calls to .recycle(). As mentioned earlier, that is prevented with AbstractCASReferenceCounted in Pulsar in most cases.

@lhotari
Copy link
Member Author

lhotari commented May 6, 2021

I've been trying to spot locations where the entry could get released multiple times, but still continued to be used.

This looks risky:

// invalidate all entries related to ledger from the cache (it might happen if entry gets corrupt
// (entry.data is already deallocate due to any race-condition) so, invalidate cache and next time read from
// the bookie)
invalidateAllEntries(lh.getId());
callback.readEntryFailed(createManagedLedgerException(t), ctx);

Together with

public synchronized void readEntryFailed(ManagedLedgerException mle, Object ctx) {
log.warn("[{}][{}] Error while replaying entries", ledger.getName(), name, mle);
if (exception.compareAndSet(null, mle)) {
// release the entries just once, any further read success will release the entry straight away
entries.forEach(Entry::release);
}

@merlimat
Copy link
Contributor

merlimat commented May 6, 2021

As mentioned earlier, that is prevented with AbstractCASReferenceCounted in Pulsar in most cases.

At this point, we can actually get rid of AbstractCASReferenceCounted. It was added in #2995 as a temporary measure to work around a change in behavior in Netty. The Netty issue was then fixed in 4.1.32 and we don't need the special treatment anymore.

The change

try {
asyncReadEntry0(lh, position, callback, ctx);
} catch (Throwable t) {
log.warn("failed to read entries for {}-{}", lh.getId(), position, t);
// invalidate all entries related to ledger from the cache (it might happen if entry gets corrupt
// (entry.data is already deallocate due to any race-condition) so, invalidate cache and next time read from
// the bookie)
invalidateAllEntries(lh.getId());
callback.readEntryFailed(createManagedLedgerException(t), ctx);
}
was added as a way to handle the same bug, but I agree that it's really dangerous in that we don't really what got wrong, as in where the corruption was...

As for the 2nd part (the Cursor.readEntryFailed), that seems ok to me. The entries there are the partial entries that were already read before. They might be coming from the cache or directly from bookies. In any case, the ref-count was increased when they came out of the cache, so we're required to release (and it should be safe to do so).

@lhotari
Copy link
Member Author

lhotari commented May 7, 2021

At this point, we can actually get rid of AbstractCASReferenceCounted. It was added in #2995 as a temporary measure to work around a change in behavior in Netty. The Netty issue was then fixed in 4.1.32 and we don't need the special treatment anymore.

I'm thinking of replacing it with something that would give extra protection against bugs. Let's see what it evolves into. I'll push changes to this PR once there's something presentable.

was added as a way to handle the same bug, but I agree that it's really dangerous in that we don't really what got wrong, as in where the corruption was...

As for the 2nd part (the Cursor.readEntryFailed), that seems ok to me. The entries there are the partial entries that were already

About these 2 lines of code together:

     invalidateAllEntries(lh.getId()); 
     callback.readEntryFailed(createManagedLedgerException(t), ctx); 

The problem here seems to be that invalidateAllEntries will call .release() and then the callback.readEntryFailed will also call .release() for the same set of entries and this leads to the "double release" which can cause the entry to be returned to the recycler (when there's one more outstanding usage of the entry). Once it's returned to the recycler, it can get used in some other usage at the same time. After that, the outstanding usage of the entry in the first place calls .release() and it could lead to the NPE that was reported. Makes sense?

@merlimat
Copy link
Contributor

merlimat commented May 7, 2021

The problem here seems to be that invalidateAllEntries will call .release() and then the callback.readEntryFailed will also call .release() for the same set of entries and this leads to the "double release"

I don't think there's a double release because:

  1. invalidateAllEntries is invalidating, thus removing and releasing the entries kept in cache
  2. callback.readEntryFailed is releasing entries that were already out of the cache. We're just release the additional ref-count that was added when the entry came out of the cache (since we're not going to use the entry).

The main traits of the entry cache are:

  1. The cache has ownership of the entry and keeps 1 ref-count when the entry is cached.
  2. To grab an entry from the cache, we try to retain() (which might throw if the entry is being deallocated)
  3. To invalidate, entries have to first get removed from the map and then release 1 ref-count

@lhotari lhotari closed this Jun 21, 2021
@lhotari
Copy link
Member Author

lhotari commented May 28, 2024

A better fix is #22789 which doesn't contain the problems that were in this PR a few years ago.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NPE in broker: EntryImpl.getLength()
4 participants