Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataShard: VERIFY failed: requirement !pr.second->HasFlag(TTxFlags::BlockingImmediateOps) failed #3142

Closed
snaury opened this issue Mar 25, 2024 · 3 comments · Fixed by #3358
Assignees
Labels
bug Something isn't working

Comments

@snaury
Copy link
Member

snaury commented Mar 25, 2024

I've got several crashes under Jepsen that look like this:

VERIFY failed (2024-03-25T12:02:01.510019+0300):
  ydb/core/tx/datashard/datashard_pipeline.cpp:2322
  operator()(): requirement !pr.second->HasFlag(TTxFlags::BlockingImmediateOps) failed
0. /home/snaury/src/ydb/util/system/yassert.cpp:83: NPrivate::InternalPanicImpl(int, char const*, char const*, int, int, int, TBasicStringBuf<char, std::__y1::char_traits<char> >, char const*, unsigned long) @ 0x15A42A0C
1. /home/snaury/src/ydb/util/system/yassert.cpp:55: NPrivate::Panic(NPrivate::TStaticBuf const&, int, char const*, char const*, char const*, ...) @ 0x15A389E6
2. /home/snaury/src/ydb/ydb/core/tx/datashard/datashard_pipeline.cpp:2322: bool NKikimr::NDataShard::TPipeline::MarkPlannedLogicallyIncompleteUpTo(NKikimr::TRowVersion const&, NKikimr::NTabletFlatExecutor::TTransactionContext&)::$_0::operator()<std::__y1::pair<NKikimr::NDataShard::TStepOrder const, TIntrusivePtr<NKikimr::NDataShard::TOperation, TDefaultIntrusivePtrOps<NKikimr::NDataShard::TOperation> > > >(std::__y1::pair<NKikimr::NDataShard::TStepOrder const, TIntrusivePtr<NKikimr::NDataShard::TOperation, TDefaultIntrusivePtrOps<NKikimr::NDataShard::TOperation> > > const&) const @ 0x213CF222
3. /home/snaury/src/ydb/ydb/core/tx/datashard/datashard_pipeline.cpp:2334: NKikimr::NDataShard::TPipeline::MarkPlannedLogicallyIncompleteUpTo(NKikimr::TRowVersion const&, NKikimr::NTabletFlatExecutor::TTransactionContext&) @ 0x213CF222
4. /home/snaury/src/ydb/ydb/core/tx/datashard/store_and_send_out_rs_unit.cpp:79: NKikimr::NDataShard::TStoreAndSendOutRSUnit::Execute(TIntrusivePtr<NKikimr::NDataShard::TOperation, TDefaultIntrusivePtrOps<NKikimr::NDataShard::TOperation> >, NKikimr::NTabletFlatExecutor::TTransactionContext&, NActors::TActorContext const&) @ 0x2168F460
5. /home/snaury/src/ydb/ydb/core/tx/datashard/datashard_pipeline.cpp:1814: NKikimr::NDataShard::TPipeline::RunExecutionPlan(TIntrusivePtr<NKikimr::NDataShard::TOperation, TDefaultIntrusivePtrOps<NKikimr::NDataShard::TOperation> >, TVector<NKikimr::NDataShard::EExecutionUnitKind, std::__y1::allocator<NKikimr::NDataShard::EExecutionUnitKind> >&, NKikimr::NTabletFlatExecutor::TTransactionContext&, NActors::TActorContext const&) @ 0x213C9443
6. /home/snaury/src/ydb/ydb/core/tx/datashard/datashard__progress_tx.cpp:73: NKikimr::NDataShard::TDataShard::TTxProgressTransaction::Execute(NKikimr::NTabletFlatExecutor::TTransactionContext&, NActors::TActorContext const&) @ 0x217DCC45

And while MarkedPlannedLogically* functions maintain (and check) their iterator invariants, it looks like when AddActiveOp tries to adhoc restore flags (because we want to stop persisting them in the future), they are not restored correctly: UnprotectedReadEdge should not mark its own version as complete, but rather as incomplete.

But here's the kicker: this should affect at most one transaction that is exactly equal to this UnprotectedReadEdge, and it shouldn't matter what it's marked as, as long as CompleteEdge and IncompleteEdge move as expected (and when they do we mark transactions already in active ops). This may all indicate there's some instability in how we choose versions, need to investigate.

@snaury snaury self-assigned this Mar 25, 2024
@snaury snaury added the bug Something isn't working label Mar 25, 2024
@snaury
Copy link
Member Author

snaury commented Mar 25, 2024

There are many more stack traces like this:

NPrivate::Panic(NPrivate::TStaticBuf const&, int, char const*, char const*, char const*, ...) () /home/snaury/src/ydb/util/system/yassert.cpp +55
bool NKikimr::NDataShard::TPipeline::MarkPlannedLogicallyCompleteUpTo(NKikimr::TRowVersion const&, NKikimr::NTabletFlatExecutor::TTransactionContext&)::$_0::operator()<std::__y1::pair<NKikimr::NDataShard::TStepOrder const, TIntrusivePtr<NKikimr::NDataShard::TOperation, TDefaultIntrusivePtrOps<NKikimr::NDataShard::TOperation> > > >(std::__y1::pair<NKikimr::NDataShard::TStepOrder const, TIntrusivePtr<NKikimr::NDataShard::TOperation, TDefaultIntrusivePtrOps<NKikimr::NDataShard::TOperation> > > const&) const () /home/snaury/src/ydb/ydb/core/tx/datashard/datashard_pipeline.cpp +2287
NKikimr::NDataShard::TPipeline::MarkPlannedLogicallyCompleteUpTo(NKikimr::TRowVersion const&, NKikimr::NTabletFlatExecutor::TTransactionContext&) () /home/snaury/src/ydb/ydb/core/tx/datashard/datashard_pipeline.cpp +2304
NKikimr::NDataShard::TDataShard::PromoteImmediatePostExecuteEdges(NKikimr::TRowVersion const&, NKikimr::NDataShard::TDataShard::EPromotePostExecuteEdges, NKikimr::NTabletFlatExecutor::TTransactionContext&) () /home/snaury/src/ydb/ydb/core/tx/datashard/datashard.cpp +2153
NKikimr::NDataShard::TDataShard::TReadOperation::Execute(NKikimr::NTabletFlatExecutor::TTransactionContext&, NActors::TActorContext const&) () /home/snaury/src/ydb/ydb/core/tx/datashard/datashard__read_iterator.cpp +1414

@snaury
Copy link
Member Author

snaury commented Mar 25, 2024

One idea on how this might happen:

  • Shard restarts and takes up to 8 active transactions, some more are blocked and neither in complete nor incomplete set, i.e. there's a tail of transactions we haven't touched yet
  • CheckMediatorStateRestored finishes and calls PromoteUnprotectedReadEdge based on the most recent mediator time, this will encompass all transactions currently in the plan queue, because we use the most recent mediator time, and it's no lower than the last planned transaction
  • The first active transaction in the queue completes, so we have a place for one more, it unblocks and gets added to active ops, here AddActiveOp will look at current UnprotectedReadEdge and infer missing flags
  • However, since this new transaction is not the new tail, iterators will keep pointing where they were, and will have a transaction past the current complete/incomplete set marked as blocking immediate ops
  • Eventually some other code will try to update complete/incomplete set, and invariant check will notice that current flags don't match transaction not being in the set

Thus it's clear that CheckMediatorStateRestored cannot update UnprotectedReadEdge without also updating complete/incomplete set accordingly. And I guess to make it possible to persist new flags and edges, it will have to perform final preparations in a transaction. :-/

@snaury
Copy link
Member Author

snaury commented Mar 25, 2024

As far as I can see this was only triggering when I was testing disabled volatile transactions. Makes sense, since volatile transactions are not persistent and don't migrate on restarts for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant