Initial support for parallel Soroban phase XDR. #4364

dmkozh · 2024-06-26T16:38:46Z

Description

I've tried to minimize the scope of the changes; specifically this doesn't contain any actual logic for the parallel execution (such as data dependency validation and building parallel stages). However, there is still some refactoring that needed to happen in order to support new, more complex tx sets.

I've also removed some legacy surge pricing logic and the corresponding tests - we should no longer need the surge pricing logic that covers more than one source account per tx set.

Checklist

Reviewed the contributing document
Rebased on top of master (no merge commits)
Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
Compiles
Ran all tests
If change impacts performance, include supporting evidence per the performance document

sisuresh

Just an initial review

src/herder/TxSetFrame.cpp

src/herder/TxSetFrame.h

src/main/main.cpp

marta-lokhova

Thanks, parallel application of Soroban txs is exciting! I did an initial preliminary review. My main question is around scope and motivation:

It's a bit hard to imagine how the refactored tx set is going to be used, because ledger close logic isn't wired up yet. As a result, I can't tell if the access/usage patterns are right in the context of multi-threading. Is there a reason to introduce apply-order changes before parallel application?
(related) can you clarify the scope of this PR? if it's just refactoring tx set to be more general, then it'd be easier to make TxSetFrame more generalized internally, but continue exposing old API to make it easy for consumers. Changes to application order should probably be an isolated protocol change. This should help ship this work incrementally.
There are some changes that I think should be factored out into a separate PR (surge pricing, test cleanup). There is no reason to block those changes on tx set refactor.

src/herder/TxSetUtils.cpp

marta-lokhova · 2024-07-08T18:31:01Z

src/herder/TxSetUtils.cpp

@@ -154,11 +179,10 @@ TxSetUtils::buildAccountTxQueues(TxSetTransactions const& txs)
    return queues;
 }

-TxSetTransactions
-TxSetUtils::getInvalidTxList(TxSetTransactions const& txs, Application& app,


Can we add a comment here that one transaction per account is expected (or potentially pass AccountID->TxFrame map)?

That's not the expectation here (yet), but probably that should be changed. I'll work on this in a separate PR.

After the cleanup we only support nomination of 1 tx/account, so I don't think the comment is necessary in this particular place.

src/herder/TxSetFrame.cpp

marta-lokhova · 2024-07-09T17:08:04Z

src/herder/TxSetFrame.cpp

    auto const& txSetV1 = txSet.v1TxSet();
-    // There was no protocol with 1 phase, so checking for 2 phases only
-    if (txSetV1.phases.size() != static_cast<size_t>(TxSetPhase::PHASE_COUNT))
+    // There was no protocol with 1 phase, so checking for 2 perPhaseTxs only


phases was a bit more clear than perPhaseTxs

That's just find-and-replace gone too far, sorry.

FWIW concerning the variable naming, perPhaseTxs is semantically different from phases. Specifically we have transactions that may be included into a phase and we also have the 'phase' itself (TxSetPhaseFrame) and phase might have additional internal structure, additional data etc.

marta-lokhova · 2024-07-09T17:20:00Z

src/herder/TxSetFrame.h

+
+        Iterator(std::variant<TxFrameList, TxStageFrameList> const& txs,
+                 size_t stageIndex, size_t txIndex);
+        std::variant<TxFrameList, TxStageFrameList> const& mTxs;


Do we need a variant here? Legacy main-thread application is just a special case of multi-thread parallel application (i.e. it's a single stage).

These are not quite the same conceptually, but I think you're right and we could just generalize the implementation (it helps that the phase frame is immutable and we shouldn't need to make the runtime assertions about the phase structure). I'll give it a shot after taking out the surge pricing changes.

Updated to not use variant.

marta-lokhova · 2024-07-09T17:34:08Z

src/herder/TxSetFrame.cpp

+}
+
+bool
+validateNonParallelPhaseXDRStructure(TransactionPhase const& phase)


nit on naming: we're introducing "parallel" concepts when there's no parallelization yet, which is kind of confusing for the reader. Can we make terminology implementation-agnostic? Really the concept we want here is conflict-free.

I'm not sure how to name that TBH. I guess I could just say 'v0' phase, would that be more readable?

v0 is fine here, but my comment was more about the naming throughout the code. Might be nice to replace all mentions of parallel with conflict-free, and also add a comment that there isn't any parallelization yet to make it clear for the readers.

Well, it's not quite conflict-free. I'm fine with using something like v0 or maybe 'sequential' for the existing phases, but I'm not sure if there is a better term than 'parallel' for the new phase, because it is designed to be applicable in parallel (unlike the existing phase). I'm not sure we need to obfuscate that fact just because there is no followup work merged yet.

I've used 'sequential' for v0 phase. Still not sure if there is a better way to name the parallel phase.

marta-lokhova · 2024-07-09T18:04:54Z

src/test/TxTests.cpp

@@ -519,15 +519,15 @@ closeLedgerOn(Application& app, uint32 ledgerSeq, TimePoint closeTime,
        }
        else
        {
-            TxSetTransactions classic;
-            TxSetTransactions soroban;
+            TxFrameList classic;


Regarding XDR changes:

I feel uneasy aboutledgerMaxParallelThreads: it's baking an implementation detail into the protocol (there's already enough complexity in config settings). I'm not certain how we plan to enforce this: a validator can bypass this check. Also, there are other ways to fall off the network even with this requirement satisfied (e.g. using CPUs that aren't powerful enough, etc). I think node quality shouldn't be enforced via protocol like this, but rather via qsets/social pressure.

Where is ParallelTxExecutionStage defined?

I feel uneasy aboutledgerMaxParallelThreads: it's baking an implementation detail into the protocol

It is not an implementation detail, but a part of the protocol. We have to define the lowest number of threads that we should be sure that every tier 1 validator can afford running in parallel, otherwise there is really not much point to all the parallelization work.

We indeed can't (and don't need to) enforce the number of threads to use for applying the tx set (less can definitely be used and more threads could be used depending on tx set composition, e.g. to speedup catchup). This is just the number of 'logical' threads in the tx set. Transactions in different 'logical' threads may not have data dependencies on each other and no logical thread can exceed the ledger-wide resource limit.

FWIW the actual config plumbing will be in a separate PR; this should be an XDR-only change for now.

Where is ParallelTxExecutionStage defined?

All the txset-related changes are defined in next branch of stellar-ledger.h: https://github.com/stellar/stellar-xdr/blob/619d33e14013712415e5218ca7f5fadb5524c948/Stellar-ledger.x#L210

Sorry, I'm still pretty confused about this parameter.

It is not an implementation detail, but a part of the protocol. We have to define the lowest number of threads that we should be sure that every tier 1 validator can afford running in parallel, otherwise there is really not much point to all the parallelization work.

What I mean is that we're adding something that is an implementation detail to the protocol. There are ways Tier1 can lose sync due to wrong hardware today. This is why we have recommended SKUs and validators collectively decide what hardware makes sense for them.

We indeed can't (and don't need to) enforce the number of threads to use for applying the tx set (less can definitely be used and more threads could be used depending on tx set composition, e.g. to speedup catchup). This is just the number of 'logical' threads in the tx set. Transactions in different 'logical' threads may not have data dependencies on each other and no logical thread can exceed the ledger-wide resource limit,

So it this config about hardware or tx set properties? You mention above that the config is needed to make sure nodes can support parallelization and not fall out of sync, but also the config isn't really enforced and it's only supposed to indicate the number of conflict-free lanes in a tx set (and the tx set itself can be applied with more or less threads anyway)

What I mean is that we're adding something that is an implementation detail to the protocol.

This is as much of an implementation detail as a transaction count limit or per-ledger instruction limit for Soroban. Of course, tier 1 validators can lose sync if they're under the necessary spec, but that doesn't mean that we can apply arbitrarily large tx sets.

This is why we have recommended SKUs and validators collectively decide what hardware makes sense for them.

Right, that's why this is a config setting - network votes on the minimum necessary parallelization support (similarly to how it can vote on the operation count/instruction limits).

So it this config about hardware or tx set properties?

The config defines the number of logical threads in tx set. We could name it in a more obfuscated fashion, but I don't see much value in that, because in the end the value still maps to the minimal number of physical threads a validator must support.

but also the config isn't really enforced

It will be enforced - this PR is already quite large and config plumbing is not a part of it, as well as building tx sets for nomination.

the tx set itself can be applied with more or less threads anyway)

Sure, but the default implementation that we expect validators to use is to spawn as many threads as we have in the tx set and these threads should be able to run on the physical cores. Again, in theory a validator can sleep for 100 milliseconds after applying every transaction, but we do expect every reasonable validator to try and apply transactions as fast as possible. Threads are similar - slow hardware will still be able to apply the tx sets, but it will be lagging behind the network and there is really no way around that.

This is as much of an implementation detail as a transaction count limit or per-ledger instruction limit for Soroban. Of course, tier 1 validators can lose sync if they're under the necessary spec, but that doesn't mean that we can apply arbitrarily large tx sets.

Sorry, I don't agree with this at all. ledger-wide limits are about bounding the amount of work needed to produce a single ledger, and preventing things like inserting an infinite loop into your contract. Number of threads is a performance tuning knob. It's not any different from any other performance knobs we have (do overlay in the background, do evictions faster, etc), and can cause nodes to be faster or slower. So the question is why should it be in the protocol?

It will be enforced - this PR is already quite large and config plumbing is not a part of it, as well as building tx sets for nomination.

How do we plan to enforce this? I don't quite understand how validators are going to prove that they meet the hardware requirements. I'm also not sure what happens if there's a protocol vote and some validators don't satisfy the new requirement (but that validator is still in everyone's quorum)

ledger-wide limits are about bounding the amount of work needed to produce a single ledger

Exactly, the thread limit has the same purpose of bounding the amount of work per-ledger on a reasonable validator. Currently 'reasonable' means 'reasonably modern CPU', with parallelization we obviously have to require a certain number of CPU cores as well (otherwise there is really no point in doing that).

Number of threads is a performance tuning knob.

It is not a performance tuning knob, it's a ledger capacity tuning knob that works in the same way as ledger-wide instruction limit.

So the question is why should it be in the protocol?

Because we need to be able to bound the ledger capacity at the protocol level. The point of parallelization is not to improve the performance by an unknown factor, it is about increasing the ledger instruction capacity without (significantly) changing the apply (and thus ledger close) time.

How do we plan to enforce this?

This configuration setting regulates the tx set validation - a valid tx set has to have no more than that many logical threads of independent transactions.

I don't quite understand how validators are going to prove that they meet the hardware requirements.

They don't need to prove that per-se, but we'll need to figure out the lowest common value that everyone thinks is reasonable. That value will need to be accounted for when setting the minimal hardware requirements for a validator. Again, there is no point in any parallelization work if we expect majority of the network to run on a single CPU. But we don't think that's the case, do we?

I think the confusion is around why this parameter is about bounding work. Tuning the number of threads means tuning how work is executed, but not changing the amount of it. So depending on this number, you can execute a block faster or slower. For tx set construction, I suppose the limit is about bounding total parallel work threads? In practice, this doesn't really mean much, does it? We can't enforce it on validators. And there are all the weird scenarios I was asking about earlier where the network changes the parameter, and now some Tier1 nodes aren't configured "correctly" (another scenario is a tier1 node being rebuilt on a different instance type transparently, for example).

Again, there is no point in any parallelization work if we expect majority of the network to run on a single CPU

The way I think about this is that of course the network will eventually get to a place where most nodes are running the right hardware to take full advantage of parallelization (otherwise, they get removed from quorum or lose sync). It just feels wrong (and not really enforceable) to have these things in the protocol.

Also, these types of things should be discussed and debated at avenues like the protocol meeting, right?

I think the confusion is around why this parameter is about bounding work.

Yeah, the parameter is not bounding work directly, but it's bounding the amount of modelled instructions per ledger given a machine with at least that many physical cores. From the protocol standpoint it's definition is 'tx set should have at most that many data-independent logical threads', i.e. a value of N guarantees that a machine with N cores can always execute a tx set in such a way that every core executes at most ledgerMaxInstructions modelled instructions. It does not (and can not) guarantee the real apply time, but the expectation is that given a machine with sufficient number of cores the apply time will always be within a low-variance factor of the 'modelled' instruction count. Machines with lower number of cores may sometimes have a factor that is 2+ times larger than the modelled time, which likely will result in 2+ times longer apply time than in machines with sufficient number of cores.

In order for parallelization to work properly, majority of nodes will need to have at least a few (2-3) physical cores more than this setting specifies. If the network votes for a value that is too high, the ledger close time will go up and likely will result in re-vote for a lower value, which is why it would make sense to reach consensus on the minimum requirement for a number of validator cores offline before voting for this.

Also, these types of things should be discussed and debated at avenues like the protocol meeting, right?

Indeed, but this is an experiment/prototype. If it is successful, we'll present the CAP as well as the initial results. If not, we'll keep iterating on this. We'll still have plenty of time for protocol discussions as this is not planned for protocol 22.

marta-lokhova · 2024-07-09T21:05:10Z

src/herder/test/HerderTests.cpp

@@ -647,10 +647,6 @@ testTxSetWithFeeBumps(uint32 protocolVersion)

 TEST_CASE("txset", "[herder][txset]")
 {
-    SECTION("protocol 13")


just checking: this is fine to remove because the test is only testing block construction (so none of this logic is used in the replay)?

Yeah, we shouldn't need to create tx sets for protocol 13 anymore, right?

we can't create new protocol 13 tx sets, but we need to be careful as to not accidentally remove replay test coverage (replay also uses TxSetFrame and the whole ledgerClose flow). That being said, I agree all tests covering surge pricing and block construction can be removed because replay just uses ready tx sets directly from the archives.

Extra tests should have been removed in the cleanup PR, I only removed one helper here that is no longer used.

src/ledger/LedgerManagerImpl.cpp

marta-lokhova · 2024-07-09T21:39:45Z

src/herder/TxSetFrame.h

+    // to determine a stable index of every transaction in the phase, even if
+    // the phase is parallel and can have certain transaction applied in
+    // arbitrary order.
+    class Iterator


This feels a bit premature: ledger close isn't yet hooked up to do anything in parallel, so it's hard to tell how this is going to be used and whether any synchronization needs to be taken into account. I'm curious about the motivation and scope of these changes, given that they seem to be preceding actual parallel application changes.

This changeset in general has to precede the actual application changes, because in order to apply a tx set one needs to have an appropriate schedule. Moreover, the purpose of iterator is to actually hide the internal structure of the phase for the scenarios where we don't care about the parallel execution at all (e.g. when we just want to check if every tx in the phase is valid etc.).

marta-lokhova · 2024-07-09T22:08:25Z

(I should mention that this is my first time tapping into soroban parallelization work, so please bear with me if I'm missing context)

dmkozh · 2024-07-10T00:57:03Z

It's a bit hard to imagine how the refactored tx set is going to be used, because ledger close logic isn't wired up yet. As a result, I can't tell if the access/usage patterns are right in the context of multi-threading. Is there a reason to introduce apply-order changes before parallel application?

This PR actually has to go first, because we can't properly wire up the parallel application logic if we don't have a proper schedule to use. FWIW there is no multi-threading here at all; the only meaningful interface change is that we'll expose the 'parallel' application schedule from the respective phase (instead of a transaction sequence). I probably mentioned this elsewhere, but to be clear we have to build a tx set in a proper fashion in order to be able to apply it in parallel (specifically we need to run the transactions with data dependencies in the same thread or different stages).

(related) can you clarify the scope of this PR? if it's just refactoring tx set to be more general, then it'd be easier to make TxSetFrame more generalized internally, but continue exposing old API to make it easy for consumers. Changes to application order should probably be an isolated protocol change. This should help ship this work incrementally.

The idea here is to provide support for a new phase type in the TxSetFrame; I've tried to limit the API changes as much as possible and basically the main change is that instead of exposing all the transactions at once we expose the phases separately (that just required an additional loop in a few places).

I realize that some stuff could be moved to separate PRs; I'll at least move all the surge-pricing related stuff. That said, I'm not sure the interface changes would make much sense without at least the stub for supporting the new XDR (otherwise it's hard to tell what's the motivation of the changes and whether they look reasonable). I could try separating that as well, but I'm not sure it will help reviewing the design decisions.

dmkozh

I've rebased and updated the PR. There is no rush to merge it into the minor release, so I suppose it'll need to wait more.

dmkozh · 2024-08-09T14:51:14Z

src/herder/TxSetUtils.cpp

@@ -154,11 +179,10 @@ TxSetUtils::buildAccountTxQueues(TxSetTransactions const& txs)
    return queues;
 }

-TxSetTransactions
-TxSetUtils::getInvalidTxList(TxSetTransactions const& txs, Application& app,


After the cleanup we only support nomination of 1 tx/account, so I don't think the comment is necessary in this particular place.

src/herder/TxSetUtils.cpp

dmkozh · 2024-08-09T15:20:11Z

src/herder/TxSetFrame.h

+
+        Iterator(std::variant<TxFrameList, TxStageFrameList> const& txs,
+                 size_t stageIndex, size_t txIndex);
+        std::variant<TxFrameList, TxStageFrameList> const& mTxs;


Updated to not use variant.

dmkozh · 2024-08-09T15:20:46Z

src/herder/TxSetFrame.cpp

+}
+
+bool
+validateNonParallelPhaseXDRStructure(TransactionPhase const& phase)


I've used 'sequential' for v0 phase. Still not sure if there is a better way to name the parallel phase.

dmkozh · 2024-08-09T15:21:34Z

src/herder/test/HerderTests.cpp

@@ -647,10 +647,6 @@ testTxSetWithFeeBumps(uint32 protocolVersion)

 TEST_CASE("txset", "[herder][txset]")
 {
-    SECTION("protocol 13")


Extra tests should have been removed in the cleanup PR, I only removed one helper here that is no longer used.

I've tried to minimize the scope of the changes; specifically this doesn't contain any actual logic for the parallel execution (such as data dependency validation and building parallel stages). However, there is still some refactoring that needed to happen in order to support new, more complex tx sets.

dmkozh force-pushed the parallel_txset2 branch from 073814b to 1ca680d Compare June 26, 2024 16:41

sisuresh reviewed Jun 28, 2024

View reviewed changes

marta-lokhova self-requested a review June 28, 2024 23:23

marta-lokhova requested changes Jul 9, 2024

View reviewed changes

dmkozh force-pushed the parallel_txset2 branch 4 times, most recently from f38272f to fcf19f2 Compare August 9, 2024 15:17

dmkozh commented Aug 9, 2024

View reviewed changes

dmkozh force-pushed the parallel_txset2 branch 2 times, most recently from 9b4579b to ac3b7db Compare September 13, 2024 17:26

dmkozh force-pushed the parallel_txset2 branch from ac3b7db to 93c9541 Compare September 18, 2024 17:41

Initial support for parallel Soroban phase XDR. #4364

Are you sure you want to change the base?

Initial support for parallel Soroban phase XDR. #4364

Conversation

dmkozh commented Jun 26, 2024

Description

Checklist

sisuresh left a comment

Choose a reason for hiding this comment

marta-lokhova left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marta-lokhova commented Jul 9, 2024

dmkozh commented Jul 10, 2024

dmkozh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment