Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Core Groups #607

Open
Tracked by #26
eskimor opened this issue Jun 29, 2023 · 25 comments
Open
Tracked by #26

Idea: Core Groups #607

eskimor opened this issue Jun 29, 2023 · 25 comments
Labels
I6-meta A specific issue for grouping tasks or bugs of a specific category.

Comments

@eskimor
Copy link
Member

eskimor commented Jun 29, 2023

Backers are no longer responsible for cores, but for core groups. A core group
consists of e.g. 4 cores. The backers will then try to back 4 cores at once and
bundle them up into a single bundle they vote on. There will be a single
backing vote per backer per bundle. If a PoV turns out invalid the backer will
simply not put into the bundle.

Cores will then only matter for scheduling and for backers in the sense that
they are bundling PoVs up. Likely backers will communicate about individual PoVs
with each other to build up those bundles efficiently.

From then on we have statements per bundle, bitfields were bits represent those
bundles and approvals/assignments that operate on bundles. Basically with a
bundle size of 2, we should be as good with 6s block times as we were with 12s,
with a bundle size of 4 we are even twice as good and should be able to support
twice as many parachains as we can support now. This optimization does optimize
the whole stack:

  1. Less backing votes/statements.
  2. Less pressure on availability - we have bigger candidates were chunking is
    more meaningful.
  3. Smaller bitfields.
  4. Less assignments and less approvals.

The nice thing is, we can scale bundles with machines becoming more powerful - we
are no longer bottlenecking on network and constant overhead.

Roadmap:

  1. Launch async backing with 12s.
  2. Implement above improvements.
  3. Start offer 6s block times and twice the number of available cores, instead
    of half with bundles of size 4!

Further Scaling Extensions

We can have big and little cores. A big core is actually representing a physical
core. Meaning, POVs of two big cores will be executed in parallel. But we can
also have smaller cores, with half the execution time for example. Then groups would have
cores of different sizes. E.g. a core group with 4 big/full cores, another could
have 8 little cores (each offering half the execution time). Parachains can
order big/little/tiny cores as they see fit.

Path for a parachain: A parachain with little blockspace demand can take a small
core, if demand grows it can scale up to a big core, if that still is not
sufficient it can use boost, taking two cores for parallel execution of two PoVs
at once.

@eskimor eskimor changed the title Idea: Core Groups Idea: Core Groups (CPUs) Jun 29, 2023
@eskimor eskimor changed the title Idea: Core Groups (CPUs) Idea: Core Groups Jun 29, 2023
@sandreim
Copy link
Contributor

I like this idea and also think that more batching is the way to go as it minimises the overhead. I will expand this some more soon

@sandreim
Copy link
Contributor

Core Time

It makes sense to introduce a CoreTime primitive which represents the unit for measuring how much block space is available for a parachain at a given relay parent. On-demand parachains will need to specify the neededCoreTime when acquiring block space. The CoreTime is enforced in the backing phase - PVF execution killed passed allowed time. I propose an enum for the fractions of a core - assuming 100ms slices we get smth like 5 half core, 10 full core or more than 10 when boosting. Also we should assume there is some extra overhead/underutilisation for a Full core if we fraction it and then execute multiple PoVs in parallel rather than a single one.

The parachains are assigned to cores. For fractions they can be also assigned to same core, then grouped in a CoreGroup that can hold up to MaxCoreTime of entries.
MaxCoreTime is a configuration parameter that defines the amount of execution bandwidth validators need to provision per relay chain block.
MaxCoreTime should be BACKING_TIMEOUT multiplied by number pvf workers allowed in parallel. A good start would indeed be 4 PVF workers which means, 8s for async backing, and 16s for async backing assuming (timeout is 4s for 2s of parachain block execution).

This is for sure useful for more exotic scheduling as described in Further Scaling Extensions above, but we can start with the current state which is 1 full core only. Later when we can add support for fractions and boost for parallel block production.

Backpressure

When dealing with large core groups it could be possible that the validator has more PVF work to do than what it is possible in a given time frame. Execution priority is Disputes first , approval voting then backing last to enforce faster finality while also slowing down parachain block production by backing less candidates in a group.

@burdges
Copy link

burdges commented Jun 30, 2023

We should do this of course. We should remain aware that workload variance per relay chain block increases somewhat by doing this, but maybe no too much, and total variance even less so.

We'll still merge paritytech/polkadot#6782 or some flavor of course. We'd ideally still sign approvals for multiple core groups ala #732 too, but it becomes less valuable now.

@sandreim
Copy link
Contributor

sandreim commented Jul 3, 2023

Yeah, I also don't think #701 makes that much sense if we move to CoreGroup, but it still might if we cap the groups to a smaller CoreTime. Ideally we want to do more with less validators (maxing out on the current spec vs adding more validators).

@rphmeier
Copy link
Contributor

rphmeier commented Jul 3, 2023

We need to figure out a light algorithm for validators assigned to different cores to choose the subset(s) of possible core group candidates that they want to back, accounting for the fact that each parachain can produce as few as 0 candidates and an unbounded amount of valid candidates.

@rphmeier
Copy link
Contributor

rphmeier commented Jul 15, 2023

I thought a bunch more about core groups, and I don't think it's quite the right solution, nor does it really align with the scheduling infrastructure proposed here polkadot-fellows/RFCs#3 . That's still a pending RFC so I won't take it as a given, but I haven't seen any other solid proposals for dynamic scheduling.

To be clear, this SHOULD be an RFC, and I'm treating this as an initial discussion where nothing is set in stone.

Reframing Goals

The original issue stated these goals

  1. Less backing votes/statements.
  2. Less pressure on availability - we have bigger candidates were chunking is
    more meaningful.
  3. Smaller bitfields.
  4. Less assignments and less approvals.

I would reframe this way

  1. Reduce constant costs associated with candidates
    a. Availability and Approvals, in particular
  2. Use this amortization of costs to allow many light state transitions to be posted to the same core in the same relay-chain block.
    a. This provides a better product to end-users (chains running on Polkadot) as they can post smaller batches of work more frequently, for lower latency.
  3. Backing should be as close to asynchronous as possible

Why not core groups?

The main issue with core groups is that it hides a large amount of complexity in the validators of a backing group choosing which parachain candidates to back together. Any solution there would probably require running a consensus protocol among backers, and leaves out the possibility for candidates to be backed at different points in time, i.e. for backing to be a continuous and asynchronous process that feeds into the relay chain.

It also doesn't overlap nicely with the probabilistic scheduling approach in RFC-3, where backers don't know exactly which chains are scheduled on which cores at which times, but just roughly what could be scheduled. We are able to reuse backing statements from previous rounds to include stuff that didn't quite make it in.

We should apply this same approach to core groups: what we want to do is bundle many small state transitions onto a single core. We don't know exactly which chains they're going to be from, and we do want backing statements about them to be reusable in later relay-chain blocks.

Proposed Alternative

I'm still in favor of doing some kind of bundling. Goals (1) and (2) are pretty obviously useful at this point. So here are all the things that stay the same as in core groups:

  • A bunch of candidates get bundled together. They become available together, taking up 1 bit in the bitfield.
  • Approval checkers assigned to the core recover the data for the whole bundle and validate all these candidates together.

Here's what's different:

  1. Backers aren't responsible for doing the bundling. The relay-chain block author is. All candidates must be backed by a sufficient number of group members, but there's no requirement that they all must agree on a subset.
  2. We do the availability bundling on the network layer, because (1) doesn't let us get an overarching erasure-root.
    a. Each backer will have only backed a subset of the candidates, but should answer network requests with chunks for all the candidates they've backed as part of the batch that ended up on-chain. They should attempt to fetch the full PoV from other backers in their group for the ones they didn't back themselves.
    b. Validators fetching their chunks might need to make multiple requests in the case that the subsets of backers for each block pending availability don't overlap. They request the same chunk index for each candidate.
    c. Signing a bit in the bitfield means you have all your chunks from all the candidates in the bundle
  3. Approval checkers / dispute participants recovering from chunks need to do N erasure-decodings/recodings while validating candidates. In this respect, each candidate still carries a constant cost, but each erasure-coding will have 1/N the size of the aggregate. (according to @burdges the setup phases can be done simultaneously so it's still an optimization)
  4. The state transitions themselves aren't bundled up into a single blob. The approval checker still validates all the candidates sequentially, but it can error out quickly if any of them are invalid.

Integrating into Scheduling

To be honest, the only thing we need for cores is to specify their limits

// Per-relay-chain-block core specs
struct CoreResourceLimits {
    // Total across all candidates backed in the same relay-chain block
    total_execution_time: Duration,
    // Total across all candidates backed in the same relay-chain block
    total_data_size: u32,

    // The minimum amount of execution time a single state transition backed on this core can use.
    // State transitions on this core are always rounded up to the nearest multiple of this.
    execution_time_step: Duration,
    // The minimum amount of data a single state transition backed on this core can use.
    // State transitions on this core are always rounded up to the nearest multiple of this.
    data_size_step: u32,

    // maybe some overhead budgeting constants, i.e. each additional candidate has an expected overhead
}

If we update the CandidateCommitments to have two fields:

struct CandidateCommitments {
   // ...
   execution_time: ...,
   pov_size: ...,
}

then we get this really nice property that the amount of candidates as well as the distribution of resources across candidates going into a core can vary from block to block. This is not just core groups, but dynamic market-based core groups. Where we can regulate the amount of candidates parachains produce with the Region market.

I'm thinking of modifying RFC-3 so that regions just specify the general resource consumption rate of a parachain and then chains get to decide how many blocks they make, how big they are, etc without being attached to any particular size and rate.

@burdges
Copy link

burdges commented Jul 15, 2023

  • A bunch of candidates get bundled together. They become available together, taking up 1 bit in the bitfield.
  • Approval checkers assigned to the core recover the data for the whole bundle and validate all these candidates together.

Yeah, this sounds like the essence anyways. I like..

Interestingly, we might even solve the "weights are too aggressive" problem here, but without touching the weights, although this becomes fraught elsewhere. cc @Sophia-Gold

We do the availability bundling on the network layer, because (1) doesn't let us get an overarching erasure-root.
a. Each backer will have only backed a subset of the candidates, but should answer network requests with chunks for all the candidates they've backed as a batch
b. Signing a bit in the bitfield means you have all your chunks from all the candidates in the bundle

Yeah this seems fine. It's possible to censor blocks you never backed, but so what.

Approval checkers / dispute participants recovering from chunks need to do N erasure-decodings/recodings while validating candidates. In this respect, each candidate still carries a constant cost, but each erasure-coding will have 1/N the size of the aggregate.

Importantly, we fetch exactly the same availability chunk indices for each parablock in the availability core, so we only do the setup phase for one index set. It's thus no different from doing one big decoding, except maybe a couple cache misses.


As an aside, we do not technically need availability and approvals cores to be the same either, but not doing so adds complexity, performs slightly worse if you must rerun the setup phase, and afaik yields no benefits. It'd look like this:

  1. Backing
  2. Backing votes appear on chain
  3. Availability
  4. Inclusion, now computes on-chain assignment of availability cores into approvals cores
  5. Reconstruction etc

Yet reconstruction now fetches chunks with three optimization preference levels:

  1. Systemic availability chunk indices
  2. Availability chunk indices shared with other members of the same approval core, so setup phase can be reused.
  3. Availability chunk indices, requires rerunning setup phase.

In essence, your proposal does the grouping in 2, but you could do the grouping in 4, if you accepting having bigger bitfields, and this nasty reconstruction optimization preference levels mess. I doubt this improves anything.

@rphmeier
Copy link
Contributor

rphmeier commented Jul 15, 2023

It's possible to censor blocks you never backed, but so what.

It's possible to delay availability on blocks you never backed, but if you never backed them they've still been backed by enough other validators in the same group. Seems just as censorable as the current system - you know who backed it when it goes on-chain and can DoS them. It's faster as a validator to get your chunks in a single request but as an example:

  • 3 parachain blocks backed on the core. 2 are backed by subset A, 1 is backed by subset B
  • Ideally, subsets A and B would communicate to get the full data, but they might withhold this from each other
  • Validators looking to get their chunks would attempt to get all their chunks for all 3 parachain blocks from a single validator in A or B, but they might need to make 2 requests.
  • This is only relevant for availability distribution - in availability recovery, we are guaranteed to get 3 combined chunks per request by the changed meaning of bits in the availability bitfields

Importantly, we fetch exactly the same availability chunk indices for each parablock in the availability core, so we only do the setup phase for one index set. It's thus no different from doing one big decoding, except maybe a couple cache misses.

OK, this is good info. Yes - glad this amortizes nicely, though we would need some new APIs to do that in practice and it's not a blocker for an initial implementation.

your proposal does the grouping in 2, but you could do the grouping in 4, if you accepting having bigger bitfields, and this nasty reconstruction optimization preference levels mess. I doubt this improves anything.

I also considered doing the grouping in 4 but it seems like it's always best to do the grouping as early as possible and then move it later when that presents challenges.

As an aside, we do not technically need availability and approvals cores to be the same either

Yes, and RFC-3 I've proposed introducing some "hypercores" for making scheduling probabilistic (i.e. relay chain block authors can temporarily "widen" cores and bitfields up to some per-block limit to allow making up for missed opportunities to post stuff on-chain)


As another aside, it's worth noting that RFC-5 just specifies the proportion of resources each parachain should have on the core. That interface would work for approaches where we have one block per core per relay-chain block, but could also adapt to core-sharing under the hood. e.g. if a parachain wants to, they could just build 2x as many blocks, half the size, and some of them will go in earlier, some of them will go in later. Obviously this comes with some additional cumulative PoV usage, so it's not something you can do for free, but still may be useful.

RFC-3 also now doesn't phrase things in terms of "blocks" but rather core resource consumption rates, so it should be totally forward-compatible with these changes

@burdges
Copy link

burdges commented Jul 15, 2023

Assume backing groups of 4 or 5 validators with a backing threshold of 2. If X is backed by 1 & 2 and Y is backed by 3 & 4, but not 1 & 2, then X < X+Y cannot become available if 3 & 4 withhold chunks in Y.

Imho, this is not really an immediate problem, and these backing groups sound too large anyways, but it'll maybe bother @AlistairStewart or sr labs or complicate billing. I suppose grouping in inclusion fixes this, but likely not worth those headaches.

We could've logic that mitigates this under your grouping in backing model too. It's easiest if backing are 2 out of 3 but other options:

Example 1. If you make the relay chain block after availability fails then you could retry with the chunks you received, but not with the chunks you did not recieve. This is bad in Aurababe, since you're not anonymous 3/4th of the time, but you're anonymous in sassafras, so then it'll work reasonably well.

Example 2. If availability is really slow for some availability core, then we dynamically add more find grained votes to the bitfield, before abandoning the whole availability run.

As a bike shed, I'd think names like "hypercores" or even "core groups" confuse people, so I'd suggest something more inspired by the protocol.

  • execution core - whatever really, but maybe the unit a parachain provides. para-job or similar works too here.
  • availability core - one bit in the availability bitfield, but possibly several execution core / para-job
  • approval core - one assignment and voting target in the approval core, assuming the availability core is smaller (grouping in inclusion) or larger (example 2 above)

Also..

If cores contain multiple blocks then we'll likely validate them in parallel. We discussed flavors of SPREE being truly a separate block, but this sounded excessive to me, and is not how Basti envisions SPREE. We might anyways have a truly separat-ish blocks from the same parachain being validated in parallel here, so one could imagine multiple blocks from the same parachain having merged PoV witnesses.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/altering-polkadots-fork-choice-to-reduce-da-load/3389/13

@burdges
Copy link

burdges commented Jul 24, 2023

Any idea how hard @rphmeier's flavor of core groups is going to be?

@sandreim
Copy link
Contributor

Any idea how hard @rphmeier's flavor of core groups is going to be?

Essentially this requires changes across the entire stack, so it will not be easy at all. I'm going to start working on a PoC after I have a very good understanding of the details.

@burdges
Copy link

burdges commented Jul 24, 2023

Ahh, I suppose the initial idea changed little after backing, plus whatever cumulus does, but the improved version by Rob demands deep changes in availability too.

@sandreim
Copy link
Contributor

I've drafted a PoC plan based on the discussion here. Nothing here is set in stone, so feel free to comment. I expect the PoC to be very well documented and incrementally built. @eskimor @rphmeier @burdges If this sounds good, I will start to create individual tickets for the work.

Goal

Minimizing the constant overheads in approval checking and availability recovery. Compatiblity with RFC-1, RFC-3, RFC-5.

High level changes

  1. Backing
    1. backers can be assigned to more than 1 core, bounded by a configuration value. This allows to still have a backing group of 5 when n_validators / n_cores is less than 5.
    2. multiple candidates are backed
  2. Relay chain block is built
    1. Runtime computes a bounded set of ExecutionGroup based on the backed candidates in the inherent data
    2. A configuration variable N will specify the maximum size of a group
  3. Relay chain block is imported
    1. availability distribution starts up to N chunk fetches ExecutionGroup from shuffled per candidate backers.
  4. Bitfields are created and signed
    1. a bit of 1 now means availability of all candidates in an ExecutionGroup
    2. bitfields get shorter
  5. When candidate is included
    1. Approval checkers get assigned to ExecutionGroupIndex instead of CoreIndex.
    2. When own assignment triggers, we get backers for all candidates in ExecutionGroup
    3. They recover all PoVs in a group and executes them in parallel (depending on the load, we might bump PVF worker count from 2 to 4)
      1. Can stack with v2 assignments and approval coalescing
      2. Hypercores: do we require serial execution of candidates if for example we want to back N, N+1 parablocks per relay block?
  6. Disputes
    1. An opposite vote on a candidate from an ExecutionGroup will trigger the check of that invidual candidate, not the whole group.

New configuration parameters

max_candidates_per_execution_core: u32,
max_cores_per_backer: u32, // maximum number backing groups the validator can be part of

Target Numbers for testing

6s prablocks
1000 paravalidators ( n_validators )
300 n_cores
6s block times
30 needed_approvals
4 max_candidates_per_execution_core

core_groups: 75 n_cores / max_candidates_per_execution_core
total assignments + approvals: ~ 4500 2 * (needed_approvals * core_groups) / n_validators)
validations per validator per relay chain block = 2,5 * max_candidates_per_execution_core = 10

From a birds eye view it becomes a necessity to bump CPUs from 8 to 10 or more to accomodate this scale and be able to absorb any surge in block space utilization ( longer PVF runs).

The testing will use a mix of PVF execution times and PoV sizes using Glutton, that should amount to a target of 50% utilization of block space and time. We can easily adiust the configuration Gluttons to increase/decrease that target.

Runtime

At least one new API will be needed to expose the ExecutionGroup mappings and create_inherent and enter changes for creating the ExecutionGoups on each relay chain block.

Backing

CandidateCommitments could include additional fields like execution_time and pov_size which can be used as criteria when the runtime creates the ExecutionGroups mappings.

Availability

As we fetch the same availability chunk indices for each candidate in the ExecutionGroup we only do the setup phase for one chunk index set. CPU usage for doing such N reconstructs should be roughly the same as doing a single reconstruct.

We’d still prefer to fetch the PoV from the backing group up to some limit (currently 128KB), but if that fails we’ll do a normal chunk recovery. There are additional optimizations left on the table, which can be implemented later such as:

  1. Systemic availability chunk indices
  2. Availability chunk indices shared with other members of the same ExecutionGroup (setup phase can be reused )

Approval checking

For the PoC it would be best if we could reuse CandidateIndex and CoreIndex from assignments certificates and make them point to ExecutionGroup indices.

Execution

The target numbers assume worse case scenario of all cores in a group use 100% of their maximum execution time means we need to bump PVF workers to 5 which is 50% of total node CPU bandwidth. The other 50% must be enough for the last 2 top CPU consumers, networking and availability recovery.

The erasure recovery and network optimizations could possibly give back 1 CPU that can be used in validation.

@burdges
Copy link

burdges commented Jul 27, 2023

Important: It's relay chain block producers who allocate candidates to cores, but it's still backers on the hook for the execution time. We need thus backers to place the resource consumption, in time and memory, into their backing statements. In principle, the backers statements about memory usage determine if nodes are running parachain blocks in parallel.

@rphmeier
Copy link
Contributor

We need thus backers to place the resource consumption, in time and memory, into their backing statements.

I imagined this could go in the candidate receipt, where collators actually determine it and backers just check if the resource requirements hold. PoV size should be committed to in the candidate receipt anyway, for a bunch of reasons, but committing to execution time accurately is probably a bit harder so it'd tend to be an overestimate.

@rphmeier
Copy link
Contributor

rphmeier commented Jul 31, 2023

backers can be assigned to more than 1 core, bounded by a configuration value. This allows to still have a backing group of 5 when n_validators / n_cores is less than 5.

I was imagining we'd still keep it as a single "core", but just have multiple things go onto it. If we stick with 1 core per candidate, it'll lead to the number of cores per relay-chain block fluctuating based on what gets posted. That seems odd. My mental model is based around regions, so I'm just imagining that backers work on all the regions that are assigned to the core, however many of them there may be. For the PoC we have to do something different, so maybe that's what this is. But we could achieve the same thing by just having the availability_cores runtime API give a list of scheduled para-ids, not a single one.

  1. Runtime computes a bounded set of ExecutionGroup based on the backed candidates in the inherent data
  2. A configuration variable N will specify the maximum size of a group

Wouldn't we rather have the resource constraints specified on a per-core rather than a per-group basis?
as suggested above. I don't see why we need to compute anything or what the ExecutionGroup struct would even look like. We just need to make sure that the Vec<Candidate> posted passes some predicate fn resource_pred(TotalCoreLimits, &[Candidate]) -> bool. That predicate would take into account things like the number of candidates, minimum resource usage per candidate, maximum across all, etc.

Or is an ExecutionGroup essentially an "approval core"?

availability distribution starts up to N chunk fetches ExecutionGroup from shuffled per candidate backers.
a bit of 1 now means availability of all candidates in an ExecutionGroup

Yeah, this sounds right

Approval checkers get assigned to ExecutionGroupIndex instead of CoreIndex.

We could still call these cores. Not sure we need a new term, just that we need to change the semantic meaning of a core. Everything else looks right here.

An opposite vote on a candidate from an ExecutionGroup will trigger the check of that invidual candidate, not the whole group.

Yep

max_candidates_per_execution_core: u32,
max_cores_per_backer: u32, // maximum number backing groups the validator can be part of

We should be thinking in regions, not cores. e.g. we should support including multiple candidates from the same process in a single bundle. max_candidates is one variable but ideally we're accommodating many candidates of different sizes, so we need something more detailed to describe resource limitations.

I'm not sure why a backer would need to be assigned to many cores at once. At least in light of RFC-3, we will already have many processes multiplexed on the same core, so a single backing group can work on all those regions at once anyway.

Maybe fine for a PoC, but I expect this'd have to change a lot for production.

The target numbers assume worse case scenario of all cores in a group use 100% of their maximum execution time means we need to bump PVF workers to 5 which is 50% of total node CPU bandwidth.

Shouldn't it be the case that the sum of all candidates on a single core use e.g. 2 seconds of execution time or whatever the limit per core is? I suppose we could push higher when taking parallelism into account. However, at that point we are starting to break compatibility with RFC-1/3 as we assume that 1 parachain is even capable of occupying 100% of a core. Of course, with block authoring/import parallelism on the Cumulus side we can probably achieve this.

However I'd still push for the approach where a single "core" can just handle larger amounts of throughput per-block, but we expect that many chains are simultaneously multiplexed across them. e.g. every core can handle 8s of execution per 6s, but any individual state transition put on that core is at most 2s of execution. This'd let us lean into the amortization without introducing any new concepts, except for the resource allocation schedule of the core.

@sandreim
Copy link
Contributor

sandreim commented Jul 31, 2023

My assumption here was that we do not want to change backing or implement parts of RFC-3. But yeah, the PoC should still in principle be as close as possible to production. Something like this should be workable

  • Regions get assigned to cores (not in PoC)
  • Backers work on all regions and multiple candidates from different regions get backed (not in PoC)
  • Block author computes ExecutionGroups based on execution time and pov sizes in CanidateCommitments and includes them in the relay chain block.
  • Alternatively the Runtime can compute ExecutionGroups (slower)
  • Runtime sanitizes the ExecutionGroups, and panics if the grouping fails any of the criteria (duplicates, resource limits breached, not all candidates included, not efficient grouping, etc)
  • everything else stays the same

Shouldn't it be the case that the sum of all candidates on a single core use e.g. 2 seconds of execution time or whatever the limit per core is? I suppose we could push higher when taking parallelism into account. However, at that point we are starting to break compatibility with RFC-1/3 as we assume that 1 parachain is even capable of occupying 100% of a core.
Of course, with block authoring/import parallelism on the Cumulus side we can probably achieve this.

Why should a parachain using 100% of the core break RFC-1,3 ? That would basically be what we have now, and it means it will be the only thing in an ExecutionGroup

However I'd still push for the approach where a single "core" can just handle larger amounts of throughput per-block, but we expect that many chains are simultaneously multiplexed across them. e.g. every core can handle 8s of execution per 6s, but any individual state transition put on that core is at most 2s of execution. This'd let us lean into the amortization without introducing any new concepts, except for the resource allocation schedule of the core.

Yes, that is doable, but I would like to keep it simpler. We can just consider 1 PVF worker per core per relay chain block -> so we're limited by 6s of execution. But, since these ExecutionGroups also stack with the v2 assignments and coalescing of approvals. We can get some more parallelism and optimizastion going on at that level, but in practice we do end up getting more than 6s of execution per relay chain block. We just have to be careful to not overcommit on the PVF workers. But even if it happens occasionally on some nodes, is easy to prioritize executions at PVF worker level in this order: disputes, approval, backing - if a node has too much PVF work to do, it will simply back less candidates.

@rphmeier
Copy link
Contributor

rphmeier commented Jul 31, 2023

Can you explain the struct ExecutionGroup? I still don't understand it.

I am just thinking that a Core can have a per-block workload which is comprised of state transitions from many processes. This writeup seems to assume that a core only has one process on it per block, so we need this ExecutionGroup struct to put them together.

type ExecutionGroup = Vec<CommittedCandidateReceipt>, where they are all grouped together on the same core?

Why should a parachain using 100% of the core break RFC-1,3 ?

It doesn't, but it sounds like you're talking about cores (or groups?) which can process more than 2s of execution time. Once we pass a certain threshold it's pretty hard for a parachain to keep up, which I think is OK.

@sandreim
Copy link
Contributor

sandreim commented Jul 31, 2023

Yes, its Vec<CommittedCandidateReceipt>, and yes it's just a bunch of candidates for which their execution sums up to at most 2s.

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629/1

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629/3

@Polkadot-Forum
Copy link

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/dynamic-backing-groups-preparing-cores-for-agile-scheduling/3629/8

@Sophia-Gold Sophia-Gold transferred this issue from paritytech/polkadot Aug 24, 2023
@the-right-joyce the-right-joyce added the I6-meta A specific issue for grouping tasks or bugs of a specific category. label Oct 3, 2023
@eskimor
Copy link
Member Author

eskimor commented Jan 30, 2024

The idea evolved to push the bundling into the other direction: To builders/collators. Candidate Descriptor changes are already being worked on in the course of #1829 .

@burdges
Copy link

burdges commented Jan 30, 2024

Although less dynamic, that's less relay chain complexity, probably the right choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I6-meta A specific issue for grouping tasks or bugs of a specific category.
Projects
Status: Backlog
Development

No branches or pull requests

6 participants