Poor PV performance (missed votes) with v1.12.0 and v1.13.0 #4800

aperture-sandi · 2024-06-14T15:30:40Z

Is there an existing issue?

I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

This is not a support question.

Description of bug

EDITED to add...
Further research seems to indicate an overall increase in missed votes for a significant percentage of validators.
See comment below: #4800 (comment)

Poor PV performance (missed votes) with v1.12.0 and v1.13.0

We have been seeing poor PV performance in the form of missed votes for v1.12.0 and v1.13.0
For our DOT node for versions v1.11.0 and previous our missed vote ratio (MVR) was ~0.3%
For our DOT node for versions v1.12.0 and newer our missed vote ratio (MVR) is ~1.2%

To confirm v1.12.0+ was the issue we reverted to v1.11.0 for 8+ sessions and saw our MVR return to previous low values.

We have tested this on Xeon(R) E-2136 CPU @ 3.30GHz and Xeon(R) E-2386G CPU @ 3.50GHz bare metal servers running NVME drives.
*NVME drive are connected to PCIE via add on card for direct PCIE access
*Hyperthreading is turned off
(See attached benchmarks)

We compile using "RUSTFLAGS="-C target-cpu=native" and "--profile production"
RUSTFLAGS="-C target-cpu=native" ~/.cargo/bin/cargo install --force --locked --path . --profile production

Rust Version:

~$ rustc --version
rustc 1.77.0 (aedd173a2 2024-03-17)

Ubuntu 22.04.3 LTS

See attached image of MVR via TurboFlakes ONE-T report.
Live Report here: https://apps.turboflakes.io/?chain=polkadot#/validator/5F99nAGjKA9cU8JTpuTDWDhqpMxwxmwc8M1ojkKXZAR7Mn2c?mode=history

How can we help to narrow down the cause of this performance issue?

Steps to reproduce

To confirm v1.12.0+ was the issue we switch back to v1.11.0 for 8+ sessions and saw our MVR return to previous low values.

We have tested this on Xeon(R) E-2136 CPU @ 3.30GHz and Xeon(R) E-2386G CPU @ 3.50GHz bare metal servers running NVME drives.

EDITED to add...
Further research seems to indicate an overall increase in missed votes for a significant percentage of validators.

See comment below: #4800 (comment)

The text was updated successfully, but these errors were encountered:

bkchr · 2024-06-17T20:04:49Z

CC @sandreim @alexggh

alexggh · 2024-06-18T06:59:24Z

First of all the drop in performance is small, since you are still getting at least A grades, but not that often getting A+, with these small margins I would double check a few times that it is not just restarts that improve things and not the version changing.

Getting into potential reason for this happening, I don't think it is a problem with those releases since ~90% of the validators is already running them and getting A+ grades, so I think it might be a combination of network or your node being a bit slow and maybe something that change between those versions. Two pull requests touched that area(statement-distribution) between v1.11 and v1.12:

prospective-parachains rework #4035, @Overkillus any chance this could explain this behaviour ?
Statement Distribution Per Peer Rate Limit #3444, @alindima any chance this could explain this behaviour ?

aperture-sandi · 2024-06-18T14:22:44Z

Thanks for helping out with this. I will address the possible issues with our servers and network below.

that it is not just restarts that improve things

For versions 1.11.0 and previous, our nodes for ran for weeks at a time without restart and achieved very few missed votes.

During the "v1.12.0 section of our graph above (snip below), we restarted no less than four times as we moved the node back and forth between our E-2136 and E-2386G severs. Restarts showed no improvement.

a combination of network (or your node) being a bit slow

To keep network latency to a minimum we use 16 core Microtik routers capable of handling 100x the traffic we throw at it. Our routers' CPU usage is almost idle.

your node being a bit slow

We thought our E-2136 server might be a bit slow (despite it meeting benchmarks and running v1.11.0 fine), so we did switch the DOT node to the E-2386G server that achieves a score of over 200% on all tests except memory. As discussed above, the faster server was no help.

aperture-sandi · 2024-06-19T14:54:49Z

Adding more data to this to show that this appears to be effecting other validators...

This is a chart showing network DOT performance for v1.11.0. Note the "Distribution by grade" section, percentage of A+ validators was 89.5%

Now, this is a chart showing network DOT performance for v1.12.0+. Note the "Distribution by grade" section, percentage of A+ validators is 71.5%

ONE-T data only goes back 30 days, so data for v1.11.0 is now limited but seems to indicate an overall increase in missed votes.

eskimor · 2024-06-19T15:27:31Z

Can someone check whether parachain block times also take a hit? Maybe some parachains are forking?

sandreim · 2024-06-19T15:48:54Z

Two pull requests touched that area(statement-distribution) between v1.11 and v1.12:

Statement Distribution Per Peer Rate Limit #3444, @alindima any chance this could explain this behaviour ?

Could be it, currently there is a hard limit of 1 concurrent request per peer: #3444 (comment)

I would expect that some validators are not upgraded and send more than one request at a time to peers upgraded to v1.12+

sandreim · 2024-06-19T15:53:50Z

Can someone check whether parachain block times also take a hit? Maybe some parachains are forking?

Block times for past 90days

sandreim · 2024-06-19T16:20:11Z

I don't have any metrics for Polkadot as we dont have any validators running there but the metrics for Kusama show a clear trend of dropping statement distribution requests. It seems to be better than it was before as more have upgraded.

However I don't see this issue affecting liveness at all.

alexggh · 2024-06-20T07:14:06Z

Adding more data to this to show that this appears to be effecting other validators...

Digging a bit more through the data and I concur with your finding it seems like around session 8.736 -> 21-22th of May the number of validators with A grade instead of A+ increased from 6% to 22%, that also coincides with the day release v1.12 was announced.

Can someone check whether parachain block times also take a hit? Maybe some parachains are forking?

Doesn't look like from the metrics and it makes sense because https://apps.turboflakes.io/?chain=polkadot#/insights gives you the grade based on the relative points you gathered compared with the Peers in your group, so in this case if other peers got slightly more points it means they backed the candidate and that ended-up on chain.

Just from the data it looks like, sometimes you loose the race of including your Valid/Backed statements on chain, as of now the most obvious candidate for causing this is: #4800 (comment), but this needs more investigation to understand the full conditions.

alexggh · 2024-06-20T08:13:18Z

@aperture-sandi any chance you could provide us with the grafana values for the following metrics, when running v1.12 or v1.13 ?

avg(rate(polkadot_parachain_statement_distribution_peer_rate_limit_request_drop_total{}[$__rate_interval]))
avg(rate(polkadot_parachain_statement_distribution_max_parallel_requests_reached_total{}[$__rate_interval]))

sandreim · 2024-06-21T08:51:52Z

Quick update on next steps here:

reproduce issue on our Kusama burn-in
~~prepare and test potential fix (allow more than one request in parallel in statement distribution)~~

alexggh · 2024-06-21T14:07:26Z

Looking at the logs from one of our kusama validator I see this: https://grafana.teleport.parity.io/goto/7QCJDgwIR?orgId=1


2024-06-21 16:27:55.766	2024-06-21 13:27:55.757 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40004
2024-06-21 15:27:55.947	2024-06-21 12:27:55.942 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40003
2024-06-21 14:29:24.594	2024-06-21 11:29:24.594 DEBUG tokio-runtime-worker parachain::availability-store: Candidate included candidate_hash=0xc6d10156031899e715e3a8afa669d970bfb166290921b761cbed5857a3edf4e3 traceID=264272360305836897721701964816212154736
2024-06-21 14:29:24.541	2024-06-21 11:29:24.534 DEBUG tokio-runtime-worker parachain::availability-store: Candidate included candidate_hash=0xc6d10156031899e715e3a8afa669d970bfb166290921b761cbed5857a3edf4e3 traceID=264272360305836897721701964816212154736
2024-06-21 14:29:12.965	2024-06-21 11:29:12.964 DEBUG tokio-runtime-worker parachain::candidate-backing: Refusing to second candidate at leaf. Is not a potential member. candidate_hash=0xc6d10156031899e715e3a8afa669d970bfb166290921b761cbed5857a3edf4e3 leaf_hash=0xa830b4842c1db6ed4c5df3ecebc4854f20e9835662257498eed0cbb07b4ce2d8 traceID=264272360305836897721701964816212154736
2024-06-21 14:27:56.147	2024-06-21 11:27:56.139 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40002
2024-06-21 13:34:24.413	2024-06-21 10:34:24.412 DEBUG tokio-runtime-worker parachain::availability-store: Candidate included candidate_hash=0x23132e882af4e3a044deb39fa876e1d469aa653406e34ce2dc74214fc8d2f085 traceID=46622577271950510439480467781380923860
2024-06-21 13:34:13.174	2024-06-21 10:34:13.161 DEBUG tokio-runtime-worker parachain::candidate-backing: Refusing to second candidate at leaf. Is not a potential member. candidate_hash=0x23132e882af4e3a044deb39fa876e1d469aa653406e34ce2dc74214fc8d2f085 leaf_hash=0xd362726d4c4f4e1d5f4fbd8518508df453c26e314465a336837a050c73ce9d5c traceID=46622577271950510439480467781380923860
2024-06-21 13:32:42.752	2024-06-21 10:32:42.751 DEBUG tokio-runtime-worker parachain::availability-store: Candidate included candidate_hash=0xc1da83cf1eb21ca84ad230063e99fc6942892a9904e2488d00883330eca4cf10 traceID=257675597307036949136235535082798578793
2024-06-21 13:32:36.126	2024-06-21 10:32:36.076 DEBUG tokio-runtime-worker parachain::candidate-backing: Refusing to second candidate at leaf. Is not a potential member. candidate_hash=0xc1da83cf1eb21ca84ad230063e99fc6942892a9904e2488d00883330eca4cf10 leaf_hash=0xf8985bf5e6dcf2afa8630b6f4341f65a568392b2d1ef66daff3e0c7be7ba8182 traceID=257675597307036949136235535082798578793
2024-06-21 13:27:55.473	2024-06-21 10:27:55.470 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40001
2024-06-21 12:27:55.649	2024-06-21 09:27:55.643 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=40000
2024-06-21 11:27:55.700	2024-06-21 08:27:55.692 DEBUG tokio-runtime-worker parachain::gossip-support: New session detected session_index=39999

And this correlates exactly with the number of missed statements by the validator: https://apps.turboflakes.io/?chain=kusama#/validator/FSkbnjrAgD26m44uJCyvaBvJhGgPbCAifQJYb8yBYz8PZNA?mode=history. It misses 2 statements at session 40001 and 1 statement at 4002.

The changes have been introduced with #4035, so most likely that's what is causing the behaviour on your validators, I'll dig deeper into why this condition Refusing to second candidate at leaf. Is not a potential member gets triggered.

The good news is that this is happening very rarely and even when it happens the candidate gets backed anyway, so the parachain does not misses any slot, it's just that the validator will not get the backing points for that single parachain block.

alindima · 2024-06-21T14:39:19Z

ok I was a bit confused by the logs being in reverse chronological order.

This log can appear if the parachain is producing forks. But the resulting number of on-chain backing votes should not be different prior to #4035 .

Prior to #4035, different parts of the backing group (let's say half and half) could back different forks. The block author would in the end only pick one of the two (so only half of the votes would end up on chain). But the block author will have both candidates to pick from.

After #4035, each half of the validator group would back different forks, but the block author will only ever know one candidate. In the end, th result should be the same (only half of the votes would end up on chain).

Based on this reasoning this should not affect backing rewards.

However, I think what may be happening is that prior to #4035, each validator in the group would have backed both candidates. That is, even if a validator voted for a candidate, this wouldn't stop the validator from voting for a competing candidate.

aperture-sandi · 2024-06-21T14:43:47Z

@aperture-sandi any chance you could provide us with the grafana values for the following metrics, when running v1.12 or v1.13 ?
avg(rate(polkadot_parachain_statement_distribution_peer_rate_limit_request_drop_total{}[$__rate_interval]))
avg(rate(polkadot_parachain_statement_distribution_max_parallel_requests_reached_total{}[$__rate_interval]))

Hi all, I'm out of the office until Tuesday won't be able to get to this until Tuesday, Wednesday, if still needed

sandreim · 2024-06-21T20:06:39Z

The changes have been introduced with #4035, so most likely that's what is causing the behaviour on your validators, I'll dig deeper into why this condition Refusing to second candidate at leaf. Is not a potential member gets triggered.

The good news is that this is happening very rarely and even when it happens the candidate gets backed anyway, so the parachain does not misses any slot, it's just that the validator will not get the backing points for that single parachain block.

I discussed this further with @alindima offline and we think you are right.

Prior to #4035, different parts of the backing group (let's say half and half) could back different forks. The block author would in the end only pick one of the two (so only half of the votes would end up on chain). But the block author will have both candidates to pick from.

This is almost correct. You would have more than half of the votes, as the validators would accept to second the fork which is what changed with this PR.

eskimor · 2024-06-21T20:58:49Z

A simple fix would be to ramp up the backing threshold to 3 out of 5 on Polkadot. Better because always working: Deterministic fork selection: Instead of dropping the new import, we compare by CandidateHash and drop the smaller/greater one. If we do this, then please don't forget to add this to the implementer's guide.

alexggh · 2024-06-24T07:38:56Z

Double checked: 0xc1da83cf1eb21ca84ad230063e99fc6942892a9904e2488d00883330eca4cf10 and 0x23132e882af4e3a044deb39fa876e1d469aa653406e34ce2dc74214fc8d2f085 and indeed, there seems to be a fork in both situations, I see two different candidates with the same relay-parent, and our validator validates/seconds the ones that are on the losing part of the fork. These are asset-hub and bridge-hub blocks, so it is not clear to me if a fork is expected there, I would have expected because they run aura to not have forks at all.

A simple fix would be to ramp up the backing threshold to 3 out of 5 on Polkadot

How would that fix this situation ? If you are on the loosing side of the fork your vote would still not be counted and rewarded.

burdges · 2024-06-24T09:53:23Z

Yeah, aura should not fork except from significant gossip delays, wrong clocks, or other concensus-ish issues.

Are there equivocations on the parachain? Aura equivocations should look like two blocks with the same slot, which incidentally have the same block producer authority.

A simple fix would be to ramp up the backing threshold to 3 out of 5 on Polkadot

At minimum, we should glance at those backing group size charts first, because 2-of-5 chooses 2<5/2 for liveness. I'd think 2-of-4 retains liveness, maybe it reduces this too, but likely the rel fix lies elsewhere.

alexggh · 2024-06-24T10:40:11Z

Are there equivocations on the parachain? Aura equivocations should look like two blocks with the same slot, which incidentally have the same block producer authority.

These are the two blocks I'm talking about:
[Good] https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-bridge-hub-kusama.luckyfriday.io#/explorer/query/0x39e375a33b5b3e345b32a71c6597f8c956aa2eec1ea857489012ae00a1e1295c

[Abandoned but our validator backs it first] https://polkadot.js.org/apps/?rpc=wss%3A%2F%2Frpc-bridge-hub-kusama.luckyfriday.io#/explorer/query/0x0ee1366da44b3513b8e987edb00727c224fa12f0c842871e38285d4bf4c60c58

Looking at the logs they don't seem to be produced by the same validator:

The good one:

relay_parent: 0xf8985bf5e6dcf2afa8630b6f4341f65a568392b2d1ef66daff3e0c7be7ba8182, collator: Public(cc5c5ceaf02e204217eb1d9907318fd9827772923b72e2ca4b583d13dec74460 (HCGiw6p2...))

The abandoned:
relay_parent: 0xf8985bf5e6dcf2afa8630b6f4341f65a568392b2d1ef66daff3e0c7be7ba8182, collator: Public(783689d3e497cbca5397e0444a071e7e2f82f6faf6f6ac0f3ea19caaf7650b66 (FHwTwJjS...))

alexggh · 2024-06-24T18:21:16Z

Found the root-cause of the forks: https://matrix.to/#/!gTDxPnCbETlSLiqWfc:matrix.org/$fO-SigvBAVe1uRY3Kp99yGeUDiHPvfeaL_fCR_tKyoc?via=web3.foundation&via=matrix.org&via=parity.io, it seems that people that run system parachains are running two different instances of a collator with the same authorithies keys, so when it is the turn of their authority they end-up proposing two blocks which end up racing on the relay-chain to get the necessary backing votes.

So, if a validator is backing the loosing fork it will not get the rewards, because of the reasons explained here by @alindima #4800 (comment).

My suggestion for the path forward here would be:

Short-term, ask system-parachain collators to not run 2 instances of collators with the same authority keys, I do think this is a wrong setup and aura wasn't designed to work like this. FYI @eskimor @sandreim @bkchr @skunert, maybe you've got some context why this is a good/bad idea.
Medium-term, the relay-chain backing protocol should be fixed, so that even in the presence of parachain forks validators should get even rewards and not depend on the timing of the candidates getting to them.

bkchr · 2024-06-24T18:27:07Z

Short-term, ask system-parachain collators to not run 2 instances of collators with the same authority keys, I do think this is a wrong setup and aura wasn't designed to work like this. FYI @eskimor @sandreim @bkchr @skunert, maybe you've got some context why this is a good/bad idea.

We should start slashing them. Initially I didn't added slashing for equivocations, because there was no way to determine the relay chain block associated to a parachain block. However, now we can do this and we should enable it.

Generally the idea is that there is one block per slot from each author.

bkchr · 2024-06-24T18:29:05Z

Medium-term, the relay-chain backing protocol should be fixed, so that even in the presence of parachain forks validators should get even rewards and not depend on the timing of the candidates getting to them.

Sounds complicated to me, as a validator could start signing any kind of blob and pretend they voted for something.

alexggh · 2024-06-24T19:41:59Z

Short-term, ask system-parachain collators to not run 2 instances of collators with the same authority keys, I do think this is a wrong setup and aura wasn't designed to work like this. FYI @eskimor @sandreim @bkchr @skunert, maybe you've got some context why this is a good/bad idea.

We should start slashing them. Initially I didn't added slashing for equivocations, because there was no way to determine the relay chain block associated to a parachain block. However, now we can do this and we should enable it.

Generally the idea is that there is one block per slot from each author.

They are literally, running two instances with the same keys, because they think it is good practice, I will tell them to stop this practice for system parachains, since it does more harm than help.

sandreim · 2024-06-24T20:36:06Z

Short-term, ask system-parachain collators to not run 2 instances of collators with the same authority keys, I do think this is a wrong setup and aura wasn't designed to work like this. FYI @eskimor @sandreim @bkchr @skunert, maybe you've got some context why this is a good/bad idea.

We should start slashing them. Initially I didn't added slashing for equivocations, because there was no way to determine the relay chain block associated to a parachain block. However, now we can do this and we should enable it.

Generally the idea is that there is one block per slot from each author.

+1 creating forks will impact parachain block time, so there is a good incentive for parachains to slash their collators if they do it.

On the other hand, theoretically you could even have proof of work parachains or some custom block production consensus that creates forks. In such a scenario, if your validator is assigned to a such a parachain it will get less rewards than before, so problem is not solved in the general case.

With the current behaviour of never backing a fork we actually save some CPU that could be used for more useful work. Of course, we trade off some liveness when a fork gets backed with only 2 backers that have bandwidth issues or go offline.

IMO the solution seems to be changing prospective parachains to accept forks. This is limited to just ensure we have as many backers as possible, but the forks would never be taken into consideration to form a chain, so we keep current behaviour and make validators happy again.

bkchr · 2024-06-24T21:40:18Z

With the current behaviour of never backing a fork we actually save some CPU that could be used for more useful work. Of course, we trade off some liveness when a fork gets backed with only 2 backers that have bandwidth issues or go offline.

We are backing forks? The point being that you may only back a fork that doesn't get on chain and thus, you don't get any rewards as explained by @alexggh above.

Generally I would also say that this is expected. Over time the validator will get its normalized rewards. Also don't we already give more era points for approval voting than backing? So, backing a fork should even have less impact.

burdges · 2024-06-24T23:10:35Z

@alexggh You've different collator: Public(.. fields above, so those were not the authority key for craeting the block?

It's harmless for polkadot if the fork gets back, just some lost rewards for the backer. We do not want collators equivocating of course, but we should first ask them to stop. If we then need non-slashing technical measures, then we could break out my CRDT trick authority key registration, aka only one registered on-chain at a time. We'd then only have real dishonest equivocations from collators, for which parachains likely eject them.

Also, we're not even paying the approval checkers yet, so our backing rewards look really exagerated right now. Just naively, validator rewards should break down roughly like 70% approval checks, 8% availability, 7% backing, and 15% babe & grandpa. We're missing like 78% of that allocation, so all current rewards shall shrink by a factor of 5 once we account for all this critical work, and then people should care less if they back the wrong thing ocasionally.

eskimor · 2024-06-25T05:11:14Z

Great work @alexggh ! Indeed, the fork choice already happens before a candidate is fully backed (actually not good), so 3 out of 5 does not help here. Picking by comparing candidate hashes should do the trick though.

alexggh · 2024-06-25T05:34:32Z

@alexggh You've different collator: Public(.. fields above, so those were not the authority key for craeting the block?

Yeah that was a red-herring, those weren't the authority ID, they are some keys that are auto-generated at startup for each instance, so that's why I thought these are different collators, but they aren't.

Generally I would also say that this is expected. Over time the validator will get its normalized rewards.

Yes, I think that it is what will happen when the forks would be more or less random, but in this case it looks like the same set of validators find themselves more often on the loosing side, I would assume because the fork is deterministic and the same validator are loosing because of their latency to the collator.

Also don't we already give more era points for approval voting than backing

We don't have approval voting rewards implemented yet, but yeah once we reduce the backing proportion in rewards this should actually be a non-issue, even currently the difference in points it is just around 1%.

alexggh · 2024-06-25T05:47:22Z

Asked the system parachain collators to stop running two instances with the same authority keys here: https://matrix.to/#/!gTDxPnCbETlSLiqWfc:matrix.org/$nP852NAScRgGukKtdNHoDEISH_XaqnfgXX5MoSPZy1Q?via=web3.foundation&via=matrix.org&via=parity.io, I'll keep an eye on the dashboards to confirmed if there is any impact.

alindima · 2024-06-25T09:36:55Z

I've discussed this issue more with @eskimor and @sandreim and we came up with the following solution that we think would work.

In #4035 , I removed support for parachain forks in prospective parachains, but added support for unconnected candidates (candidates for which we don't yet know the parent, so that chained candidates can be validated in parallel by different backing groups, needed for elastic scaling).
A candidate is currently inserted into prospective parachains when we see a Seconded statement.

The proposal is to have prospective-parachains only build the chain with candidates that have been backed. This way, a single validator that seconds an invalid candidate would not prevent a legitimate fork from being backed.
If multiple candidates are backed at the same height, use a deterministic, random and simple way of choosing the best fork (choosing the lower candidate hash for example). This way, we'd ensure that all validators will back the same fork (therefore they'll likely all get the backing reward).

The unconnected candidate storage could hold forks, but the actual chain prospective-parachains builds would not allow for forks (which was the source of great complexity before #4035)

TL;DR: prospective-parachains will again allow for forks between candidates that are not backed but will not track them once the candidates are backed. validators will use a deterministic fork selection rule during backing

This is definitely not a trivial fix that can be done in a couple of days though.

bkchr · 2024-06-26T10:18:28Z

validators will use a deterministic fork selection rule during backing

Can you explain what you mean by this?

alindima · 2024-06-26T11:07:21Z

validators will use a deterministic fork selection rule during backing

Can you explain what you mean by this?

I mean that if there is a fork (two different candidates are seconded, having the same parent), the other validators in the backing group will all choose to validate the same candidate, so that they all have the highest chance of getting their backing reward. The block author will also pick the same candidate to back on-chain between the two. This rule could be that we favour the candidate with the lower candidate hash. This is something random that can't be easily manipulated.

alindima · 2024-06-26T11:17:00Z

Actually, now that I've started prototyping this I'm not sure it's needed to do this comparison during backing, as long as we allow both forks to be backed. It's up to the block author to pick one and both of them could have the same number of votes.

It'd only be more efficient to give up validation of a candidate that we could reasonably know will not be picked by the block author. But adds a whole lot of complexity for little benefit the more I think about it

bkchr · 2024-06-26T11:23:05Z

It's up to the block author to pick one and both of them could have the same number of votes.

Yeah this was actually the case before the changes you have done in #4035. I mean the general idea of choosing the next block to back, based on the lower hash sounds reasonable to me. I mean there could be N (number of validators in the backing group) forks and to ensure that one gets the required number of votes, by letting all decide in the same way sounds like a good idea. But then we would maybe need to stop fetching a POV when we see some smaller hash or whatever. So, it adds complexity again. Maybe something when we are bored :P

sandreim · 2024-06-26T11:33:29Z

Actually, now that I've started prototyping this I'm not sure it's needed to do this comparison during backing, as long as we allow both forks to be backed. It's up to the block author to pick one and both of them could have the same number of votes.

It'd only be more efficient to give up validation of a candidate that we could reasonably know will not be picked by the block author. But adds a whole lot of complexity for little benefit the more I think about it

IIUC this means we need to go back to tracking trees rather than chains + unconnected candidates. The idea with the comparison is to use it so only the chosen fork makes it to the block author (unless it is himself in that group). All other forks remains local to the backing group (cluster topology).

burdges · 2024-06-26T12:42:35Z

I'm not sure it's needed to do this comparison during backing, as long as we allow both forks to be backed. It's up to the block author to pick one and both of them could have the same number of votes.

This was always the idea. We've no control over the order in which backers & block authors recieve the candidates & statements, and thus cannot reward every backed candidate, because establishing control costs too much CPU time & bandwidth. That's fine.

Resolves #4800 # Problem In #4035, we removed support for parachain forks and cycles and added support for backing unconnected candidates (candidates for which we don't yet know the full path to the latest included block), which is useful for elastic scaling (parachains using multiple cores). Removing support for backing forks turned out to be a bad idea, as there are legitimate cases for a parachain to fork (if they have other consensus mechanism for example, like BABE or PoW). This leads to validators getting lower backing rewards (depending on whether they back the winning fork or not) and a higher pressure on only the half of the backing group (during availability-distribution for example). Since we don't yet have approval voting rewards, backing rewards are a pretty big deal (which may change in the future). # Description A backing group is now allowed to back forks. Once a candidate becomes backed (has the minimum backing votes), we don't accept new forks unless they adhere to the new fork selection rule (have a lower candidate hash). This helps with keeping the implementation simpler, since forks will only be taken into account for candidates which are not backed yet (only seconded). Having this fork selection rule also helps with reducing the work backing validators need to do, since they have a shared way of picking the winning fork. Once they see a candidate backed, they can all decide to back a fork and not accept new ones. But they still accept new ones during the seconding phase (until the backing quorum is reached). Therefore, a block author which is not part of the backing group will likely not even see the forks (only the winning one). Just as before, a parachain producing forks will still not be able to leverage elastic scaling but will still work with a single core. Also, cycles are still not accepted. ## Some implementation details `CandidateStorage` is no longer a subsystem-wide construct. It was previously holding candidates from all relay chain forks and complicated the code. Each fragment chain now holds their candidate chain and their potential candidates. This should not increase the storage consumption since the heavy candidate data is already wrapped in an Arc and shared. It however allows for great simplifications and increase readability. `FragmentChain`s are now only creating a chain with backed candidates and the fork selection rule. As said before, `FragmentChain`s are now also responsible for maintaining their own potential candidate storage. Since we no longer have the subsytem-wide `CandidateStorage`, when getting a new leaf update, we use the storage of our latest ancestor, which may contain candidates seconded/backed that are still in scope. When a candidate is backed, the fragment chains which hold it are recreated (due to the fork selection rule, it could trigger a "reorg" of the fragment chain). I generally tried to simplify the subsystem and not introduce unneccessary optimisations that would otherwise complicate the code and not gain us much (fragment chains wouldn't realistically ever hold many candidates) TODO: - [x] update metrics - [x] update docs and comments - [x] fix and add unit tests - [x] tested with fork-producing parachain - [x] tested with cycle-producing parachain - [x] versi test - [x] prdoc

Resolves #4800 In #4035, we removed support for parachain forks and cycles and added support for backing unconnected candidates (candidates for which we don't yet know the full path to the latest included block), which is useful for elastic scaling (parachains using multiple cores). Removing support for backing forks turned out to be a bad idea, as there are legitimate cases for a parachain to fork (if they have other consensus mechanism for example, like BABE or PoW). This leads to validators getting lower backing rewards (depending on whether they back the winning fork or not) and a higher pressure on only the half of the backing group (during availability-distribution for example). Since we don't yet have approval voting rewards, backing rewards are a pretty big deal (which may change in the future). A backing group is now allowed to back forks. Once a candidate becomes backed (has the minimum backing votes), we don't accept new forks unless they adhere to the new fork selection rule (have a lower candidate hash). This helps with keeping the implementation simpler, since forks will only be taken into account for candidates which are not backed yet (only seconded). Having this fork selection rule also helps with reducing the work backing validators need to do, since they have a shared way of picking the winning fork. Once they see a candidate backed, they can all decide to back a fork and not accept new ones. But they still accept new ones during the seconding phase (until the backing quorum is reached). Therefore, a block author which is not part of the backing group will likely not even see the forks (only the winning one). Just as before, a parachain producing forks will still not be able to leverage elastic scaling but will still work with a single core. Also, cycles are still not accepted. `CandidateStorage` is no longer a subsystem-wide construct. It was previously holding candidates from all relay chain forks and complicated the code. Each fragment chain now holds their candidate chain and their potential candidates. This should not increase the storage consumption since the heavy candidate data is already wrapped in an Arc and shared. It however allows for great simplifications and increase readability. `FragmentChain`s are now only creating a chain with backed candidates and the fork selection rule. As said before, `FragmentChain`s are now also responsible for maintaining their own potential candidate storage. Since we no longer have the subsytem-wide `CandidateStorage`, when getting a new leaf update, we use the storage of our latest ancestor, which may contain candidates seconded/backed that are still in scope. When a candidate is backed, the fragment chains which hold it are recreated (due to the fork selection rule, it could trigger a "reorg" of the fragment chain). I generally tried to simplify the subsystem and not introduce unneccessary optimisations that would otherwise complicate the code and not gain us much (fragment chains wouldn't realistically ever hold many candidates) TODO: - [x] update metrics - [x] update docs and comments - [x] fix and add unit tests - [x] tested with fork-producing parachain - [x] tested with cycle-producing parachain - [x] versi test - [x] prdoc

aperture-sandi added I10-unconfirmed Issue might be valid, but it's not yet known. I2-bug The node fails to follow expected behavior. labels Jun 14, 2024

alindima mentioned this issue Jul 3, 2024

prospective-parachains rework: take II #4937

Merged

7 tasks

sandreim mentioned this issue Jul 31, 2024

Bump async backing parameters #4287

Closed

3 tasks

alindima closed this as completed in #4937 Aug 12, 2024

Poor PV performance (missed votes) with v1.12.0 and v1.13.0 #4800

Poor PV performance (missed votes) with v1.12.0 and v1.13.0 #4800

Comments

aperture-sandi commented Jun 14, 2024 • edited Loading

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Steps to reproduce

bkchr commented Jun 17, 2024

alexggh commented Jun 18, 2024

aperture-sandi commented Jun 18, 2024 • edited Loading

aperture-sandi commented Jun 19, 2024

eskimor commented Jun 19, 2024

sandreim commented Jun 19, 2024

sandreim commented Jun 19, 2024

sandreim commented Jun 19, 2024

alexggh commented Jun 20, 2024

alexggh commented Jun 20, 2024

sandreim commented Jun 21, 2024 • edited Loading

alexggh commented Jun 21, 2024

alindima commented Jun 21, 2024

aperture-sandi commented Jun 21, 2024 • edited Loading

sandreim commented Jun 21, 2024 • edited Loading

eskimor commented Jun 21, 2024

alexggh commented Jun 24, 2024

burdges commented Jun 24, 2024

alexggh commented Jun 24, 2024

alexggh commented Jun 24, 2024

bkchr commented Jun 24, 2024 • edited Loading

bkchr commented Jun 24, 2024 • edited Loading

alexggh commented Jun 24, 2024

sandreim commented Jun 24, 2024 • edited Loading

bkchr commented Jun 24, 2024

burdges commented Jun 24, 2024 • edited Loading

eskimor commented Jun 25, 2024

alexggh commented Jun 25, 2024

alexggh commented Jun 25, 2024

alindima commented Jun 25, 2024 • edited Loading

bkchr commented Jun 26, 2024

alindima commented Jun 26, 2024

alindima commented Jun 26, 2024

bkchr commented Jun 26, 2024

sandreim commented Jun 26, 2024

burdges commented Jun 26, 2024

aperture-sandi commented Jun 14, 2024 •

edited

Loading

aperture-sandi commented Jun 18, 2024 •

edited

Loading

sandreim commented Jun 21, 2024 •

edited

Loading

aperture-sandi commented Jun 21, 2024 •

edited

Loading

sandreim commented Jun 21, 2024 •

edited

Loading

bkchr commented Jun 24, 2024 •

edited

Loading

bkchr commented Jun 24, 2024 •

edited

Loading

sandreim commented Jun 24, 2024 •

edited

Loading

burdges commented Jun 24, 2024 •

edited

Loading

alindima commented Jun 25, 2024 •

edited

Loading