-
Notifications
You must be signed in to change notification settings - Fork 689
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
approval-distribution
: process assignments and votes in parallel
#732
Comments
I am still in favor of trying real hard to improve performance/reduce work somehow. Using multiple cores just for signature checking in approval voting ... maxing out multiple cores, even when all parachains are 100% idle? That does sound like something we should be able to improve. |
Yes, I created #729 and if I understand correctly it does exactly what you mean and helps a lot. I think we should implement improvements at the protocol/cryptography level as well as architectural. While in perfect network conditions maybe the changes suggested in #729 might be enough to scale to 500+ validators, but if we go higher or upper tranches trigger or disputes happen then we clearly need a solution to be able to handle bursts of assignments and votes efficiently and scale beyond 1 CPU. |
FWIW, we just have these constant overheads for every parachain block. The flip side is that they don't get worse when parachains get heavier. |
Spent some time introspecting and profiling the approval voting to understand why we see the high
Reducing the assignments from this paritytech/polkadot#6782 will definitely help with the number of messages we have to process.
While this operations happen relatively rarely, they do happen at peak time, so they might be just the straw that break the camel's back.
|
Great work! On signature checking, do we know how much of it is approvals, vs. assignments? |
We need better benchmarks in schnorrkel but verifying an approval should take roughly 40 microseconds, so like ed25519, and verifying an assignment should be less than 3*25 + 40 = 115 micro seconds, so in 1 millisecond on one core you'd maybe 25 approvals, or 8 assignments, or 6 of each. I've some tricks to speed up the VRF parts..
An elligator hash-to-curve should be fast, but we could cache the hash-to-curve if we uniformized them by excluding the validator public key from the hashed-to-curve. As the extreme we could compute a base point table for those uniformized hash-to-curves, which saves 10 microseconds if not batching. I'd think sr25519 and ed25519 cost the same, so we'd only speed up approvals verification though either batching or else by adopting a considerably larger but non-batchable signature, like Rabin-Williams. We no longer think network throughout bottlenecks us here? We already have something placing multiple assignments into the same message? Are the assignments by the same signer or possibly different signers?
Why 6? An sr25519 signing operation should take only 15 microseconds, just like ed25519, so maybe the keystore has problems here? We need roughly 25+25+15 = 65 microseconds for VRF signing, btw.
Is this querying approvals data or something about the chain?
mmap fixes this one. approvals-voting db needs to be serialized in memory for this, but that's not really problematic. |
Nice work @alexggh
Yes! But this improvement becomes less useful if we just scale the validators up. It is more useful when we actually need to do more work per validator - more parachains.
I also observed this some time ago. My theory back then:
|
Time of flight and it defines how much time a message arrived in the approval-distribution waits till it gets picked up for processing.
They seem to be evenly spread, e.g 1 second snapshot:
I'll dig a bit deeper here, but in both my instrumentation on the tracing we have here assignment imports seem to take 200-300 micro seconds and approvals imports between 100 and 200 micro seconds. Also, I noticed rare cases where checking of a
It seems we are doing it per candidate, so I think we have more at the same time.
What problems are you thinking about ?
Will re-check it.
Yes, it is asking the runtime for block information a few times.
Will look to see if this is used on the images I did the testing, if not I'll check what impact it has.
Are you referring to this paritytech/polkadot#6782, if yes. The answer is no, I wasn't running that PR when I did the measurements.
The messages are already in the |
If I understand, we do not currently know what performance issues the networking itself brings, so we should not yet assume we can make life harder for the networking either.
Ask @koute maybe?
Relay VRF preoutputs probably, maybe something else. We copy & keep the relay VRF output outside the chain once we start approvals work, no?
I've no idea how the keystore works internally, but unrealted.. I noticed one reason signing might be slow or unpredictable: We do a syscall every time we sign anything in https://github.com/w3f/schnorrkel/blob/master/src/lib.rs#L228 which gives linux an excuse to bump the thread. I'd previously used ThreadRng but removed it to simplify maintenance in w3f/schnorrkel@c333370 I'll revert that commit, fix rand, and make some other changes.
So 3x and 4x longer than my guestimates for the bare crypto, respectively. We do need better benchmarks on schnorrkel of course, like the rand issue for example.
No. We discussed batching together other approvals messages too, but maybe no issue exists yet. If we're already signing 6 around the same time then maybe this saves lots. |
It's not. This still requires passing a special flag to the compiler when compiling polkadot so that the SIMD instructions actually get used. But I have another PR to The expected speedup thanks to this is somewhere between ~30%-50%, depending on the hardware. |
It doesn't exists yet, I think it is this one: #701
Yes, we do that.
Thank you, I'll try to see if I can obtain some numbers with that reverted.
@koute Could you point me to the PR, so I can bring it locally and get out some numbers. |
It's merged to That said, actually integrating it is a little bit more complex because the new version of If you're interested here are some performance numbers from Scalar vs AVX2
Scalar vs AVX512
AVX2 vs AVX512
|
I think the lower level crypto optimizations should be somewhat orthogonal. It's clear they'll happen over the next couple months regardless, so.. If you want to focus on our real protocol optimization that's fine too. Ask @eskimor what he thinks. We do need to know when the crypto optimizations merge of course, so that we do not mistakenly conclude some protocol optimization did stuff it did not do. We can otoh quickly cut a schnorrkel version without the syscall if you want to see if that's causing threads to be bumped or whatever. |
Yes, I agree they are orthogonal. I just went this route because I wanted to understand where we spent most of the time. With that in mind, I think in the order of impact we should do the following:
|
In principle it sounds good, but do we know how much time the signature check takes vs the internal book keeping that is done by approval voting ? At a scale of 1k paravalidators and 200 parachains, under normal operation with 30 needed approvals and 6 vrf modulo samples, the question we are asking is basically - |
Yes this was always my question. :) I suspect this syscall in schnorrkel is a major cost since it lets linux bump the thread.
Not really, but we could use those other threads for checking more parachain blocks or whatever. At a high level, we're not exactly in the shoes of a web2 company who once they parallelize then they can simply throw more hardware at the problem. We have the hardware we told validators they could buy, and upgrading that requires community work. |
TBH, not sure that the syscall would result in the CPU being given to another thread, but we could actually test the 2 scenarios and see what the result is. Totally agree with you, the thing is that with current architecture we cannot dynamically scale to process stuff on more than 1 CPU to handle bursty workloads. Assuming 1k para valdiators and 200 paras, I think we need to worry about the specs since that means at most 6-7 candidates to check (1 to back, 6 to approve) and 2x times that with async backing. We could add more validators there so the amount of checks per validator goes down, but that means we need to be able to handle more gossip traffic still. |
At least in approval-voting subsystem from 1 second of useful work it spends around 700 millis doing just the crypto operations for either assignment or approval. I'll get back when I have some number splits for |
Some measurements regarding where the time is spent in
One thing I noticed while performing this measurements is that we've got cases where we get multiple assignments & approvals from peers in the same message, here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L568 and here https://github.com/paritytech/polkadot/blob/master/node/network/approval-distribution/src/lib.rs#L607 but because we process those assignments one by one we are going to fan-out those message into multiple messages sent to the same peer. In average we have 10-15 approvals in those messages, but I see some case where we even get I think this inefficiency, is a low-hanging fruit, and should help us with at least reducing the network traffic/load. |
Thanks for the numbers @alexggh, it appears that book keeping is not eating a lot, but we might be able to improve there as well.
Sounds like a good idea, it should clearly reduce the notification counts. |
…er approval-voting In, the current implementation every time we process an assignment or an approval that needs checking after the approval voting, we will wait till approval-voting answers the message. Given that approval-voting will execute some signatures checks that take significant time(between 400us and 1 millis) per message, that's where most of the time in the approval-distribution, see https://github.com/paritytech/polkadot/issues/6608#issuecomment-1590942235 for the numbers. So, modify approval-distribution, so that it picks another message from the queue while the approval-voting is busy doing it's work. This will have a few benefits: 1. Better pipelinening of the messages, approval-voting will always have work to do and it won't have to wait for the approval-distribution to send it a message. Additionally, some of the works of the approval-distribution will be executed in parallel with work in approval-voting instead of serially. 2. By allowing approval-distribution to process messages from it's queue while approval-voting confirms that a message is valid we give the approval-distribution the ability to build a better view about what messages other peers already know, so it won't decide to gossip messages to some of it's peers once we confirm that message as being correct. 3. It opens the door for other optimizations in approval-voting subsystem, which would still be the bottleneck. Note! I still expect the amount of work the combo of this two systems can do, to still be bounded by the numbers of signatures checks it has to do, so we would have to stack this with other optimizations we have in the queue. - https://github.com/paritytech/polkadot/issues/6608 - https://github.com/paritytech/polkadot/issues/6831 [] Evaluate impact in versi [] Cleanup code an make CI happy to make the PR meargeable. Signed-off-by: Alexandru Gheorghe <[email protected]>
…voting In, the current implementation every time we process an assignment or an approval that needs checking in the approval voting, we will wait till approval-voting answers the message. Given that approval-voting will execute some signatures checks that take significant time(between 400us and 1 millis) per message, that's where most of the time in the approval-distribution, see https://github.com/paritytech/polkadot/issues/6608#issuecomment-1590942235 for the numbers. So, modify approval-distribution, so that it picks another message from the queue while the approval-voting is busy doing it's work. This will have a few benefits: 1. Better pipelinening of the messages, approval-voting will always have work to do and it won't have to wait for the approval-distribution to send it a message. Additionally, some of the works of the approval-distribution will be executed in parallel with work in approval-voting instead of serially. 2. By allowing approval-distribution to process messages from it's queue while approval-voting confirms that a message is valid we give the approval-distribution the ability to build a better view about what messages other peers already know, so it won't decide to gossip messages to some of it's peers once we confirm that message as being correct. 3. It opens the door for other optimizations in approval-voting subsystem, which would still be the bottleneck. Note! I still expect the amount of work the combo of this two systems can do, to still be bounded by the numbers of signatures checks it has to do, so we would have to stack this with other optimizations we have in the queue. - https://github.com/paritytech/polkadot/issues/6608 - https://github.com/paritytech/polkadot/issues/6831 [] Evaluate impact in versi [] Cleanup code an make CI happy to make the PR meargeable. Signed-off-by: Alexandru Gheorghe <[email protected]>
Putting this on-hold in favor of: #701 |
* chore: format code * format code * add format hook * add lint when commit
Will be addressed with #1617 |
Follow up from some earlier scaling tests with paritytech/polkadot#6247. The initial attempts did made it apparent that
approval-distribution
andapproval-voting
are bottlenecks increasing approval checking finality lag in our tests with 250 paravalidators and 50 parachains at the time.That argument still holds true, but only at much larger scale than what we were experimenting at the time. The real issue was paritytech/polkadot#6400 which was solved.
Bumping up to 350+ para validators requires that we make
approval-voting
andapproval-distribution
do more work in the same amount of time. The first step is to implement the parallelization changes from paritytech/polkadot#6247 which would result in maxing out cpu usage inapproval-voting
which sits right now at 80% due to the serial processing and waiting for completion of vote imports issued byapproval-distribution
. This will enable us to further improve how fast the subsystem churns through work by makingapproval-voting
parallelise signature checks.Important note: The implementation will make use of internal queueing of messages. We have to be careful to preserve the backpressure so we need these internal queues to be bounded, for example by a fixed number of imports that can be pending at a given time.
The text was updated successfully, but these errors were encountered: