Statement Distribution Per Peer Rate Limit #3444

Overkillus · 2024-02-22T12:54:23Z

Drop requests from a PeerID that is already being served by us.
Don't sent requests to a PeerID if we already are requesting something from them at that moment (prioritise other requests or wait).
Tests
~~Add a small rep update for unsolicited requests (same peer request)~~ not included in original PR due to potential issues with nodes slowly updating
Add a metric to track the amount of dropped requests due to peer rate limiting
Add a metric to track how many time a node reaches the max parallel requests limit in v2+

Helps with but does not close yet: https://github.com/paritytech-secops/srlabs_findings/issues/303

polkadot/node/network/statement-distribution/src/v2/mod.rs

sandreim · 2024-02-26T12:15:34Z

polkadot/node/network/statement-distribution/src/v2/mod.rs

-				return
+				// If peer currently being served drop request
+				if active_peers.contains(&request.peer) {
+					continue


Shouldn't this be handled in the context of a candidate/relay parent ? Otherwise we would drop legit requests.

This is definitely a fair point. Rob mentioned it https://github.com/paritytech-secops/srlabs_findings/issues/303#issuecomment-1588185410 as well.

It's a double-edged sword. If we make it 1 per context then the request/response protocol is more efficient, but with async backing enabled (which to my understanding broadens the scope of possible contexts) the mpsc channel (size 25-ish) can still be easily overwhelmed. The problem of dropped requests gets somewhat a bit better after the recent commit [Rate limit sending from the requester side](https://github.com/paritytech/polkadot-sdk/pull/3444/commits/58d828db45b7c023e8a8266d3180be4361723519) which also introduces the rate limit on the requester side so effort is not wasted.

If we are already requesting from a peer and want to request something else as well we simply wait for the first one to finish.

What could be argued is the limit as a small constant 1-3 instead of 1 per peerID. I would like to do some testing and more reviews and changing it to a configurable constant instead of a hard limit to 1 is actually a trivial change. (I don't think it's necessary. Better have a single completely sent candidate than two half sent candidates.)

Truth be told this whole solution is far from ideal. It's a fix but the final solution is to manage DoS at a significantly lower level than in parachain consensus. It will do for now but DoS needs to be looked on in more detail in near future.

This is definitely a fair point. Rob mentioned it paritytech-secops/srlabs_findings#303 (comment) as well.

It's a double-edged sword. If we make it 1 per context then the request/response protocol is more efficient, but with async backing enabled (which to my understanding broadens the scope of possible contexts) the mpsc channel (size 25-ish) can still be easily overwhelmed. The problem of dropped requests gets somewhat a bit better after the recent commit [Rate limit sending from the requester side](https://github.com/paritytech/polkadot-sdk/pull/3444/commits/58d828db45b7c023e8a8266d3180be4361723519) which also introduces the rate limit on the requester side so effort is not wasted.

If we are already requesting from a peer and want to request something else as well we simply wait for the first one to finish.

I would rather put the constraint on the responder side rather than the requester. The reasoning is to avoid any bugs in other clients implementation due to the subtle requirement to wait for first request to finish.

What could be argued is the limit as a small constant 1-3 instead of 1 per peerID. I would like to do some testing and more reviews and changing it to a configurable constant instead of a hard limit to 1 is actually a trivial change. (I don't think it's necessary. Better have a single completely sent candidate than two half sent candidates.)

I totally agree with DoS protection as low as possible in the stack, but it seems to be sensible. Maybe instead of relying on a hardcoded constant we can derive the value based on the async backing parameters.

I would rather put the constraint on the responder side rather than the requester. The reasoning is to avoid any bugs in other clients implementation due to the subtle requirement to wait for first request to finish.

I agree that constraint on the responder side is more important, but having it in both places seems even better.

The reasoning is to avoid any bugs in other clients implementation due to the subtle requirement to wait for first request to finish.

Even without this PR the risk seems exactly the same. We already were checking for the number of parallel requests being handled in the requester so even if there would be bugs in other clients they could surface there with or without this PR.

The only really bad scenario would be if for some reason we keep a PeerID marked as still being active when it really isn't effectively blacklisting that peer. If that would happen it means that the future connected to that PeerID is still in the pending_responses which would eventually brick the requester anyway since we cannotgo over the max limit of MAX_PARALLEL_ATTESTED_CANDIDATE_REQUESTS. Both situations are just as fatal, but the new system at least has extra protections against DoS attempts so it's a security gain.

By adding the constraint on the requester side we limit the wasted effort and can potentially safely add reputation updates if we get those unsolicited requests which protects us from DoS even further. (Some honest requests might slip through so the rep update needs to be small.)

Maybe instead of relying on a hardcoded constant we can derive the value based on the async backing parameters.

We potentially could but also I don't why this value would need to change a lot. Even if we change some async backing params we should be fine. It seems to be more sensitive to channel size as we want to make it hard to dominate that queue as a malicious node and the MAX_PARALLEL_ATTESTED_CANDIDATE_REQUESTS.

You need not choose a uniform distribution over recent relay parents either. Instead the full value on the first couple, then 1/2 and 1/4 for a few, and then 1 one for a while. You need to figure out if the selected distribution breaks collators, but maybe works.

alexggh

Added some comments, what is not clear to me is why can_request from answer_request is not considered fast enough, that already has the benefit that we punish peers that are misbehaving.

Just my2cents, if what we want is to prevent peers from exhausting the buget for the number of connections, I would think the proper fix needs to go deeper into the layer network, rather than here, since there is a lot time spent until the message reaches here.

polkadot/node/network/statement-distribution/src/v2/mod.rs

polkadot/node/network/statement-distribution/src/v2/requests.rs

Overkillus · 2024-02-28T16:38:40Z

re: @alexggh

Added some comments, what is not clear to me is why can_request from answer_request is not considered fast enough, that already has the benefit that we punish peers that are misbehaving.

Technically fair point but I think there's a reason. It's not about speed but technically yeah respond_task is a bit quicker than answer_request so it seems like a minimal gain here. It also would logically make sense for this logic to be in something called can_request.

Even before this change we tracked the parallel requests up to the limit (MAX_PARALLEL_ATTESTED_CANDIDATE_REQUESTS) in respond_task and my guess why is simply that respond_task encapsulates the async component of the request filtering procedure while can_request focuses on the simpler sync filtering. To properly filter out the most current PeerID this logic here needed to be async as well hence it's also in the async respond_task.

The can_request is primarily doing the filtering based on the knowledge we have about what was sent/received previously while the async respond_task is very time sensitive as we are operating on stuff that is currently being sent over.

Just my2cents, if what we want is to prevent peers from exhausting the buget for the number of connections, I would think the proper fix needs to go deeper into the layer network, rather than here, since there is a lot time spent until the message reaches here.

I 100% agree with you here. I said this myself and me and Robert have already spoken about it and this approach will be pursued. That's why this solution is not ideal but it at least makes the attack vector much harder to pull off with minimal changes to the current code-base. Not ideal but worth the tiny effort before we tackle the wider more general issue at the lower level.

sandreim · 2024-04-17T11:21:53Z

polkadot/node/malus/src/variants/spam_statement_requests.rs

+				msg: NetworkBridgeTxMessage::SendRequests(mut requests, if_disconnected),
+			} => {
+				// AttestedCandidateV2 requests arrive 1 by 1
+				if requests.len() == 1 {


This might be true for now, but can be easily broken in the future. Why not loop through all incoming requests ?

Handled multiple requests 👍

sandreim · 2024-04-17T11:38:44Z

polkadot/node/malus/src/variants/spam_statement_requests.rs

+						for _ in 0..self.spam_factor - 1 {
+							let (new_outgoing_request, _) = OutgoingRequest::new(
+								peer_to_duplicate.clone(),
+								payload_to_duplicate.clone(),
+							);
+							let new_request = Requests::AttestedCandidateV2(new_outgoing_request);
+							requests.push(new_request);
+						}


It should work without needing peer_to_duplicate and payload_to_duplicate clones:

Suggested change

for _ in 0..self.spam_factor - 1 {

let (new_outgoing_request, _) = OutgoingRequest::new(

peer_to_duplicate.clone(),

payload_to_duplicate.clone(),

);

let new_request = Requests::AttestedCandidateV2(new_outgoing_request);

requests.push(new_request);

}

requests.extend(0..self.spam_factor - 1).into_iter().map(|_|

Requests::AttestedCandidateV2(OutgoingRequest::new(

peer_to_duplicate.clone(),

payload_to_duplicate.clone(),

)))

Borrowed the extend syntax but still need the clones to satisfy the borrow checker.

polkadot/node/malus/src/variants/spam_statement_requests.rs

polkadot/node/network/statement-distribution/src/v2/mod.rs

sandreim

Nice work!

sandreim · 2024-04-24T09:14:22Z

polkadot/node/malus/src/variants/spam_statement_requests.rs

+
+						}
+						_ => {
+							new_requests.push(request);


We can dedup this line by moving it above the match.

This one is somewhat of a conscious trade-off but if you see a graceful way of doing it feel free to suggest it.

Problem is Request does not implement copy/clone. I could possibly track the index where I push it but this will make it even less readable imo.

polkadot/node/network/statement-distribution/src/v2/mod.rs

polkadot/node/network/statement-distribution/src/v2/requests.rs

alexggh

Good job!

* 'master' of https://github.com/metaspan/polkadot-sdk: (65 commits) Introduces `TypeWithDefault<T, D: Get<T>>` (paritytech#4034) Publish `polkadot-sdk-frame` crate (paritytech#4370) Add validate field to prdoc (paritytech#4368) State trie migration on asset-hub westend and collectives westend (paritytech#4185) Fix: dust unbonded for zero existential deposit (paritytech#4364) Bridge: added subcommand to relay single parachain header (paritytech#4365) Bridge: fix zombienet tests (paritytech#4367) [WIP][CI] Add more GHA jobs (paritytech#4270) Allow for 0 existential deposit in benchmarks for `pallet_staking`, `pallet_session`, and `pallet_balances` (paritytech#4346) Deprecate `NativeElseWasmExecutor` (paritytech#4329) More `xcm::v4` cleanup and `xcm_fee_payment_runtime_api::XcmPaymentApi` nits (paritytech#4355) sc-tracing: enable env-filter feature (paritytech#4357) deps: update jsonrpsee to v0.22.5 (paritytech#4330) Add PoV-reclaim enablement guide to polkadot-sdk-docs (paritytech#4244) cargo: Update experimental litep2p to latest version (paritytech#4344) Bridge: ignore client errors when calling recently added `*_free_headers_interval` methods (paritytech#4350) Make parachain template great again (and async backing ready) (paritytech#4295) [Backport] Version bumps and reorg prdocs from 1.11.0 (paritytech#4336) HRMP - set `DefaultChannelSizeAndCapacityWithSystem` with dynamic values according to the `ActiveConfig` (paritytech#4332) Statement Distribution Per Peer Rate Limit (paritytech#3444) ...

- [x] Drop requests from a PeerID that is already being served by us. - [x] Don't sent requests to a PeerID if we already are requesting something from them at that moment (prioritise other requests or wait). - [x] Tests - [ ] ~~Add a small rep update for unsolicited requests (same peer request)~~ not included in original PR due to potential issues with nodes slowly updating - [x] Add a metric to track the amount of dropped requests due to peer rate limiting - [x] Add a metric to track how many time a node reaches the max parallel requests limit in v2+ Helps with but does not close yet: https://github.com/paritytech-secops/srlabs_findings/issues/303

minimal scaffolding

1b1ba25

Overkillus commented Feb 22, 2024

View reviewed changes

polkadot/node/network/statement-distribution/src/v2/mod.rs Outdated Show resolved Hide resolved

Overkillus added 3 commits February 22, 2024 18:38

respond task select overhaul

3345aa6

test adjustment

1734c09

fmt

cb1d210

sandreim reviewed Feb 26, 2024

View reviewed changes

Rate limit sending from the requester side

58d828d

Overkillus requested a review from alexggh February 27, 2024 12:23

alexggh reviewed Feb 28, 2024

View reviewed changes

polkadot/node/network/statement-distribution/src/v2/mod.rs Show resolved Hide resolved

polkadot/node/network/statement-distribution/src/v2/requests.rs Show resolved Hide resolved

Overkillus added 5 commits March 20, 2024 18:29

sender side rate limiter test

b5e7e06

spammer malus variant

456a466

Duplicate requests

975e3bd

Filtering test

7056159

fmt

3f234e1

Overkillus self-assigned this Apr 2, 2024

Overkillus marked this pull request as ready for review April 2, 2024 08:38

Overkillus added T0-node This PR/Issue is related to the topic “node”. I1-security The node fails to follow expected, security-sensitive, behaviour. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Apr 2, 2024

Overkillus added 2 commits April 2, 2024 10:40

prdoc

7e817fa

pipeline

913b234

Overkillus requested a review from a team as a code owner April 2, 2024 09:44

Overkillus and others added 4 commits April 2, 2024 10:48

increment test

e812dcc

Merge branch 'master' into mkz-statement-distribution-rate-limit

c01be29

crate bump

af234ef

clippy nits

fdaaf4a

sandreim reviewed Apr 17, 2024

View reviewed changes

Overkillus added 2 commits April 19, 2024 17:22

zombienet test simplification

e988c60

review fixes

83cf297

sandreim approved these changes Apr 24, 2024

View reviewed changes

Overkillus mentioned this pull request Apr 30, 2024

Add Statement Distribution V2/V3 Metrics #4334

Open

3 tasks

Overkillus added 7 commits April 30, 2024 17:30

debug -> trace

37fd618

cleanup

adf84a2

metric registered

f0a40d5

reverting paste mistake

6099498

registering max parallel requests metric

07d44ee

updating metrics in statement distribution respond task

09e7367

fmt

d1f50c5

alexggh approved these changes May 1, 2024

View reviewed changes

Merge branch 'master' into mkz-statement-distribution-rate-limit

e9680d1

AndWeHaveAPlan approved these changes May 1, 2024

View reviewed changes

Overkillus added 5 commits May 1, 2024 16:46

param typo

678b141

Bump test numbering

7ff1ddd

remove unused malus param

f6e8def

tracing

8e79dc4

fmt

a38762b

Overkillus enabled auto-merge May 1, 2024 16:56

Overkillus added this pull request to the merge queue May 1, 2024

Merged via the queue into master with commit 6d392c7 May 1, 2024
140 checks passed

Overkillus deleted the mkz-statement-distribution-rate-limit branch May 1, 2024 17:41

Overkillus mentioned this pull request May 2, 2024

Add a rep cost to unexpected requests in statement distribution (peer rate limiting) #4356

Open

alexggh mentioned this pull request Jun 18, 2024

Poor PV performance (missed votes) with v1.12.0 and v1.13.0 #4800

Closed

2 tasks

This was referenced Aug 21, 2024

Update polkadot-sdk from v1.11.0 to stable2407 moondance-labs/tanssi#659

Open

Update polkadot-sdk from v1.11.0 to stable2407 moonbeam-foundation/moonbeam#2912

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Statement Distribution Per Peer Rate Limit #3444

Statement Distribution Per Peer Rate Limit #3444

Overkillus commented Feb 22, 2024 •

edited

Loading

sandreim Feb 26, 2024

Overkillus Feb 26, 2024 •

edited

Loading

sandreim Feb 27, 2024

Overkillus Feb 27, 2024

burdges Mar 21, 2024 •

edited

Loading

alexggh left a comment

Overkillus commented Feb 28, 2024

sandreim Apr 17, 2024

Overkillus Apr 19, 2024

sandreim Apr 17, 2024

Overkillus Apr 19, 2024

sandreim left a comment

sandreim Apr 24, 2024

Overkillus Apr 30, 2024

alexggh left a comment

Statement Distribution Per Peer Rate Limit #3444

Statement Distribution Per Peer Rate Limit #3444

Conversation

Overkillus commented Feb 22, 2024 • edited Loading

Choose a reason for hiding this comment

Overkillus Feb 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

burdges Mar 21, 2024 • edited Loading

Choose a reason for hiding this comment

alexggh left a comment

Choose a reason for hiding this comment

Overkillus commented Feb 28, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sandreim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexggh left a comment

Choose a reason for hiding this comment

Overkillus commented Feb 22, 2024 •

edited

Loading

Overkillus Feb 26, 2024 •

edited

Loading

burdges Mar 21, 2024 •

edited

Loading