[Deliverable] DOS protection for req-res protocols and metrics #66

chair28980 · 2023-09-01T15:16:35Z

Project: https://github.com/orgs/waku-org/projects/11/views/1

Summary

Current minimum scope applies to implement:

Bandwidth measurement and metrics for per shard traffic.
DoS protecting service nodes by applying request rate limitation on non relay protocols.
- This applies some limited bandwidth limitation on those protocols
Provide failsafe mechanism to third party apps / client side help for request rejection mechanisms.

Descoped from this milestone:
As the autosharded public network grows and traffic increases per shard, we want to provide some bandwidth management mechanisms for relay nodes to dynamically choose the number of shards they support based on bandwidth availability. For example, when the network launches, it's reasonable for relay nodes to support all shards and gradually unsubscribe from shards as bandwidth increases. The minimum amount of shards to support would be 1, so the network design and DoS mechanisms (see Track 3) would have to provide predictable limits on max bandwidth per shard. We could also envision a bandwidth protection mechanism that drops messages over a threshold, but this will affect the node's scoring so should be carefully planned.

Epics

Output

fryorcraken · 2023-10-25T01:11:32Z

I can see that all issues tagged on this are "not critical for launch" except for feat: Validation mechanism to limit "free traffic" on the network for which I understand there is uncertainty on:

practicality
feasibility

Would be good to review this epic and see if we should postpone it. Or even include it in Gen 0? Or at least, focusing on docs for operators with waku-org/nwaku#1946

cc @jm-clius @alrevuelta @vpavlin

chair28980 · 2023-11-22T00:18:03Z

Issues de-scoped from Gen 0 milestone:

I propose that we descope the effort to provide a "free tier" of bandwidth in the network for now (a part of Epic 1.3: Node bandwidth mechanism). This would have allowed up to ~1 Mbps of messages without RLN proofs (i.e. publishers don't require RLN memberships), theoretically making it easier for early adopters to trial the tech. However, based on discussions we've had in the meantime and the fundamental unreliability of such a mechanism, I propose we descope/deprioritise work related on this and continue designing the network around mandatory RLN memberships. Let me know if you have strong objections or ideas.

cc @jm-clius

waku-org/nwaku#1938
waku-org/js-waku#1503
waku-org/go-waku#677

Ivansete-status · 2023-12-23T08:17:00Z

Weekly Update

achieved: make configurable the max message size in nim-waku chore: message.nim - set max message size to 150KiB according to spec nwaku#2298

chair28980 · 2024-02-19T15:45:18Z

Scope signed-off during EU-NA pm 2024-02-19.

fryorcraken · 2024-07-22T00:14:51Z

This took longer than expect. Any specific reasons @Ivansete-status ?
What is the status?
Has dogfooding start or will it be done with 0.31.0?

From a client PoC, meaning that a service may reject request due to reaching rate limit, it this handled?
@richard-ramos @chaitanyaprem for go-waku
@weboko for js-waku,

NagyZoltanPeter · 2024-07-22T05:38:03Z

This took longer than expect. Any specific reasons @Ivansete-status ? What is the status? Has dogfooding start or will it be done with 0.31.0?

From a client PoC, meaning that a service may reject request due to reaching rate limit, it this handled? @richard-ramos @chaitanyaprem for go-waku @weboko for js-waku,

@fryorcraken, @Ivansete-status: Yes, it was dependent on me, no specific reason in terms of the feature, only other tasks caused a bit of distraction. Several phases were done and redesigns meanwhile.
Yes the full feature is part of 0.31.0 release.

For filter, there is an estimated viable request rate is set by default.
For lightpush and store it must be configured on CLI.
- for lightpush RLN is more restrictive in any manner (if applied), but this protection can prevent flooding the protocol with false requests.
for store, we need to figure out a good balanced number based on experimental data on what the clients need to operate well and what load a node can handle.

weboko · 2024-07-23T15:36:56Z

initial work from js-waku was done by handling more error codes and upgrading API

a service may reject request due to reaching rate limit, it this handled

we intend to address it as part of req-res reliability with waku-org/js-waku#2054

@NagyZoltanPeter is there a task for upgrading lightPush on the side of nwaku?

NagyZoltanPeter · 2024-07-23T15:51:34Z

initial work from js-waku was done by handling more error codes and upgrading API

a service may reject request due to reaching rate limit, it this handled

we intend to address it as part of req-res reliability with waku-org/js-waku#2054

@NagyZoltanPeter is there a task for upgrading lightPush on the side of nwaku?

Regarding what exactly? Do you mean the new protocol definition? This one: waku-org/nwaku#2722

fryorcraken · 2024-08-28T04:20:15Z

Talking with @NagyZoltanPeter , can we please clarify the roll out strategy?

I understand that filter and light push rate limits are already deployed on status.prod because it's set by default.

Store is not setup and actually is the more difficult one. As light client can use status desktop nodes to light push and filter services.

For store, there are few things to take in consideration:

I believe we noticed that one store node is more used than others. Not sure we understand why. cc @richard-ramos
There is less traffic on status.staging. So enabling store rate limit there may not allow us to learn much about impact
There are several store nodes in status.prod, so we could enable rate limit for one node, see impact, and then enable for other nodes
Waku store performance is not yet resolved. If there is a specific resource that is getting starved, it would be interesting to use this rate limit to help offload starved and improve overall experience to users.

NagyZoltanPeter · 2024-08-28T10:47:35Z

Talking with @NagyZoltanPeter , can we please clarify the roll out strategy?

I understand that filter and light push rate limits are already deployed on status.prod because it's set by default.

Store is not setup and actually is the more difficult one. As light client can use status desktop nodes to light push and filter services.

For store, there are few things to take in consideration:

I believe we noticed that one store node is more used than others. Not sure we understand why. cc @richard-ramos

There is less traffic on status.staging. So enabling store rate limit there may not allow us to learn much about impact

There are several store nodes in status.prod, so we could enable rate limit for one node, see impact, and then enable for other nodes

Waku store performance is not yet resolved. If there is a specific resource that is getting starved, it would be interesting to use this rate limit to help offload starved and improve overall experience to users.

Sorry, I might be not clear.
Exactly, nwaku filter protocol has a hard coded rate limit applied (without any configuration).
It is 30 req/1 min for each subscriber peer.

For lightpush and store the default is no-rate limit.
We can currently apply one cli config (applies for both) --request-rate-limit and --request-rate-period

If it turns out that we will need different rate limit settings for different protocols. We will need a separate configuration or derive a final value out of it.

fryorcraken · 2024-08-29T02:13:46Z

Back to roll out strategy, we also want to monitor before/after and extract good value to set from current data.

richard-ramos · 2024-08-29T14:58:20Z

I believe we noticed that one store node is more used than others. Not sure we understand why. cc @richard-ramos

It's probably due to the storenode selection heuristic used by the storenode cycle:

We ping all storenodes
Order them by reply time, lowest to highest
Choose randomly a storenode from the fastest 25% (the first quartile of all the storenode replies ordered by rtt ascendant).

Since we only have 6 storenodes in the fleet, it will tend to choose, the fastest 25% (rounded up), will always be the 2 geographically closest storenodes. Since most core contributors are located in Europe, this will show up in the data as if there's a preference for Amsterdam storenodes.

In my particular case, status-go will tend to prefer those in US Central

fryorcraken · 2024-09-02T00:55:11Z

Looking at this deliverable and the matching work for go-waku (as service node). The dogfooding is difficult as it is directly related to resource available and consumptions.

The scenarios that can go wrong are:

Limits are set too high in the fleet: meaning that even when resources of fleet node are exhausted, clients are still connected and get degraded (ie slow) services
Limits are set too low in the fleet:
a. For store: this is a problem are store node are only the node in the fleet, meaning that clients may not have access to a store node when we have the resource. UX impact is can't download messages.
b. For other services: Fallback to using desktop nodes for lp/filter. Which is fine as long as mobile/desktop ratio is fair and there is no problem with desktop lp/filter limits (see 3 and 4).
Limits are set too low on desktop nodes: meaning not enough slots for mobile users, mobile users may have difficult to receive/send messages
Limits are set too high on desktop nodes: meaning too much bandwidth/CPU consumed on desktop nodes, impacting users mainly by slowing the app.
(not a direct consequence of this, but to keep in mind): assuming limits properly configured on fleet store nodes, if limits are regularly exhausted then more resources should be deployed (a per shard granularity is necessary here).

From this, it seems that the practical approach is to have proper monitoring so we can measure each point above and adjust as needed.

It seems that what we should monitor is

A. Fleet: metrics directly in correlation to the rate we are setting. E.g. ability from a graph to see what ratio of the rate limit we are at (e.g. 90% of rate limit). For both the global but also local. Eg, we need to know if a lot of users (IP?) are reaching 100% capacity of their quota, potentially indicating quota is too low

B. Clients: Reason of connection/request failure: so we know whether failure are due to DoS rate limit and for what service (filter, lp, store). tracked in #182

C. Clients: bandwidth, memory and CPU consumption of Waku node. We have already plans to track bandwidth. Will need review for CPU and memory.

D. Clients: rate limit of local service node (filter, lp, only desktop). It may make sense to report only specific event, e.g. "reached 80% global rate limit on service", or "rejected a client for reaching per IP rate limit". Will need to review to plan this work.

We may also decide to set up alerts on thresholds

chaitanyaprem · 2024-09-02T01:33:10Z

completely agree with you thoughts above @fryorcraken .

But what you are suggesting seems more like tuning rate-limits on fleet and service-nodes rather than dogfooding. we need to do this anyways in order to understand what are the optimal rate-limits to be set. But then again, we can't predict user count and mobile/desktop user ratio in the network which means this maybe an ongoing exercise for sometime as users grow in status.
In order to help us understand if rate-limits are being hit and node's resources are being consumed, i agree that it would be good to use telemetry/some sort of metrics to keep track of usage of services (i.e if rate limits are being hit and node is overwhelmed due to configured rate limits etc).
In the long run, this metrics (not using telemetry here since all users may not enable it) framework would actually help users in taking decisions such as if their own desktop node is running out of resources and they can probably tune their rate-limits or automate this in code to reduce rate-limits based on resource consumption.

if we want to dogfood features(such as rate-limit) specifically in fleet and desktop service-nodes, we can do it in an isolated environment with status-cli and nwaku instances, but it would require some work to be done.

chaitanyaprem · 2024-09-02T01:50:16Z

Exactly, nwaku filter protocol has a hard coded rate limit applied (without any configuration).
It is 30 req/1 min for each subscriber peer.

@NagyZoltanPeter , i assume you mean 30 req/min in terms of ping, subscribe etc. I hope no rate-limit is applied for unsubscribe as there is no retry for unsubscribe and it may end-up in a situation where a client thinks unsubscribe is done, but servicenode keeps sending msgs due to pings received for other filters from same client.

also, i am sure this is the case but just to confirm that there is no rate-limit on how many messages are delivered to a filter-client right.

NagyZoltanPeter · 2024-09-02T07:30:17Z

Exactly, nwaku filter protocol has a hard coded rate limit applied (without any configuration).
It is 30 req/1 min for each subscriber peer.

@NagyZoltanPeter , i assume you mean 30 req/min in terms of ping, subscribe etc. I hope no rate-limit is applied for unsubscribe as there is no retry for unsubscribe and it may end-up in a situation where a client thinks unsubscribe is done, but servicenode keeps sending msgs due to pings received for other filters from same client.

also, i am sure this is the case but just to confirm that there is no rate-limit on how many messages are delivered to a filter-client right.

Hm, rate limit is applied all subs/unsubs/ping request.
Pushing messages are not limited of course.
What do you mean by limit must not be applied to unsubscribe?
Do you think an app needs to issue more than 30 requests in a row to manage it's subscriptions including unsubscribe-all?
Notice limit is applied per peer. 30req/sec/peer.

If course it is a guess value and we can change it.
But still having max 1000 subscriber it is still 30.000req/sec which is quite high I think.

chaitanyaprem · 2024-09-03T10:20:53Z

What do you mean by limit must not be applied to unsubscribe?
Do you think an app needs to issue more than 30 requests in a row to manage it's subscriptions including unsubscribe-all?
Notice limit is applied per peer. 30req/sec/peer.

well, i thought i saw a corner case where unsubscribe gets rejected by rate-limit...but thats ok since rate limit is quite high per peer. you can ignore that comment.

fryorcraken · 2024-09-06T06:14:09Z

But what you are suggesting seems more like tuning rate-limits on fleet and service-nodes rather than dogfooding
What I am saying is that dogfooding is going to be challenging and limited and that a monitoring/data gathering approach is likely to be more useful.

It's also part of nwaku so we will still do some dogfooding on status.staging. Not saying we should skip it.

fryorcraken · 2024-09-09T04:42:08Z

New dashboard: waku-org/nwaku#3025

But now you can browse these metrics live: https://grafana.infra.status.im/d/qrp_ZCTGz/nim-waku-v2?orgId=1

chaitanyaprem · 2024-09-12T06:14:49Z

While dogfooding Store rate-limits in status-fleets an issue has been noticed which made us revert the change as it will affect message reliability of status users otherwise.

issue was found : waku-org/go-waku#1219 status clients are not able to switch to alternate store node for periodic message checks in case store node returns rate-limit failure.
fix is tracked: target release 2.31.0
mitigation was done: disabled rate limit
dogfooding will resume when: 2.31.0 released

fryorcraken · 2024-09-13T03:58:23Z

@chair28980 it means we wont be able to close this under 2.31.0 is released. Do we have a date for it?

NagyZoltanPeter · 2024-09-24T20:16:19Z

@chair28980 @fryorcraken @Ivansete-status
From nwaku point of view this deliverable is ready.
Rate limiting DOS protection is added to store(v2-v3), lightpush, filter and peer-exchange protocols.
Fully configurable from the cli individually to each protocol.
Protocol metrics support tracking of server and rejected request rates, in-out traffic rates.
Bandwidth usage (in-net/gros, out) per shard over relay metrics and dashboard also included.
Dashboard is deployed.

Last part - peer-exchange DOS protection and full CLI config - will be released with nwaku v0.33.0.

Ivansete-status · 2024-09-26T10:50:17Z

fyi @chair28980 , @fryorcraken
@NagyZoltanPeter completed the points:

allow measuring the normal bandwidth and request rate per req-resp protocols.
setting limits for req-resp protocols.

Additionally, we've added a separate feature which is to measure the possible impact between req-resp protocols.
waku-org/nwaku#3060

fryorcraken · 2024-09-30T05:11:23Z

Need dashboard link so we can easily refer to it.

chair28980 · 2024-09-30T05:17:38Z

@chair28980 it means we wont be able to close this under 2.31.0 is released. Do we have a date for it?

Sorry for delayed response, the original release date for 2.31.0 was scheduled for Sep 26th (now past), it appears to be delayed I'll follow up with projected release date.

fryorcraken · 2024-09-30T05:18:48Z

@chair28980 it means we wont be able to close this under 2.31.0 is released. Do we have a date for it?

Sorry for delayed response, the original release date for 2.31.0 was scheduled for Sep 26th (now past), it appears to be delayed I'll follow up with projected release date.

I think it's done, just need a link to dashboard/panels we are referring too.

chair28980 added the Epic Tracks a sub-team Epic. label Sep 1, 2023

fryorcraken added this to Waku Sep 1, 2023

fryorcraken added this to the Waku Network Gen 0 milestone Sep 5, 2023

fryorcraken added the E:1.3: Node bandwidth management mechanism See https://github.com/waku-org/pm/issues/66 for details label Sep 8, 2023

chair28980 assigned richard-ramos and vpavlin Oct 31, 2023

chair28980 assigned NagyZoltanPeter and unassigned vpavlin Nov 14, 2023

fryorcraken mentioned this issue Nov 24, 2023

feat: Validation mechanism to limit "free traffic" on the network waku-org/js-waku#1503

Open

chair28980 modified the milestone: Waku Network Gen 0 Dec 4, 2023

chair28980 unassigned richard-ramos Dec 12, 2023

chair28980 changed the title ~~[Epic] 1.3: Node bandwidth management mechanism~~ [Milestone] Node Bandwidth Management Mechanism Feb 2, 2024

chair28980 added Deliverable Tracks a Deliverable and removed Epic Tracks a sub-team Epic. E:1.3: Node bandwidth management mechanism See https://github.com/waku-org/pm/issues/66 for details labels Feb 2, 2024

chair28980 modified the milestones: Waku Network Gen 0, Node Bandwidth Management Mechanism Feb 2, 2024

chair28980 added M: Node Bandwidth Mgmt needs scoping labels Feb 3, 2024

chair28980 changed the title ~~[Milestone] Node Bandwidth Management Mechanism~~ [Milestone] DOS protection for req-res protocols and metrics Feb 19, 2024

chair28980 mentioned this issue Feb 27, 2024

[Epic: js-waku] Node Bandwidth Management Features waku-org/js-waku#2148

Closed

1 task

chair28980 removed needs scoping M: Node Bandwidth Mgmt labels Mar 6, 2024

chair28980 changed the title ~~[Milestone] DOS protection for req-res protocols and metrics~~ [Deliverable] DOS protection for req-res protocols and metrics May 27, 2024

NagyZoltanPeter mentioned this issue May 29, 2024

Update 13/STORE: new error codes for Extended HistoryResponse vacp2p/rfc-index#42

Closed

chair28980 added this to the Store Service Upgrade milestone Jun 6, 2024

chair28980 moved this to In Progress in Waku Aug 19, 2024

NagyZoltanPeter mentioned this issue Sep 12, 2024

chore: rate limit peer exchange protocol, enhanced response status in RPC waku-org/nwaku#3035

Merged

9 tasks

NagyZoltanPeter mentioned this issue Sep 13, 2024

chore: DOS protection of non-relay req/resp protocols new cli argument description waku-org/docs.waku.org#216

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Deliverable] DOS protection for req-res protocols and metrics #66

[Deliverable] DOS protection for req-res protocols and metrics #66

chair28980 commented Sep 1, 2023 •

edited

Loading

fryorcraken commented Oct 25, 2023

chair28980 commented Nov 22, 2023

Ivansete-status commented Dec 23, 2023

chair28980 commented Feb 19, 2024

fryorcraken commented Jul 22, 2024

NagyZoltanPeter commented Jul 22, 2024 •

edited

Loading

weboko commented Jul 23, 2024 •

edited

Loading

NagyZoltanPeter commented Jul 23, 2024 •

edited

Loading

fryorcraken commented Aug 28, 2024

NagyZoltanPeter commented Aug 28, 2024

fryorcraken commented Aug 29, 2024

richard-ramos commented Aug 29, 2024

fryorcraken commented Sep 2, 2024 •

edited

Loading

chaitanyaprem commented Sep 2, 2024 •

edited

Loading

chaitanyaprem commented Sep 2, 2024

NagyZoltanPeter commented Sep 2, 2024

chaitanyaprem commented Sep 3, 2024

fryorcraken commented Sep 6, 2024

fryorcraken commented Sep 9, 2024

chaitanyaprem commented Sep 12, 2024

fryorcraken commented Sep 13, 2024

NagyZoltanPeter commented Sep 24, 2024

Ivansete-status commented Sep 26, 2024

fryorcraken commented Sep 30, 2024

chair28980 commented Sep 30, 2024

fryorcraken commented Sep 30, 2024

[Deliverable] DOS protection for req-res protocols and metrics #66

[Deliverable] DOS protection for req-res protocols and metrics #66

Comments

chair28980 commented Sep 1, 2023 • edited Loading

Summary

Epics

Output

fryorcraken commented Oct 25, 2023

chair28980 commented Nov 22, 2023

Ivansete-status commented Dec 23, 2023

chair28980 commented Feb 19, 2024

fryorcraken commented Jul 22, 2024

NagyZoltanPeter commented Jul 22, 2024 • edited Loading

weboko commented Jul 23, 2024 • edited Loading

NagyZoltanPeter commented Jul 23, 2024 • edited Loading

fryorcraken commented Aug 28, 2024

NagyZoltanPeter commented Aug 28, 2024

fryorcraken commented Aug 29, 2024

richard-ramos commented Aug 29, 2024

fryorcraken commented Sep 2, 2024 • edited Loading

chaitanyaprem commented Sep 2, 2024 • edited Loading

chaitanyaprem commented Sep 2, 2024

NagyZoltanPeter commented Sep 2, 2024

chaitanyaprem commented Sep 3, 2024

fryorcraken commented Sep 6, 2024

fryorcraken commented Sep 9, 2024

chaitanyaprem commented Sep 12, 2024

fryorcraken commented Sep 13, 2024

NagyZoltanPeter commented Sep 24, 2024

Ivansete-status commented Sep 26, 2024

fryorcraken commented Sep 30, 2024

chair28980 commented Sep 30, 2024

fryorcraken commented Sep 30, 2024

chair28980 commented Sep 1, 2023 •

edited

Loading

NagyZoltanPeter commented Jul 22, 2024 •

edited

Loading

weboko commented Jul 23, 2024 •

edited

Loading

NagyZoltanPeter commented Jul 23, 2024 •

edited

Loading

fryorcraken commented Sep 2, 2024 •

edited

Loading

chaitanyaprem commented Sep 2, 2024 •

edited

Loading