Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Deliverable] DOS protection for req-res protocols and metrics #66

Open
2 of 3 tasks
chair28980 opened this issue Sep 1, 2023 · 26 comments
Open
2 of 3 tasks

[Deliverable] DOS protection for req-res protocols and metrics #66

chair28980 opened this issue Sep 1, 2023 · 26 comments
Assignees
Labels
Deliverable Tracks a Deliverable

Comments

@chair28980
Copy link
Contributor

chair28980 commented Sep 1, 2023

Project: https://github.com/orgs/waku-org/projects/11/views/1

Summary

Current minimum scope applies to implement:

  • Bandwidth measurement and metrics for per shard traffic.
  • DoS protecting service nodes by applying request rate limitation on non relay protocols.
    • This applies some limited bandwidth limitation on those protocols
  • Provide failsafe mechanism to third party apps / client side help for request rejection mechanisms.

Descoped from this milestone:
As the autosharded public network grows and traffic increases per shard, we want to provide some bandwidth management mechanisms for relay nodes to dynamically choose the number of shards they support based on bandwidth availability. For example, when the network launches, it's reasonable for relay nodes to support all shards and gradually unsubscribe from shards as bandwidth increases. The minimum amount of shards to support would be 1, so the network design and DoS mechanisms (see Track 3) would have to provide predictable limits on max bandwidth per shard. We could also envision a bandwidth protection mechanism that drops messages over a threshold, but this will affect the node's scoring so should be carefully planned.

Epics

Output

@chair28980 chair28980 added the Epic Tracks a sub-team Epic. label Sep 1, 2023
@fryorcraken fryorcraken added this to Waku Sep 1, 2023
@fryorcraken fryorcraken added this to the Waku Network Gen 0 milestone Sep 5, 2023
@fryorcraken fryorcraken added the E:1.3: Node bandwidth management mechanism See https://github.com/waku-org/pm/issues/66 for details label Sep 8, 2023
@fryorcraken
Copy link
Contributor

I can see that all issues tagged on this are "not critical for launch" except for feat: Validation mechanism to limit "free traffic" on the network for which I understand there is uncertainty on:

  • practicality
  • feasibility

Would be good to review this epic and see if we should postpone it. Or even include it in Gen 0? Or at least, focusing on docs for operators with waku-org/nwaku#1946

cc @jm-clius @alrevuelta @vpavlin

@chair28980
Copy link
Contributor Author

Issues de-scoped from Gen 0 milestone:

I propose that we descope the effort to provide a "free tier" of bandwidth in the network for now (a part of Epic 1.3: Node bandwidth mechanism). This would have allowed up to ~1 Mbps of messages without RLN proofs (i.e. publishers don't require RLN memberships), theoretically making it easier for early adopters to trial the tech. However, based on discussions we've had in the meantime and the fundamental unreliability of such a mechanism, I propose we descope/deprioritise work related on this and continue designing the network around mandatory RLN memberships. Let me know if you have strong objections or ideas.

cc @jm-clius

waku-org/nwaku#1938
waku-org/js-waku#1503
waku-org/go-waku#677

@Ivansete-status
Copy link

Weekly Update

@chair28980 chair28980 changed the title [Epic] 1.3: Node bandwidth management mechanism [Milestone] Node Bandwidth Management Mechanism Feb 2, 2024
@chair28980 chair28980 added Deliverable Tracks a Deliverable and removed Epic Tracks a sub-team Epic. E:1.3: Node bandwidth management mechanism See https://github.com/waku-org/pm/issues/66 for details labels Feb 2, 2024
@chair28980 chair28980 modified the milestones: Waku Network Gen 0, Node Bandwidth Management Mechanism Feb 2, 2024
@chair28980
Copy link
Contributor Author

Scope signed-off during EU-NA pm 2024-02-19.

@chair28980 chair28980 changed the title [Milestone] Node Bandwidth Management Mechanism [Milestone] DOS protection for req-res protocols and metrics Feb 19, 2024
@chair28980 chair28980 changed the title [Milestone] DOS protection for req-res protocols and metrics [Deliverable] DOS protection for req-res protocols and metrics May 27, 2024
@fryorcraken
Copy link
Contributor

This took longer than expect. Any specific reasons @Ivansete-status ?
What is the status?
Has dogfooding start or will it be done with 0.31.0?

From a client PoC, meaning that a service may reject request due to reaching rate limit, it this handled?
@richard-ramos @chaitanyaprem for go-waku
@weboko for js-waku,

@NagyZoltanPeter
Copy link

NagyZoltanPeter commented Jul 22, 2024

This took longer than expect. Any specific reasons @Ivansete-status ? What is the status? Has dogfooding start or will it be done with 0.31.0?

From a client PoC, meaning that a service may reject request due to reaching rate limit, it this handled? @richard-ramos @chaitanyaprem for go-waku @weboko for js-waku,

@fryorcraken, @Ivansete-status: Yes, it was dependent on me, no specific reason in terms of the feature, only other tasks caused a bit of distraction. Several phases were done and redesigns meanwhile.
Yes the full feature is part of 0.31.0 release.

  • For filter, there is an estimated viable request rate is set by default.
  • For lightpush and store it must be configured on CLI.
    • for lightpush RLN is more restrictive in any manner (if applied), but this protection can prevent flooding the protocol with false requests.
  • for store, we need to figure out a good balanced number based on experimental data on what the clients need to operate well and what load a node can handle.

@weboko
Copy link

weboko commented Jul 23, 2024

initial work from js-waku was done by handling more error codes and upgrading API

a service may reject request due to reaching rate limit, it this handled

we intend to address it as part of req-res reliability with waku-org/js-waku#2054

@NagyZoltanPeter is there a task for upgrading lightPush on the side of nwaku?

@NagyZoltanPeter
Copy link

NagyZoltanPeter commented Jul 23, 2024

initial work from js-waku was done by handling more error codes and upgrading API

a service may reject request due to reaching rate limit, it this handled

we intend to address it as part of req-res reliability with waku-org/js-waku#2054

@NagyZoltanPeter is there a task for upgrading lightPush on the side of nwaku?

Regarding what exactly? Do you mean the new protocol definition? This one: waku-org/nwaku#2722

@chair28980 chair28980 moved this to In Progress in Waku Aug 19, 2024
@fryorcraken
Copy link
Contributor

Talking with @NagyZoltanPeter , can we please clarify the roll out strategy?

I understand that filter and light push rate limits are already deployed on status.prod because it's set by default.

Store is not setup and actually is the more difficult one. As light client can use status desktop nodes to light push and filter services.

For store, there are few things to take in consideration:

  • I believe we noticed that one store node is more used than others. Not sure we understand why. cc @richard-ramos
  • There is less traffic on status.staging. So enabling store rate limit there may not allow us to learn much about impact
  • There are several store nodes in status.prod, so we could enable rate limit for one node, see impact, and then enable for other nodes
  • Waku store performance is not yet resolved. If there is a specific resource that is getting starved, it would be interesting to use this rate limit to help offload starved and improve overall experience to users.

@NagyZoltanPeter
Copy link

Talking with @NagyZoltanPeter , can we please clarify the roll out strategy?

I understand that filter and light push rate limits are already deployed on status.prod because it's set by default.

Store is not setup and actually is the more difficult one. As light client can use status desktop nodes to light push and filter services.

For store, there are few things to take in consideration:

  • I believe we noticed that one store node is more used than others. Not sure we understand why. cc @richard-ramos
  • There is less traffic on status.staging. So enabling store rate limit there may not allow us to learn much about impact
  • There are several store nodes in status.prod, so we could enable rate limit for one node, see impact, and then enable for other nodes
  • Waku store performance is not yet resolved. If there is a specific resource that is getting starved, it would be interesting to use this rate limit to help offload starved and improve overall experience to users.

Sorry, I might be not clear.
Exactly, nwaku filter protocol has a hard coded rate limit applied (without any configuration).
It is 30 req/1 min for each subscriber peer.

For lightpush and store the default is no-rate limit.
We can currently apply one cli config (applies for both) --request-rate-limit and --request-rate-period

If it turns out that we will need different rate limit settings for different protocols. We will need a separate configuration or derive a final value out of it.

@fryorcraken
Copy link
Contributor

Back to roll out strategy, we also want to monitor before/after and extract good value to set from current data.

@richard-ramos
Copy link
Member

I believe we noticed that one store node is more used than others. Not sure we understand why. cc @richard-ramos

It's probably due to the storenode selection heuristic used by the storenode cycle:

  1. We ping all storenodes
  2. Order them by reply time, lowest to highest
  3. Choose randomly a storenode from the fastest 25% (the first quartile of all the storenode replies ordered by rtt ascendant).

Since we only have 6 storenodes in the fleet, it will tend to choose, the fastest 25% (rounded up), will always be the 2 geographically closest storenodes. Since most core contributors are located in Europe, this will show up in the data as if there's a preference for Amsterdam storenodes.

In my particular case, status-go will tend to prefer those in US Central

@fryorcraken
Copy link
Contributor

fryorcraken commented Sep 2, 2024

Looking at this deliverable and the matching work for go-waku (as service node). The dogfooding is difficult as it is directly related to resource available and consumptions.

The scenarios that can go wrong are:

  1. Limits are set too high in the fleet: meaning that even when resources of fleet node are exhausted, clients are still connected and get degraded (ie slow) services
  2. Limits are set too low in the fleet:
    a. For store: this is a problem are store node are only the node in the fleet, meaning that clients may not have access to a store node when we have the resource. UX impact is can't download messages.
    b. For other services: Fallback to using desktop nodes for lp/filter. Which is fine as long as mobile/desktop ratio is fair and there is no problem with desktop lp/filter limits (see 3 and 4).
  3. Limits are set too low on desktop nodes: meaning not enough slots for mobile users, mobile users may have difficult to receive/send messages
  4. Limits are set too high on desktop nodes: meaning too much bandwidth/CPU consumed on desktop nodes, impacting users mainly by slowing the app.
  5. (not a direct consequence of this, but to keep in mind): assuming limits properly configured on fleet store nodes, if limits are regularly exhausted then more resources should be deployed (a per shard granularity is necessary here).

From this, it seems that the practical approach is to have proper monitoring so we can measure each point above and adjust as needed.

It seems that what we should monitor is

A. Fleet: metrics directly in correlation to the rate we are setting. E.g. ability from a graph to see what ratio of the rate limit we are at (e.g. 90% of rate limit). For both the global but also local. Eg, we need to know if a lot of users (IP?) are reaching 100% capacity of their quota, potentially indicating quota is too low

B. Clients: Reason of connection/request failure: so we know whether failure are due to DoS rate limit and for what service (filter, lp, store). tracked in #182

C. Clients: bandwidth, memory and CPU consumption of Waku node. We have already plans to track bandwidth. Will need review for CPU and memory.

D. Clients: rate limit of local service node (filter, lp, only desktop). It may make sense to report only specific event, e.g. "reached 80% global rate limit on service", or "rejected a client for reaching per IP rate limit". Will need to review to plan this work.

We may also decide to set up alerts on thresholds

@chaitanyaprem
Copy link

chaitanyaprem commented Sep 2, 2024

completely agree with you thoughts above @fryorcraken .

But what you are suggesting seems more like tuning rate-limits on fleet and service-nodes rather than dogfooding. we need to do this anyways in order to understand what are the optimal rate-limits to be set. But then again, we can't predict user count and mobile/desktop user ratio in the network which means this maybe an ongoing exercise for sometime as users grow in status.
In order to help us understand if rate-limits are being hit and node's resources are being consumed, i agree that it would be good to use telemetry/some sort of metrics to keep track of usage of services (i.e if rate limits are being hit and node is overwhelmed due to configured rate limits etc).
In the long run, this metrics (not using telemetry here since all users may not enable it) framework would actually help users in taking decisions such as if their own desktop node is running out of resources and they can probably tune their rate-limits or automate this in code to reduce rate-limits based on resource consumption.

if we want to dogfood features(such as rate-limit) specifically in fleet and desktop service-nodes, we can do it in an isolated environment with status-cli and nwaku instances, but it would require some work to be done.

@chaitanyaprem
Copy link

Exactly, nwaku filter protocol has a hard coded rate limit applied (without any configuration).
It is 30 req/1 min for each subscriber peer.

@NagyZoltanPeter , i assume you mean 30 req/min in terms of ping, subscribe etc. I hope no rate-limit is applied for unsubscribe as there is no retry for unsubscribe and it may end-up in a situation where a client thinks unsubscribe is done, but servicenode keeps sending msgs due to pings received for other filters from same client.

also, i am sure this is the case but just to confirm that there is no rate-limit on how many messages are delivered to a filter-client right.

@NagyZoltanPeter
Copy link

Exactly, nwaku filter protocol has a hard coded rate limit applied (without any configuration).
It is 30 req/1 min for each subscriber peer.

@NagyZoltanPeter , i assume you mean 30 req/min in terms of ping, subscribe etc. I hope no rate-limit is applied for unsubscribe as there is no retry for unsubscribe and it may end-up in a situation where a client thinks unsubscribe is done, but servicenode keeps sending msgs due to pings received for other filters from same client.

also, i am sure this is the case but just to confirm that there is no rate-limit on how many messages are delivered to a filter-client right.

Hm, rate limit is applied all subs/unsubs/ping request.
Pushing messages are not limited of course.
What do you mean by limit must not be applied to unsubscribe?
Do you think an app needs to issue more than 30 requests in a row to manage it's subscriptions including unsubscribe-all?
Notice limit is applied per peer. 30req/sec/peer.

If course it is a guess value and we can change it.
But still having max 1000 subscriber it is still 30.000req/sec which is quite high I think.

@chaitanyaprem
Copy link

What do you mean by limit must not be applied to unsubscribe?
Do you think an app needs to issue more than 30 requests in a row to manage it's subscriptions including unsubscribe-all?
Notice limit is applied per peer. 30req/sec/peer.

well, i thought i saw a corner case where unsubscribe gets rejected by rate-limit...but thats ok since rate limit is quite high per peer. you can ignore that comment.

@fryorcraken
Copy link
Contributor

But what you are suggesting seems more like tuning rate-limits on fleet and service-nodes rather than dogfooding
What I am saying is that dogfooding is going to be challenging and limited and that a monitoring/data gathering approach is likely to be more useful.

It's also part of nwaku so we will still do some dogfooding on status.staging. Not saying we should skip it.

@fryorcraken
Copy link
Contributor

New dashboard: waku-org/nwaku#3025

But now you can browse these metrics live: https://grafana.infra.status.im/d/qrp_ZCTGz/nim-waku-v2?orgId=1

@chaitanyaprem
Copy link

While dogfooding Store rate-limits in status-fleets an issue has been noticed which made us revert the change as it will affect message reliability of status users otherwise.

issue was found : waku-org/go-waku#1219 status clients are not able to switch to alternate store node for periodic message checks in case store node returns rate-limit failure.
fix is tracked: target release 2.31.0
mitigation was done: disabled rate limit
dogfooding will resume when: 2.31.0 released

@fryorcraken
Copy link
Contributor

@chair28980 it means we wont be able to close this under 2.31.0 is released. Do we have a date for it?

@NagyZoltanPeter
Copy link

@chair28980 @fryorcraken @Ivansete-status
From nwaku point of view this deliverable is ready.
Rate limiting DOS protection is added to store(v2-v3), lightpush, filter and peer-exchange protocols.
Fully configurable from the cli individually to each protocol.
Protocol metrics support tracking of server and rejected request rates, in-out traffic rates.
Bandwidth usage (in-net/gros, out) per shard over relay metrics and dashboard also included.
Dashboard is deployed.

Last part - peer-exchange DOS protection and full CLI config - will be released with nwaku v0.33.0.

@Ivansete-status
Copy link

fyi @chair28980 , @fryorcraken
@NagyZoltanPeter completed the points:

  • allow measuring the normal bandwidth and request rate per req-resp protocols.
  • setting limits for req-resp protocols.

Additionally, we've added a separate feature which is to measure the possible impact between req-resp protocols.
waku-org/nwaku#3060

@fryorcraken
Copy link
Contributor

Need dashboard link so we can easily refer to it.

@chair28980
Copy link
Contributor Author

@chair28980 it means we wont be able to close this under 2.31.0 is released. Do we have a date for it?

Sorry for delayed response, the original release date for 2.31.0 was scheduled for Sep 26th (now past), it appears to be delayed I'll follow up with projected release date.

@fryorcraken
Copy link
Contributor

@chair28980 it means we wont be able to close this under 2.31.0 is released. Do we have a date for it?

Sorry for delayed response, the original release date for 2.31.0 was scheduled for Sep 26th (now past), it appears to be delayed I'll follow up with projected release date.

I think it's done, just need a link to dashboard/panels we are referring too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deliverable Tracks a Deliverable
Projects
Status: In Progress
Development

No branches or pull requests

8 participants