-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Deliverable] DOS protection for req-res protocols and metrics #66
Comments
I can see that all issues tagged on this are "not critical for launch" except for
Would be good to review this epic and see if we should postpone it. Or even include it in Gen 0? Or at least, focusing on docs for operators with waku-org/nwaku#1946 |
Issues de-scoped from Gen 0 milestone:
cc @jm-clius waku-org/nwaku#1938 |
Weekly Update
|
Scope signed-off during EU-NA pm 2024-02-19. |
This took longer than expect. Any specific reasons @Ivansete-status ? From a client PoC, meaning that a service may reject request due to reaching rate limit, it this handled? |
@fryorcraken, @Ivansete-status: Yes, it was dependent on me, no specific reason in terms of the feature, only other tasks caused a bit of distraction. Several phases were done and redesigns meanwhile.
|
initial work from
we intend to address it as part of @NagyZoltanPeter is there a task for upgrading |
Regarding what exactly? Do you mean the new protocol definition? This one: waku-org/nwaku#2722 |
Talking with @NagyZoltanPeter , can we please clarify the roll out strategy? I understand that filter and light push rate limits are already deployed on status.prod because it's set by default. Store is not setup and actually is the more difficult one. As light client can use status desktop nodes to light push and filter services. For store, there are few things to take in consideration:
|
Sorry, I might be not clear. For lightpush and store the default is no-rate limit. If it turns out that we will need different rate limit settings for different protocols. We will need a separate configuration or derive a final value out of it. |
Back to roll out strategy, we also want to monitor before/after and extract good value to set from current data. |
It's probably due to the storenode selection heuristic used by the storenode cycle:
Since we only have 6 storenodes in the fleet, it will tend to choose, the fastest 25% (rounded up), will always be the 2 geographically closest storenodes. Since most core contributors are located in Europe, this will show up in the data as if there's a preference for Amsterdam storenodes. In my particular case, status-go will tend to prefer those in US Central |
Looking at this deliverable and the matching work for go-waku (as service node). The dogfooding is difficult as it is directly related to resource available and consumptions. The scenarios that can go wrong are:
From this, it seems that the practical approach is to have proper monitoring so we can measure each point above and adjust as needed. It seems that what we should monitor is A. Fleet: metrics directly in correlation to the rate we are setting. E.g. ability from a graph to see what ratio of the rate limit we are at (e.g. 90% of rate limit). For both the global but also local. Eg, we need to know if a lot of users (IP?) are reaching 100% capacity of their quota, potentially indicating quota is too low B. Clients: Reason of connection/request failure: so we know whether failure are due to DoS rate limit and for what service (filter, lp, store). tracked in #182 C. Clients: bandwidth, memory and CPU consumption of Waku node. We have already plans to track bandwidth. Will need review for CPU and memory. D. Clients: rate limit of local service node (filter, lp, only desktop). It may make sense to report only specific event, e.g. "reached 80% global rate limit on service", or "rejected a client for reaching per IP rate limit". Will need to review to plan this work. We may also decide to set up alerts on thresholds |
completely agree with you thoughts above @fryorcraken . But what you are suggesting seems more like tuning rate-limits on fleet and service-nodes rather than dogfooding. we need to do this anyways in order to understand what are the optimal rate-limits to be set. But then again, we can't predict user count and mobile/desktop user ratio in the network which means this maybe an ongoing exercise for sometime as users grow in status. if we want to dogfood features(such as rate-limit) specifically in fleet and desktop service-nodes, we can do it in an isolated environment with status-cli and nwaku instances, but it would require some work to be done. |
@NagyZoltanPeter , i assume you mean also, i am sure this is the case but just to confirm that there is no rate-limit on how many messages are delivered to a filter-client right. |
Hm, rate limit is applied all subs/unsubs/ping request. If course it is a guess value and we can change it. |
well, i thought i saw a corner case where unsubscribe gets rejected by rate-limit...but thats ok since rate limit is quite high per peer. you can ignore that comment. |
It's also part of nwaku so we will still do some dogfooding on status.staging. Not saying we should skip it. |
New dashboard: waku-org/nwaku#3025 But now you can browse these metrics live: https://grafana.infra.status.im/d/qrp_ZCTGz/nim-waku-v2?orgId=1 |
While dogfooding Store rate-limits in status-fleets an issue has been noticed which made us revert the change as it will affect message reliability of status users otherwise. issue was found : waku-org/go-waku#1219 status clients are not able to switch to alternate store node for periodic message checks in case store node returns rate-limit failure. |
@chair28980 it means we wont be able to close this under 2.31.0 is released. Do we have a date for it? |
@chair28980 @fryorcraken @Ivansete-status Last part - peer-exchange DOS protection and full CLI config - will be released with nwaku v0.33.0. |
fyi @chair28980 , @fryorcraken
Additionally, we've added a separate feature which is to measure the possible impact between req-resp protocols. |
Need dashboard link so we can easily refer to it. |
Sorry for delayed response, the original release date for 2.31.0 was scheduled for Sep 26th (now past), it appears to be delayed I'll follow up with projected release date. |
I think it's done, just need a link to dashboard/panels we are referring too. |
Project: https://github.com/orgs/waku-org/projects/11/views/1
Summary
Current minimum scope applies to implement:
Descoped from this milestone:
As the autosharded public network grows and traffic increases per shard, we want to provide some bandwidth management mechanisms for relay nodes to dynamically choose the number of shards they support based on bandwidth availability. For example, when the network launches, it's reasonable for relay nodes to support all shards and gradually unsubscribe from shards as bandwidth increases. The minimum amount of shards to support would be 1, so the network design and DoS mechanisms (see Track 3) would have to provide predictable limits on max bandwidth per shard. We could also envision a bandwidth protection mechanism that drops messages over a threshold, but this will affect the node's scoring so should be carefully planned.
Epics
Output
The text was updated successfully, but these errors were encountered: