-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Added simple, configurable rate limit for lightpush and store-query #2390
feat: Added simple, configurable rate limit for lightpush and store-query #2390
Conversation
This PR may contain changes to configuration options of one of the apps. If you are introducing a breaking change (i.e. the set of options in latest release would no longer be applicable) make sure the original option is preserved with a deprecation note for 2 following releases before it is actually removed. Please also make sure the label |
You can find the image built from this PR at
Built from 18992b2 |
Notice: cc: @chair28980 Even I will be able to implement the feature, it may cause some delays as we need to fix nim-chronos either by ourselves or by Jacek or Eugen or someone else. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks amazing! 🔥 🔥 🔥 Thanks so much!
Added some small questions :))
b96ab38
to
67339fd
Compare
waku-org/js-waku#1952 covers one issue caused by changing protocol interface in js-waku tests. |
…uery Adjust lightpush rest response to rate limit, added tests ann some fixes Add rest store query test for rate limit checks and proper error response Update apps/wakunode2/external_config.nim Co-authored-by: gabrielmer <[email protected]> Fixing typos and PR comments Fix tests Move chronos/tokenbucket to nwaku codebasee with limited and fixed feature set Add meterics to lightpush rate limits Fixes for code and test + test reliability with timing
76cf038
to
7dd66d0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks so much! Added some small questions about points that I still don't understand completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amaaaazing! Thanks so much!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Masterpiece PR! Thanks for it! 💯
Just added few comments that I hope you find useful.
Something that we might need to consider in the future is to expose byte
rate metrics.
For now, we expose all the libp2p
in/out transmission metrics from:
https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L112
and
https://github.com/vacp2p/nim-libp2p/blob/09b3e11956459ad4c364b353b7f1067f42267997/libp2p/stream/chronosstream.nim#L133
It will be very interesting to know the percentage (and absolute values) of the bandwidth (bytes/sec) consumed by each protocol. Something to consider in upcoming PRs :P
Cheers
await sleepAsync(1000.millis) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Beautiful test!
nitpick: I wonder if we could make the test a bit faster, i.e. 500ms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was in my mind. But found that the whole test is very sensitive to timing. The base of this problem that there is a quite large gap between the test case starts a timer and make assumption on it and when the exact timing will occur on the protocol an token bucket side.
This is normally not a problem as user of the protocol will not make assumptions normally on timing the request. This is mostly a testing problem.
But agree to avoid such sleeps as much as possible, but this rate limit tests are time sensitive ones.
requestRateLimit* {. | ||
desc: | ||
"Number of requests to serve by each service in the specified period. Set it to 0 for unlimited", | ||
defaultValue: 0, | ||
name: "request-rate-limit" | ||
.}: int | ||
|
||
## Currently default to switch of rate limit until become official | ||
requestRatePeriod* {. | ||
desc: "Period of request rate limitation in seconds. Set it to 0 for unlimited", | ||
defaultValue: 0, | ||
name: "request-rate-period" | ||
.}: int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you already commented and I might be missing something but these params seem not to be used :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems washed out during rebase in relation with refactoring the node initialization.
Yes, we have a separate issue tracking that: #1945 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments:
- We are missing some DoS attack vectors. For example this allows an attacker to DoS the node, since it will naively read from the connection and return. We should also count read bytes, not just requests made. And not only on valid requests but also when RPC decode is not successful. This can be used for inspiration. I would suggest something similar, where if a given peer has sent more than x bytes/unit of time,
readLp
is not called + we disconnect from that peer without reading from the connection. In short: This code allows an attacker to flood with random bytes the peer without any limits.
let buf = await conn.readLp(MaxRpcSize.int)
let decodeRes = HistoryRPC.decode(buf)
if decodeRes.isErr():
error "failed to decode rpc", peerId = $conn.peerId
waku_store_errors.inc(labelValues = [decodeRpcFailure])
# TODO: Return (BAD_REQUEST, cause: "decode rpc failed")
return
-
Unless I'm missing something, the rate limit is applied globally (eg to all peers), which means that a given peer can trigger this rate limit and won't allow other nodes to get any service from the node. imho the rate limit should be per peerId (or per IP address) so that we allow x amount of usage (bytes or requests) per peer instead of globally. Without this, we are still exposed to DoS attacks.
-
Can we split this PR into different ones? I see 3 clear different PRs here: i) introduce tocken bucket, ii) rate limit lightpush, iii) rate limit store.
-
Do we really need rate limits at the rest API level? afaik we can assume that the REST API is exposed to trusted parties so not sure if we need to rate limit there. imho we should focus on the libp2p protocol, which is the service exposed globally.
|
||
import chronos | ||
|
||
## This is an extract from chronos/ratelimit.nim due to the found bug in the original implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind pointing to the bug and elaborating more? Especially why it can't be fixed upstream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In libp2p chronos' TockenBucket implementation is used but a bit differently than our expectation. It has a kind of time statistics replenish method.
It heavily depends on the usage and can ruin our usage. Unfortunately I could not - yet - fix it that may securely not affect libp2p usage. That fix needs a bit more wide testing for that I did not want to block this feature in nwaku.
I have an idea how to fix that and already agreed the aproach with Jacek... he encouraged us to take actions and co-work with other projects more easy.
Thanks for these comments, all really valuable!
Of course there are some open questions along this road as rate and bandwidth limits may contradict each other, at least in priority level. @alrevuelta : WDYT? |
Findings are now address in #2589 for future handle them |
Agreed to track those findings in following PRs + track them in #2589
Description
As a first step for rate and bandwidth caping of non relay protocol, hereby a very simple request rate / period limitation is added to light-push and store protocols.
Changes
issue
#2032