-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[packetbeat] Expire source port mappings. #41581
Conversation
This pull request does not have a backport label.
To fixup this pull request, you need to add the backport labels for the needed
|
|
Pinging @elastic/sec-linux-platform (Team:Security-Linux Platform) |
The fix naturally results in more /proc parsing, basically at least once per 10 seconds for every connection, before it was basically "at every new connection that doesn't have a map already". The other conservative approach would be to do an O(n) expiration of every map every X seconds, while this would reduce the amount of /proc parsing on the positive case, it wouldn't change anything for the negative (if it misses the cache it tries to rebuild it anyway). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nitpicks, but LGTM
@@ -253,9 +276,8 @@ func (proc *ProcessesWatcher) expireProcessCache() { | |||
} | |||
} | |||
|
|||
// proc.mu must be locked |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add this comment above line 250, also.
packetbeat/procs/procs.go
Outdated
// the whole map. | ||
// | ||
// We take a conservative approach by discarding the entry if | ||
// it's old enough. When we the first time here, our caller |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// it's old enough. When we the first time here, our caller | |
// it's old enough. The first time here, our caller |
port->pid mappings were only overwritten, never expired, the overwriting mechanism has a bunch of issues: - It only overwrites if it manages to find the new pid, so it misses short lived processes. - It only refreshes the mapping of said port, if a packet arriving on _another_ port misses the lookup (otherwise the original port is found and returned). Meaning, once all ports are used at least once, the cache is filled and never mutated again. The observable effect is that the user will see wrong process correlations _to_ older/long lived processes, imagine the follwing: - Long lived process makes _short_ lived TCP connection from src_port S. - Years later, a _short_ lived process makes a TCP connection to somewhere else, but from the same src_port S. It hits the cache, since it had a mapping for S, so packetbeat incorrectly correlates the new short-lived process connection, with the old long lived process. Related to a very long SDH, where a more in depth explanation of the bug can be found here, with a program to reproduce it. - elastic/sdh-beats#4604 (comment) - elastic/sdh-beats#4604 (comment) The solution is to discard mappings that are "old enough", with a hardcoded window of 10 seconds, so as long as the port is not re-used in this window, we are fine. This also makes sure the cache never becomes "immutable", since mappings will invariably get old, forcing a refresh. It's a very conservative approach as I don't want to introduce other bugs by redesigning it, work is on the way to change how the cache works in linux anyway. While here, I've noticed the locking was also wrong, we were doing the lookup unlocked, and also having to relock in case we have to update the mapping, so change this to grab the lock once and only once, interleaving is baad.
a0d2ab8
to
4ed9319
Compare
port->pid mappings were only overwritten, never expired, the overwriting mechanism has a bunch of issues: - It only overwrites if it manages to find the new pid, so it misses short lived processes. - It only refreshes the mapping of said port, if a packet arriving on _another_ port misses the lookup (otherwise the original port is found and returned). Meaning, once all ports are used at least once, the cache is filled and never mutated again. The observable effect is that the user will see wrong process correlations _to_ older/long lived processes, imagine the follwing: - Long lived process makes _short_ lived TCP connection from src_port S. - Years later, a _short_ lived process makes a TCP connection to somewhere else, but from the same src_port S. It hits the cache, since it had a mapping for S, so packetbeat incorrectly correlates the new short-lived process connection, with the old long lived process. Related to a very long SDH, where a more in depth explanation of the bug can be found here, with a program to reproduce it. - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2459969325 - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2460829030 The solution is to discard mappings that are "old enough", with a hardcoded window of 10 seconds, so as long as the port is not re-used in this window, we are fine. This also makes sure the cache never becomes "immutable", since mappings will invariably get old, forcing a refresh. It's a very conservative approach as I don't want to introduce other bugs by redesigning it, work is on the way to change how the cache works in linux anyway. While here, I've noticed the locking was also wrong, we were doing the lookup unlocked, and also having to relock in case we have to update the mapping, so change this to grab the lock once and only once, interleaving is baad. (cherry picked from commit 587dc60)
port->pid mappings were only overwritten, never expired, the overwriting mechanism has a bunch of issues: - It only overwrites if it manages to find the new pid, so it misses short lived processes. - It only refreshes the mapping of said port, if a packet arriving on _another_ port misses the lookup (otherwise the original port is found and returned). Meaning, once all ports are used at least once, the cache is filled and never mutated again. The observable effect is that the user will see wrong process correlations _to_ older/long lived processes, imagine the follwing: - Long lived process makes _short_ lived TCP connection from src_port S. - Years later, a _short_ lived process makes a TCP connection to somewhere else, but from the same src_port S. It hits the cache, since it had a mapping for S, so packetbeat incorrectly correlates the new short-lived process connection, with the old long lived process. Related to a very long SDH, where a more in depth explanation of the bug can be found here, with a program to reproduce it. - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2459969325 - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2460829030 The solution is to discard mappings that are "old enough", with a hardcoded window of 10 seconds, so as long as the port is not re-used in this window, we are fine. This also makes sure the cache never becomes "immutable", since mappings will invariably get old, forcing a refresh. It's a very conservative approach as I don't want to introduce other bugs by redesigning it, work is on the way to change how the cache works in linux anyway. While here, I've noticed the locking was also wrong, we were doing the lookup unlocked, and also having to relock in case we have to update the mapping, so change this to grab the lock once and only once, interleaving is baad. (cherry picked from commit 587dc60)
port->pid mappings were only overwritten, never expired, the overwriting mechanism has a bunch of issues: - It only overwrites if it manages to find the new pid, so it misses short lived processes. - It only refreshes the mapping of said port, if a packet arriving on _another_ port misses the lookup (otherwise the original port is found and returned). Meaning, once all ports are used at least once, the cache is filled and never mutated again. The observable effect is that the user will see wrong process correlations _to_ older/long lived processes, imagine the follwing: - Long lived process makes _short_ lived TCP connection from src_port S. - Years later, a _short_ lived process makes a TCP connection to somewhere else, but from the same src_port S. It hits the cache, since it had a mapping for S, so packetbeat incorrectly correlates the new short-lived process connection, with the old long lived process. Related to a very long SDH, where a more in depth explanation of the bug can be found here, with a program to reproduce it. - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2459969325 - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2460829030 The solution is to discard mappings that are "old enough", with a hardcoded window of 10 seconds, so as long as the port is not re-used in this window, we are fine. This also makes sure the cache never becomes "immutable", since mappings will invariably get old, forcing a refresh. It's a very conservative approach as I don't want to introduce other bugs by redesigning it, work is on the way to change how the cache works in linux anyway. While here, I've noticed the locking was also wrong, we were doing the lookup unlocked, and also having to relock in case we have to update the mapping, so change this to grab the lock once and only once, interleaving is baad. (cherry picked from commit 587dc60)
port->pid mappings were only overwritten, never expired, the overwriting mechanism has a bunch of issues: - It only overwrites if it manages to find the new pid, so it misses short lived processes. - It only refreshes the mapping of said port, if a packet arriving on _another_ port misses the lookup (otherwise the original port is found and returned). Meaning, once all ports are used at least once, the cache is filled and never mutated again. The observable effect is that the user will see wrong process correlations _to_ older/long lived processes, imagine the follwing: - Long lived process makes _short_ lived TCP connection from src_port S. - Years later, a _short_ lived process makes a TCP connection to somewhere else, but from the same src_port S. It hits the cache, since it had a mapping for S, so packetbeat incorrectly correlates the new short-lived process connection, with the old long lived process. Related to a very long SDH, where a more in depth explanation of the bug can be found here, with a program to reproduce it. - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2459969325 - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2460829030 The solution is to discard mappings that are "old enough", with a hardcoded window of 10 seconds, so as long as the port is not re-used in this window, we are fine. This also makes sure the cache never becomes "immutable", since mappings will invariably get old, forcing a refresh. It's a very conservative approach as I don't want to introduce other bugs by redesigning it, work is on the way to change how the cache works in linux anyway. While here, I've noticed the locking was also wrong, we were doing the lookup unlocked, and also having to relock in case we have to update the mapping, so change this to grab the lock once and only once, interleaving is baad. (cherry picked from commit 587dc60) Co-authored-by: Christiano Haesbaert <[email protected]>
port->pid mappings were only overwritten, never expired, the overwriting mechanism has a bunch of issues: - It only overwrites if it manages to find the new pid, so it misses short lived processes. - It only refreshes the mapping of said port, if a packet arriving on _another_ port misses the lookup (otherwise the original port is found and returned). Meaning, once all ports are used at least once, the cache is filled and never mutated again. The observable effect is that the user will see wrong process correlations _to_ older/long lived processes, imagine the follwing: - Long lived process makes _short_ lived TCP connection from src_port S. - Years later, a _short_ lived process makes a TCP connection to somewhere else, but from the same src_port S. It hits the cache, since it had a mapping for S, so packetbeat incorrectly correlates the new short-lived process connection, with the old long lived process. Related to a very long SDH, where a more in depth explanation of the bug can be found here, with a program to reproduce it. - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2459969325 - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2460829030 The solution is to discard mappings that are "old enough", with a hardcoded window of 10 seconds, so as long as the port is not re-used in this window, we are fine. This also makes sure the cache never becomes "immutable", since mappings will invariably get old, forcing a refresh. It's a very conservative approach as I don't want to introduce other bugs by redesigning it, work is on the way to change how the cache works in linux anyway. While here, I've noticed the locking was also wrong, we were doing the lookup unlocked, and also having to relock in case we have to update the mapping, so change this to grab the lock once and only once, interleaving is baad. (cherry picked from commit 587dc60) Co-authored-by: Christiano Haesbaert <[email protected]>
port->pid mappings were only overwritten, never expired, the overwriting mechanism has a bunch of issues: - It only overwrites if it manages to find the new pid, so it misses short lived processes. - It only refreshes the mapping of said port, if a packet arriving on _another_ port misses the lookup (otherwise the original port is found and returned). Meaning, once all ports are used at least once, the cache is filled and never mutated again. The observable effect is that the user will see wrong process correlations _to_ older/long lived processes, imagine the follwing: - Long lived process makes _short_ lived TCP connection from src_port S. - Years later, a _short_ lived process makes a TCP connection to somewhere else, but from the same src_port S. It hits the cache, since it had a mapping for S, so packetbeat incorrectly correlates the new short-lived process connection, with the old long lived process. Related to a very long SDH, where a more in depth explanation of the bug can be found here, with a program to reproduce it. - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2459969325 - https://github.com/elastic/sdh-beats/issues/4604#issuecomment-2460829030 The solution is to discard mappings that are "old enough", with a hardcoded window of 10 seconds, so as long as the port is not re-used in this window, we are fine. This also makes sure the cache never becomes "immutable", since mappings will invariably get old, forcing a refresh. It's a very conservative approach as I don't want to introduce other bugs by redesigning it, work is on the way to change how the cache works in linux anyway. While here, I've noticed the locking was also wrong, we were doing the lookup unlocked, and also having to relock in case we have to update the mapping, so change this to grab the lock once and only once, interleaving is baad. (cherry picked from commit 587dc60) Co-authored-by: Christiano Haesbaert <[email protected]>
port->pid mappings were only overwritten, never expired, the overwriting mechanism has some issues:
The observable effect is that the user will see wrong process correlations to older/long lived processes, imagine the follwing:
Related to a very long SDH, where a more in depth explanation of the bug can be found here, with a program to reproduce it.
The solution is to discard mappings that are "old enough", with a hardcoded window of 10 seconds, so as long as the port is not re-used in this window, we are fine.
This also makes sure the cache never becomes "immutable", since mappings will invariably get old, forcing a refresh.
It's a very conservative approach as I don't want to introduce other bugs by redesigning it, work is on the way to change how the cache works in linux anyway.
While here, I've noticed the locking was also wrong, we were doing the lookup unlocked, and also having to relock in case we have to update the mapping, so change this to grab the lock once and only once, interleaving is baad.
Proposed commit message
Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration files- [ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Test
The following program can be used to demonstrate the bug:
Build and run with:
It will do one connection per source port to the specified address (192.168.1.50:12345) and send some bytes, to make it easier, 192.168.1.50 should be in another machine than packetbeat, you can then run
tcpbench
, by yours truly, or any other service that will accept tcp connections and eat some bytes:After running
all_your_mappings_are_belong_to_us
, if you do a tcp connection to any other port, packetbeat will incorrectly assume it belongs toall_your_mappings_are_belong_to_us
, see screenshotsTested on 8.14.3 and main.
Screenshots
The circled in red thing is a
wget
to google.com, yet it things it's fromall_your_mappings_are_belong_to_us
.After the fix, the mappings correctly show
wget