Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dos and monitoring docs #160

Merged
merged 11 commits into from
Aug 15, 2022
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added content/reference/assets/fail2bango-libp2p.mov
Binary file not shown.
301 changes: 301 additions & 0 deletions content/reference/dos-mitigation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,301 @@
---
title: "DOS Mitigation"
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved
weight: 3
---

DOS mitigation is an essential part of any P2P application. We need to design
our protocols to be resilient to malicious peers. We need to monitor our
application for signs of suspicious activity or an attack. And we need to be
able to respond to an attack.

Here we'll cover how we can use libp2p to achieve the above goals.

# Table of contents

- [What we mean by a DOS attack](#what-we-mean-by-a-dos-attack)
- [Incoporating DOS mitigation from the start](#incoporating-dos-mitigation-from-the-start)
- [Limit the number of concurrent streams your protocol needs](#limit-the-number-of-concurrent-streams-your-protocol-needs)
- [Limit the number of connections your application needs](#limit-the-number-of-connections-your-application-needs)
- [Reduce blast radius](#reduce-blast-radius)
- [Fail2ban](#fail2ban)
- [Leverage the resoure manager to limit resource (go-libp2p only)](#leverage-the-resoure-manager-to-limit-resource-go-libp2p-only)
- [Rate limiting incoming connections (go-libp2p only)](#rate-limiting-incoming-connections-go-libp2p-only)
- [Monitoring your application](#monitoring-your-application)
- [Responding to an attack](#responding-to-an-attack)
- [Who’s misbehaving?](#whos-misbehaving)
- [How to block a misbehaving peer](#how-to-block-a-misbehaving-peer)
- [How to automate blocking with fail2ban](#how-to-automate-blocking-with-fail2ban)
- [Summary](#summary)

# What we mean by a DOS attack

A DOS attack is any attack that can cause your application to crash, stall, or
otherwise fail to respond normally. An attack is considered viable if it takes
fewer resources to execute than the damage it does. In other words, if the
payoff is higher than the investment it is a viable attack and should be
mitigated. Here are a couple examples
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved

1. One server opening many connections to a remote server and forcing that
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved
server to spend 10x the compute time to handle the request relative to the
attacker server. This is attack viable because a single server amplifies it's
affect 10x. This attack will continue to scale if the attacker adds more
servers.

2. 100 servers asking a single server to do some work, but if this single server
goes down it will indirectly cause the loss of an asset. If the asset is more
valuable than the compute time of 100 servers, this attack is viable.

3. Many servers connecting to a single server such that that server can no
longer accept new connections from an honest peer. This server is now
isolated from the honest peers in the network. This is commonly called an
eclipse attack and is viable if it's either cheap to eclipse this node, or if
eclipsing this node has a high payoff.

Generally the effect on our application can range from crashing to stalling to
failing to handle new peers to degraded performance. Ideally we want
our application to at worst suffer a slight perfomance penalty, but otherwise
stay up and healthy.

In the next section we'll cover some design strategies you should incorporate
into your protocol to make sure your application stays up and healthy.

# Incoporating DOS mitigation from the start

The general strategy is to use the minimum amount of resources as possible, and
make sure that there's no untrusted amplification mechanism (e.g. an untrusted
node can force you do to 10x the work it does). A protocol level reputation
system can help (take a look at [GossipSub](https://github.com/libp2p/specs/tree/master/pubsub/gossipsub) for inspiration) as well as
logging misbehaving nodes and actioning those logs separately (see fail2ban
below).

Here are some more specific recommendations
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved

## Limit the number of concurrent streams your protocol needs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: I think a similar section on the number of connections would be helpful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved

Each stream has some resource cost associated with it. Depending on the
transport and multiplexer, this can be bigger or smaller. Try to avoid having
too many concurrent streams open per peer for your protocol. Instead try to
limit the maximum number of concurrent streams to something reasonable (surely
you don't need >512 streams open at once for a peer?). Multiple concurrent
streams can be useful for logic or to avoid [Head-of-line
blocking](https://en.wikipedia.org/wiki/Head-of-line_blocking), but having too
many streams will offset these benefits.

Using a stream for a short period of time and then closing it is fine. It's
really the number of _concurrent_ streams that you need to be careful of.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use Identify as an example, where it doesn't make sense for the protocol to support more than one stream per connection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know more than me here. Would it be useful to use Identify as an example?


## Limit the number of connections your application needs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given connections happen before streams in an application's lifecycle, maybe move this above the section above?


Like streams, each connection has a resource cost associated with it. A
connection will usually represent a peer and a set of protocols with each their
own resource usage. So limiting connections can have a leveraged effect on your
resource usage.

In go-libp2p the number of active connections is managed by the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading through this doc I don't think it's clear for a user on when to use the connmgr or the resource manager for go-libp2p.

[`connmgr`](https://pkg.go.dev/github.com/libp2p/[email protected]/p2p/net/connmgr#BasicConnMgr.Protect).
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved
`ConnManager` will trim connections when you hit the high watermark number of
connections. You can protect certain connections with the `.Protect` method.

In rust-libp2p handlers should implement
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[`connection_keep_alive`](https://docs.rs/libp2p/0.46.1/libp2p/swarm/trait.ConnectionHandler.html#tymethod.connection_keep_alive)
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved
to define when a connection can be closed.

## Reduce blast radius

If you can split up your libp2p application into multiple separate processes you
can increase the resiliency of your overall system. For example your node may
have to help achieve consensus and respond to user queries. By splitting this up
into two processes you now rely on the OS’s guarantee that the user query
process won’t take down the consensus process.

## Fail2ban

If you can log when a peer is misbehaving or is malicious, you can then hook up
those logs to fail2ban and have fail2ban manage your firewall to automatically
block misbehaving nodes. go-libp2p includes some builtin support for this
usecase. More details below.


## Leverage the resoure manager to limit resource (go-libp2p only)
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved

go-libp2p includes a powerful [resource
manager](https://github.com/libp2p/go-libp2p-resource-manager) that keeps track
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this link survive after the soon-coming repo consolidation?

Copy link
Contributor Author

@MarcoPolo MarcoPolo Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a comment to update this, but worst case will force a user to click through another link.

of resources used for each protocol, peer, connection, and more. You can use it
within your protocol implementation to make sure you don't allocate more than
some predetermined amount of memory per connection. It's basically a resource
accounting abstraction that you can make use of in your own application.

## Rate limiting incoming connections (go-libp2p only)

Depending on your use case, it can help to limit the number of inbound
connections. You can use go-libp2p's
Comment on lines +201 to +202
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we give guidance on when to use this mechanism vs. the resource manager? I see this is a good hook for custom logic, but it seems like what Prysm is doing could be covered by go-libp2p resource manager right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we give guidance on when to use this mechanism vs. the resource manager?

No, but I could add something here.

I see this is a good hook for custom logic, but it seems like what Prysm is doing could be covered by go-libp2p resource manager right?

Not really. If you're trying to avoid an adversary that can connect to you and give you a ton of work to do all at once the rcmgr doesn't protect at all. This attack can easily be mitigated by rate limiting though.

Not all applications will want this rate limiting, or they may want to rate limit certain things (e.g. something in the protocol rather than in the connections). For example, if I'm Google I wouldn't want to rate limit any new connection to me. I would rather rate limit work per connection.

Should the rcmgr do this? I don't think so. It's not directly related to limiting the resources being used, and if it can be handled by a smaller component that already exists the better.

I hope that makes sense, but happy to expand more as well.

[ConnectionGater](https://pkg.go.dev/github.com/libp2p/go-libp2p-core/connmgr#ConnectionGater)
and `InterceptAccept` for this. For a concrete example, take a look at how Prysm
implements their (Connection
Gater)[https://github.com/prysmaticlabs/prysm/blob/63a8690140c00ba6e3e4054cac3f38a5107b7fb2/beacon-chain/p2p/connection_gater.go#L43].
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
implements their (Connection
Gater)[https://github.com/prysmaticlabs/prysm/blob/63a8690140c00ba6e3e4054cac3f38a5107b7fb2/beacon-chain/p2p/connection_gater.go#L43].
implements their (ConnectionGater)[https://github.com/prysmaticlabs/prysm/blob/63a8690140c00ba6e3e4054cac3f38a5107b7fb2/beacon-chain/p2p/connection_gater.go#L43].

Fixing the rendering issue here: https://bafybeid4zqncc4v5epc4urfvl5ajgnmqeksksk4xrgabdwaxxswpsigh6y.on.fleek.co/reference/dos-mitigation/#leverage-the-resource-manager-to-limit-resource-usage-go-libp2p-only

image

Copy link
Contributor Author

@MarcoPolo MarcoPolo Aug 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a misuse of (foo)[bar] vs [foo](bar)


# Monitoring your application

Once we've designed our protocols to be resilient to DOS attacks and deployed
them, we then need to monitor our application to both verify our mitigation works
and to be alerted if a new attack vector is exploited.


Monitoring is implementation specific, so consult the links below to see how
your implementation does it.


For rust-libp2p look at the [libp2p-metrics crate](https://github.com/libp2p/rust-libp2p/tree/master/misc/metrics).

For go-libp2p resource usage take a look at the OpenCensus metrics exposed by the resource
manager
[here](https://pkg.go.dev/github.com/libp2p/[email protected]/obs).
In general go-libp2p wants to add more metrics across the stack in the future,
this work is being tracked in issue
[go-libp2p#1356](https://github.com/libp2p/go-libp2p/issues/1356).

# Responding to an attack

When you see that your node is being attacked (e.g. crashing, stalling, high cpu
usage), then the next step is responding to the attack.

## Who’s misbehaving?

To answer the question of which peer is misbehaving and harming you, go-libp2p
exposes a [canonical log
lines](https://github.com/libp2p/go-libp2p-core/blob/master/canonicallog/canonicallog.go#L18)
that identifies a misbehaving peers. A canonical log line is simply a log line
with a special format. For example here’s a peer status log line that tells us a
peer established a connection with us, and that this log line was randomly
sampled (1 out of 100).

```
Jul 27 12:14:14 ipfsNode ipfs[46133]: 2022-07-27T12:14:14.674Z INFO canonical-log swarm/swarm_listen.go:128 CANONICAL_PEER_STATUS: peer=12D3KooWSbNLGMYeUuMSXDiHwbhXHzTJaWZzH95MZzeAob9BeB51 addr=/ip4/147.75.74.239/udp/4001/quic sample_rate=100 connection_status="established" dir="inbound"
```

To see these kinds of logs make sure you’ve enabled the `"canonical-log=info"`
log level. You can do this in code like
[so](https://github.com/libp2p/go-libp2p-core/blob/master/canonicallog/canonicallog_test.go#L14),
or by setting the environment variable `GOLOG_LOG_LEVEL="canonical-log=info"`.

In rust-libp2p you can do something similar yourself by logging a sample of
connection events from [SwarmEvent](https://docs.rs/libp2p/0.46.1/libp2p/swarm/enum.SwarmEvent.html).

## How to block a misbehaving peer

Once you’ve identified the misbehaving peer, you can block them with `iptables`
or `ufw`. Here we’ll outline how to block the peer with `ufw`. You can get the
ip address of the peer from the
[multiaddr](https://github.com/multiformats/multiaddr) in the log.

```bash
sudo ufw deny from 1.2.3.4
```

## How to automate blocking with fail2ban

You can hook up [fail2ban](https://www.fail2ban.org) to
automatically block connections from these misbehaving peers if they emit this
log line multiple times in some period of time. For example, a simple fail2ban
filter for go-libp2p would look like this:

```
[Definition]
failregex = ^.*[\t\s]CANONICAL_PEER_STATUS: .* addr=\/ip[46]\/<HOST>[^\s]*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered. Copied here:

<HOST>: I assume this is fail2ban syntax for saying that the host to block is this string that comes after “/ip[46]/” and before “[^\s]”. Is that right? (I was a little surprised not to see more conventional capture group conventions.)

Exactly. fail2ban expands this to a regex that captures the host (?:::f{4,6}:)?(?P<host>\S+). See https://www.fail2ban.org/wiki/index.php/MANUAL_0_8#Filters

```
`/etc/fail2ban/filter.d/go-libp2p-peer-status.conf`

This matches any canonical peer status logs. If a peer shows up often in these
sampled logs, something abnormal is happening. i.e. maybe they are churning
connections.

A conservative fail2ban rule for go-libp2p using the above filter would look
like this:

```
[go-libp2p-weird-behavior-iptables]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some comments about this section and format/layout changes in I had a question about this in https://www.notion.so/pl-strflt/Guide-for-how-to-respond-to-resource-exhaustion-attacks-b10f55cc9a3d4917ae80c9b914e05e8c.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered

@marco Munizaga : it’s not clear to me how the fail2ban rule below is tied to the filter above. I don’t see an id or name reference.

By the filename. The filter is in a file go-libp2p-peer-status.conf and this rule references filter=go-libp2p-peer-status.

I'll leave a comment on the relevant line as well

# Block an IP address if it fails a handshake or reconnects more than
# 50 times a second over the course of 3 minutes. Since
# we sample at 1% this means we block if we see more
# than 90 failed handshakes over 3 minutes. (50 logs/s * 1% = 1 log every
# 2 seconds. for 60 * 3 seconds = 90 reqs in 3 minutes.)
enabled = true
filter = go-libp2p-peer-status
action = iptables-allports[name=go-libp2p-fail2ban]
backend = systemd[journalflags=1]
# This uses systemd for logging.
# This assumes you have a systemd service named ipfs-daemon.
journalmatch = _SYSTEMD_UNIT=ipfs-daemon.service
findtime = 180 # 3 minutes
bantime = 600 # 10 minutes
maxretry = 90
```
`/etc/fail2ban/jail.d/go-libp2p-weird-behavior-iptables.conf`

Note that the above configuration is relying on systemd to get the logs for
ipfs. This will be different depending on your go-libp2p process.

For completeness here’s my systemd service definition for a [Kubo instance](https://github.com/ipfs/kubo):

```
$ cat /etc/systemd/system/ipfs-daemon.service
[Unit]
After=network.target
Description=ipfs-daemon

[Service]
Environment="LOCALE_ARCHIVE=/nix/store/r4jm7wfirgdr84zmsnq5qy7hvv14c7l7-glibc-locales-2.34-210/lib/locale/locale-archive"
Environment="PATH=/nix/store/7jr7pr4c6yb85xpzay5xafs5zlcadkhz-coreutils-9.0/bin:/nix/store/140f6s4nwiawrr3xyxarmcv2mk62m62y-findutils-4.9.0/bin:/nix/store/qd9jxc0q00cr7fp30y6jbbww20gj33lg-gnugrep-3.7/bin:/nix/store/lgvd2fh4cndlv8mnyy49jp1nplpml3xp-gnused-4.8/bin:/nix/store/0f3ncs289m2x1vmv2b3grd6l9x1yp2m3-systemd-250.4/bin:/nix/store/7jr7pr4c6yb85xpzay5xafs5zlcadkhz-coreutils-9.0/sbin:/nix/store/140f6s4nwiawrr3xyxarmcv2mk62m62y-findutils-4.9.0/sbin:/nix/store/qd9jxc0q00cr7fp30y6jbbww20gj33lg-gnugrep-3.7/sbin:/nix/store/lgvd2fh4cndlv8mnyy49jp1nplpml3xp-gnused-4.8/sbin:/nix/store/0f3ncs289m2x1vmv2b3grd6l9x1yp2m3-systemd-250.4/sbin"
Environment="TZDIR=/nix/store/n83qx7m848kg51lcjchwbkmlgdaxfckf-tzdata-2022a/share/zoneinfo"

Environment=GOLOG_LOG_LEVEL="canonical-log=info" LIBP2P_RCMGR=1
ExecStart=/nix/store/mmvd2akskpaszlradl8qv4v703v1cy11-kubo-0.0.1/bin/ipfs daemon
Restart=always
RestartSec=1min
User=ipfs
```

### Example screen recording of fail2ban in action

[fail2ban+go-libp2p.mov](./assets/fail2bango-libp2p.mov)
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved

### Setting Up fail2ban

For a general guide to setting up fail2ban, consult this useful tutorial:
[How to protect ssh with fail2ban on Ubuntu 20.04](https://www.digitalocean.com/community/tutorials/how-to-protect-ssh-with-fail2ban-on-ubuntu-20-04).
We’ll focus on the specifics around fail2ban and go-libp2p here.

Once you have fail2ban installed simple copy the above files into their
respective places. The filter definition into
`/etc/fail2ban/filter.d/go-libp2p-peer-status.conf` and the rule into
`/etc/fail2ban/jail.d/go-libp2p-weird-behavior-iptables.conf`. Remember you may
need to tweak the rule to read from the correct log location or change the
systemd service name. Also remember you need to enable the canonical log level
(see the above section for how to enable this log level). Finally restart
fail2ban to reload the configuration with `systemctl restart fail2ban`.

Verify our jail is active by running `fail2ban-client status
go-libp2p-weird-behavior-iptables`. If you see something like:

```
Status for the jail: go-libp2p-weird-behavior-iptables
|- Filter
| |- Currently failed: 0
| |- Total failed: 0
| `- Journal matches: _SYSTEMD_UNIT=ipfs-daemon.service
`- Actions
|- Currently banned: 0
|- Total banned: 0
`- Banned IP list:
```

Then you’re good to go! You’ve successfully set up a go-libp2p jail.

# Summary

Mitigating DOS attacks is hard because an attacker needs only one flaw, while a
protocol developer needs to cover _all_ their bases. Libp2p provides some tools
to design better protocols, but developers should still monitor their
applications to protect against novel attacks. Finally developers should
leverage existing tools like `fail2ban` to automate blocking misbehaving nodes
by logging when peers behave maliciously.
7 changes: 7 additions & 0 deletions content/reference/monitoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
title: "Monitoring and Observability"
weight: 4
---

Reference the [Monitoring your application](todo) section in [DOS
MarcoPolo marked this conversation as resolved.
Show resolved Hide resolved
Mitigation](todo).