Skip to content

Latest commit

 

History

History
266 lines (198 loc) · 21.5 KB

04-routing-0.277.0-tcp-router-port-conflict.md

File metadata and controls

266 lines (198 loc) · 21.5 KB
title expires_at tags
(routing-release-0.277.0) TCP Router Port Conflict
2028-08-17
routing-release
0.277.0

(routing-release-0.277.0) TCP Router Port Conflict

📑 Context

Each TCP route requires one port on the TCP Router VM. Ports for TCP routes are managed via router groups. Each router group has a list of reservable_ports. The Cloud Foundry documentation for "Enabling and Configuring TCP Routing" has the following warning and suggestions for valid port ranges:

Do not enter reservable_ports that conflict with other TCP router instances or ephemeral port ranges. Cloud Foundry recommends using port ranges within 1024-2047 and 18000-32767 on default installations.

These port suggestions do not overlap with any ports used by system components. However, there is nothing (until now) preventing users from expanding this range into ports that do overlap with ports used by system components.

This port conflict can result in two different buggy outcomes.

🔥 Affected Versions

  • All versions of routing-release before 0.277.0

✔️ Operator Checklist

  • Read this doc.
  • Compare the listening ports on your TCP Router VM to the list below. See how here.
  • Update your manifest to make routing_api.reserved_system_component_ports match the ports you learned about from step 2. See bosh properties details here.
  • Upgrade to a version of routing-release with these fixes.
  • Look at the TCP Router logs to see if any exisiting router groups are invalid. See logs to look for here.
  • Fix invalid router groups. See routing-api documentation here.
  • Re-run the check to make sure all router groups are valid. See how here.

🐛 Bug Variation 1 - TCP Router claims the port first

Symptoms

  1. Some bosh job on the TCP Router VM fails to start. This will likely cause a deployment to fail.
  2. There are logs for the failing job that say it was unable to bind to its port.
2020/10/13 22:12:20 Metrics server closing: listen tcp :14726: bind: address already in use
2020/10/13 22:12:20 stopping metrics-agent
  1. Run netstat -tlpn | grep PORT and see that haproxy is running on the port that the bosh job tried to bind to.

Explanation

If a TCP route gets the port before the bosh job, then the job will fail to bind to its port.

🐞 Bug Variation 2 - Internal component claims the port first

Symptoms

  1. You created a tcp route, but it doesnt work.
  2. Check the TCP Router logs and see that it failed to bind to the port for the tcp route.
{"timestamp":"2020-10-01T21:23:17.526206817Z","level":"info","source":"tcp-router","message":"tcp-router.writing-config","data":{"num-bytes":826}}
{"timestamp":"2020-10-01T21:23:17.526332658Z","level":"info","source":"tcp-router","message":"tcp-router.running-script","data":{}}
{"timestamp":"2020-10-01T21:23:19.581306843Z","level":"info","source":"tcp-router","message":"tcp-router.running-script","data":{"output":"[ALERT] 274/212317 (43) : Starting proxy listen_cfg_2822: cannot bind socket [0.0.0.0:2822]\n"}}
{"timestamp":"2020-10-01T21:23:19.581361142Z","level":"error","source":"tcp-router","message":"tcp-router.failed-to-run-script","data":{"error":"exit status 1"}}
  1. Run netstat -tlpn | grep PORT and see that some other process is running on the port that the TCP route is trying to use.

Explanation

The TCP Router will fail to load the new config with the new TCP route, because something it bound to the conflicting port. This prevents ALL new TCP routes from working as long as the conflicting port is in the config. This will not cause the bosh job for TCP Router to fail. This bug is dangerous because it is easy to miss and can affect many users.

🧰 Fix

Overview

The fix for this issues focuses on preventing the creation of router groups that conflict with system component ports. We have done this via:

  • a runtime check for creating and updating router groups
  • a deploytime check for exising router groups

These fixes are available in routing release v0.277.0+. If you cannot update at this time, you can fix your routing groups manually. See here for instructions.

New Bosh Properties

Bosh Property Description Default
routing_api.reserved_system_component_ports Array of ports that are reserved for system components. Users will not be able to create router_groups with ports that overlap with this value. See Appendix A in this document to see what system components use these ports. If you run anything else on your TCP Router VM you must add its port to this list, or else you run the risk of still running into this bug. See Appendix A
tcp_router.fail_on_router_port_conflicts Fail the TCP Router if routing_api.reserved_system_component_ports conflict with ports in existing router groups. We suggest giving your users a chance to update their router groups before turning it to true. false
routing_api.fail_on_router_port_conflicts By default this is set to the same value as tcp_router.fail_on_router_port_conflicts. If true, then API calls to create or update router groups will fail if the reserved_ports conflict with the routing_api.reserved_system_component_ports. false

Runtime Check Details

If routing_api.fail_on_router_port_conflicts is true, then when a user tries to create or update a router group to include a port in routing_api.reserved_system_component_ports they will get a status code 400 and the following error:

{"name":"ProcessRequestError","message":"Cannot process request: Invalid ports. Reservable ports must not include the following reserved system component ports: [2822 2825 3458 3459 3460 3461 8853 9100 14726 14727 14821 14822 14823 14824 14829 15821 17002 35095 39873 40177 42393 46567 53035 53080]."}

Deploytime Check Details

When the TCP Router starts it will check all existing router groups against the routing_api.reserved_system_component_ports property. To re-run this check you can monit restart the tcp router.

You will see the following in the TCP Router logs...

If there are invalid router groups and tcp_router.fail_on_router_port_conflicts is false

  1. You will see tcp-router.router-group-port-checker-error: WARNING! In the future this will cause a deploy failure.
  2. Plus you will see a list of which router groups contain the conflicting ports.
{
  "timestamp": "2021-05-03T20:59:43.127270911Z",
  "level": "error",
  "source": "tcp-router",
  "message": "tcp-router.router-group-port-checker-error: WARNING! In the future this will cause a deploy failure.",
  "data": {
    "error": "The reserved ports for router group 'group-1' contains the following reserved system component port(s): '14726, 14727, 14821, 14822, 14823, 14824, 14829, 15821, 17002'. Please update your router group accordingly.\nThe reserved ports for router group 'group-2' contains the following reserved system component port(s): '40177'. Please update your router group accordingly."
  }
}

If there are invalid router groups and tcp_router.fail_on_router_port_conflicts is true

  1. You will see tcp-router.router-group-port-checker-error: Exiting now.
  2. Plus you will see a list of which router groups contain the conflicting ports.
  3. Then monit will report the tcp router as failing
{
  "timestamp": "2021-05-03T21:04:02.507129979Z",
  "level": "error",
  "source": "tcp-router",
  "message": "tcp-router.router-group-port-checker-error: Exiting now.",
  "data": {
    "error": "The reserved ports for router group 'group-1' contains the following reserved system component port(s): '14726, 14727, 14821, 14822, 14823, 14824, 14829, 15821, 17002'. Please update your router group accordingly.\nThe reserved ports for router group 'group-2' contains the following reserved system component port(s): '40177'. Please update your router group accordingly."
  }
}

If the seeded router groups in routing_api.router_groups are invalid and routing_api.fail_on_router_port_conflicts is true

  1. The routing-api job will cause the deployment to fail.
  2. You will see the following log in routing-api.stdout.log
{
  "timestamp": "2021-05-03T21:04:02.507129979Z",
  "source": "routing-api",
  "message": "routing-api.failed-load-config",
  "log_level": 2,
  "data": {
    "error": "Invalid ports. Reservable ports must not include the following reserved system component ports: [2822 2825 3457 3458 3459 3460 3461 8853 9100 14726 14727 14821 14822 14823 14824 14829 14830 14920 14922 15821 17002 53035 53080]."
  }
}

If there are no invalid router groups

  1. You will see tcp-router.router-group-port-checker-success: No conflicting router group ports.
{
  "timestamp": "2021-05-03T21:08:32.733453194Z",
  "level": "info",
  "source": "tcp-router",
  "message": "tcp-router.router-group-port-checker-success: No conflicting router group ports.",
  "data": {}
}

🗨️ FAQ

❓ Do I really need to check the ports running on my TCP Router VM?

Yes. You might have custom jobs running on your deployment. If you don't include all in-use ports you risk running into this bug that will break TCP routes.

❓ How can I see what ports are in use on my TCP Router VM?

  1. Ssh onto your TCP Router VM and become root.
  2. Run netstat -tlpn | grep -v haproxy. Ignore haproxy since those are tcp routes and we are looking for system components.
  3. To sort them all nicely try this: netstat -tlpn | grep -v haproxy | cut -d" " -f16 | cut -d":" -f2 | grep -v For | sort -n

❓ I see something running on port 22! Why isn't that included in routing_api.reserved_system_component_ports?

Router Groups have never been allowed to use ports 0 - 1023 so you don't need to specifically exclude them.

❓ Why aren't my ports for udp-forwarder and system-metrics-scraper included in routing_api.reserved_system_component_ports?

Currently these jobs choose any open ephemeral port when they starts. This is problematic for this bug and will be fixed soon. You can track this issue for udp-forwarder here and system-metrics-scraper here.

❓ I fixed my router groups. How can I rerun the check?

You can rerun the check by monit restarting the TCP Router. Or you can wait for the next deploy that will restart the TCP Router.

❓ In the logs it says that there is a conflicting port, but everything is running just fine. What's up with that?

Either (1) you don't have a system component running on that port and everything is fine or (2) you having a ticking time bomb waiting to happen and you will likely run into this bug soon.

To see if there is a system component using that port run netstat -tlpn | grep PORT on the TCP Router VM. If there is no system component running there, then you are fine and you can remove the port from routing_api.reserved_system_component_ports. If there is a system component running there, then you should update your router group to not include that port ASAP.

❓ I can't upgrade yet. Is there another way I could check to see if there are invalid router groups?

Yes! You don't need our fancy automation, you can do it yourself. First grab all of the ports from the TCP Router VM (see instructions here). Then grab all of your router groups (see docs here). Then check all of the router groups to make sure they don't include any of the system component ports.

You will also need to check the router groups seeded in the routing_api.router_groups property. Even though this property is only used to seed router groups on the very first deploy, it cannot contain invalid router groups. Either delete these seeded router groups from the manifest (this will have no affect on the current created router groups) or fix the router groups to contain valid ports only.

❓ Why can't you detect what is running on the VM and see what ports are used? Why is there a deploy time configured list?

We wanted a runtime and deploytime check for misconfigured router groups. This way we can check all existing router groups and router groups that will be updated and created in the future. It is hard to determine what will be running on a VM at deploytime. We determined that this was the easiest solution.

❓ Will I ever have to update this list?

Maybe, but not often. In release notes we will include instructions to update this list if a new system component starts running on the TCP Router VM. Of course if you have your own custom deployment setup then we can't warn you when this happens.

❓ I got a router-group-port-checker-error in the TCP Router logs. What does that mean?

This error means that the port check was unsuccessful at checking to see if your router groups contain ports that overlap with routing_api.reserved_system_component_ports. This can happen for a few reasons:

  • The tcp_router client may not be authorized via UAA to view router groups. See this PR for an example of how to fix this.
  • There could be a problem connecting to uaa. Debug your network connection and then rerun the check.
  • There could be a problem connecting to the routing-api. Debug your network connection and then rerun the check.

📝 Appendix A: Default System Component Ports

This is a list of all of the system components for a default CF-deployment that might be running on the TCP Router VM and their ports. These are the default ports used for the routing_api.reserved_system_component_ports property.

Some of these ports are configurable and may not match what is running on your deployment. You are responsible for checking this list against what is running on your deployment.

Note: Router Groups have never been allowed to use ports 0 - 1023, so you don't need to specifically exclude them.

Port System Component or Job Name Bosh Property Name Bosh Link? Note
2822 monit n/a n/a Not configurable. See code here.
2825 bosh agent n/a n/a Not configurable. See code here.
3457 loggr-udp-forwarder-agent listening_port no See bosh property here.
3458 loggr-forwarder-agent grpc_port no See bosh property here.
3459 loggregator_agent grpc_port yes See bosh property here. This is overwritten in the default CF-deployment here.
3460 loggr-syslog-agent port no This is overwritten in the default CF-deployment here.
3461 metrics-agent port no See bosh property here.
8853 bosh-dns-health health.server.port no See bosh property here.
9100 otel-collector ingress.grpc.port no Used by otel-collector as the main ingress port to receive OTLP over GRPC. This port was reclaimed from system-metrics-agent which had this as it used 53035 everywhere. See bosh property here.
14726 metrics-agent metrics_exporter_port no Prometheus endpoint. See bosh property here.
14727 metrics-agent metrics.port no Agent's own metrics and debug. See bosh property here.
14821 prom-scaper metrics.port no See bosh property here.
14822 loggr-syslog-agent metrics.port no See bosh property here.
14823 loggr-forwarder-agent metrics.port no See bosh property here
14824 loggregator_agent metrics.port no See bosh property here.
14829 loggr-udp-forwarder metrics.port no See bosh property here.
14830 otel-collector TBD n/a This port is used for the collector's metrics. This port was previously used by loggr-udp-forwarder, however it was disabled there. See this issue for more historical information.
14920* system-metrics-scraper metrics_port no *This job does not run on TCP router or Gorouter! However you should not use it for an agent that will be deployed along side that job. See bosh property here.
14921* system-metrics-scraper n/a n/a *This port was considered for a debug port, but it turns out it's in use by leadership-election which does not run on tcp-router. It is not reserved in TCP Router. See this issue for more information.
14922 system-metrics-agent debug_port no See bosh property here
15821 metrics-discovery-registrar metrics.port no See bosh property here.
17002 cf-tcp-router tcp_router.debug_address yes See bosh property here.
53035 system-metrics-agent metrics_port no This is the new default. See the bosh property here. This used to be configured by an ops file in CF-deployment.
53080 bosh-dns api.port no See bosh property here.