Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hostgw backend fails to replace old route table entries #801

Closed
julia-stripe opened this issue Aug 29, 2017 · 1 comment · Fixed by #803
Closed

hostgw backend fails to replace old route table entries #801

julia-stripe opened this issue Aug 29, 2017 · 1 comment · Fixed by #803

Comments

@julia-stripe
Copy link
Contributor

julia-stripe commented Aug 29, 2017

In our cluster, flannel is failing to replace existing routes in the route table with new routes. Here's a log of the failure message

{"log":"I0829 17:00:21.967998       1 network.go:83] Subnet added: 10.32.10.0/24 via 10.68.29.72\n","stream":"stderr","time":"2017-08-29T17:00:21.968055987Z"}
{"log":"W0829 17:00:21.968144       1 network.go:106] Replacing existing route to 10.32.10.0/24 via 10.68.26.131 with 10.32.10.0/24 via 10.68.29.72.\n","stream":"stderr","time":"2017-08-29T17:00:21.968211104Z"}
{"log":"E0829 17:00:21.968207       1 network.go:108] Error deleting route to 10.32.10.0/24: no such process\n","stream":"stderr","time":"2017-08-29T17:00:21.96826321Z"}

Basically for some reason when flannel sends a message to the netlink socket asking the kernel to update the route table, the kernel returns a "no such process" error. I'm able to delete routes with sudo ip route delete, and when I delete the routes manually, Flannel is able to create the new routes correctly.

As a result of this the route table ends up being pretty badly misconfigured and what happens is that no packets can be sent through the cni0 bridge (all network connections from a container fail with no route to host).

Expected Behavior

The route table should get updated when the subnet -> IP address mapping changes

Current Behavior

New routes are added, but any update that requires a deletion fails.

Steps to Reproduce (for bugs)

Terminate nodes in our Kubernetes cluster and bring up new nodes (with new IP addresses)

Context

Your Environment

  • Flannel version: 0.7.1 (none of the hostgw code has changed since then, though)
  • Backend used: hostgw
  • Kubernetes version (if used): 1.7.3
  • Operating System and version: Ubuntu 16.04
  • Link to your project (optional):
@julia-stripe
Copy link
Contributor Author

Found some more information!

I straced flannel, and these are the messages it's sending to the netlink socket:

https://gist.github.com/julia-stripe/c2a4aafbccf3533d738be1e665a79eb8

I parsed them all (using pyroute2: http://docs.pyroute2.org/debug.html

and got this resuilt for the failed message:


{'attrs': [('RTA_DST', '10.32.5.0'),
           ('RTA_GATEWAY', '10.68.28.131'),
           ('RTA_OIF', 0)],
 'dst_len': 24,
 'family': 2,
 'flags': 0,
 'header': {'flags': 5,
            'length': 52,
            'pid': 0,
            'sequence_number': 4,
            'type': 25},
 'proto': 0,
 'scope': 0,
 'src_len': 0,
 'table': 254,
 'tos': 0,
 'type': 0}

So basically what's happening is that Flannel sets RTA_OIF (the interface ID for the network interface) to 0 when it should be 2 (on our machines). This value (the interface id) comes from the linkIndex struct member, which appears to be unset. So it seems like the linkIndex struct member being 0 (instead of the right interface id) is the culprit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant