Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should AWS-SNAT-CHAIN-0 be vpcCIDRs or should it be subnetCIDRs? With.... #550

Closed
rshutt opened this issue Jul 31, 2019 · 4 comments
Closed
Labels

Comments

@rshutt
Copy link

rshutt commented Jul 31, 2019

Howdy,

Looking at this whole thing... using CGNat space and custom subnets... Shouldn't AWS-SNAT-CHAIN-0 be more specific? Basically, it's not SNATing anything that is destined to the entire VPC... This means that Pods talking to anything else in the entire VPC will NOT be SNATed and will have their communications and return packets not traverse the Kubelet at all. It also means that the IP addresses behind the Kubelet (i.e. the CGnat space) has to be routable. This seems to be to be a violation of the very idea of the Pod network being a virtual network that does not exist in reality.

I've read the code and I do not see anything that looks like I could easily fix this other than perhaps disabling SNAT and then manually adding my own rules that do the SNAT at the kubelet for any traffic that is originating from the POD network unless the destination is also the POD network? Or am I clinging to non-EKS models too much?

It just feels wrong to have Pod <-> out of cluster communications NOT traversing the Kubelet rather routing directly around through the VPC's virtual routers?

We wind up with this rule being first and there seems to be no way to usurp it? X.X.0.0/16 is the CIDR of the VPC. Also notable is the misspelled "CHAN" vs. "CHAIN". Reading the source, this should not be a thing... Perhaps Im using an older version?

-A POSTROUTING -m comment --comment "AWS SNAT CHAN" -j AWS-SNAT-CHAIN-0
-A AWS-SNAT-CHAIN-0 ! -d X.X.0.0/16 -m comment --comment "AWS SNAT CHAN" -j AW
S-SNAT-CHAIN-1

More details:

// NetworkAPIs defines the host level and the eni level network related operations
type NetworkAPIs interface {
        // SetupNodeNetwork performs node level network configuration
        SetupHostNetwork(vpcCIDR *net.IPNet, vpcCIDRs []*string, primaryMAC string, primaryAddr *net.IP) error
        type snatCIDR struct {
                cidr        string
                isExclusion bool
        }
        var allCIDRs []snatCIDR
        for _, cidr := range vpcCIDRs {
                allCIDRs = append(allCIDRs, snatCIDR{cidr: *cidr, isExclusion: false})
        }
        for _, cidr := range n.excludeSNATCIDRs {
                allCIDRs = append(allCIDRs, snatCIDR{cidr: cidr, isExclusion: true})
        }
@rshutt
Copy link
Author

rshutt commented Jul 31, 2019

After speaking directly with support, i was informed that there's no real way to make the AWS VPC CNI environment "feel" like an on prem or vanilla kube solution wherein the only access to the Pod IP network was via a Kubelet... hopefully through the Services/Endpoints framework.

Sure, it could be simulated with security groups on the ENIConfigs, but anything in the same VPC is going to think that it can route straight to a pod and unless we can usurp the AWS-SNAT-CHAIN-0 rule which prevents natting anything that is in a vpcCIDR, making such a security group would be problematic?

I guess if this were a feature request, I'd ask that there be a flag to force all pod outbound traffic to SNAT behind the kubelet IP unless it was headed for another Pod IP on the same cluster.

Thoughts?

@mogren
Copy link
Contributor

mogren commented Aug 2, 2019

Hi @ShuttR,
You are right that it used to be misspelled. It was changed in PR #520 where the option to skip SNAT:ing within the VPC. Is this what you were looking for?

@rshutt
Copy link
Author

rshutt commented Aug 2, 2019

@mogren Actually I kinda wanted the opposite... Forced SNATing unless the Pod was talking to another Pod on the same cluster. Basically, like it would be with an on-premise solution not leveraging AMZN VPC, but without building a vxlan fabric or scope-limited bgp peering for the pod network. It does not make sense to me that some random ec2 instance within my VPC can speak directly to a pod when that pod initiates the connection. It seems like I can use security groups on the ENIConfig allocated subnets to simulate non-reachability save through the Kube Proxy/IPTables/IPVS, but the outbound is literally short circuited with the IPTables rule that is at the head of the SNAT Chains since that will not match packets with a destination of the VPC CIDR and therefore it would not SNAT as Pods sent traffic through the docker bridge to the real network.

@rshutt
Copy link
Author

rshutt commented Aug 2, 2019

That said, we're actually stepping away from the CUSTOM side of this cni plugin due to the fact that they simply decided to allow us a larger set of private subnets which obviates the need to use the CGNat 100.64/10 space in the first place... But not before i wrote a big old thing bifurcating the eksctl CF instantiation of the Control plane and the Nodes and inserting the DaemonSet env's and then doing a big old loop in a... bash script... bitshifting to match subnets (( >> 24 )) ..etc..etc.. on the actual 32 bit addresses to find the "next available" CIDR block of size 22, spin up subnets, and then tag them correctly owned and named, associate them with the VPC, and so on (and the delete_cluster functionality to reverse it). I think they liked giving me more 1918 better than giving some script rights to totally mess up the VPC in the event of a bug :P

And lets be honest, the pod initiating outbound to other services on the same VPC without SNATing at the Kubelet is probably loads more efficient. It just violates that principle that the Pod IPs are literally not real... They are now real within the VPC routing domain.

:)

R.I.P.

  for i in $(seq 0 3); do
    dottedquad[${i}]=$(( ${ipdecwork} / (( 2 ** (( 8 * (( 3 - ${i} )) )) )) )) || error "Unable to convert the $(( ${i} + 1 ))th octet: ${ipdecwork}"

    ipdecwork=$(( ${ipdecwork} - (( ${dottedquad[${i}]} * (( 2 ** (( 8 * (( 3 - ${i} )) )) )) )) )) || \
       error "Unable to carry remainder to next octet"
  done

  echo "${dottedquad[0]}.${dottedquad[1]}.${dottedquad[2]}.${dottedquad[3]}"

and

  OLDIFS=${IFS}
  IFS=.
  set -- ${ipaddr}
  IFS=${OLDIFS}

  local counter=3

  for dottedquad in "$@"; do
    ipdec=$(( ${ipdec} + (( ${dottedquad} * \
             (( 256 ** (( counter-- )) )) )) )) || \
             error "Unable to convert ${dottedquad} to decimal"
  done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants