-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A proposal for CNI migration from Calico to Antrea #5578
Comments
@luolanzone Could you please take a look at this proposal, thanks. |
@ceclinux you can take a look as well. |
@hjiajing Thanks for the efforts. A few questions and comments:
|
========== WARNING ==========
THIS IS AN EXPERIMENTAL FEATURE.
YOUR SERVICE MAY NOT WORK AS EXPECTED DURING THE MIGRATION.
IF YOU WANT TO MIGRATE YOUR SERVICE, PLEASE CHECK THE REQUIREMENTS TABLE.
+--------------------------------------------------+
| Requirements Calico Configuration |
+-------------------------------+------------------+
| calico version | v3.26.1 |
+-------------------------------+------------------+
| cluster ipam plugin | calico-ipam |
+-------------------------------+------------------+
| ipip mode | always |
+-------------------------------+------------------+
| natoutgoing | true |
+-------------------------------+------------------+
| alloweduses | workload, tunnel |
+-------------------------------+------------------+
| ipamconfig.autoallocateblocks | true |
+-------------------------------+------------------+ |
Migrate Calico to AntreaThis document is a guide to migrate a Calico Cluster to Antrea Cluster. During the migration, the Service may not be available. Terms
Steps
ExampleBuild antrea-migrator$ GOOS=linux GOARCH=amd64 go build -o antrea-migrator ./main.go Check the requirementNOT all Calico NetworkPolicy is supported by Antrea. The antrea-migrator can check if the NetworkPolicy is supported by Antrea. $ ./antrea-migrator print-requirements
========== WARNING ==========
THIS IS AN EXPERIMENTAL FEATURE.
YOUR SERVICE MAY NOT WORK AS EXPECTED DURING THE MIGRATION.
IF YOU WANT TO MIGRATE YOUR SERVICE, PLEASE MAKE SURE THE CALICO APISERVER IS INSTALLED.
$ ./antrea-migrator check
I1023 05:35:50.581556 1291848 check.go:47] Checking GlobalNetworkPolicy: deny-blue
I1023 05:35:50.581634 1291848 check.go:47] Checking GlobalNetworkPolicy: deny-green
I1023 05:35:50.581650 1291848 check.go:47] Checking GlobalNetworkPolicy: deny-nginx-ds
I1023 05:35:50.592310 1291848 check.go:71] Calico NetworkPolicy check passed, all Network Polices and Global Network Policies are supported by Antrea Deploy AntreaDeploy Antrea to the cluster, now that there are two CNI in the cluster. In this step, all legacy Pods are still using Calico CNI, but the new Pods will use Antrea. The Calico Pods could still communicate with each other, but the Calico Pods cannot connect to the Antrea Pods. The Service in the Cluster might be unavailable in this step. But sometimes the Pod IP might be conflict with Convert NetworkPolicyThis step will convert Calico NetworkPolicy to Antrea NetworkPolicy. The Calico GlobalNetworkPolicy is converted to Antrea ClusterNetworkPolicy and Calico NetworkPolicy is converted to Antrea NetworkPolicy.
./antrea-migrator convert-networkpolicy
I1023 06:13:18.813386 1721989 convert-networkpolicy.go:72] Converting Namespaced NetworkPolicy
I1023 06:13:18.814819 1721989 convert-networkpolicy.go:77] Converting Global NetworkPolicy
I1023 06:13:18.920224 1721989 convert-networkpolicy.go:97] "Creating Antrea Cluster NetworkPolicy" ClusterNetworkPolicy="deny-blue"
W1023 06:13:18.962208 1721989 warnings.go:70] unknown field "spec.ingress[0].to"
I1023 06:13:18.962989 1721989 convert-networkpolicy.go:97] "Creating Antrea Cluster NetworkPolicy" ClusterNetworkPolicy="deny-green"
W1023 06:13:18.977663 1721989 warnings.go:70] unknown field "spec.ingress[0].to"
I1023 06:13:18.977908 1721989 convert-networkpolicy.go:97] "Creating Antrea Cluster NetworkPolicy" ClusterNetworkPolicy="deny-nginx-ds"
W1023 06:13:18.992997 1721989 warnings.go:70] unknown field "spec.ingress[0].to"
# Check the Network Policy
$ kubectl get globalnetworkpolicies.crd.projectcalico.org
NAME AGE
default.deny-blue 3m29s
default.deny-green 3m26s
default.deny-nginx-ds 3m22s
$ kubectl get clusternetworkpolicies.crd.antrea.io
NAME TIER PRIORITY DESIRED NODES CURRENT NODES AGE
deny-blue application 10 1 1 41s
deny-green application 10 0 0 41s
deny-nginx-ds application 10 1 1 41s Kill sandboxThis step will kill all During this step, several Jobs will be created, whose name is While this, the workload Pods' statues could be cat <<EOF | kubectl apply -f -
apiVersion: batch/v1
kind: Job
metadata:
name: antrea-migrator-kill-sandbox-$node
namespace: kube-system
spec:
template:
spec:
nodeName: $node
hostPID: true
containers:
- name: antrea-migrator-kill-sandbox
image: busybox:1.28
command:
- "/bin/sh"
- "-c"
- "pkill -9 /pause"
securityContext:
privileged: true
restartPolicy: Never
EOF
========== Creating Sandbox killer job on Node: kind-control-plane ==========
job.batch/antrea-migrator-kill-sandbox-kind-control-plane unchanged
========== Creating Sandbox killer job on Node: kind-worker ==========
job.batch/antrea-migrator-kill-sandbox-kind-worker unchanged
========== Creating Sandbox killer job on Node: kind-worker10 ==========
job.batch/antrea-migrator-kill-sandbox-kind-worker10 unchanged
job.batch/antrea-migrator-kill-sandbox-kind-worker6 created
========== Creating Sandbox killer job on Node: kind-worker8 ==========
job.batch/antrea-migrator-kill-sandbox-kind-worker8 created
========== Creating Sandbox killer job on Node: kind-worker9 ==========
job.batch/antrea-migrator-kill-sandbox-kind-worker9 created
========== Waiting for Sandbox killer job on Node: kind-control-plane ==========
job.batch/antrea-migrator-kill-sandbox-kind-control-plane condition met
========== Waiting for Sandbox killer job on Node: kind-worker ==========
job.batch/antrea-migrator-kill-sandbox-kind-worker condition met
========== Waiting for Sandbox killer job on Node: kind-worker10 ==========
job.batch/antrea-migrator-kill-sandbox-kind-worker10 condition met
========== Waiting for Sandbox killer job on Node: kind-worker8 ==========
job.batch/antrea-migrator-kill-sandbox-kind-worker8 condition met
========== Waiting for Sandbox killer job on Node: kind-worker9 ==========
job.batch/antrea-migrator-kill-sandbox-kind-worker9 condition met Remove Calico CNIThis step will remove Calico CNI by manifest, all calico network interfaces and corresponding static IP routes will be removed. kubectl delete -f https://docs.projectcalico.org/manifests/calico.yaml Remove legacy iptables rules of CalicoAlthough the Calico CNI is removed, the legacy iptables rules of Calico still exist. This step will remove the legacy iptables rules of Calico. antrea_agents=$(kubectl get pods -n kube-system -l app=antrea -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}')
for antrea_agent in $antrea_agents; do
echo "========== Removing Calico iptables rules on Node: $antrea_agent =========="
kubectl exec $antrea_agent -n kube-system -- /bin/sh -c "iptables-save | grep -v cali | iptables-restore"
done |
After last discussion, I follow @tnqn 's suggestion to use a simple way to migrate Calico to Antrea. In this method, all Pods in cluster will restart in place and switch to Antrea CNI. It does not cost too much time, and do not need the help from multus CNI and rolling update Pods. The downtime also is acceptable. Could you please take a look, thanks. @luolanzone @ceclinux @tnqn @edwardbadboy |
How long does it take to recover using the new way? |
It takes about 30 seconds for the Pod to restart in my testbed(20 Node cluster). |
I did a rough test by accessing a Service continuously. The failed request showed up at 2023/10/23 07:14:23 and disappeared at 2023/10/23 07:15:23
|
@hjiajing after NP conversion,will all old Calico NPs still be kept in the cluster without any effect? If so, maybe warn user about this or provide an option to allow user to back up them, delete after backup etc. |
I have thought of using DaemonSet, which is more convenient, and do not need to create a Job for every Node. But I found that it's not very easy to determine whether the "killing sandbox job" is completed. If we use DaemonSet, the DS controller will restart completed Pods. Or if we use start-up script, the Pods will keep |
Add a YAML file antrea-migrator.yml to help migrate clusters with other CNIs to Antrea. It will restart all Pods in-place. A new DaemonSet "antrea-migrator" is responsible for cleaning up configuration from the previous CNI (e.g., iptables rules), as well as "restarting" all Pods on all Nodes, to ensure that they are connected to the Antrea network. To restart Pods with minimum downtime and without causing them to be rescheduled, it uses the `crictl stopp` command to restart the Pods in-place. Fixes #5578 Signed-off-by: hjiajing <[email protected]>
Migrate CNI from Calico to Antrea
This document describes how to migrate CNI from Calico to Antrea.
Terms
Pre-test
Deploy Antrea directly
The CNI configuration files are located at
/etc/cni/net.d/
. If you have installed Calico, you can find theconfiguration files of Calico CNI. If you have installed Antrea, you can find the configuration files of Antrea CNI.
In Kubernetes, the Kubelet use CNI configuration file by alphabetical order. So if we deploy both Calico and Antrea, the
Kubelet will use Antrea CNI configuration file (
/etc/cni/net.d/10-antrea.conflist
). Any Pod created afterAntrea installation will use Antrea CNI. Any Pod created before Antrea installation will use Calico CNI. In this case,
the Antrea Pod could communicate with Calico Pod, but the Calico Pod could not communicate with Antrea Pod (A few Calico
Pods could communicate with Antrea Pod, but most of them could not).
For example:
Multus-CNI
Multus CNI is a CNI plugin for Kubernetes that enables attaching multiple network interfaces to pods. It means that we
can use Multus CNI to attach both Calico and Antrea CNI to a Pod. In this way, we can migrate CNI from Calico to Antrea.
After deploying Calico and Antrea, we can use Multus CNI to attach both Calico and Antrea CNI to a Pod, which is a
Multus Pod.
The Multus Pods could communicate with both Calico Pods and Antrea Pods. After that, we can migrate CNI from Calico to
Antrea.
Install Multus-CNI
Using Multus-CNI to be a bridge between Calico and Antrea
After that, we can see that the nginx Pod is a Multus Pod with two network interfaces. One is managed by Calico CNI(
eth0), and
the other is managed by Antrea CNI(net1).
The Multus Pods could still communicate with Calico Pods.
If we remove the Multus CNI configuration file
00-multus.conf
, Kubelet will use Antrea CNI configuration file10-antrea.conflist
. So all Pods created after removing00-multus.conf
will use Antrea CNI. They are Antrea Pods.PodCIDR
Does every Multus Pod could communicate with both Calico Pods and Antrea Pods? The answer is no. For example:
We notice that the Multus Pod client could not communicate with the Calico Pods on Node test-worker3, test-worker5, and
test-worker11.
The root cause is that the PodCIDR of these Nodes is
10.10.4.128/26
,10.10.8.128/26
, and10.10.6.42/26
, which isconflicted with the PodCIDR of Antrea CNI(Node PodCIDR).
Which result in the conflict of the routing table.
root@test-worker2:/# ip r default via 172.18.0.1 dev eth0 10.10.0.0/24 via 10.10.0.1 dev antrea-gw0 onlink 10.10.1.0/24 via 10.10.1.1 dev antrea-gw0 onlink 10.10.2.0/24 via 10.10.2.1 dev antrea-gw0 onlink 10.10.3.0/24 via 10.10.3.1 dev antrea-gw0 onlink 10.10.4.0/24 via 10.10.4.1 dev antrea-gw0 onlink6 10.10.4.128/26 via 172.18.0.19 dev tunl0 proto bird onlink 10.10.5.0/24 via 10.10.5.1 dev antrea-gw0 onlink 10.10.6.0/24 via 10.10.6.1 dev antrea-gw0 onlink 10.10.6.64/26 via 172.18.0.6 dev tunl0 proto bird onlink 10.10.7.0/24 via 10.10.7.1 dev antrea-gw0 onlink 10.10.8.0/24 via 10.10.8.1 dev antrea-gw0 onlink 10.10.8.128/26 via 172.18.0.15 dev tunl0 proto bird onlink 10.10.9.0/24 dev antrea-gw0 proto kernel scope link src 10.10.9.1
Step to Migration CNI from Calico to Antrea
Simple Scenario
If the PodCIDR of Calico ipamblock is not overlapped with the PodCIDR of Antrea(Node PodCIDR). We can migrate CNI from
Calico to Antrea by the following steps:
For exmaple:
Check the connectivity from client to Service, the Service
nginx-origin
works well. The traffic from client to Serviceis from Calico Pod to Calico Pod.
❯ kubectl exec client -- /client -c 10 -n 1000 -url http://nginx-orign 2023/10/14 12:13:14 100 Requests completed 2023/10/14 12:13:16 200 Requests completed 2023/10/14 12:13:18 300 Requests completed 2023/10/14 12:13:20 400 Requests completed 2023/10/14 12:13:22 500 Requests completed 2023/10/14 12:13:24 600 Requests completed 2023/10/14 12:13:26 700 Requests completed 2023/10/14 12:13:28 800 Requests completed 2023/10/14 12:13:30 900 Requests completed 2023/10/14 12:13:32 1000 Requests completed 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Receiving Stop Signal, stopping... 2023/10/14 12:13:32 Total Requests: 1002 2023/10/14 12:13:32 Success: 0 2023/10/14 12:13:32 Failure: 1002 2023/10/14 12:13:32 Success Rate: 0.000000% 2023/10/14 12:13:32 Total time: 24.651842152s
After deploying Antrea CNI and Multus, check the connectivity from client to Service again Service works well, too. The
traffic from client to Service is still from Calico Pod to Calico Pod because all Pods are legacy Calico Pods.
❯ kubectl exec client -- /client -c 10 -n 1000 -url http://nginx-orign 2023/10/14 12:15:01 100 Requests completed 2023/10/14 12:15:03 200 Requests completed 2023/10/14 12:15:05 300 Requests completed 2023/10/14 12:15:07 400 Requests completed 2023/10/14 12:15:09 500 Requests completed 2023/10/14 12:15:11 600 Requests completed 2023/10/14 12:15:13 700 Requests completed 2023/10/14 12:15:15 800 Requests completed 2023/10/14 12:15:17 900 Requests completed 2023/10/14 12:15:19 1000 Requests completed 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Receiving Stop Signal, stopping... 2023/10/14 12:15:19 Total Requests: 1003 2023/10/14 12:15:19 Success: 0 2023/10/14 12:15:19 Failure: 1003 2023/10/14 12:15:19 Success Rate: 0.000000% 2023/10/14 12:15:19 Total time: 20.259637559s
Create a new client
multus-client
with Multus CNI, Check the connectivity from multus-client to Calico Nginx. Thetraffic from multus-client to Service is from Multus Pod to Calico Pod.
❯ kubectl exec multus-client -- /client -c 10 -n 1000 -url http://nginx-origin 2023/10/14 12:23:32 100 Requests completed 2023/10/14 12:23:34 200 Requests completed 2023/10/14 12:23:36 300 Requests completed 2023/10/14 12:23:38 400 Requests completed 2023/10/14 12:23:40 500 Requests completed 2023/10/14 12:23:42 600 Requests completed 2023/10/14 12:23:44 700 Requests completed 2023/10/14 12:23:46 800 Requests completed 2023/10/14 12:23:48 900 Requests completed 2023/10/14 12:23:50 1000 Requests completed 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Receiving Stop Signal, stopping... 2023/10/14 12:23:50 Total Requests: 1005 2023/10/14 12:23:50 Success: 1005 2023/10/14 12:23:50 Failure: 0 2023/10/14 12:23:50 Success Rate: 100.000000% 2023/10/14 12:23:50 Total time: 20.414199716s
Restart nginx daemonset, after restart. Remove Multus CNI. After this, the priority of Antrea CNI is higher than Calico,
so all new Pods will be Antrea Pods.
Check the connectivity from Antrea client to Multus Nginx, it still works well.
❯ kubectl exec antrea-client -- /client -c 10 -n 1000 -url http://nginx-origin 2023/10/14 12:27:32 100 Requests completed 2023/10/14 12:27:35 200 Requests completed 2023/10/14 12:27:37 300 Requests completed 2023/10/14 12:27:39 400 Requests completed 2023/10/14 12:27:41 500 Requests completed 2023/10/14 12:27:43 600 Requests completed 2023/10/14 12:27:45 700 Requests completed 2023/10/14 12:27:47 800 Requests completed 2023/10/14 12:27:51 900 Requests completed 2023/10/14 12:27:53 1000 Requests completed 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Receiving Stop Signal, stopping... 2023/10/14 12:27:53 Total Requests: 1008 2023/10/14 12:27:53 Success: 1008 2023/10/14 12:27:53 Failure: 0 2023/10/14 12:27:53 Success Rate: 100.000000% 2023/10/14 12:27:53 Total time: 24.287786997s
Restart nginx again, then all Pods will only use Antrea CNI with single NIC. Check the connectivity from Antrea client to Nginx.
It still works well. The traffic from Antrea client to Nginx is from Antrea Pod to Antrea Pod.
❯ k exec antrea-client -- /client -c 10 -n 1000 -url http://nginx-origin 2023/10/14 12:35:44 100 Requests completed 2023/10/14 12:35:46 200 Requests completed 2023/10/14 12:35:48 300 Requests completed 2023/10/14 12:35:50 400 Requests completed 2023/10/14 12:35:52 500 Requests completed 2023/10/14 12:35:54 600 Requests completed 2023/10/14 12:35:56 700 Requests completed 2023/10/14 12:35:58 800 Requests completed 2023/10/14 12:36:00 900 Requests completed 2023/10/14 12:36:02 1000 Requests completed 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Receiving Stop Signal, stopping... 2023/10/14 12:36:02 Total Requests: 1003 2023/10/14 12:36:02 Success: 1003 2023/10/14 12:36:02 Failure: 0 2023/10/14 12:36:02 Success Rate: 100.000000% 2023/10/14 12:36:02 Total time: 20.248457447s
A Small Part of PodCIDR Overlapped
Most of PodCIDR Overlapped
Solution 1
Edit the Calico IPPool to avoid PodCIDR overlapped.
The default Calico IP block is
/26
, which is smaller than PodCIDR on Node, so maybe we can migrate IPPool to another which is not overlapped with Node PodCIDR, then do migration above.calico-ippool-migration
Solution 2
Hard way. We need to migrate CNI from Calico to Antrea in a small part of Nodes. Then we need to migrate CNI from Calico
to Antrea in the rest of Nodes.
The connection between the two parts of Nodes may be broken.
Files to deliver
antrea-migrate.sh
: A script to migrate CNI from Calico to Antrea.antrea-migrator
: A tool to convert Calico Networkpolicy to Antrea Networkpolicy, check CIDR overlapped, and so on.antrea-migrator
source code (Maybe we could add it toantctl
)Reference
multus-cni: https://github.com/k8snetworkplumbingwg/multus-cni
some scripts: https://github.com/hjiajing/studious-doodle (WIP)
The text was updated successfully, but these errors were encountered: