Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico/VPP NSM integration #325

Open
6 tasks
edwarnicke opened this issue Jul 15, 2021 · 22 comments
Open
6 tasks

Calico/VPP NSM integration #325

edwarnicke opened this issue Jul 15, 2021 · 22 comments
Assignees

Comments

@edwarnicke
Copy link
Member

edwarnicke commented Jul 15, 2021

Calico allows for a choice of dataplanes. VPP is one of them.

Normally cmd-forwarder-vpp, normally cmd-forwarder-vpp starts its own instance of vpp in its own Pod.

In response to a request for integration between NSM and Calico/VPP, the process for integration was described.

This issue is about actually trying (and shaking the bugs out of) such integration.

This breaks down into a number of steps:

  • Calico/VPP NSM integration testing on kind
  • Calico/VPP NSM integration testing on gke
  • Calico/VPP NSM integration testing on aks
  • Calico/VPP NSM integration testing on aws
  • Calico/VPP NSM integration testing on packet
  • Calico/VPP NSM integration testing on packet with Calico/VPP binding to the interface with vfio.
@edwarnicke
Copy link
Member Author

This fix is needed for Calico/VPP to work in Kind:
projectcalico/vpp-dataplane#204

Even though that PR has not been merged, the docker images have been pushed and these steps should work:
https://github.com/projectcalico/vpp-dataplane/pull/204/files#diff-9004d08acd588e7b7e93a8ff6fbe357d4eba3adc003d48ab4b7bed0186af1a11R1

@AloysAugustin
Copy link

Hi @edwarnicke , just an note that testing the integration in GKE will be complex at this stage, because we haven't found a way to override the default CNI in GKE, so Calico/VPP doesn't work there for now.
Also, why are you making a difference in the last two steps between Calico/VPP owning or not the main interface? Calico/VPP has relatively strong assumptions that it owns the main interface (= the interface that has the k8s Node address). Giving Calico/VPP an other interface will likely result in a non-functional cluster. As a side note, we are starting to look at giving more than one interface to VPP in a Calico/VPP deployment, but that isn't supported yet.

@edwarnicke
Copy link
Member Author

Also, why are you making a difference in the last two steps between Calico/VPP owning or not the main interface? Calico/VPP has relatively strong assumptions that it owns the main interface (= the interface that has the k8s Node address). Giving Calico/VPP an other interface will likely result in a non-functional cluster. As a side note, we are starting to look at giving more than one interface to VPP in a Calico/VPP deployment, but that isn't supported yet.

@AloysAugustin You are correct, I should have phrased the last one differently. I was thinking in terms of 'attaches to the interface with vfio' vs 'attaches to the interface with AF_XDP'... the idea being to attach with the highest performance option.

@AloysAugustin
Copy link

Ah, sounds good then 👍

@Bolodya1997 Bolodya1997 self-assigned this Jul 26, 2021
@Bolodya1997
Copy link

Calico uses VPP v21.xx, so probably we need to first update used VPP version in Forwarder and after make a new try:
networkservicemesh/cmd-forwarder-vpp#284

@Bolodya1997
Copy link

@edwarnicke
We are still facing issues with with VPP/govpp versions used in Calico and used in VPP Forwarder. Currently to make it work I need to:

  1. Build special Calico version with all our VPP patches applied - https://github.com/Bolodya1997/vpp-dataplane/blob/nsm/vpplink/binapi/vpp_clone_current.sh
  2. Replace github.com/edwarnicke/govpp/binapi with github.com/projectcalico/vpp-dataplane/vpplink/binapi/vppapi in VPP Forwarder.

Probably [2] step is not actually needed and can be fixed with changing used govpp version in [1] - needs to be tested.

But it actually looks like if we want to support such integration, we need to provide Calico images, k8s configuration files.
Is it OK?

@Bolodya1997
Copy link

Bolodya1997 commented Aug 26, 2021

  • Calico/VPP NSM integration testing on kind

There are 2 issues needs to be fixed to make it work:

  1. Use Calico VPP in Client - [integration-k8s-kind#325] Use existing VPP instance if given cmd-nsc-vpp#236.
  2. Make memif and memifproxy socket files shared with Calico VPP pod - [Calico/VPP NSM integration] memif and memifproxy files should be shared with VPP pod sdk-vpp#357.

Currently there is another issue - Calico and NSM uses different VPP versions with some different additional patches, so for testing I am currently using Calico VPP fork with added NSM VPP patches:
https://github.com/Bolodya1997/vpp-dataplane/blob/nsm-new/vpplink/binapi/vpp_clone_current.sh.
And govpp fork with added Calico VPP patches (added only to generated part, not to the generator):
https://github.com/Bolodya1997/govpp/tree/calico.

Update: Memif2Memif test case is not currently working - networkservicemesh/sdk-vpp#362.

@Bolodya1997
Copy link

Bolodya1997 commented Aug 31, 2021

  • Calico/VPP NSM integration testing on packet

Failing to start k8s cluster with Calico on packet, so created an issue to the Calico team - projectcalico/vpp-dataplane#217.

Update: succeeded to setup a cluster, currently working with tests.

Update: All basic scenarios except Memif2Memif currently work - networkservicemesh/sdk-vpp#362.

@Bolodya1997
Copy link

Bolodya1997 commented Sep 1, 2021

  • Calico/VPP NSM integration testing on gke
  • Calico/VPP NSM integration testing on aks
Vladimir Popov Yesterday at 5:38 PM
---
Hi, I am trying to use vpp-calico with different cloud providers: [AKS, GKE, AWS].
On project wiki I have found page only for AWS integration. Does it mean that [AKS, GKE] currently can’t
be configured to use vpp-calico?

Aloys Augustin  2 hours ago
---
Hi Vladimir, at this point only EKS is officially supported. We're working on AKS support which may come
in the near future. GKE is less likely to be supported soon because GKE doesn't allow to swap the CNI,
however there is always the option to deploy a self-managed cluster on google cloud as well.

@edwarnicke
Looks like it can be hardly possible to test NSM with Calico VPP on GKE, AKS.

@Bolodya1997
Copy link

  • Calico/VPP NSM integration testing on packet with Calico/VPP binding to the interface with vfio.

Used https://docs.projectcalico.org/reference/vpp/uplink-configuration Using DPDK -> With available hugepages. @edwarnicke is it exactly what you mean by binding interface with vfio?

All basic scenarios except Memif2Memif currently work. Tested additionally with Vfio2Noop scenario to make sure that there is no problem with VFIO - also works well.

@Bolodya1997
Copy link

  • Calico/VPP NSM integration testing on aws

Tested with the abstract sockets solution. All basic scenarios except Memif2Memif currently work.

@Bolodya1997
Copy link

@edwarnicke
Do we want to have any CI running for this issue?

@edwarnicke
Copy link
Member Author

Yes

@Bolodya1997
Copy link

@edwarnicke
Please, take a look at the following schemes and algorithms. Are all of them OK or do we need something to be implemented in some other way?

Node scheme

  1. VPP Forwarder uses node VPP instance aka Calico VPP.
  2. NSC, NSE uses their own VPP instances.
    image

memif to xxx

  1. NSC requests Forwarder for a memif connection.
    • netns file
  2. Forwarder requests NSE (probably remote over remote Forwarder) and creates a NSE-side connection.
  3. Forwarder requests VPP to create a memif server socket.
    • abstract socket path
    • netns file
  4. Forwarder creates xconnect with VPP.
  5. Forwarder responses back to NSC.
    • abstract socket path
  6. NSC requests VPP to create a memif client socket.
    • abstract socket path

xxx to memif

  1. NSC (probably remote over remote Forwarder) requests Forwarder for some connection.
  2. Forwarder requests NSE for a memif connection.
  3. NSE requests VPP to create a memif server socket.
    • abstract socket path
  4. NSE responses back to Forwarder.
    • abstract socket path
    • netns file
  5. Forwarder requests VPP to create a memif client socket.
    • abstract socket path
    • netns file
  6. Forwarder creates NSC-side connection.
  7. Forwarder creates xconnect with VPP.
  8. Forwarder responses back to NSC.

memif to memif

  1. NSC requests Forwarder for a memif connection.
    • NSC netns file
  2. Forwarder requests NSE for a memif connection.
  3. NSE requests VPP to create a memif server socket.
    • NSE abstract socket path
  4. NSE responses back to Forwarder.
    • NSE abstract socket path
    • NSE netns file
  5. Forwarder creates memif proxy socket on proxy abstract socket path in NSC netns and starts transferring all data between NSE abstract socket path in NSE netns.
  6. Forwarder responses back to NSC.
    • proxy abstract socket path
  7. NSC requests VPP to create a memif client socket.
    • proxy abstract socket path

@edwarnicke
Copy link
Member Author

This looks about right yes :)

@Bolodya1997
Copy link

@edwarnicke
Calico has integrated all needed for NSM patches to their VPP, but we still have different VPP version, so cmd-forwarder-vpp cannot be directly used with Calico VPP.
Should we create a new cmd-forwarder-vpp-calico with govpp generated for Calico VPP version?
Or maybe we should use last release Calico VPP as a base for the NSM VPP applications and so just update govpp?

@edwarnicke
Copy link
Member Author

@Bolodya1997 I'll spin a new image with their patches, test it, and we can look at upgrading.

Is everything else working well? Is it just a matter of updating our image and landing some PRs from you?

@Bolodya1997
Copy link

Is everything else working well? Is it just a matter of updating our image and landing some PRs from you?

I am working on abstract sockets memif implementation, it will be clear after I will finish and test it.
Currently it is still not clear whether there is or not an issue with LinkUP events, because it possibly can be caused by old solution.

@Bolodya1997
Copy link

Is everything else working well? Is it just a matter of updating our image and landing some PRs from you?

@edwarnicke
We have a problem with Calico VPP setup on packet - internet is not accessible from pods without hostNetwork: true.
It actually looks like I am missing something in configuration, filed an issue for this in Calico repo - projectcalico/vpp-dataplane#263.

This issue affects DNS test, but we are planning to rework it, because it would make more sense if test nslookup something like kubernetes.default instead of google.com and so we don't need internet access in such case.

All other basic/feature tests are working, currently I am working on CI.

@edwarnicke
Copy link
Member Author

@Mixaster995 You will probably need this: networkservicemesh/cmd-forwarder-vpp#421

@glazychev-art
Copy link
Contributor

@edwarnicke

We have tested Calico Integration PR and we have the following suggestions:

1. Сluster on which we will do the integration

1.1 Packet

This PR does the integration Calico on Packet. Problems:

  • setup is not stable - not all runs are successful
  • we use 2 clusters per test run (without Сalico and with it). It's wasteful and will make it difficult to integrate other CNI plugins later. It would be better if we use one cluster: Tests without Calico - cluster cleanup - Tests with Calico

1.2 Kind

A cursory test found problems with the setup - Calico-Vpp doesn't start. Need more time to research. The problem can be related to Calico or to Kind.

2. Forwarder configuration

Currently, we have 2 version of tests - usual and for Calico. We need to consider use only one version.
We can try to use External VppAPISocket as default (https://github.com/networkservicemesh/cmd-forwarder-vpp/blob/main/internal/config/config.go#L49) and mount this socket from host to a specific folder on forwarder (vpp-ext for example).
Forwarder will check for the default VPP API socket on startup.
So, if we have one - use it (Calico case), if not - create a new vpp instance (current behavior).

3. Healing.

As I remember there are many chain elements that (explicitly or not) assume that forwarder death == vpp death. It's not right for the Calico case. We need to come up with a correct VPP cleaning when the forwarder is restarted:

  • check all forwarder's chain elements
  • or create a new one, which will clear VPP completely

Questions

  1. What do you think about Kind integration?
  2. Is it fine to use external VPP socket as default for Forwarder?
  3. Any thoughts about Healing solutions?

@glazychev-art
Copy link
Contributor

Description

There is a problem with forwarder configuration. It is related to network namespaces - Calico-VPP doesn't have grpcfd. For example, when we connect to the Endpoint, forwarder receives network namespace fd using grpcfd. But Calico-VPP doesn't have that one, therefore knows nothing about NSE's network namespace. And when we try to create network interface - we receive an error.

Solutions

  1. The simplest solution - Use hostPID:true for the forwarder by default - see comment - [Calico/VPP NSM integration] Forwarder should create named net NS instead of using /proc/1/fd/x sdk-vpp#354 (comment)
  2. Use shared directory between Forwarder and Calico, where we can create namespace fds. But here we need to know at the stage of creating the network interface whether we use our own VPP or from Calico.
  3. Create a proxy sidecar for Calico. This sidecar will handle some vpp api calls differently. We can send inode to the sidecar and create a unix connection between forwarder and sidecar to send fd

We think that 1 is the preferred solution at the moment. We can create an issue to use a different approach in future releases.

@edwarnicke
What do you think? Is networkservicemesh/sdk-vpp#354 (comment) still actual and we can use hostPID:true by default?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants