Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flow Visibility] Add flow-aggregator with clickhouse related e2e tests #3673

Merged
merged 3 commits into from
May 4, 2022

Conversation

wsquan171
Copy link
Contributor

This PR adds the following changes to flow visibility e2e:

  • add clickhouse related e2e test cases and deployment files
  • add manifest generation with proper config for Antrea agent and flow aggregator
  • update github workflow to only run flow aggregator related test when --flow-visibility flag is passed in

@codecov-commenter
Copy link

codecov-commenter commented Apr 21, 2022

Codecov Report

Merging #3673 (ee93057) into main (f08aecd) will decrease coverage by 7.34%.
The diff coverage is n/a.

❗ Current head ee93057 differs from pull request most recent head 32620eb. Consider uploading reports for the commit 32620eb to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3673      +/-   ##
==========================================
- Coverage   64.59%   57.25%   -7.35%     
==========================================
  Files         278      392     +114     
  Lines       39513    55091   +15578     
==========================================
+ Hits        25523    31541    +6018     
- Misses      12009    21085    +9076     
- Partials     1981     2465     +484     
Flag Coverage Δ
e2e-tests 46.42% <ø> (?)
integration-tests 38.02% <ø> (?)
kind-e2e-tests 52.42% <ø> (+0.15%) ⬆️
unit-tests 43.84% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...agent/flowexporter/connections/deny_connections.go 44.82% <0.00%> (-39.09%) ⬇️
pkg/agent/flowexporter/exporter/exporter.go 49.00% <0.00%> (-22.28%) ⬇️
pkg/agent/controller/networkpolicy/packetin.go 54.92% <0.00%> (-11.98%) ⬇️
.../flowexporter/connections/conntrack_connections.go 72.54% <0.00%> (-6.38%) ⬇️
pkg/controller/externalippool/controller.go 86.16% <0.00%> (-5.36%) ⬇️
pkg/agent/flowexporter/utils.go 72.34% <0.00%> (-4.26%) ⬇️
pkg/ovs/openflow/ofctrl_nxfields.go 62.06% <0.00%> (-3.45%) ⬇️
pkg/ipam/poolallocator/allocator.go 53.49% <0.00%> (-0.97%) ⬇️
pkg/agent/openflow/pipeline.go 73.83% <0.00%> (-0.93%) ⬇️
pkg/agent/controller/egress/egress_controller.go 74.35% <0.00%> (-0.69%) ⬇️
... and 141 more

ci/kind/test-e2e-kind.sh Outdated Show resolved Hide resolved
ci/jenkins/test.sh Outdated Show resolved Hide resolved
test/e2e/fixtures.go Show resolved Hide resolved
@@ -1147,3 +1376,55 @@ func matchSrcAndDstAddress(srcIP string, dstIP string, isDstService bool, isIPv6
}
return srcField, dstField
}

type ClickHouseFullRow struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not defined anywhere else in the Flow Aggregator code? I am just worried about keeping things consistent between the FA code and the test code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the struct is defined in clickhouseclient.go, but:

  • the Json tag binding is only used for test code. Not sure if it makes sense to include those tags there just for the sake of this e2e testing unmarshalling
  • (nit) attrs are not exported / public in that file
  • certain db side columns are not used as part of client (TimeInserted, Trusted), thus not included under fa.

I don't think there's too much overhead in having two structs here. If anything new is added on the FA side, the test code should be updated to take those in anyways.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The JSON tags would definitely not be an issue, however it sounds like there are other reasons for keeping them separate.

Comment on lines +724 to +736
if err != nil || rc != 0 {
// ClickHouseInstallation CRD from ClickHouse Operator install bundle applied soon before
// applying CR. Sometimes apiserver validation fails to recognize resource of
// kind: ClickHouseInstallation. Retry in such scenario.
if strings.Contains(stderr, "ClickHouseInstallation") || strings.Contains(stdout, "ClickHouseInstallation") {
return false, nil
}
return false, fmt.Errorf("error when deploying the flow visibility YML %s: %s, %s, %v", flowVisibilityYML, stdout, stderr, err)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a problem for regular users as well (outside of test code)? If yes, is this documented and did we consider having 2 YAML manifests instead to reduce the likelihood of the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's two steps with two yaml files in the user facing documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's one more argument in favor of deploying CH ou-of-band from the test cases.

ac := func(config *agentconfig.AgentConfig) {
config.FeatureGates["FlowExporter"] = false
// deployFlowVisibilityClickHouse deploys ClickHouse operator and DB.
func (data *TestData) deployFlowVisibilityClickHouse() (*PodIPs, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference would have been to refactor the test code so that for these tests we deploy Antrea, the FA, and ClickHouse "out-of-band" (so not as part of the test cases themselves). I think it tends to be a better and simpler approach. However, that doesn't exactly match how we have designed tests so far, so we can think about doing it as a follow-up.

test/e2e/framework.go Outdated Show resolved Hide resolved
test/e2e/framework.go Outdated Show resolved Hide resolved
return nil
}

func (data *TestData) mutateFlowAggregatorConfigMap(ipfixCollector string, faClusterIP string) error {
func (data *TestData) mutateFlowAggregatorConfigMap(ipfixCollector, clickHouse string) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these parameter names are too vague
maybe add the URL suffix to them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 116 to 129
if [ "$MODE" == "e2e" ]; then
mkdir -p base/provisioning/datasources
cp $KUSTOMIZATION_DIR/base/clickhouse.yml base/clickhouse.yml
cp $KUSTOMIZATION_DIR/base/kustomization-e2e.yml base/kustomization.yml
cp $KUSTOMIZATION_DIR/base/kustomize-config.yml base/kustomize-config.yml
cp $KUSTOMIZATION_DIR/base/provisioning/datasources/create_table.sh base/provisioning/datasources/create_table.sh
cp $KUSTOMIZATION_DIR/../clickhouse-operator-install-bundle.yml clickhouse-operator-install-bundle.yml

$KUSTOMIZE edit add base base
$KUSTOMIZE edit add base clickhouse-operator-install-bundle.yml
$KUSTOMIZE edit add patch --path imagePullPolicyOperator.yml

$KUSTOMIZE edit set image flow-visibility-clickhouse-monitor=projects.registry.vmware.com/antrea/flow-visibility-clickhouse-monitor:latest
$KUSTOMIZE edit add patch --path imagePullPolicyClickhouse.yml --group clickhouse.altinity.com --version v1 --kind ClickHouseInstallation --name clickhouse
else
$KUSTOMIZE edit add base $BASE
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we need so much customization compared to dev mode. Maybe it indicates that something should be improved?

Have you given any thought to what this will be like when the visibility solution is moved to the theia repo? Maybe you intend to improve things at that time?

Copy link
Contributor Author

@wsquan171 wsquan171 Apr 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

essentially what we're doing here is:

  • re-create a kustomize folder to use kustomization-e2e.yml as base kustomization.yml, which only includes clickhouse db but not other grafana related manifest components
  • since we don't want to dup other shared files (patches / configs / db init script) in the flow-visibility dir, we're copying them from the base dir instead of building directly in its dedicated base dir. This makes it look like having a lot of customization, but actually half of the steps are just copying existing files.
  • also includes clickhouse-operator-install-bundle.yml in one yaml file for easier manifest copying and deploying with test code. For release yaml it's a separate file.

Once FV manifest is moved out of Antrea we'll have a simplified folder structure dedicated for e2e test. At that time all these copying is not necessary.

@wsquan171 wsquan171 force-pushed the fa-e2e branch 3 times, most recently from cdf8e03 to 5fbb762 Compare April 27, 2022 21:15
@wsquan171
Copy link
Contributor Author

/test-integration

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Someone from the Flow Visibility team should review this as well.

if $FLOW_EXPORTER; then
HELM_VALUES+=("featureGates.FlowExporter=true")
if [ "$MODE" == "dev" ]; then
HELM_VALUES+=("flowCollector.flowPollInterval=1s" "flowCollector.activeFlowExportTimeout=2s" "flowCollector.idleFlowExportTimeout=1s")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized that these values (for flowCollector.activeFlowExportTimeout & flowCollector.idleFlowExportTimeout) don't really make sense outside of tests, yet we are applying them every time we generate the manifest in 'dev' mode.

Maybe you could modify generate-manifest.sh so it can take as parameters extra Helm values, or the path to an Helm values file. Alternatively you could generate the manifest you need using Helm directly, with your own values file, and without using generate-manifest.sh at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a new param --extra-helm-values-file for generate-manifest.sh and defined these test-facing values in a separate file.


statefulSetRestartAnnotationKey = "antrea-e2e/restartedAt"

clickHousePort = "9000"
clickHouseOperatorYMLUrl = "https://raw.githubusercontent.com/Altinity/clickhouse-operator/0.18.2/deploy/operator/clickhouse-operator-install-bundle.yaml"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we also have a version of this manifest checked-in in this repo, maybe it makes more sense to use our version for consistency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

separated current fv deployment step into 2. first with the install bundle already checked-in in antrea, and then the clickhouse related definitions. we still need the error handling when applying ClickHouseInstallation CR though.

Comment on lines 745 to 751
_, err = data.PodWaitFor(defaultTimeout, podName, kubeNamespace, func(p *corev1.Pod) (bool, error) {
for _, condition := range p.Status.Conditions {
if condition.Type == corev1.PodReady {
return condition.Status == corev1.ConditionTrue, nil
}
}
return false, nil
})
if err != nil {
return nil, err
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use a more concise style for those:

if _, err := data.PodWaitFor(defaultTimeout, podName, kubeNamespace, func(p *corev1.Pod) (bool, error) {
    ...
}); err != nil {
    return nil, err
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

introduced a helper function to wait for ready status as @dreamtalen 's comment below.

tb.Logf("Applying flow aggregator YAML with ipfix collector address: %s", ipfixCollectorAddr)
if err := testData.deployFlowAggregator(ipfixCollectorAddr); err != nil {
tb.Logf("Deploying ClickHouse")
chPodIPs, err := testData.deployFlowVisibilityClickHouse()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized we are using the Pod IP to configure the Flow Aggregator. Do you know why we don't use the default instead (the Service DNS)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously the dns name may have issue when tested together with other Antrea kind e2e test modes. reverted to the default svc dns name as the case is isolated now.

@@ -323,6 +335,9 @@ func exportLogs(tb testing.TB, data *TestData, logsSubDir string, writeNodeLogs
// dump the logs for flow-aggregator Pods to disk.
data.forAllMatchingPodsInNamespace("", flowAggregatorNamespace, writePodLogs)

// dump the logs for flow-visibility Pods to disk.
data.forAllMatchingPodsInNamespace("", flowVisibilityNamespace, writePodLogs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we get the logs from the CH operator Pod as well, just in case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

test/e2e/framework.go Outdated Show resolved Hide resolved
test/e2e/framework.go Outdated Show resolved Hide resolved
hack/generate-manifest-flow-visibility.sh Show resolved Hide resolved
config.FeatureGates["FlowExporter"] = false
// deployFlowVisibilityClickHouse deploys ClickHouse operator and DB.
func (data *TestData) deployFlowVisibilityClickHouse() (*PodIPs, error) {
err := data.CreateNamespace(flowVisibilityNamespace, nil)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Flow Visibility NS creation is part of flowVisibilityYML, did you delete that intentionally in the e2e mode of generating manifest script?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ns is defined under grafana yml which is not included as base here. I think maybe we can keep it as-is and revist this after theia migration.

test/e2e/framework.go Outdated Show resolved Hide resolved
test/e2e/fixtures.go Outdated Show resolved Hide resolved
@wsquan171 wsquan171 force-pushed the fa-e2e branch 8 times, most recently from 9911dc0 to a76de68 Compare May 2, 2022 20:32
test/e2e/fixtures.go Outdated Show resolved Hide resolved
@wsquan171 wsquan171 force-pushed the fa-e2e branch 5 times, most recently from a7fefbd to ee93057 Compare May 3, 2022 21:00
@wsquan171
Copy link
Contributor Author

/test-conformance
/test-e2e
/test-networkpolicy

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some small comments, otherwise LGTM

}
if err := wait.Poll(defaultInterval, defaultTimeout, func() (bool, error) {
rc, stdout, stderr, err := testData.RunCommandOnNode(controlPlaneNodeName(),
fmt.Sprintf("curl -Ss %s:%s", chSvc.Spec.ClusterIP, clickHouseHTTPPort))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like assuming that curl is available on the Nodes, but the only other solution is to start a Pod just to curl...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from kind node image dockerfile https://github.com/kubernetes-sigs/kind/blob/main/images/base/Dockerfile
we can see curl has been there for more than 3 years and actively being used as part of image build process. given its small size and common use case, i think the possibility of them suddenly remove it from the node image is quiet low. it's also used the same way in part of current antctl e2e test. given that fa tests are isolated now I think maybe we can always come back and and switch to the pod approach later.

rc, stdout, stderr, err := testData.RunCommandOnNode(controlPlaneNodeName(),
fmt.Sprintf("curl -Ss %s:%s", chSvc.Spec.ClusterIP, clickHouseHTTPPort))
if rc != 0 || err != nil {
log.Infof("Error curl clickhouse service: %s", strings.Trim(stderr, "\n"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.Infof("Error curl clickhouse service: %s", strings.Trim(stderr, "\n"))
log.Infof("Failed to curl clickhouse Service: %s", strings.Trim(stderr, "\n"))

do we use log.Info in other places?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. it's used in several other places under the same file. for example:
func killProcessAndCollectCovFiles: log.Infof("Sending SIGINT to '%s'", processName)
func DeleteNamespace: log.Infof("Deleting Namespace %s took %v", namespace, time.Since(startTime))

test/e2e/framework.go Outdated Show resolved Hide resolved
test/e2e/framework.go Outdated Show resolved Hide resolved
@@ -168,8 +168,8 @@ FLOW_VIS_YML="/tmp/flow-visibility.yml"
# If a flow collector address is also provided, we update the Antrea
# manifest (to enable all features)
if [[ $FLOW_COLLECTOR != "" ]]; then
echo "Generating manifest with all features enabled along with FlowExporter feature"
$THIS_DIR/../../../../hack/generate-manifest.sh --mode dev --all-features > "${ANTREA_YML}"
echo "Generating manifest with flow exporter enabled"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FlowExporter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

This change introduces the following changes to fa e2e test:
- only run fa e2e test if asserted
- setup antrea agent with proper config as part of manifest generation
- update github workflow for kind tests

Signed-off-by: Shawn Wang <[email protected]>
- deploy ch operator in a separate step from in-repo yml bundle
- define antrea side exporter configs from helm values file
- extract ch operator logs as well in case of failure
- update ch related image with flow-visibility prefix
- test ping ch svc for success connectivity before apply fa

Signed-off-by: Shawn Wang <[email protected]>
@wsquan171
Copy link
Contributor Author

/test-conformance
/test-e2e
/test-networkpolicy

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas antoninbas merged commit 8f451f7 into antrea-io:main May 4, 2022
wsquan171 added a commit to wsquan171/theia that referenced this pull request May 10, 2022
This change ports various improvments for the current flow visibility
e2e test in antrea main repo:
antrea-io/antrea#3673

Other minor changes:
- removed coverage related code and option from the test
- changed busybox image to the correct one

Signed-off-by: Shawn Wang <[email protected]>
wsquan171 added a commit to wsquan171/theia that referenced this pull request May 10, 2022
This change ports various improvments for the current flow visibility
e2e test in antrea main repo:
antrea-io/antrea#3673

Other minor changes:
- removed coverage related code and option from the test
- changed busybox image to the correct one

Signed-off-by: Shawn Wang <[email protected]>
wsquan171 added a commit to wsquan171/theia that referenced this pull request May 10, 2022
This change ports various improvments for the current flow visibility
e2e test in antrea main repo:
antrea-io/antrea#3673

Other minor changes:
- removed coverage related code and option from the test
- changed busybox image to the correct one

Signed-off-by: Shawn Wang <[email protected]>
wsquan171 added a commit to wsquan171/theia that referenced this pull request May 10, 2022
This change ports various improvments for the current flow visibility
e2e test in antrea main repo:
antrea-io/antrea#3673

Other minor changes:
- removed coverage related code and option from the test
- changed busybox image to the correct one

Signed-off-by: Shawn Wang <[email protected]>
wsquan171 added a commit to wsquan171/theia that referenced this pull request May 10, 2022
This change ports various improvments for the current flow visibility
e2e test in antrea main repo:
antrea-io/antrea#3673

Other minor changes:
- removed coverage related code and option from the test
- changed busybox image to the correct one

Signed-off-by: Shawn Wang <[email protected]>
wsquan171 added a commit to wsquan171/theia that referenced this pull request May 10, 2022
This change ports various improvments for the current flow visibility
e2e test in antrea main repo:
antrea-io/antrea#3673

Other minor changes:
- removed coverage related code and option from the test
- changed busybox image to the correct one

Signed-off-by: Shawn Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants