Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Flow Visibility] Add CI validation job for flow visibility depolyment files #3312

Merged
merged 1 commit into from
Mar 25, 2022

Conversation

dreamtalen
Copy link
Contributor

@dreamtalen dreamtalen commented Feb 11, 2022

In this commit, we add a CI validation job for Flow Visibility. It will
validate the deployment of Flow Exporter, Flow Aggregator, Clickhouse
Server, and Grafana UI dashboard. It will execute daily on Jenkins like
what we have before for the ELK Flow Collector setup.

@codecov-commenter
Copy link

codecov-commenter commented Feb 11, 2022

Codecov Report

Merging #3312 (e277d77) into main (a539b78) will decrease coverage by 7.27%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3312      +/-   ##
==========================================
- Coverage   65.46%   58.18%   -7.28%     
==========================================
  Files         278      278              
  Lines       27771    27771              
==========================================
- Hits        18179    16158    -2021     
- Misses       7666     9826    +2160     
+ Partials     1926     1787     -139     
Flag Coverage Δ
kind-e2e-tests 39.63% <ø> (-15.56%) ⬇️
unit-tests 43.39% <ø> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pkg/apis/controlplane/types.go 0.00% <0.00%> (-100.00%) ⬇️
pkg/agent/controller/networkpolicy/reject.go 0.00% <0.00%> (-87.91%) ⬇️
...g/agent/apiserver/handlers/featuregates/handler.go 0.00% <0.00%> (-82.36%) ⬇️
pkg/controller/networkpolicy/mutate.go 0.00% <0.00%> (-73.57%) ⬇️
pkg/controller/networkpolicy/store/group.go 5.00% <0.00%> (-70.00%) ⬇️
pkg/agent/controller/networkpolicy/packetin.go 0.00% <0.00%> (-68.37%) ⬇️
pkg/apis/controlplane/helper.go 33.33% <0.00%> (-66.67%) ⬇️
pkg/apis/controlplane/v1beta2/helper.go 33.33% <0.00%> (-66.67%) ⬇️
pkg/controller/networkpolicy/convert.go 0.00% <0.00%> (-64.52%) ⬇️
pkg/controller/egress/store/egressgroup.go 1.38% <0.00%> (-58.34%) ⬇️
... and 87 more

@dreamtalen
Copy link
Contributor Author

Hi @heanlan and @yanjunz97, may I have an internal code review from you about this PR? I haven't done the rebase since the two dependent PRs are actively addressing comments. Please look at the last commit only. Thanks!

Copy link
Contributor

@heanlan heanlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @dreamtalen , could you please add one more check for the installation of Grafana plugins? Basically, when the Grafana Pod is ready, check the log like below:

## by grepping antreaflowvisibility-grafana-sankey-plugin, we need to check two logs, one showing it has been downloaded, the other showing it has been registered
➜  kubectl logs grafana-5c6c5b74f7-6k827 -n flow-visibility | grep 'antreaflowvisibility-grafana-sankey-plugin' 
✔ Downloaded antreaflowvisibility-grafana-sankey-plugin v1.0.0 zip successfully
t=2022-02-25T22:56:28+0000 lvl=info msg="Plugin registered" logger=plugin.manager pluginId=antreaflowvisibility-grafana-sankey-plugin

## by grepping grafana-clickhouse-datasource, we need to check one log, showing it has been registered
➜  kubectl logs grafana-5c6c5b74f7-6k827 -n flow-visibility | grep 'grafana-clickhouse-datasource'
t=2022-02-25T22:56:28+0000 lvl=info msg="Plugin registered" logger=plugin.manager pluginId=grafana-clickhouse-datasource

This will also help us verify the S3 url of the sankey-plugin is valid. Thanks

@yanjunz97
Copy link
Contributor

Shall we also check the clickhouse monitor job status in this CI validation job? If this PR also integrates PR #3244, maybe we can add check for clickhouse monitor job by

CLICKHOUSE_MONITOR=$(kubectl get job -l app=clickhouse-monitor -n flow-visibility -o jsonpath="{.items[-1].metadata.name}")
kubectl wait --for=condition=complete job/$CLICKHOUSE_MONITOR -n flow-visibility --timeout=60s

As the cronjob starts the job every minute, the check may need to wait 60s for the first job to start.

@dreamtalen
Copy link
Contributor Author

Hi @dreamtalen , could you please add one more check for the installation of Grafana plugins? Basically, when the Grafana Pod is ready, check the log like below:

## by grepping antreaflowvisibility-grafana-sankey-plugin, we need to check two logs, one showing it has been downloaded, the other showing it has been registered
➜  kubectl logs grafana-5c6c5b74f7-6k827 -n flow-visibility | grep 'antreaflowvisibility-grafana-sankey-plugin' 
✔ Downloaded antreaflowvisibility-grafana-sankey-plugin v1.0.0 zip successfully
t=2022-02-25T22:56:28+0000 lvl=info msg="Plugin registered" logger=plugin.manager pluginId=antreaflowvisibility-grafana-sankey-plugin

## by grepping grafana-clickhouse-datasource, we need to check one log, showing it has been registered
➜  kubectl logs grafana-5c6c5b74f7-6k827 -n flow-visibility | grep 'grafana-clickhouse-datasource'
t=2022-02-25T22:56:28+0000 lvl=info msg="Plugin registered" logger=plugin.manager pluginId=grafana-clickhouse-datasource

This will also help us verify the S3 url of the sankey-plugin is valid. Thanks

Thanks, added.

@dreamtalen
Copy link
Contributor Author

Shall we also check the clickhouse monitor job status in this CI validation job? If this PR also integrates PR #3244, maybe we can add check for clickhouse monitor job by

CLICKHOUSE_MONITOR=$(kubectl get job -l app=clickhouse-monitor -n flow-visibility -o jsonpath="{.items[-1].metadata.name}")
kubectl wait --for=condition=complete job/$CLICKHOUSE_MONITOR -n flow-visibility --timeout=60s

As the cronjob starts the job every minute, the check may need to wait 60s for the first job to start.

Thanks, added.

Copy link
Contributor Author

@dreamtalen dreamtalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rebase with main branch and cherry-pick latest code of other two PRs

Copy link
Contributor

@heanlan heanlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ci/test-flow-visibility.sh Outdated Show resolved Hide resolved
ci/test-flow-visibility.sh Outdated Show resolved Hide resolved
heanlan
heanlan previously approved these changes Mar 21, 2022
Copy link
Contributor

@salv-orlando salv-orlando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits inline, no major item to address.

I think the PR can be approved, just need a clarification on the test workflow

ci/test-flow-visibility.sh Outdated Show resolved Hide resolved
setup_flow_visibility() {
echo "=== Starting Flow Visibility ==="
# install ClickHouse operator
kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good idea to test master branch. This will allow us to catch early if any change in the clickhouse operator might break Antrea flow visibility.

In case we need to test a specific clickhouse version, is there a way to parameterize this script in a way that it could pull a specific operator version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we could add a parameter to test a specific version of the Clickhouse operator. Just wondering when we would like to test a specific version? Since in the flow-visibility doc, we will always pull the yaml file from master branch: https://github.com/antrea-io/antrea/blob/main/docs/network-flow-visibility.md#deployment-steps-1

Copy link
Contributor Author

@dreamtalen dreamtalen Mar 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update:
Flow Aggregator e2e test with Clickhouse has been removed in the latest commits from #3196. So there may be issues testing the yml from the master branch here, because we may not have the latest version of altinity/clickhouse-operator, altinity/metrics-exporter images on Jenkins nodes. Similar problems apply to other external images:grafana/grafana and yandex/clickhouse-server too.
I'm considering updating our clickhouse-operator-install-bundle.yaml to a specific version in doc, changing the external images in flow-visilbility.yml using a specific version inside Antrea docker hub, and adding a workflow to push all these external images to Antrea docker hub.
May I also hear opinions from maintainers @antoninbas @jianjuns @tnqn regarding this? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any other concern besides image availability? If so, can we address it in a different way?

Regarding your plan, it seems you would like to have an "antrea" version of the clickhouse operator install bundle, and also mirror all the external images we leverage in the antrea dockerhub. I wonder if this last step is really necessary or whether instead we could just use specific tags for those images without pushing them to Antrea's docker hub space.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antoninbas what's your feedback on @dreamtalen's plan?

Copy link
Contributor Author

@dreamtalen dreamtalen Mar 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished switching to Antrea docker hub for external images. @salv-orlando
Involve Shawn @wsquan171 to review it too since he did similar changes in the Flow Aggregator e2e test.

Copy link
Contributor

@antoninbas antoninbas Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm ok with the approach, which is consistent with what we have done for other images. Maybe all of these mirrored images could be prefixed with flow-visibility- though, so that they are "grouped" together and to be consistent with the flow-visibility-clickhouse-monitor image. The metrics-exporter image in particular has a very generic name right now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Antonin for the reviews! Sounds good to me, added flow-visibility- prefix to all related images.


check_clickhouse_monitor() {
echo "=== Check the status of ClickHouse monitor cron job status==="
sleep 60
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanjunz97 - do you think we can set a 120sec timeout at line 102 instead of adding this sleep? I'm thinking maybe it's not worth always waiting 60 seconds if the job actually already completed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cronjob does not launch a job immediately when it is deployed. For example, if the Cronjob is deployed at 12:00:01, it will wait till 12:01:00 to launch its first job. This is why we need to wait for 60s for the first job to be launched. Otherwise kubectl get job -l app=clickhouse-monitor -n flow-visibility may not find the job.

I have just opened PR #3498 to run Clickhouse monitor in a long-running pod. If this one is merged, we can get rid of the waiting, and update the check as the other components with something like
kubectl wait --for=condition=ready pod -l app=clickhouse-monitor -n flow-visibility --timeout=600s

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yanjunz97 so the problem is that maybe we will not have a job at all for which we should verify completion if we don't wait those initial 60 seconds? I'll take a look at the new PR as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is possible we do not have a job running in the first 60 second.

ci/test-flow-visibility.sh Show resolved Hide resolved
Comment on lines 477 to 492
echo "====== Pulling and delivering images in flow-visbility.yml to all the Nodes======"
FV_IMAGES=($(cat $GIT_CHECKOUT_DIR/build/yamls/flow-visibility.yml | grep image: | awk '{print $(NF)}' | xargs))
IPs=($(kubectl get nodes -o wide --no-headers=true | awk '{print $6}' | xargs))
old_IFS=$IFS
for image in "${FV_IMAGES[@]}"; do
docker pull $image
IFS=:
read repo tag <<< "$image"
IFS=$old_IFS
echo "Saving" $repo:$tag
docker save -o flow-visibility.tar $repo:$tag
for i in "${!IPs[@]}"
do
copy_image flow-visibility.tar $repo ${IPs[$i]} $tag false
done
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this needed? can't we let each Node pull the image here, since we don't need to retag it?

Copy link
Contributor Author

@dreamtalen dreamtalen Mar 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing it out! Yes, all changes in test-vmc.sh are no longer needed right now since we have all images inside the VMware harbor and don't need to download&modifty clickhouse-operator-install manifest.

done

echo "====== Pulling and delivering images in clickhouse-operator-install-bundle.yaml to all the Nodes======"
wget https://raw.githubusercontent.com/Altinity/clickhouse-operator/0.18.2/deploy/operator/clickhouse-operator-install-bundle.yaml -O $GIT_CHECKOUT_DIR/build/yamls/clickhouse-operator-install-bundle.yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could consider having our own version of the manifest checked-in, to avoid this...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, checked in our own version clickhouse-operator-install manifest with our internal image names. It also benefits the CI job no need to download the manifest.

@dreamtalen dreamtalen force-pushed the flow-visibility-ci branch 3 times, most recently from a4c879b to ced8c35 Compare March 24, 2022 05:25
- name: Push antrea/flow-visibility-grafana
uses: akhilerm/[email protected]
with:
src: docker.io/grafana/grafana:${{ github.event.inputs.ch-server-tag }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"ch-server-tag" -> "grafana-tag"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Thanks, Anlan.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for separating my comments. just realized we might also want to add flow-visibility.yml here: https://github.com/antrea-io/antrea/blob/main/ci/check-manifest.sh
could you help add it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will add it soon.

@dreamtalen dreamtalen force-pushed the flow-visibility-ci branch 2 times, most recently from ada7bd9 to 3657392 Compare March 24, 2022 21:18
@antoninbas
Copy link
Contributor

@dreamtalen I will review again after #3196 is merged and you rebase

@yanjunz97
Copy link
Contributor

yanjunz97 commented Mar 25, 2022

Hi @dreamtalen I just have my PR #3498 merged. As the monitor is implemented as a container in the ClickHouse pod, we may not need extra check for the monitor.

@dreamtalen
Copy link
Contributor Author

Hi @dreamtalen I just have my PR #3498 merged. As the monitor is implemented as a container in the ClickHouse pod, we may not need extra check for the monitor.

Thanks Yanjun, removed the clickhouse-monitor check.

@dreamtalen dreamtalen force-pushed the flow-visibility-ci branch 2 times, most recently from f226f97 to 9a3d585 Compare March 25, 2022 04:53
@dreamtalen
Copy link
Contributor Author

@dreamtalen I will review again after #3196 is merged and you rebase

Hi @antoninbas, I finished rebasing and tested the CI job manually on a VM provided by Zhengsheng. Please take a look when you're available.
Changes in build/yamls/flow-aggregator.yml are not related to this PR but are used to fix a regression failure found during testing. Thanks @wsquan171 help to debug it.

In this commit, we add a CI validation job for Flow Visibility. It will
validate the deployment of Flow Exporter, Flow Aggregator, Clickhouse
Server, and Grafana UI dashboard. It will execute daily on Jenkins like
what we have before for the ELK Flow Collector setup.

Signed-off-by: Yongming Ding <[email protected]>
@dreamtalen
Copy link
Contributor Author

Rebase again for PR #3521


setup_flow_aggregator() {
echo "=== Config Antrea Flow Aggregator ==="
perl -i -p0e 's/ # Enable is the switch to enable exporting flow records to ClickHouse.\n #enable: false/ # Enable is the switch to enable exporting flow records to ClickHouse.\n enable: true/' ./build/yamls/flow-aggregator.yml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why we need to use perl here instead of sed
also is this not possible by using hack/generate-manifest-flow-aggregator.sh?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there are two lines of #enable: false in flow-aggregator.yml right now for FlowCollector and Clickhouse respectively, using perl seems better and easier to do this multi-line string replacement than sed.
hack/generate-manifest-flow-aggregator.sh is a good point, looks like it has not been updated with our new config in Flow Aggregator https://github.com/antrea-io/antrea/blob/main/hack/generate-manifest-flow-aggregator.sh#L141. Do you want me to update that .sh in this PR or open another one after?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can look into this in a follow up PR, let me approve this one

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, Antonin! I will trigger an update for Flow Visibility docker images using Github workflow after this PR is merged.

@dreamtalen
Copy link
Contributor Author

/test-all

@dreamtalen
Copy link
Contributor Author

/test-integration

@dreamtalen
Copy link
Contributor Author

dreamtalen commented Mar 25, 2022

All required tests have passed. This PR is ready for merging.

@antoninbas antoninbas merged commit c2a664b into antrea-io:main Mar 25, 2022
@dreamtalen dreamtalen mentioned this pull request Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants