-
Notifications
You must be signed in to change notification settings - Fork 370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flow Visibility] Add CI validation job for flow visibility depolyment files #3312
Conversation
Codecov Report
@@ Coverage Diff @@
## main #3312 +/- ##
==========================================
- Coverage 65.46% 58.18% -7.28%
==========================================
Files 278 278
Lines 27771 27771
==========================================
- Hits 18179 16158 -2021
- Misses 7666 9826 +2160
+ Partials 1926 1787 -139
Flags with carried forward coverage won't be shown. Click here to find out more.
|
784acbb
to
2274350
Compare
2274350
to
d5b9730
Compare
Hi @heanlan and @yanjunz97, may I have an internal code review from you about this PR? I haven't done the rebase since the two dependent PRs are actively addressing comments. Please look at the last commit only. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @dreamtalen , could you please add one more check for the installation of Grafana plugins? Basically, when the Grafana Pod is ready, check the log like below:
## by grepping antreaflowvisibility-grafana-sankey-plugin, we need to check two logs, one showing it has been downloaded, the other showing it has been registered
➜ kubectl logs grafana-5c6c5b74f7-6k827 -n flow-visibility | grep 'antreaflowvisibility-grafana-sankey-plugin'
✔ Downloaded antreaflowvisibility-grafana-sankey-plugin v1.0.0 zip successfully
t=2022-02-25T22:56:28+0000 lvl=info msg="Plugin registered" logger=plugin.manager pluginId=antreaflowvisibility-grafana-sankey-plugin
## by grepping grafana-clickhouse-datasource, we need to check one log, showing it has been registered
➜ kubectl logs grafana-5c6c5b74f7-6k827 -n flow-visibility | grep 'grafana-clickhouse-datasource'
t=2022-02-25T22:56:28+0000 lvl=info msg="Plugin registered" logger=plugin.manager pluginId=grafana-clickhouse-datasource
This will also help us verify the S3 url of the sankey-plugin is valid. Thanks
Shall we also check the clickhouse monitor job status in this CI validation job? If this PR also integrates PR #3244, maybe we can add check for clickhouse monitor job by
As the cronjob starts the job every minute, the check may need to wait 60s for the first job to start. |
d5b9730
to
91c59ae
Compare
Thanks, added. |
Thanks, added. |
91c59ae
to
1ad64b7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rebase with main branch and cherry-pick latest code of other two PRs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
1ad64b7
to
1e7d154
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits inline, no major item to address.
I think the PR can be approved, just need a clarification on the test workflow
ci/test-flow-visibility.sh
Outdated
setup_flow_visibility() { | ||
echo "=== Starting Flow Visibility ===" | ||
# install ClickHouse operator | ||
kubectl apply -f https://raw.githubusercontent.com/Altinity/clickhouse-operator/master/deploy/operator/clickhouse-operator-install-bundle.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a good idea to test master branch. This will allow us to catch early if any change in the clickhouse operator might break Antrea flow visibility.
In case we need to test a specific clickhouse version, is there a way to parameterize this script in a way that it could pull a specific operator version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, we could add a parameter to test a specific version of the Clickhouse operator. Just wondering when we would like to test a specific version? Since in the flow-visibility doc, we will always pull the yaml file from master branch: https://github.com/antrea-io/antrea/blob/main/docs/network-flow-visibility.md#deployment-steps-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update:
Flow Aggregator e2e test with Clickhouse has been removed in the latest commits from #3196. So there may be issues testing the yml from the master branch here, because we may not have the latest version of altinity/clickhouse-operator
, altinity/metrics-exporter
images on Jenkins nodes. Similar problems apply to other external images:grafana/grafana
and yandex/clickhouse-server
too.
I'm considering updating our clickhouse-operator-install-bundle.yaml
to a specific version in doc, changing the external images in flow-visilbility.yml
using a specific version inside Antrea docker hub, and adding a workflow to push all these external images to Antrea docker hub.
May I also hear opinions from maintainers @antoninbas @jianjuns @tnqn regarding this? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any other concern besides image availability? If so, can we address it in a different way?
Regarding your plan, it seems you would like to have an "antrea" version of the clickhouse operator install bundle, and also mirror all the external images we leverage in the antrea dockerhub. I wonder if this last step is really necessary or whether instead we could just use specific tags for those images without pushing them to Antrea's docker hub space.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antoninbas what's your feedback on @dreamtalen's plan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finished switching to Antrea docker hub for external images. @salv-orlando
Involve Shawn @wsquan171 to review it too since he did similar changes in the Flow Aggregator e2e test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm ok with the approach, which is consistent with what we have done for other images. Maybe all of these mirrored images could be prefixed with flow-visibility-
though, so that they are "grouped" together and to be consistent with the flow-visibility-clickhouse-monitor
image. The metrics-exporter
image in particular has a very generic name right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Antonin for the reviews! Sounds good to me, added flow-visibility-
prefix to all related images.
ci/test-flow-visibility.sh
Outdated
|
||
check_clickhouse_monitor() { | ||
echo "=== Check the status of ClickHouse monitor cron job status===" | ||
sleep 60 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanjunz97 - do you think we can set a 120sec timeout at line 102 instead of adding this sleep? I'm thinking maybe it's not worth always waiting 60 seconds if the job actually already completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cronjob does not launch a job immediately when it is deployed. For example, if the Cronjob is deployed at 12:00:01, it will wait till 12:01:00 to launch its first job. This is why we need to wait for 60s for the first job to be launched. Otherwise kubectl get job -l app=clickhouse-monitor -n flow-visibility
may not find the job.
I have just opened PR #3498 to run Clickhouse monitor in a long-running pod. If this one is merged, we can get rid of the waiting, and update the check as the other components with something like
kubectl wait --for=condition=ready pod -l app=clickhouse-monitor -n flow-visibility --timeout=600s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yanjunz97 so the problem is that maybe we will not have a job at all for which we should verify completion if we don't wait those initial 60 seconds? I'll take a look at the new PR as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It is possible we do not have a job running in the first 60 second.
0191718
to
054f5f9
Compare
ci/jenkins/test-vmc.sh
Outdated
echo "====== Pulling and delivering images in flow-visbility.yml to all the Nodes======" | ||
FV_IMAGES=($(cat $GIT_CHECKOUT_DIR/build/yamls/flow-visibility.yml | grep image: | awk '{print $(NF)}' | xargs)) | ||
IPs=($(kubectl get nodes -o wide --no-headers=true | awk '{print $6}' | xargs)) | ||
old_IFS=$IFS | ||
for image in "${FV_IMAGES[@]}"; do | ||
docker pull $image | ||
IFS=: | ||
read repo tag <<< "$image" | ||
IFS=$old_IFS | ||
echo "Saving" $repo:$tag | ||
docker save -o flow-visibility.tar $repo:$tag | ||
for i in "${!IPs[@]}" | ||
do | ||
copy_image flow-visibility.tar $repo ${IPs[$i]} $tag false | ||
done | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this needed? can't we let each Node pull the image here, since we don't need to retag it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing it out! Yes, all changes in test-vmc.sh
are no longer needed right now since we have all images inside the VMware harbor and don't need to download&modifty clickhouse-operator-install manifest.
ci/jenkins/test-vmc.sh
Outdated
done | ||
|
||
echo "====== Pulling and delivering images in clickhouse-operator-install-bundle.yaml to all the Nodes======" | ||
wget https://raw.githubusercontent.com/Altinity/clickhouse-operator/0.18.2/deploy/operator/clickhouse-operator-install-bundle.yaml -O $GIT_CHECKOUT_DIR/build/yamls/clickhouse-operator-install-bundle.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we could consider having our own version of the manifest checked-in, to avoid this...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, checked in our own version clickhouse-operator-install manifest with our internal image names. It also benefits the CI job no need to download the manifest.
a4c879b
to
ced8c35
Compare
- name: Push antrea/flow-visibility-grafana | ||
uses: akhilerm/[email protected] | ||
with: | ||
src: docker.io/grafana/grafana:${{ github.event.inputs.ch-server-tag }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"ch-server-tag" -> "grafana-tag"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Thanks, Anlan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for separating my comments. just realized we might also want to add flow-visibility.yml here: https://github.com/antrea-io/antrea/blob/main/ci/check-manifest.sh
could you help add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will add it soon.
ada7bd9
to
3657392
Compare
@dreamtalen I will review again after #3196 is merged and you rebase |
Hi @dreamtalen I just have my PR #3498 merged. As the monitor is implemented as a container in the ClickHouse pod, we may not need extra check for the monitor. |
3657392
to
9d7c0a4
Compare
Thanks Yanjun, removed the clickhouse-monitor check. |
f226f97
to
9a3d585
Compare
Hi @antoninbas, I finished rebasing and tested the CI job manually on a VM provided by Zhengsheng. Please take a look when you're available. |
In this commit, we add a CI validation job for Flow Visibility. It will validate the deployment of Flow Exporter, Flow Aggregator, Clickhouse Server, and Grafana UI dashboard. It will execute daily on Jenkins like what we have before for the ELK Flow Collector setup. Signed-off-by: Yongming Ding <[email protected]>
9a3d585
to
e277d77
Compare
Rebase again for PR #3521 |
|
||
setup_flow_aggregator() { | ||
echo "=== Config Antrea Flow Aggregator ===" | ||
perl -i -p0e 's/ # Enable is the switch to enable exporting flow records to ClickHouse.\n #enable: false/ # Enable is the switch to enable exporting flow records to ClickHouse.\n enable: true/' ./build/yamls/flow-aggregator.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure why we need to use perl here instead of sed
also is this not possible by using hack/generate-manifest-flow-aggregator.sh
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there are two lines of #enable: false
in flow-aggregator.yml
right now for FlowCollector and Clickhouse respectively, using perl
seems better and easier to do this multi-line string replacement than sed.
hack/generate-manifest-flow-aggregator.sh
is a good point, looks like it has not been updated with our new config in Flow Aggregator https://github.com/antrea-io/antrea/blob/main/hack/generate-manifest-flow-aggregator.sh#L141. Do you want me to update that .sh in this PR or open another one after?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can look into this in a follow up PR, let me approve this one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, Antonin! I will trigger an update for Flow Visibility docker images using Github workflow after this PR is merged.
/test-all |
/test-integration |
All required tests have passed. This PR is ready for merging. |
In this commit, we add a CI validation job for Flow Visibility. It will
validate the deployment of Flow Exporter, Flow Aggregator, Clickhouse
Server, and Grafana UI dashboard. It will execute daily on Jenkins like
what we have before for the ELK Flow Collector setup.