Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Frequent failures of helm chart tests #24037

Closed
2 tasks done
potiuk opened this issue May 31, 2022 · 10 comments · Fixed by #24395
Closed
2 tasks done

Frequent failures of helm chart tests #24037

potiuk opened this issue May 31, 2022 · 10 comments · Fixed by #24395
Labels
area:core kind:bug This is a clearly a bug

Comments

@potiuk
Copy link
Member

potiuk commented May 31, 2022

Apache Airflow version

main (development)

What happened

We keep on getting very frequent failures of Helm Chart tests and seems that a big number of those errors are because of errors when pulling charts from bitnami for postgres:

Example here (but I saw it happening very often recently):

https://github.com/apache/airflow/runs/6666449965?check_suite_focus=true#step:9:314

  Save error occurred:  could not find : chart postgresql not found in https://charts.bitnami.com/bitnami: looks like "https://charts.bitnami.com/bitnami" is not a valid chart repository or cannot be reached: stream error: stream ID 1; INTERNAL_ERROR
  Deleting newly downloaded charts, restoring pre-update state
  Error: could not find : chart postgresql not found in https://charts.bitnami.com/bitnami: looks like "https://charts.bitnami.com/bitnami" is not a valid chart repository or cannot be reached: stream error: stream ID 1; INTERNAL_ERROR
  Dumping logs from KinD

It is not only a problem for our CI but it might be similar problem for our users who want to install the chart - they might also get the same kinds of error.

I guess we should either make it more resilient to intermittent problems with bitnami charts or use another chart (or maybe even host the chart ourselves somewhere within apache infrastructure. While the postgres chart is not really needed for most "production" users, it is still a dependency of our chart and it makes our chart depend on external and apparently flaky service.

What you think should happen instead

We should find (or host ourselves) more stable dependency or get rid of it.

How to reproduce

Look at some recent CI builds and see that they often fail in K8S tests and more often than not the reason is missing postgresql chart.

Operating System

any

Versions of Apache Airflow Providers

not relevant

Deployment

Other

Deployment details

CI

Anything else

Happy to make the change once we agree what's the best way :).

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@potiuk potiuk added kind:bug This is a clearly a bug area:core labels May 31, 2022
@potiuk
Copy link
Member Author

potiuk commented May 31, 2022

cc: @kaxil @jedcunningham @ephraimbuddy - I think it would be great to find a better way. We do not really need that postgres chart to be frequently updated, so hosting it ourselves (say in a github repo) is I think the best option.

WDYT ?

@jedcunningham
Copy link
Member

We can also vendor it in chart/charts. Have we tried to reach out to the bitnami folks to see if they are aware of the issues and working to resolve them though? That's the easiest for us, at least.

@potiuk
Copy link
Member Author

potiuk commented Jun 1, 2022

Surely - we can ask, but they are part of vmware so I have no high expectations their free offering will be looked at carefully. But I definitely will.

However I think we should seriously consider moving it to Github (or vendoring in) regardless of their answers. What do we do if they answer "we fixed it" ? It happens intermittently so we are not even able to check if they did.

I really do not like situations where WE have to carry the burden of CI errors (and make our contributors unhappy) when someone else screwed up.

I am not sure if this is a serious effort - this is one-time effort really which needs zero maintenance (maybe upgrade from time to time). Comparing with multiple unforeseen errors that are yet another reson for our users to learn that "red" is normal and reaching out for help. I think the key to keep CI "nice" for the users is to eliminate relentlessly any reasons for potential problems - leaving only those that are "real" problems as false negatives undermine your trust in it and the more it happens the more you reach out to maintainers "the tests failed, please help" - we should keep the possibility down to minimum. It can never be eliminated entirely but if we do not have control over something we cannot improve it.

I've heard very similar concerns when I moved the images used during our integration tests.
And this worked beautifully after I did this for all the images we use in our integration tests in ghcr.io :

Do you remember the last time we had a problem with this image: https://github.com/orgs/apache/packages?tab=packages&q=airflow-openldap ?

Me neither.

And we had plenty of similar problems with those images from time time to time when they were pulled from dockerhub. It's literaly 0 overhead - I pushed it once 11 months ago and did not touch it since. I think the one time effort of moving stuff to GitHub (either as separate repo or vendor-in) is certainly worth it over the long period of time.

The nice thing of keeping everything in GitHub is that when it fails, it generally fails in the way that many things do not work (including actions or CI in general). So when something breaks in GitHub people generally are aware of it (and more often than not they simply cannot even submit or run their CI jobs).

And if we see intermittent errors like that we have enterprise level support with them via Apache agreement - they either commented with workarounds or fixed many isues I raised to them already.

@potiuk
Copy link
Member Author

potiuk commented Jun 1, 2022

Someone opened an issue to bitnami charts about it iliterally 15 minutes ago. Commented it there: bitnami/charts#10535 (comment)

Any upvotes, cheering comments are most welcome

@potiuk
Copy link
Member Author

potiuk commented Jun 1, 2022

Speaking of "supporting free use" - they just closed the issue and redirected it to issue they are "working on" bitnami/charts#8433 which has been open 16 Dec 2021.

@jedcunningham - do you still think it's the "easiest route" :P ?

@potiuk
Copy link
Member Author

potiuk commented Jun 1, 2022

BTW. Rough looking at the issue about half of the ~50 comments there is "we have the same issue - here are the details" and the other half (exaggerating a bit for the effect of course) is "Thanks, we are working on it" from bitnami support - some of those "we are working on it" are from February.

@jedcunningham
Copy link
Member

I've opened #24089 to point at github for the index instead. This will stop the bleeding at least and give us a little time to see if they resolve it (sounds like it is now causing them pain, so I'm hopeful) or if we need to do something else.

@potiuk
Copy link
Member Author

potiuk commented Jun 2, 2022

So taking into account what happened here tonight bitnami/charts#10539 - I believe vendoring-in the chart seems to be the best idea.

I REALLY don't like when someone's arbitrary decision can break all our released versions of charts.

@jedcunningham
Copy link
Member

It doesn't break our released versions thankfully (helm vendors it in the tarball already), but it does hose our CI and main for sure. +1 to vendoring it. Should we update to the newest version while we are at it?

@potiuk
Copy link
Member Author

potiuk commented Jun 2, 2022

Yeah I think so. Happy to do it to learn a bit more. That's great it's vendored in in the tarball.

potiuk added a commit to potiuk/airflow that referenced this issue Jun 11, 2022
After the Bitnami fiasco
bitnami/charts#10539

We lost trust in bitnami index being good and reliable source
of charts. That's why we vendored-in the postgres chart needed for
our Helm chart.

Fixes: apache#24037
potiuk added a commit that referenced this issue Jun 14, 2022
After the Bitnami fiasco
bitnami/charts#10539

We lost trust in bitnami index being good and reliable source
of charts. That's why we vendored-in the postgres chart needed for
our Helm chart.

Fixes: #24037
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants