Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add deferrable mode to MLEngineStartTrainingJobOperator #27405

Merged
merged 1 commit into from
Jan 25, 2023

Conversation

VladaZakharova
Copy link
Contributor


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@potiuk
Copy link
Member

potiuk commented Nov 11, 2022

conflicts :(

@VladaZakharova VladaZakharova changed the title Add deferrable mode to TestMLEngineStartTrainingJobOperator Add deferrable mode to MLEngineStartTrainingJobOperator Nov 14, 2022
@VladaZakharova
Copy link
Contributor Author

Hello @potiuk
This is not a problem of tests themselves, i suppose, but the problem of new package that i have added to providers.yaml file (gcloud-aio-auth): https://github.com/apache/airflow/actions/runs/3523584902/jobs/5908291978#step:10:1997
It seems that the version of currently used pyarrow is not compatible with version of pyarrow that is used in this new package.
So the solution here is may be to upgrade version of pyarrow in gcloud-aio-auth, but i am not sure how to do it correctly.

@potiuk
Copy link
Member

potiuk commented Dec 4, 2022

For : https://github.com/apache/airflow/actions/runs/3523584902/jobs/5908291978 - the steps to follow are described in the instructions after the error is displayed. Just follow. The problem is really about new "warning" generated and you just need to add the warning as "known" warning to the list of known warnings.

The other failures are - I think - just intermitteent errors - this seems to be a problem when you run the whole suite of tests as a non-commiter, sometimes (and we have not figured out when) the tests are failing intermittently in groups - but when we re-run them they work. those are flaky tests that we need to fix eventually.

Please rebase and fix the warning and if other tests fail next time, ping me and I will re-run them.

@VladaZakharova
Copy link
Contributor Author

@potiuk
Thank you for having a look and helping with this problem!
Unfortunately, fix with adding this warning to "known" category didn't help and some of the tests are still failing. Can you please restart them once again?
Thanks!

@potiuk
Copy link
Member

potiuk commented Dec 5, 2022

Yeah. I will take a look at those failing "public runners" tests shortly (cc: @Taragolis - unless you want to take a look following my description).

@potiuk
Copy link
Member

potiuk commented Dec 5, 2022

Restarted.

@potiuk
Copy link
Member

potiuk commented Dec 11, 2022

Rebased after fixing the "public" tests last week.

@VladaZakharova
Copy link
Contributor Author

@potiuk
Thank you for rebasing :)
Still seems like some errors that are not related to my changes but with Kubernetes executor:
https://github.com/apache/airflow/actions/runs/3668592118/jobs/6201917454#step:6:7783

@Taragolis
Copy link
Contributor

Interesting, the reason of failing tests could be some kind of race condition.

@Taragolis
Copy link
Contributor

Seems like the issue come from this class, which changed by this PR: #28047:

class ResourceVersion:
"""Singleton for tracking resourceVersion from Kubernetes."""
_instance = None
resource_version: dict[str, str] = {}
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance

During the tests time to time additional namespace default come into this singleton.
I do not know is it issue with some tests itself or problem with some internals of k8s executor.

@VladaZakharova
Copy link
Contributor Author

@Taragolis
Thank you for having a look!
Have you faced the same problem?

@potiuk
Copy link
Member

potiuk commented Dec 19, 2022

@Taragolis Thank you for having a look! Have you faced the same problem?

It started to happen recently randomly in a number of PRs and usually goes away afer re-running, typical flakey test

@potiuk
Copy link
Member

potiuk commented Dec 19, 2022

@XD-DENG - maybe you can also take a look to see for a possible race condition it could have caused? It seems like a side-effect - especially that when it fails it fails because ir returns several namespaces (airflow and default) so it is rather appropriate for the multi-namespace change being the culprit.

@XD-DENG
Copy link
Member

XD-DENG commented Dec 19, 2022

Sure I will take a look @potiuk . Please ping me for reminding if you don't hear from me later

@potiuk
Copy link
Member

potiuk commented Dec 19, 2022

OK. I fixed it (and I found that the test missed one assert). PR is coming.

@potiuk
Copy link
Member

potiuk commented Dec 19, 2022

Fix to the flaky exception here: #28475

@VladaZakharova
Copy link
Contributor Author

@potiuk @XD-DENG
Team, thanks a lot for so quick response and helping with this problem :)

@VladaZakharova
Copy link
Contributor Author

@potiuk
Hi! Can you please review this PR?
Many thanks!

@VladaZakharova
Copy link
Contributor Author

@potiuk
Hi Team :)
Can we please review this PR?

@VladaZakharova
Copy link
Contributor Author

@potiuk
Hello Team,
Is it possible to review and merge this changes? Thanks! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants