Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multiple issues in Microsoft AzureContainerInstancesOperator #15634

Merged

Conversation

BKronenbitter
Copy link
Contributor

After having worked with with Airflow in Azure using the AzureContainerInstancesOperator I encountered some issues.

  • The sleep command seems to be indented wrongly and not run periodically.
  • When log messages are collected, a check for None is missing.
  • One possible state for logging is missing.
  • Container exits are logged as error, should be info.

@boring-cyborg
Copy link

boring-cyborg bot commented May 3, 2021

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, pylint and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: [email protected]
    Slack: https://s.apache.org/airflow-slack

@potiuk
Copy link
Member

potiuk commented May 3, 2021

Thanks @BKronenbitter. Would you mind adding/extending unit tests to cover the failure scenarios ?

@BKronenbitter
Copy link
Contributor Author

Thanks for your quick reaction. Let me address the individual fixes separately:

  • It am not aware of a practical way to test the sleep time in the main loop of the _monitor_logging. If you know a good way, of course I can do so.
  • I for sure can test the addition of the state for logging.
  • Testing the different log level does not make much sense, I would say.
  • Testing the check for missing events is also tricky, since the exception is only logged and not raised further. If there is a good way to test, that no exception is logged within the main loop, of course I can do that.

@potiuk
Copy link
Member

potiuk commented May 4, 2021

Thanks for your quick reaction. Let me address the individual fixes separately:

Just to give you a bit of context, we rely on automated testing heavily. And our philosophy is that with every single change we want to add unit tests so that the coverage grows and probability of regression goes down. Adding any change without tests almost by definition goes in the other direction, so almost any change that has no unit tests (and touches the python code) will end up with a request to add them.

  • It am not aware of a practical way to test the sleep time in the main loop of the _monitor_logging. If you know a good way, of course I can do so.

We usually mock any calls like that - example here where sleep command have been mocked:

https://github.com/apache/airflow/blob/master/tests/cli/commands/test_webserver_command.py

In this particular case we would like to add a test where we have at least two loops where state changes and sleep is called in-between. All that can be mocked rather easily.

  • I for sure can test the addition of the state for logging.

Coool

  • Testing the different log level does not make much sense, I would say.

Agree.

  • Testing the check for missing events is also tricky, since the exception is only logged and not raised further. If there is a good way to test, that no exception is logged within the main loop, of course I can do that.

I think mocking here should help as well.

@BKronenbitter
Copy link
Contributor Author

Thanks a lot for the explanation. I added/changed tests according to your comment. They all fail in the current master and pass with fixes.

@BKronenbitter
Copy link
Contributor Author

Sorry for the failing static check. I was not aware, that it was skipped locally. Should be fixed now and all static checks are succeeding locally.

@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@github-actions
Copy link

The Workflow run is cancelling this PR. Building image for the PR has been cancelled

@BKronenbitter
Copy link
Contributor Author

Hi, I am confused by the output of the failing checks:

  • "Build and test provider packages wheel" seems to fail with error An unexpected error occurred: "https://registry.yarnpkg.com/d3-tip/-/d3-tip-0.9.1.tgz: connect ECONNREFUSED 104.16.21.35:443".
  • "Quarantined tests" apparently runs out of memory: Error: Process completed with exit code 137.
  • "Build docs" seems to fail with /opt/airflow/docs/_inventory_cache/jinja2/objects.inv' not fetchable due to <class 'FileNotFoundError'>: [Errno 2] No such file or directory: '/opt/airflow/docs/_inventory_cache/jinja2/objects.inv.

I am not sure, to what degree these errors are caused by the changes in the PR.

@potiuk
Copy link
Member

potiuk commented May 14, 2021

  • "Build and test provider packages wheel" seems to fail with error An unexpected error occurred: "https://registry.yarnpkg.com/d3-tip/-/d3-tip-0.9.1.tgz: connect ECONNREFUSED 104.16.21.35:443".

Apparently temporary error.

  • "Quarantined tests" apparently runs out of memory: Error: Process completed with exit code 137.

Yep. That's one of the reason why those tests are quarantined. They fail occassionally.

  • "Build docs" seems to fail with /opt/airflow/docs/_inventory_cache/jinja2/objects.inv' not fetchable due to <class 'FileNotFoundError'>: [Errno 2] No such file or directory: '/opt/airflow/docs/_inventory_cache/jinja2/objects.inv.

That was also apparently temporary failure.

I am not sure, to what degree these errors are caused by the changes in the PR.

Likely None. I will close/reopen to trigger the build again.

@potiuk potiuk closed this May 14, 2021
@potiuk potiuk reopened this May 14, 2021
@@ -339,8 +340,8 @@ def _monitor_logging(self, resource_group: str, name: str) -> int:
"Exception while getting logs from container instance, retrying..."
)

if state == "Terminated":
self.log.error("Container exited with detail_status %s", detail_status)
if state in ["Terminated"]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is probably just left over from testing. I can change that back of course.

@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.

@BKronenbitter
Copy link
Contributor Author

I am not sure, to what degree these errors are caused by the changes in the PR.

Likely None. I will close/reopen to trigger the build again.

How do we proceed now? Is there anything I can do from my side to resolve the issues.

@BKronenbitter
Copy link
Contributor Author

@potiuk Please feel gently poked. I think, I addressed all the comments and am basically waiting on your feedback.

@potiuk
Copy link
Member

potiuk commented Jun 18, 2021

There are some failures resulting from your changes being behind main (409 commits). can you please rebase to latest main ? (and ping me please then)

@BKronenbitter BKronenbitter force-pushed the FixesInAzureContainerInstancesOperator branch 2 times, most recently from f852f40 to a7088d6 Compare June 18, 2021 09:45
@BKronenbitter
Copy link
Contributor Author

@potiuk Thanks for the quick answer. The rebase is done. The build of some of the containers is now failing. This issue seems to be present also in the main branch.

@potiuk potiuk closed this Jun 18, 2021
@potiuk potiuk reopened this Jun 18, 2021
@potiuk
Copy link
Member

potiuk commented Jun 18, 2021

Should be fixed now

@BKronenbitter
Copy link
Contributor Author

@potiuk I think, everything is as it should be right now.

@potiuk
Copy link
Member

potiuk commented Jun 23, 2021

Hey @BKronenbitter - can you please rebase again (one last time I hope). we had some changes in main that require rebasing (busy time).

@potiuk
Copy link
Member

potiuk commented Jun 23, 2021

Apologies for the hassle.

@BKronenbitter BKronenbitter force-pushed the FixesInAzureContainerInstancesOperator branch from a7088d6 to 284b2a5 Compare June 23, 2021 14:11
@BKronenbitter
Copy link
Contributor Author

No problem, done.

@potiuk
Copy link
Member

potiuk commented Jun 23, 2021

Running build :P

@potiuk
Copy link
Member

potiuk commented Jun 23, 2021

As surprising as it is... pylint started to detect some errors . I think @ashb you were quite right that pylint is not worth it.. @BKronenbitter - can you please add #pylint: disable=no-member in those two files that are flagged (apparently randomly)

@potiuk
Copy link
Member

potiuk commented Jun 23, 2021

BTW. I just started a vote (after @ashb hinting/lobbying for it for quite a while despite my reservations) to remove pylint ... It brings us more harm than good...

@BKronenbitter BKronenbitter force-pushed the FixesInAzureContainerInstancesOperator branch from c749ec7 to 7d9e66b Compare June 24, 2021 09:27
@BKronenbitter
Copy link
Contributor Author

Done, I added the exception to the whole files, since everything else ended up in a mess of individual exceptions.

@potiuk
Copy link
Member

potiuk commented Jun 24, 2021

Done, I added the exception to the whole files, since everything else ended up in a mess of individual exceptions.

Yeah. I hope we will finally get rid of pylint. Been proponent of it for quite some time but I see how much pain it causes and stopped liking it.

@github-actions github-actions bot added the full tests needed We need to run full set of tests for this PR to merge label Jun 24, 2021
@github-actions
Copy link

The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest main at your convenience, or amend the last commit of the PR, and push it with --force-with-lease.

@potiuk potiuk merged commit ffb1fca into apache:main Jun 24, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Jun 24, 2021

Awesome work, congrats on your first merged pull request!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers full tests needed We need to run full set of tests for this PR to merge provider:microsoft-azure Azure-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants