Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery green threads incompatibility #28380

Closed
1 of 2 tasks
victorjourne opened this issue Dec 15, 2022 · 5 comments
Closed
1 of 2 tasks

Celery green threads incompatibility #28380

victorjourne opened this issue Dec 15, 2022 · 5 comments
Labels
area:core kind:bug This is a clearly a bug

Comments

@victorjourne
Copy link

victorjourne commented Dec 15, 2022

Apache Airflow version

2.5.0

What happened

Celery offers the capability to concurrently run green threads like gevent or eventlet for IO bound tasks.

I didn't manage to setup a simple Airflow project which follow this basic idea, as I pointed out in this discussion.

  1. Let us say N concurrent tasks are scheduled with a dynamic task mapping to a gevent celery pool of size C
  2. The first C tasks are correctly executed and the metadata database is updated, but the their status in the result backend (here postgres) are not updated (Why?). In flower UI, tasks are still active.
  3. After 10 minutes, an attempt is made to complete the tasks. As a result, (and despite of worrying logs), the task status turn into success
  4. Same scenario for the N - C left tasks.

It could be followed in those logs files :

What you think should happen instead

Instead, the result backend database should be updated as soon as a task completes, in order to quickly let other tasks to run.

First, I have suspected the Postgres result backend to be the culprit since it is not clear that Psychopg manage concurrent writings with gevent.

But after seeing worker logs warning identical than #8164 about the gevent monkey-patching , I have a doubt.

How to reproduce

  • Strictly follow docker compose airflow 2.5.0 instructions.

  • Add to airflow-common-env the env var AIRFLOW__CELERY__POOL: 'gevent'.

  • Launch from Airflow UI this simple DAG:

from datetime import datetime

from airflow import DAG
from airflow.decorators import task

with DAG(dag_id="simple_mapping",
        catchup=False,
        start_date=datetime(2022, 3, 4),
        max_active_tasks=200) as dag:

    @task
    def add_one(x: int):
        return x + 1

    @task
    def sum_it(values):
        total = sum(values)
        print(f"Total was {total}")

    xlist = list(range(25))
    added_values = add_one.expand(x=xlist)

    sum_it(added_values)
  • Observe the airflow-worker logs, the Airflow UI tasks flow and Flower active tasks.

Operating System

Ubuntu 22.04.1 LTS

Versions of Apache Airflow Providers

No response

Deployment

Docker-Compose

Deployment details

Docker Engine - Community - Version: 20.10.18
Docker Compose version v2.10.2

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@victorjourne victorjourne added area:core kind:bug This is a clearly a bug labels Dec 15, 2022
@boring-cyborg
Copy link

boring-cyborg bot commented Dec 15, 2022

Thanks for opening your first issue here! Be sure to follow the issue template!

@victorjourne victorjourne changed the title Celery green treads incompatibility Celery green threads incompatibility Dec 15, 2022
@potiuk
Copy link
Member

potiuk commented Dec 15, 2022

As I explained in Slack, it's lilely fixed in #28283 .

Provisionally, clossing, until you test and confirm it is not fixed by it.

@potiuk potiuk closed this as completed Dec 15, 2022
@victorjourne
Copy link
Author

victorjourne commented Dec 16, 2022

@potiuk, I still have the issue. Nevertheless, monkey patch warning have disappeared.
Tested with Breeze breeze start-airflow --python 3.7 --backend postgres --postgres-version 13 --integration celery. then launched airflow worker and flower.

#files/airflow-breeze-config/variables.env
export AIRFLOW__CORE__EXECUTOR=CeleryExecutor
export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN='postgresql+psycopg2://postgres:airflow@postgres/airflow'
export AIRFLOW__CELERY__RESULT_BACKEND='db+postgresql://postgres:airflow@postgres/airflow'
export AIRFLOW__CELERY__BROKER_URL='redis://:@redis:6379/0'
export AIRFLOW__CELERY__POOL=gevent
export AIRFLOW__CELERY__WORKER_CONCURRENCY=10
export _AIRFLOW_PATCH_GEVENT=1

I am quite sure now that the celery broker or backend are responsible. The result backend database are not updated when some tasks finished, contrary to metadata database...

What do you think?

@potiuk
Copy link
Member

potiuk commented Dec 16, 2022

@potiuk, I still have the issue. Nevertheless, monkey patch warning have disappeared.

This is a good news. At least it allows us to merge this change because clearly the warnings are removed.

I am quite sure now that the celery broker or backend are responsible. The result backend database are not updated when some tasks finished, contrary to metadata database...

What do you think?

There are many things that could have caused the behaviour - but without any evidences, I can only guess, and I have no prior similar experience to base it on. As discussed on Slack the "Celery" Executor is not supported in Breeze as a "working" feature (Issue opened #28412 . It might well be some issue in the way how things are configured in Breeze. If you would like to explore it further and try to investigate how to implement it, that woudl be awesome. But this is not a high priority, really and I would prefer someone (like you) who is a celery user can do some more investigation how to make it works in Breeze. This happend in the past and I would like more of the contirbutors to spend their time in improving our dev env - seems that someone who understand what gevent is and wants to use it has enough incentive to do more investigation. Happy to help with it but need more evidences and some deeper debugging from your side if we are to progress there.

I am happy to get ideas bounced off me and to help to understand ins-outs of Breeze to guide such a person. Slack discussion in #breeze channel on slack is likely best place for that.

@victorjourne
Copy link
Author

victorjourne commented Dec 18, 2022

After testing many configurations about celery backend, the solution I found is the combination of :

Thus, the whole issue about celery green threads is related to the way of airflow calls the result backend. There is something blocking the celery workers to stop. I would go deeper in the code, but I am quite surprised that I don't see many users suffering from this issue, since it is a quite common pattern to concurrently call IO tasks with green threads. To achieve that, do you guys use CeleryExecutor, or the LocalExecutor rather than the CeleryExecutor?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests

2 participants