-
Notifications
You must be signed in to change notification settings - Fork 14.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIRFLOW-6181] Add InProcessExecutor #6740
Conversation
Codecov Report
@@ Coverage Diff @@
## master #6740 +/- ##
=========================================
- Coverage 84.37% 84.3% -0.07%
=========================================
Files 672 673 +1
Lines 38214 38350 +136
=========================================
+ Hits 32242 32330 +88
- Misses 5972 6020 +48
Continue to review full report at Codecov.
|
Do we expect people to use the production DB together with this executor? Seems to be quite dangerous. If we expect people to setup local meta DB, do we want to call it out and maybe somehow force to use that local meta DB? |
How do you want to force the local database? In the world of containers it is very difficult to distinguish between a local and a remote database. |
I am against forcing any db. The executor is meant for local, development purposes meaning that there should be no production db to mess with. The way I would expect DAG creators to use it is to go with local environment + sqlite (or any other) or use breeze / other image as their environment. |
I think it's exactly the same case as with SequentialExecutor. I consider the executor choice as actually part of deployment (I.e. some people use LocalExecutor, some Celery some Kubernetes) |
if executor_name in executors: | ||
executor_module = importlib.import_module(executors[executor_name]) | ||
executor = getattr(executor_module, executor_name) | ||
return executor() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. this is muuuch cleaner.
Hwey @kaxil @ashb - you might want to take a look. This is the new InProcessExecutor contributed by Databand.ai and perfected by @nuclearpinguin we told you about. It's really great for testing/debugging of DAGs. |
I'd go for it. There is low risk it will break anything and I think it is super useful for anyone testing DAGs. Maybe also we should announce it in devlist/slack the there is this new way of running the in-process executor. I would love to cherry-pick all those related changes (pylint & others) to 1.10.7 as soon as possible. |
@mik-laj ? |
My main thought is where is this useful, that |
|
||
if __name__ == '__main__': | ||
dag.clear(reset_dag_runs=True) | ||
dag.run() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't there already a dag.cli()
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there is. I would like us to tie in to that command somehow please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dag.cli()
does not work as far as I know (planned to fix). I was thinking about adding debug
method, so only this has to be done in DAG, no additional configuration of running a file, WDYT?
In all other executors the main process is running subprocess.popen. That prevents python debuggers from debugging then new process forked as the result. So you cannot set breakpoint and hit "Debug" to get it working. The only way we could find so far was remote debugging, but it requires paid version of IntelliJ and it is rather complex to run. With this setting you go back to a "sane" way of debugging DAGs -you just add those two lines, set the breakpoint and use "Debug". I am not aware of any simple way of doing with with starting subprocesses. Unless someone knows it (but I have not heard it so far). Moreover with this we will also be able to debug the code inside Breeze using Docker integration (in the same super-intuitive and easy way). You just add two lines to your DAG, set your environment to point to your Docker image/container and you can use all the debugging features of your IDE out of the box to not only initialize but also execute your DAG. This is super-powerful. We are going to use it at our workshops we have this Friday and this is so much easier for the users to debug the DAGs this way. Even if you use other IDEs which have good debugging integration, it's going to be super easy because you debug it in exactly the same way as you debug other python programs (which means it will just work). I know @feluelle tried it before but could not make it work with the other executors. |
(sorry for short replies -- will expand on this tomorrow) |
sqlite since sqlite do not support multiple connections. | ||
|
||
It executes one task instance at time. Additionally to support working | ||
with sensors, sensor's ``mode`` will be automatically set to "reschedule". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? This requirement/hard-coding doesn't make immediate sense to me.
with sensors, sensor's ``mode`` will be automatically set to "reschedule". | |
with sensors, all sensors ``mode`` will be automatically set to "reschedule". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if sensor depends on another task doing it's work, it will block the executor and will not let the operator to do its job. We had a few examples of those when we worked on GCP operators. Basically - when you want to spin-off both sensor and operator at the same time you need this. Example here:
https://github.com/apache/airflow/blob/master/airflow/gcp/example_dags/example_bigtable.py
Here we have two sensors waiting and they might be fired in any order - they are waiting for two creates (they trigger the sensor's replicate wait to complete). We have more than one sensor and they can fire in any order as they are not depending on each other.
Using "reschedule" mode makes it robust to more strict dependencies set in the DAG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be called something like DebugExecutor
-- InProcess makes sense to us as developers of Airflow, but might notbe immedately obvious to users. Plus Debug
make is more obvious that you shouldn't be using it in production.
'airflow test' only runs one task but then we want to run the whole dag - sometimes when you have a complex DAG you want to run all steps before - and running them manually is not a good idea. you need to run them in the right sequence manually. Being able to run the whole DAG is much more convenient. |
[debug] | ||
# Used only with DebugExecutor. If set to True DAG will fail with first | ||
# failed task. Helpful for debugging purposes. | ||
fail_fast = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit but overall this LGTM!
I think we should merge it after travis passes :) |
Adds new executor that is meant to be used mainly for debugging and DAG development purposes. This executor executes single task instance at time and is able to work with SQL Lite and sensors.
e535d94
to
fbae6ac
Compare
@nuclearpinguin I'll merge as soon as tests pass :) |
Adds new executor that is meant to be used mainly for debugging and DAG development purposes. This executor executes single task instance at time and is able to work with SQLLite and sensors. (cherry picked from commit fe2334f)
Adds new executor that is meant to be used mainly for debugging and DAG development purposes. This executor executes single task instance at time and is able to work with SQLLite and sensors. (cherry picked from commit fe2334f)
Adds new executor that is meant to be used mainly for debugging and DAG development purposes. This executor executes single task instance at time and is able to work with SQLLite and sensors. (cherry picked from commit fe2334f)
Adds new executor that is meant to be used mainly for debugging and DAG development purposes. This executor executes single task instance at time and is able to work with SQLLite and sensors.
Make sure you have checked all steps below.
Jira
Description
Together with guys from Databand we created a new executor that is meant to be used mainly for debugging and DAG development purposes. This executor executes single task instance at time and is able to work with SQLite and sensors.
Using this executor you can debug your DAGs from IDE 🚀
Tests
Commits
Documentation